A Decision-Theoretic Generalization of On-LineLearning and an Application to Boosting
Sachin Gupta and Nitish Lakhanpal
November 1, 2010
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Introduction
Scenario: Gambler wants to let his fellow gamblers make betson his behalf
Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do
Obviously if he knew which of his friends would do the best hewould just give him all of the money
Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend
Paper discusses a simple algorithm for solving these dynamicallocation problems
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Introduction
Scenario: Gambler wants to let his fellow gamblers make betson his behalf
Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do
Obviously if he knew which of his friends would do the best hewould just give him all of the money
Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend
Paper discusses a simple algorithm for solving these dynamicallocation problems
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Introduction
Scenario: Gambler wants to let his fellow gamblers make betson his behalf
Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do
Obviously if he knew which of his friends would do the best hewould just give him all of the money
Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend
Paper discusses a simple algorithm for solving these dynamicallocation problems
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Introduction
Scenario: Gambler wants to let his fellow gamblers make betson his behalf
Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do
Obviously if he knew which of his friends would do the best hewould just give him all of the money
Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend
Paper discusses a simple algorithm for solving these dynamicallocation problems
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Introduction
Scenario: Gambler wants to let his fellow gamblers make betson his behalf
Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do
Obviously if he knew which of his friends would do the best hewould just give him all of the money
Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend
Paper discusses a simple algorithm for solving these dynamicallocation problems
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Introduction
Scenario: Gambler wants to let his fellow gamblers make betson his behalf
Will wager a fixed sum of money on every race, but willapportion his money among friends based on how well they do
Obviously if he knew which of his friends would do the best hewould just give him all of the money
Want to create to a allocation strategy that will mimic thepayoff that he would have attained had he bet with theluckiest friend
Paper discusses a simple algorithm for solving these dynamicallocation problems
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
On-line Allocation Model
Agent A has N options or strategies to choose from
Number strategies with integers 1, . . . ,N
Each time step t = 1, 2, . . . ,T :
Allocator A decides on distribution pt over the strategiespt
i ≥ 0 is the amount allocated to strategy in∑
i=1
pt = 1
Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is
∑i (p
ti · l ti ) i.e. the average loss of the
strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
On-line Allocation Model
Agent A has N options or strategies to choose from
Number strategies with integers 1, . . . ,N
Each time step t = 1, 2, . . . ,T :
Allocator A decides on distribution pt over the strategiespt
i ≥ 0 is the amount allocated to strategy in∑
i=1
pt = 1
Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is
∑i (p
ti · l ti ) i.e. the average loss of the
strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
On-line Allocation Model
Agent A has N options or strategies to choose from
Number strategies with integers 1, . . . ,N
Each time step t = 1, 2, . . . ,T :
Allocator A decides on distribution pt over the strategiespt
i ≥ 0 is the amount allocated to strategy in∑
i=1
pt = 1
Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is
∑i (p
ti · l ti ) i.e. the average loss of the
strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
On-line Allocation Model
Agent A has N options or strategies to choose from
Number strategies with integers 1, . . . ,N
Each time step t = 1, 2, . . . ,T :
Allocator A decides on distribution pt over the strategies
pti ≥ 0 is the amount allocated to strategy in∑
i=1
pt = 1
Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is
∑i (p
ti · l ti ) i.e. the average loss of the
strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
On-line Allocation Model
Agent A has N options or strategies to choose from
Number strategies with integers 1, . . . ,N
Each time step t = 1, 2, . . . ,T :
Allocator A decides on distribution pt over the strategiespt
i ≥ 0 is the amount allocated to strategy i
n∑i=1
pt = 1
Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is
∑i (p
ti · l ti ) i.e. the average loss of the
strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
On-line Allocation Model
Agent A has N options or strategies to choose from
Number strategies with integers 1, . . . ,N
Each time step t = 1, 2, . . . ,T :
Allocator A decides on distribution pt over the strategiespt
i ≥ 0 is the amount allocated to strategy in∑
i=1
pt = 1
Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is
∑i (p
ti · l ti ) i.e. the average loss of the
strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
On-line Allocation Model
Agent A has N options or strategies to choose from
Number strategies with integers 1, . . . ,N
Each time step t = 1, 2, . . . ,T :
Allocator A decides on distribution pt over the strategiespt
i ≥ 0 is the amount allocated to strategy in∑
i=1
pt = 1
Each strategy suffers loss l ti which is decided by theenvironment
Loss suffered by A is∑
i (pti · l ti ) i.e. the average loss of the
strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
On-line Allocation Model
Agent A has N options or strategies to choose from
Number strategies with integers 1, . . . ,N
Each time step t = 1, 2, . . . ,T :
Allocator A decides on distribution pt over the strategiespt
i ≥ 0 is the amount allocated to strategy in∑
i=1
pt = 1
Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is
∑i (p
ti · l ti ) i.e. the average loss of the
strategies with respect to A’s chosen allocation rule
This loss function is called the mixture loss
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
On-line Allocation Model
Agent A has N options or strategies to choose from
Number strategies with integers 1, . . . ,N
Each time step t = 1, 2, . . . ,T :
Allocator A decides on distribution pt over the strategiespt
i ≥ 0 is the amount allocated to strategy in∑
i=1
pt = 1
Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is
∑i (p
ti · l ti ) i.e. the average loss of the
strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
On-line Allocation Model
Agent A has N options or strategies to choose from
Number strategies with integers 1, . . . ,N
Each time step t = 1, 2, . . . ,T :
Allocator A decides on distribution pt over the strategiespt
i ≥ 0 is the amount allocated to strategy in∑
i=1
pt = 1
Each strategy suffers loss l ti which is decided by theenvironmentLoss suffered by A is
∑i (p
ti · l ti ) i.e. the average loss of the
strategies with respect to A’s chosen allocation ruleThis loss function is called the mixture loss
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Bounds on Losses
Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]
No assumption about the form of loss vectors l t or manner inwhich they are generated
Adversary’s choice for l t may depend on allocator’s chosenmixture pt
Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy
A attempts to minimize net loss La −mini Li whereLa =
∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by
algorithm A on the first T trials
Li =∑T
t=1 l ti , the cumulative loss of strategy i
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Bounds on Losses
Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]
No assumption about the form of loss vectors l t or manner inwhich they are generated
Adversary’s choice for l t may depend on allocator’s chosenmixture pt
Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy
A attempts to minimize net loss La −mini Li whereLa =
∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by
algorithm A on the first T trials
Li =∑T
t=1 l ti , the cumulative loss of strategy i
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Bounds on Losses
Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]
No assumption about the form of loss vectors l t or manner inwhich they are generated
Adversary’s choice for l t may depend on allocator’s chosenmixture pt
Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy
A attempts to minimize net loss La −mini Li whereLa =
∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by
algorithm A on the first T trials
Li =∑T
t=1 l ti , the cumulative loss of strategy i
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Bounds on Losses
Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]
No assumption about the form of loss vectors l t or manner inwhich they are generated
Adversary’s choice for l t may depend on allocator’s chosenmixture pt
Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy
A attempts to minimize net loss La −mini Li whereLa =
∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by
algorithm A on the first T trials
Li =∑T
t=1 l ti , the cumulative loss of strategy i
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Bounds on Losses
Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]
No assumption about the form of loss vectors l t or manner inwhich they are generated
Adversary’s choice for l t may depend on allocator’s chosenmixture pt
Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy
A attempts to minimize net loss La −mini Li whereLa =
∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by
algorithm A on the first T trials
Li =∑T
t=1 l ti , the cumulative loss of strategy i
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Bounds on Losses
Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]
No assumption about the form of loss vectors l t or manner inwhich they are generated
Adversary’s choice for l t may depend on allocator’s chosenmixture pt
Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy
A attempts to minimize net loss La −mini Li whereLa =
∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by
algorithm A on the first T trials
Li =∑T
t=1 l ti , the cumulative loss of strategy i
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Bounds on Losses
Loss suffered by any strategy is bounded so that without lossof generality l ti ∈ [0, 1]
No assumption about the form of loss vectors l t or manner inwhich they are generated
Adversary’s choice for l t may depend on allocator’s chosenmixture pt
Goal of algorithm A is to minimize cumulative loss relative tothe loss suffered by the best strategy
A attempts to minimize net loss La −mini Li whereLa =
∑Tt=1 (pt ∗ l t), the total cumulative loss suffered by
algorithm A on the first T trials
Li =∑T
t=1 l ti , the cumulative loss of strategy i
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Repeated n Decision Game
The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.
Let there be payoff vectors pit equivalent to our loss vectors l it
Let there be a decision space ∆ consisting of decisions x it
equivalent to our weight vectors w it
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Repeated n Decision Game
The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.
Let there be payoff vectors pit equivalent to our loss vectors l it
Let there be a decision space ∆ consisting of decisions x it
equivalent to our weight vectors w it
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Repeated n Decision Game
The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.
Let there be payoff vectors pit equivalent to our loss vectors l it
Let there be a decision space ∆ consisting of decisions x it
equivalent to our weight vectors w it
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Repeated n Decision Game
The Hedge algorithm can be represented as a repeated ndecision game in a very natural way.
Let there be payoff vectors pit equivalent to our loss vectors l it
Let there be a decision space ∆ consisting of decisions x it
equivalent to our weight vectors w it
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Regret Bound for Exponential Updating
We prove this in the repeated n decision game framework(equivalent to paper’s formulation)
Lemma 1: For any ε > 0, the x t of the Weighted-majorityalgorithm Hedge(ε) maximizes 1
εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Regret Bound for Exponential Updating
We prove this in the repeated n decision game framework(equivalent to paper’s formulation)
Lemma 1: For any ε > 0, the x t of the Weighted-majorityalgorithm Hedge(ε) maximizes 1
εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Regret Bound for Exponential Updating
We prove this in the repeated n decision game framework(equivalent to paper’s formulation)
Lemma 1: For any ε > 0, the x t of the Weighted-majorityalgorithm Hedge(ε) maximizes 1
εH(x) + p1 · x + . . .+ pt−1 · xover x ∈ ∆n
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 1(Regularity)
By definition, we have for any x ∈ ∆n,
1
εH(x) + p1 · x + . . .+ pt−1 · x =
∑i
1
εxi log
1
xi+ Pt
i xi
By simple algebra, the above is equal to,
∑i
1
ε
(xi log
1
xi+ εxiP
ti
)=
1
ε
∑i
xi logeεP
ti
xi
For the xi chosen by the algorithm, the above expression is1ε
∑xi log Z t = 1
ε log Z t , because∑
xi = 1. Using Jensen’sinequality, we have:1ε
∑i xi log eεPt
i
xi≤ 1
ε log∑
i xieεPt
i
xi= 1
ε log Z t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 1(Regularity)
By definition, we have for any x ∈ ∆n,
1
εH(x) + p1 · x + . . .+ pt−1 · x =
∑i
1
εxi log
1
xi+ Pt
i xi
By simple algebra, the above is equal to,
∑i
1
ε
(xi log
1
xi+ εxiP
ti
)=
1
ε
∑i
xi logeεP
ti
xi
For the xi chosen by the algorithm, the above expression is1ε
∑xi log Z t = 1
ε log Z t , because∑
xi = 1. Using Jensen’sinequality, we have:1ε
∑i xi log eεPt
i
xi≤ 1
ε log∑
i xieεPt
i
xi= 1
ε log Z t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 1(Regularity)
By definition, we have for any x ∈ ∆n,
1
εH(x) + p1 · x + . . .+ pt−1 · x =
∑i
1
εxi log
1
xi+ Pt
i xi
By simple algebra, the above is equal to,
∑i
1
ε
(xi log
1
xi+ εxiP
ti
)=
1
ε
∑i
xi logeεP
ti
xi
For the xi chosen by the algorithm, the above expression is1ε
∑xi log Z t = 1
ε log Z t , because∑
xi = 1. Using Jensen’sinequality, we have:1ε
∑i xi log eεPt
i
xi≤ 1
ε log∑
i xieεPt
i
xi= 1
ε log Z t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 1(Regularity)
By definition, we have for any x ∈ ∆n,
1
εH(x) + p1 · x + . . .+ pt−1 · x =
∑i
1
εxi log
1
xi+ Pt
i xi
By simple algebra, the above is equal to,
∑i
1
ε
(xi log
1
xi+ εxiP
ti
)=
1
ε
∑i
xi logeεP
ti
xi
For the xi chosen by the algorithm, the above expression is1ε
∑xi log Z t = 1
ε log Z t , because∑
xi = 1. Using Jensen’sinequality, we have:1ε
∑i xi log eεPt
i
xi≤ 1
ε log∑
i xieεPt
i
xi= 1
ε log Z t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 2(Stability)
Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t
∣∣ ≤ 4εM2
ProofNote first that P t+1
i −M ≤ P ti ≤ P t+1
i + M and hence
eεPt+1i e−εM ≤ eεP
ti ≤ eεP
t+1i eεM
The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives
x ti =
eεPti
Z t≥ e−2εM eεP
t+1i
Z t+1∀i ≤ n
Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1
i
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 2(Stability)
Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t
∣∣ ≤ 4εM2
Proof
Note first that P t+1i −M ≤ P t
i ≤ P t+1i + M and hence
eεPt+1i e−εM ≤ eεP
ti ≤ eεP
t+1i eεM
The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives
x ti =
eεPti
Z t≥ e−2εM eεP
t+1i
Z t+1∀i ≤ n
Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1
i
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 2(Stability)
Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t
∣∣ ≤ 4εM2
ProofNote first that P t+1
i −M ≤ P ti ≤ P t+1
i + M and hence
eεPt+1i e−εM ≤ eεP
ti ≤ eεP
t+1i eεM
The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives
x ti =
eεPti
Z t≥ e−2εM eεP
t+1i
Z t+1∀i ≤ n
Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1
i
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 2(Stability)
Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t
∣∣ ≤ 4εM2
ProofNote first that P t+1
i −M ≤ P ti ≤ P t+1
i + M and hence
eεPt+1i e−εM ≤ eεP
ti ≤ eεP
t+1i eεM
The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives
x ti =
eεPti
Z t≥ e−2εM eεP
t+1i
Z t+1∀i ≤ n
Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1
i
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 2(Stability)
Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t
∣∣ ≤ 4εM2
ProofNote first that P t+1
i −M ≤ P ti ≤ P t+1
i + M and hence
eεPt+1i e−εM ≤ eεP
ti ≤ eεP
t+1i eεM
The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives
x ti =
eεPti
Z t≥ e−2εM eεP
t+1i
Z t+1∀i ≤ n
Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1
i
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 2(Stability)
Lemma 2: For any ε,M > 0, t ≥ 1 andp1, p2, . . . , pt ∈ [−M,M]n,∣∣pt · x t+1 − pt · x t
∣∣ ≤ 4εM2
ProofNote first that P t+1
i −M ≤ P ti ≤ P t+1
i + M and hence
eεPt+1i e−εM ≤ eεP
ti ≤ eεP
t+1i eεM
The left-hand side above implies that, Z t+1e−εM ≤ Z t ,combined with the right-hand side, gives
x ti =
eεPti
Z t≥ e−2εM eεP
t+1i
Z t+1∀i ≤ n
Finally, since e−s ≥ 1− s for s ≥ 0, we have thatx ti ≥ (1− 2εM)x t+1
i
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 2(Stability)
Let λ = 2εM. First, if λ > 1, notice that the lemma is trivialbecause 4εM2 > 2M, and the difference in payoff betweentwo decisions can never be greater than 2M. Hence, WLOG,we may assume that λ ∈ [0, 1]. Let z t ∈ Rn be the uniquevector such that,
x t = (1− λ)x t+1 + λz t
Then, we claim that z t ∈ ∆n. The fact that z ti ≥ 0 follows
directly from the argument above. The fact that∑
z ti = 1
follows from the fact that∑
x ti = 1,
∑x t+1i = 1 and that x t
is a convex combination of x t + 1 and x t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 2(Stability)
Let λ = 2εM. First, if λ > 1, notice that the lemma is trivialbecause 4εM2 > 2M, and the difference in payoff betweentwo decisions can never be greater than 2M. Hence, WLOG,we may assume that λ ∈ [0, 1]. Let z t ∈ Rn be the uniquevector such that,
x t = (1− λ)x t+1 + λz t
Then, we claim that z t ∈ ∆n. The fact that z ti ≥ 0 follows
directly from the argument above. The fact that∑
z ti = 1
follows from the fact that∑
x ti = 1,
∑x t+1i = 1 and that x t
is a convex combination of x t + 1 and x t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 2(Stability)
Let λ = 2εM. First, if λ > 1, notice that the lemma is trivialbecause 4εM2 > 2M, and the difference in payoff betweentwo decisions can never be greater than 2M. Hence, WLOG,we may assume that λ ∈ [0, 1]. Let z t ∈ Rn be the uniquevector such that,
x t = (1− λ)x t+1 + λz t
Then, we claim that z t ∈ ∆n. The fact that z ti ≥ 0 follows
directly from the argument above. The fact that∑
z ti = 1
follows from the fact that∑
x ti = 1,
∑x t+1i = 1 and that x t
is a convex combination of x t + 1 and x t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 2(Stability)
Finally,
x t · pt − x t+1 · pt = −λx t+1 · pt + λz t · pt
Since y · p ∈ [−M,M] for all y ∈ ∆n, the magnitude of theabove quantity is at most 2λM = 4εM2, as required by thelemma
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 2(Stability)
Finally,
x t · pt − x t+1 · pt = −λx t+1 · pt + λz t · pt
Since y · p ∈ [−M,M] for all y ∈ ∆n, the magnitude of theabove quantity is at most 2λM = 4εM2, as required by thelemma
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Proof of Lemma 2(Stability)
Finally,
x t · pt − x t+1 · pt = −λx t+1 · pt + λz t · pt
Since y · p ∈ [−M,M] for all y ∈ ∆n, the magnitude of theabove quantity is at most 2λM = 4εM2, as required by thelemma
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Regret Bound
Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,
regret ≤ 1
T
T∑t=1
4εM2 +log(n)
T ε= 4εM2 +
log(n)
T ε
Proof
The stability-regularization lemma, combined with the lemma1, implies that,
regret ≤ 1
T
T∑t=1
∣∣x t+1 · pt − x t · pt∣∣+ 1
Tmax
x,x′∈∆n
∣∣∣∣1εH(x)− 1
εH(x ′)
∣∣∣∣
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Regret Bound
Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,
regret ≤ 1
T
T∑t=1
4εM2 +log(n)
T ε= 4εM2 +
log(n)
T ε
Proof
The stability-regularization lemma, combined with the lemma1, implies that,
regret ≤ 1
T
T∑t=1
∣∣x t+1 · pt − x t · pt∣∣+ 1
Tmax
x,x′∈∆n
∣∣∣∣1εH(x)− 1
εH(x ′)
∣∣∣∣
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Regret Bound
Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,
regret ≤ 1
T
T∑t=1
4εM2 +log(n)
T ε= 4εM2 +
log(n)
T ε
Proof
The stability-regularization lemma, combined with the lemma1, implies that,
regret ≤ 1
T
T∑t=1
∣∣x t+1 · pt − x t · pt∣∣+ 1
Tmax
x,x′∈∆n
∣∣∣∣1εH(x)− 1
εH(x ′)
∣∣∣∣
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Regret Bound
Theorem: For any n,T ≥ 1,M ≥ 0, and for anyp1, p2, . . . , pt ∈ [−M,M]n, the weighted-majority algorithmachieves,
regret ≤ 1
T
T∑t=1
4εM2 +log(n)
T ε= 4εM2 +
log(n)
T ε
Proof
The stability-regularization lemma, combined with the lemma1, implies that,
regret ≤ 1
T
T∑t=1
∣∣x t+1 · pt − x t · pt∣∣+ 1
Tmax
x,x′∈∆n
∣∣∣∣1εH(x)− 1
εH(x ′)
∣∣∣∣
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Regret Bound
Since we have shown 0 ≤ H(x) ≤ log(n), we have1T maxx ,x ′∈∆n
1εH(x)− 1
εH(x ′) ≤ log(n)T ε . Lemma 2 bounds the
“stability” term. Putting these together gives,
regret ≤ 1
T
T∑t=1
4εM2 +log(n)
T ε= 4εM2 +
log(n)
T ε
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Regret Bound
Since we have shown 0 ≤ H(x) ≤ log(n), we have1T maxx ,x ′∈∆n
1εH(x)− 1
εH(x ′) ≤ log(n)T ε . Lemma 2 bounds the
“stability” term. Putting these together gives,
regret ≤ 1
T
T∑t=1
4εM2 +log(n)
T ε= 4εM2 +
log(n)
T ε
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Framework quite general and can be applied to a number oflearning problems
Take decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t
i on ∆. Suffers loss Λ(E ti , ω
t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]
At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t
i on ∆. Suffers loss Λ(E ti , ω
t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .
Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t
i on ∆. Suffers loss Λ(E ti , ω
t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]
Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t
i on ∆. Suffers loss Λ(E ti , ω
t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .
At each time step t, each expert i will produce its owndistribution E t
i on ∆. Suffers loss Λ(E ti , ω
t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t
i on ∆. Suffers loss Λ(E ti , ω
t)
Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t
i on ∆. Suffers loss Λ(E ti , ω
t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Framework quite general and can be applied to a number oflearning problemsTake decision space ∆, a space of outcomes Ω, and abounded loss function λ : ∆× Ω→ [0, 1]At each time step t, learning algorithm selects δt from ∆ andreceived an outcome ωt from Ω. Suffers loss λ(δt , ωt). Ingeneral, we allow the learner to select a distribution Dt overthe space of decisions. Suffers expected loss of a decisionrandomly selected according to Dt .Expected loss Λ(Dt , ωt) = Eδ∼D [λ(δ, ω)]Assuming that learner has access to a set of N experts, wecan decide an appropriate distribution Dt .At each time step t, each expert i will produce its owndistribution E t
i on ∆. Suffers loss Λ(E ti , ω
t)Learner needs to combine the distributions produced by theexperts in order to suffer an expect loss “not much worse”than that of the expert
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Run algorithm Hedge(β) treating each expert as a strategy
Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =
∑Ni=1 pt
i Eti
For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =
∑Ni=1 pt
i (Λ(E ti , ω
t))
If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t
It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T
t=1 Λ(Dt , ωt) ≤ mini∑T
t=1 Λ(E ti , ω
t) +√
2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Run algorithm Hedge(β) treating each expert as a strategy
Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =
∑Ni=1 pt
i Eti
For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =
∑Ni=1 pt
i (Λ(E ti , ω
t))
If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t
It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T
t=1 Λ(Dt , ωt) ≤ mini∑T
t=1 Λ(E ti , ω
t) +√
2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Run algorithm Hedge(β) treating each expert as a strategy
Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =
∑Ni=1 pt
i Eti
For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =
∑Ni=1 pt
i (Λ(E ti , ω
t))
If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t
It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T
t=1 Λ(Dt , ωt) ≤ mini∑T
t=1 Λ(E ti , ω
t) +√
2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Run algorithm Hedge(β) treating each expert as a strategy
Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =
∑Ni=1 pt
i Eti
For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =
∑Ni=1 pt
i (Λ(E ti , ω
t))
If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t
It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T
t=1 Λ(Dt , ωt) ≤ mini∑T
t=1 Λ(E ti , ω
t) +√
2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Run algorithm Hedge(β) treating each expert as a strategy
Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =
∑Ni=1 pt
i Eti
For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =
∑Ni=1 pt
i (Λ(E ti , ω
t))
If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t
It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T
t=1 Λ(Dt , ωt) ≤ mini∑T
t=1 Λ(E ti , ω
t) +√
2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Applications
Run algorithm Hedge(β) treating each expert as a strategy
Hedge(β) produces a distribution pt on the set of experts thatis used to construct the mixture distribution Dt =
∑Ni=1 pt
i Eti
For any outcome ωt , loss suffered by Hedge(β) is thenΛ(Dt , ωt) =
∑Ni=1 pt
i (Λ(E ti , ω
t))
If we define l ti = Λ(E t , ωt), then the loss sufferred by learneris pt l t
It follows that for any loss function λ, for any set of experts,and for any seqeuence of outcomes, the expected loss ofHedge(β) if used as described above is at most∑T
t=1 Λ(Dt , ωt) ≤ mini∑T
t=1 Λ(E ti , ω
t) +√
2L ln N + ln Nwhere L ≤ T is an assumed bound on the expected loss of thebest expert and β = g(L/ ln N)
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Rock, Paper, Scissors
Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors
Here ∆ = Ω = R,P,S
Loss function:
ωR P S
δR 1
2 1 0P 0 1
2 1S 1 0 1
2
δ represents learner’s play, ω is the adversary’s play
Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties
Cumulative loss is simply the expected number of loses in aseries of rounds of game play
Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Rock, Paper, Scissors
Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors
Here ∆ = Ω = R,P,S
Loss function:
ωR P S
δR 1
2 1 0P 0 1
2 1S 1 0 1
2
δ represents learner’s play, ω is the adversary’s play
Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties
Cumulative loss is simply the expected number of loses in aseries of rounds of game play
Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Rock, Paper, Scissors
Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors
Here ∆ = Ω = R,P,S
Loss function:
ωR P S
δR 1
2 1 0P 0 1
2 1S 1 0 1
2
δ represents learner’s play, ω is the adversary’s play
Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties
Cumulative loss is simply the expected number of loses in aseries of rounds of game play
Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Rock, Paper, Scissors
Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors
Here ∆ = Ω = R,P,S
Loss function:
ωR P S
δR 1
2 1 0P 0 1
2 1S 1 0 1
2
δ represents learner’s play, ω is the adversary’s play
Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties
Cumulative loss is simply the expected number of loses in aseries of rounds of game play
Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Rock, Paper, Scissors
Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors
Here ∆ = Ω = R,P,S
Loss function:
ωR P S
δR 1
2 1 0P 0 1
2 1S 1 0 1
2
δ represents learner’s play, ω is the adversary’s play
Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties
Cumulative loss is simply the expected number of loses in aseries of rounds of game play
Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Rock, Paper, Scissors
Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors
Here ∆ = Ω = R,P,S
Loss function:
ωR P S
δR 1
2 1 0P 0 1
2 1S 1 0 1
2
δ represents learner’s play, ω is the adversary’s play
Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties
Cumulative loss is simply the expected number of loses in aseries of rounds of game play
Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Rock, Paper, Scissors
Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors
Here ∆ = Ω = R,P,S
Loss function:
ωR P S
δR 1
2 1 0P 0 1
2 1S 1 0 1
2
δ represents learner’s play, ω is the adversary’s play
Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties
Cumulative loss is simply the expected number of loses in aseries of rounds of game play
Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Rock, Paper, Scissors
Loss function λ can represent an arbitrary matrix game likeRock, Paper, Scissors
Here ∆ = Ω = R,P,S
Loss function:
ωR P S
δR 1
2 1 0P 0 1
2 1S 1 0 1
2
δ represents learner’s play, ω is the adversary’s play
Λ(δ, ω) is learner’s loss. 1 if he loses. 0 if he wins. 12 if he ties
Cumulative loss is simply the expected number of loses in aseries of rounds of game play
Results of study show that paper’s algorithm has expectedloss numbers nearly identical to those of the best experts
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Bias-Variance Tradeoff
Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Bias-Variance Tradeoff
Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic Error
Low Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Bias-Variance Tradeoff
Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High Variance
Ex: 1st Order Curves vs. 9th Order Curves
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Bias-Variance Tradeoff
Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Bias-Variance Tradeoff
Expected Error of Learner = Variance + Squared Bias =Sensitivity of Prediction + Systematic ErrorLow Bias → High VarianceEx: 1st Order Curves vs. 9th Order Curves
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Can we reduce variance without increasing bias?
Use committees
Bias
Average Performance at least as good as average performanceof members
Variance
Spurious patterns outvoted
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Can we reduce variance without increasing bias?
Use committees
Bias
Average Performance at least as good as average performanceof members
Variance
Spurious patterns outvoted
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Can we reduce variance without increasing bias?
Use committees
Bias
Average Performance at least as good as average performanceof members
Variance
Spurious patterns outvoted
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Can we reduce variance without increasing bias?
Use committees
Bias
Average Performance at least as good as average performanceof members
Variance
Spurious patterns outvoted
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Can we reduce variance without increasing bias?
Use committees
Bias
Average Performance at least as good as average performanceof members
Variance
Spurious patterns outvoted
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Can we reduce variance without increasing bias?
Use committees
Bias
Average Performance at least as good as average performanceof members
Variance
Spurious patterns outvoted
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Can we reduce variance without increasing bias?
Use committees
Bias
Average Performance at least as good as average performanceof members
Variance
Spurious patterns outvoted
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Intuition
Break the training set up into samples focusing on the“hardest” parts of the spaceForces weak learners to generate new hypotheses → Fewermistakes are made on these parts
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Intuition
Break the training set up into samples focusing on the“hardest” parts of the space
Forces weak learners to generate new hypotheses → Fewermistakes are made on these parts
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Intuition
Break the training set up into samples focusing on the“hardest” parts of the spaceForces weak learners to generate new hypotheses → Fewermistakes are made on these parts
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Intuition
Break the training set up into samples focusing on the“hardest” parts of the spaceForces weak learners to generate new hypotheses → Fewermistakes are made on these parts
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Strong Learning
For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.
Weak Learning
Same as above except that ε ≥ 1
2− γ where γ > 0 or
decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Strong Learning
For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.
A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.
Weak Learning
Same as above except that ε ≥ 1
2− γ where γ > 0 or
decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Strong Learning
For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.
Weak Learning
Same as above except that ε ≥ 1
2− γ where γ > 0 or
decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Strong Learning
For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.
Weak Learning
Same as above except that ε ≥ 1
2− γ where γ > 0 or
decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Strong Learning
For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.
Weak Learning
Same as above except that ε ≥ 1
2− γ where γ > 0 or
decreases as 1/p for polynomial p
Intuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Strong Learning
For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.
Weak Learning
Same as above except that ε ≥ 1
2− γ where γ > 0 or
decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Strong Learning
For a given ε, δ > 0 and access to random examples, outputs ahypothesis with error at most ε with probability 1− δ.A strong learner is a classifier that is arbitrarily well correlatedwith the true classification.
Weak Learning
Same as above except that ε ≥ 1
2− γ where γ > 0 or
decreases as 1/p for polynomial pIntuitively, a weak learner is defined to be a classifier which isonly slightly correlated with the true classification (it can labelexamples better than random guessing)
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
The Algorithm
Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
The Algorithm
Given a distribution over the training set, calculate weightsover the examples
Compute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
The Algorithm
Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distribution
Feed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
The Algorithm
Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithm
Generate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
The Algorithm
Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weights
Output a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
The Algorithm
Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
The Algorithm
Given a distribution over the training set, calculate weightsover the examplesCompute a normalized distributionFeed the normalized distribution to a weak learning algorithmGenerate a new set of weightsOutput a hypothesis combining the outputs of the weakhypotheses by a weighted majority vote
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Rules for generating a new set of weights
Calculate the error of the weak hypothesis
εt =∑N
i=1 pti |ht(xi )− yi |
Set βt =εt
1− εtCalculate new weights as:
w t+1i = w t
i B1−|ht(xi )−yi |t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Rules for generating a new set of weights
Calculate the error of the weak hypothesis
εt =∑N
i=1 pti |ht(xi )− yi |
Set βt =εt
1− εtCalculate new weights as:
w t+1i = w t
i B1−|ht(xi )−yi |t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Rules for generating a new set of weights
Calculate the error of the weak hypothesis
εt =∑N
i=1 pti |ht(xi )− yi |
Set βt =εt
1− εtCalculate new weights as:
w t+1i = w t
i B1−|ht(xi )−yi |t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Rules for generating a new set of weights
Calculate the error of the weak hypothesis
εt =∑N
i=1 pti |ht(xi )− yi |
Set βt =εt
1− εt
Calculate new weights as:
w t+1i = w t
i B1−|ht(xi )−yi |t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Rules for generating a new set of weights
Calculate the error of the weak hypothesis
εt =∑N
i=1 pti |ht(xi )− yi |
Set βt =εt
1− εtCalculate new weights as:
w t+1i = w t
i B1−|ht(xi )−yi |t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Rules for generating a new set of weights
Calculate the error of the weak hypothesis
εt =∑N
i=1 pti |ht(xi )− yi |
Set βt =εt
1− εtCalculate new weights as:
w t+1i = w t
i B1−|ht(xi )−yi |t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting
Rules for generating a new set of weights
Calculate the error of the weak hypothesis
εt =∑N
i=1 pti |ht(xi )− yi |
Set βt =εt
1− εtCalculate new weights as:
w t+1i = w t
i B1−|ht(xi )−yi |t
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Relationship Between Boosting and Weighted Majority
Dual relationship between Hedge and AdaBoost
“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction
Loss large for bad predictions in Adaboost in order to focus on“hard” problems
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Relationship Between Boosting and Weighted Majority
Dual relationship between Hedge and AdaBoost
“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”
The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction
Loss large for bad predictions in Adaboost in order to focus on“hard” problems
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Relationship Between Boosting and Weighted Majority
Dual relationship between Hedge and AdaBoost
“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of Hedge
We also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction
Loss large for bad predictions in Adaboost in order to focus on“hard” problems
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Relationship Between Boosting and Weighted Majority
Dual relationship between Hedge and AdaBoost
“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction
Loss large for bad predictions in Adaboost in order to focus on“hard” problems
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Relationship Between Boosting and Weighted Majority
Dual relationship between Hedge and AdaBoost
“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction
Loss large for bad predictions in Adaboost in order to focus on“hard” problems
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Relationship Between Boosting and Weighted Majority
Dual relationship between Hedge and AdaBoost
“There is a direct mapping or reduction of the boostingproblem to the on-line allocation problem”The examples used in AdaBoost can be treated as strategiesand the weak hypotheses can be treated as trials, giving us arehashing of Adaboost in the terminology of HedgeWe also need to reverse the loss function because in AdaBoostour loss is large if we encounter a bad prediction while inHedge it is large if we encounter a good prediction
Loss large for bad predictions in Adaboost in order to focus on“hard” problems
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — Critique
Not enough data
Can’t produce something out of nothing
Base learner too weak
Committee may not be able to come up with good hypothesis
Base learner too strong
Individual learner may overfit on its own
Susceptibility to noisy data
Work too hard to find pattern in outlier
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — Critique
Not enough data
Can’t produce something out of nothing
Base learner too weak
Committee may not be able to come up with good hypothesis
Base learner too strong
Individual learner may overfit on its own
Susceptibility to noisy data
Work too hard to find pattern in outlier
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — Critique
Not enough data
Can’t produce something out of nothing
Base learner too weak
Committee may not be able to come up with good hypothesis
Base learner too strong
Individual learner may overfit on its own
Susceptibility to noisy data
Work too hard to find pattern in outlier
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — Critique
Not enough data
Can’t produce something out of nothing
Base learner too weak
Committee may not be able to come up with good hypothesis
Base learner too strong
Individual learner may overfit on its own
Susceptibility to noisy data
Work too hard to find pattern in outlier
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — Critique
Not enough data
Can’t produce something out of nothing
Base learner too weak
Committee may not be able to come up with good hypothesis
Base learner too strong
Individual learner may overfit on its own
Susceptibility to noisy data
Work too hard to find pattern in outlier
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — Critique
Not enough data
Can’t produce something out of nothing
Base learner too weak
Committee may not be able to come up with good hypothesis
Base learner too strong
Individual learner may overfit on its own
Susceptibility to noisy data
Work too hard to find pattern in outlier
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — Critique
Not enough data
Can’t produce something out of nothing
Base learner too weak
Committee may not be able to come up with good hypothesis
Base learner too strong
Individual learner may overfit on its own
Susceptibility to noisy data
Work too hard to find pattern in outlier
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — Critique
Not enough data
Can’t produce something out of nothing
Base learner too weak
Committee may not be able to come up with good hypothesis
Base learner too strong
Individual learner may overfit on its own
Susceptibility to noisy data
Work too hard to find pattern in outlier
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — Critique
Not enough data
Can’t produce something out of nothing
Base learner too weak
Committee may not be able to come up with good hypothesis
Base learner too strong
Individual learner may overfit on its own
Susceptibility to noisy data
Work too hard to find pattern in outlier
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — A Bayesian Perspective
In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.
Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]
And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]
Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as
Predict 1 if Pr [y = 1]∏
t:ht(x)=0 εt∏
t:ht(x)=1(1− εt) > Pr [y =
0]∏
t:ht(x)=0(1− εt)∏
t:ht(x)=1 εt and 0 otherwise, where
εt = Pr [ht(x) 6= y ]
Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — A Bayesian Perspective
In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.
Predict 1 if
Pr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]And predict 0 if
Pr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as
Predict 1 if Pr [y = 1]∏
t:ht(x)=0 εt∏
t:ht(x)=1(1− εt) > Pr [y =
0]∏
t:ht(x)=0(1− εt)∏
t:ht(x)=1 εt and 0 otherwise, where
εt = Pr [ht(x) 6= y ]
Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — A Bayesian Perspective
In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.
Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]
And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]
Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as
Predict 1 if Pr [y = 1]∏
t:ht(x)=0 εt∏
t:ht(x)=1(1− εt) > Pr [y =
0]∏
t:ht(x)=0(1− εt)∏
t:ht(x)=1 εt and 0 otherwise, where
εt = Pr [ht(x) 6= y ]
Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — A Bayesian Perspective
In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.
Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]
And predict 0 if
Pr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as
Predict 1 if Pr [y = 1]∏
t:ht(x)=0 εt∏
t:ht(x)=1(1− εt) > Pr [y =
0]∏
t:ht(x)=0(1− εt)∏
t:ht(x)=1 εt and 0 otherwise, where
εt = Pr [ht(x) 6= y ]
Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — A Bayesian Perspective
In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.
Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]
And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]
Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as
Predict 1 if Pr [y = 1]∏
t:ht(x)=0 εt∏
t:ht(x)=1(1− εt) > Pr [y =
0]∏
t:ht(x)=0(1− εt)∏
t:ht(x)=1 εt and 0 otherwise, where
εt = Pr [ht(x) 6= y ]
Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — A Bayesian Perspective
In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.
Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]
And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]
Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as
Predict 1 if Pr [y = 1]∏
t:ht(x)=0 εt∏
t:ht(x)=1(1− εt) > Pr [y =
0]∏
t:ht(x)=0(1− εt)∏
t:ht(x)=1 εt and 0 otherwise, where
εt = Pr [ht(x) 6= y ]
Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — A Bayesian Perspective
In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.
Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]
And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]
Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as
Predict 1 if Pr [y = 1]∏
t:ht(x)=0 εt∏
t:ht(x)=1(1− εt) > Pr [y =
0]∏
t:ht(x)=0(1− εt)∏
t:ht(x)=1 εt and 0 otherwise, where
εt = Pr [ht(x) 6= y ]
Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — A Bayesian Perspective
In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.
Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]
And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]
Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as
Predict 1 if Pr [y = 1]∏
t:ht(x)=0 εt∏
t:ht(x)=1(1− εt) > Pr [y =
0]∏
t:ht(x)=0(1− εt)∏
t:ht(x)=1 εt and 0 otherwise, where
εt = Pr [ht(x) 6= y ]
Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Boosting — A Bayesian Perspective
In a Bayesian framework with examples being generated from somedistribution, we are given a set of binary hypotheses hi ,i ∈ 1, ...,T with the goal of combining them in an optimal way.We should predict the label y with the highest likelihood.
Predict 1 ifPr [y = 1|h1(x), ..., hT (x)] > Pr [y = 0|h1(x), ..., hT (x)]
And predict 0 ifPr [y = 0|h1(x), ..., hT (x)] > Pr [y = 1|h1(x), ..., hT (x)]
Assuming that event ht(x) 6= y is conditionally independent ofy and the sequence of h1(x), ..., hT (x), then we can use Bayesrule to rewrite our selection criteria as
Predict 1 if Pr [y = 1]∏
t:ht(x)=0 εt∏
t:ht(x)=1(1− εt) > Pr [y =
0]∏
t:ht(x)=0(1− εt)∏
t:ht(x)=1 εt and 0 otherwise, where
εt = Pr [ht(x) 6= y ]
Adding in the trivial hypothesis h0 which always predicts 1,replacing Pr [y = 0] with ε0, and taking the log of both sidesreturns a rule identical to the combination rule generated byAdaBoost
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Multiclass Extensions of AdaBoost
We modify our setup so that now the set of labels isY = 1, 2, ..., k
Correspondingly, our weak learners return hypothesesht : X → Y
we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner
We calculate the error of the hypothesis as
εt =∑N
i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0
otherwise
Let βt =εt
(1− εt)
The weights are updated as w t+1i = w t
i β1−||ht(xi )6=yi ||t
The final hypothesis is
hf (x) = arg maxy∈Y
T∑t=1
(log
1
βi
)[[ht(x) = y ]]
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Multiclass Extensions of AdaBoost
We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y
we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner
We calculate the error of the hypothesis as
εt =∑N
i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0
otherwise
Let βt =εt
(1− εt)
The weights are updated as w t+1i = w t
i β1−||ht(xi )6=yi ||t
The final hypothesis is
hf (x) = arg maxy∈Y
T∑t=1
(log
1
βi
)[[ht(x) = y ]]
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Multiclass Extensions of AdaBoost
We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y
we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner
We calculate the error of the hypothesis as
εt =∑N
i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0
otherwise
Let βt =εt
(1− εt)
The weights are updated as w t+1i = w t
i β1−||ht(xi )6=yi ||t
The final hypothesis is
hf (x) = arg maxy∈Y
T∑t=1
(log
1
βi
)[[ht(x) = y ]]
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Multiclass Extensions of AdaBoost
We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y
we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner
We calculate the error of the hypothesis as
εt =∑N
i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0
otherwise
Let βt =εt
(1− εt)
The weights are updated as w t+1i = w t
i β1−||ht(xi )6=yi ||t
The final hypothesis is
hf (x) = arg maxy∈Y
T∑t=1
(log
1
βi
)[[ht(x) = y ]]
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Multiclass Extensions of AdaBoost
We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y
we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner
We calculate the error of the hypothesis as
εt =∑N
i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0
otherwise
Let βt =εt
(1− εt)
The weights are updated as w t+1i = w t
i β1−||ht(xi )6=yi ||t
The final hypothesis is
hf (x) = arg maxy∈Y
T∑t=1
(log
1
βi
)[[ht(x) = y ]]
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Multiclass Extensions of AdaBoost
We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y
we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner
We calculate the error of the hypothesis as
εt =∑N
i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0
otherwise
Let βt =εt
(1− εt)
The weights are updated as w t+1i = w t
i β1−||ht(xi )6=yi ||t
The final hypothesis is
hf (x) = arg maxy∈Y
T∑t=1
(log
1
βi
)[[ht(x) = y ]]
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Multiclass Extensions of AdaBoost
We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y
we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner
We calculate the error of the hypothesis as
εt =∑N
i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0
otherwise
Let βt =εt
(1− εt)
The weights are updated as w t+1i = w t
i β1−||ht(xi )6=yi ||t
The final hypothesis is
hf (x) = arg maxy∈Y
T∑t=1
(log
1
βi
)[[ht(x) = y ]]
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Multiclass Extensions of AdaBoost
We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y
we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner
We calculate the error of the hypothesis as
εt =∑N
i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0
otherwise
Let βt =εt
(1− εt)
The weights are updated as w t+1i = w t
i β1−||ht(xi )6=yi ||t
The final hypothesis is
hf (x) = arg maxy∈Y
T∑t=1
(log
1
βi
)[[ht(x) = y ]]
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Multiclass Extensions of AdaBoost
We modify our setup so that now the set of labels isY = 1, 2, ..., kCorrespondingly, our weak learners return hypothesesht : X → Y
we still start with a set of weights over examples and get ahypothesis bases on this set from the weak learner
We calculate the error of the hypothesis as
εt =∑N
i=1 pti [[ht(xi ) 6= yi ]] where [[z ]] is 1 if z holds and 0
otherwise
Let βt =εt
(1− εt)
The weights are updated as w t+1i = w t
i β1−||ht(xi )6=yi ||t
The final hypothesis is
hf (x) = arg maxy∈Y
T∑t=1
(log
1
βi
)[[ht(x) = y ]]
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Notes on the Multiclass Extensions of AdaBoost
Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish
More sophisticated extensions exist, some of which arediscussed in the paper
Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Notes on the Multiclass Extensions of AdaBoost
Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish
More sophisticated extensions exist, some of which arediscussed in the paper
Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Notes on the Multiclass Extensions of AdaBoost
Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish
More sophisticated extensions exist, some of which arediscussed in the paper
Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting
Notes on the Multiclass Extensions of AdaBoost
Has no way of forcing the weak learner to discriminatebetween labels that are especially hard to distinguish
More sophisticated extensions exist, some of which arediscussed in the paper
Can also approach problem by converting it to many binaryproblems, boosting on each of those, and then patching theresults together
Sachin Gupta and Nitish Lakhanpal A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting