online learning, games, boostingbanerjee/teaching/spring06/online.pdfgame theory rock paper scissors...
TRANSCRIPT
Online Learning, Games, Boosting
Arindam Banerjee
. – p.1
Predicting with Expert Advice
Basic setting: n experts, boolean prediction
Prediction proceeds in trialsEach expert makes a boolean prediction“Master algorithm” combines the predictionsThe true label is revealed
No explicit notion of input
Master algorithms with “good” cumulative/relative loss
. – p.2
The Halving Algorithm
If one of the experts is always known to be correct:
Always vote based on the majority
If the master is wrong, get rid of the experts that were wrong
Proposition: The master makes at most log2 n mistakes
. – p.3
Weighted Majority (Simple)
Initialize the weights wi = 1, [i]n1
Each expert makes a boolean prediction zi ∈ {0, 1}, [i]n1
The master predicts 1 if
∑
i:zi=1
wi ≥∑
i:zi=0
wi ,
and 0 otherwise
The true label ` is received
Modify the weights as follows: if zi = `, no change; if zi 6= `,wi ← wi/2, [i]n1
Theorem: The master makes at most 2.41(m + log2 n) mistakes,where m is the number of mistakes by the best expert so far
. – p.4
Weighted Majority (Randomized)
Initialize the weights wi = 1, [i]n1
Each expert makes a boolean prediction zi ∈ {0, 1}, [i]n1
The master predicts zi with probability wi/W , where W =∑
i wi
The true label ` is received
Modify the weights as follows: if zi = `, no change; if zi 6= `,wi ← βwi, [i]n1 for β ∈ [0, 1]
Theorem: The expected number of mistakes the master makes is atmost m ln(1/β)+ln n
1−β , where m is the number of mistakes by the bestexpert so far
Corollary: If β is adjusted dynamically, or, if the total number of trialsis known upfront, the expected number of mistakes is at mostm + ln n + O(
√m lnn)
. – p.5
Extensions
The range of prediction is more general, i.e., not necessarilyboolean
An explicit notion of input exists
The adversary is not arbitrary, e.g., choosing labels from afunction, a fixed distribution, etc.
Labels from a function with random noise
Learning (boolean) functions
Realizable learning
Agnostic learning
Random Noise
. – p.6
Game Theory
Rock Paper ScissorsRock 1
2 1 0Paper 0 1
2 1Scissors 1 0 1
2
The loss matrix M for the row player
The actual game has payoffs 2(M − 12 )
Pure strategies: Rock, Paper, Scissors
Randomized playRow player chooses distribution P
Column player chooses distribution Q
The expected loss for the row player
M(P, Q) = P T MQ =∑
ij
P (i)Q(j)M(i, j)
. – p.7
Sequential Play
Row player chooses a (randomized) strategy P
Given P , column player chooses a strategy Q such that the lossof row player is maximized, i.e., maxQ M(P, Q)
Row player chooses P upfront to minimize the maximum loss
minP
maxQ
M(P, Q)
If column player chooses first, then the loss of the row player
maxQ
minP
M(P, Q)
“Obviously” maxQ
minP
M(P, Q) ≤ minP
maxQ
M(P, Q)
Von Neumann says the following:
maxQ
minP
M(P, Q) = minP
maxQ
M(P, Q)
. – p.8
Repeated Play
Row is the learner, Column is the adversaryThe game is being played repeatedly t = 1, . . . , T
On round t, learner chooses Pt
Adversary chooses Qt
Learner gets losses M(i, Qt) for individual strategiesLearner suffers loss M(Pt, Qt)
Learner’s algorithm (Hedging)Initialize the weights w1(i) = 1, [i]n1At round t, choose
Pt(i) =wt(i)
∑
i wt(i)
Upon receiving loss M(i, Qt), update
wt+1(i) = wt(i)βM(i,Qt)
. – p.9
Main Results
Theorem: The sequence of strategies P1, . . . , PT , chosen by thealgorithm satisfies
T∑
t=1
M(Pt, Qt) ≤ aβminP
T∑
t=1
M(P, Qt) + cβ ln n
where aβ = ln(1/β)1−β and cβ = 1
1−β
Corollary: With β = 1
1+√
2 ln n
T
, the average per-trial loss satisfies
1
T
T∑
t=1
M(Pt, Qt) ≤ minP
1
T
T∑
t=1
M(P, Qt) + ∆T
where ∆T = O
(
√
ln nT
)
. – p.10
Learning in Games
Corollary: If v is the value of the game
1
T
T∑
t=1
M(Pt, Qt) ≤ v + ∆T
A (new) proof of the minimax theorem
minP
maxQ
PT MQ ≤ maxQ
1
T
∑
t
PTt MQ
≤ 1
T
∑
t
maxQ
PTt MQ =
1
T
∑
t
PTt MQt
≤ minP
1
T
∑
t
PT MQt + ∆t
≤ maxQ
minP
PT MQ + ∆t
. – p.11
Weak Learning
A weak learner predicts slightly better than random
The PAC setting (basics)Let X be a instance space, c : X 7→ {0, 1} be a targetconcept, H be a hypothesis space (h : X 7→ {0, 1})Let Q be a fixed but unknown distribution on X
An algorithm, after training on (xi, c(xi)), [i]m1 , selects h ∈ H
such that
Px∼Q[h(x) 6= c(x)] ≤ 1
2− γ
The algorithm is called a γ-weak learner
We assume the existence of such a learner
. – p.12
The Boosting Model
Boosting converts a weak learner to a strong learner
Boosting proceeds in roundsBooster constructs Dt on X, the train setWeak learner produces a hypothesis ht ∈ H so thatPx∼Dt
[ht(x) 6= c(x)] ≤ 12 − γ
After T rounds, the weak hypotheses ht, [t]T1 are combined
into a final hypothesis hfin
We need proceduresfor obtaining Dt at each stepfor combining the weak hypotheses
. – p.13
Minimax and Majority Votes
Consider the mistake matrix M(h, x) = 1 if h(x) 6= c(x) and 0otherwise
The row player can choose P over HThe column player can choose Q over XNow minP maxx M(P, x) = v = maxQ minh M(h, Q)
However M(h, Q) = Px∼Q[h(x) 6= c(x)]
Now ∃Q∗, M(h, Q∗) = Px∼Q∗ [h(x) 6= c(x)] ≥ v, ∀hFrom weak learning, ∃h, Px∼Q∗ [h(x) 6= c(x)] ≤ 1
2 − γ
Hence v ≤ 12 − γ
Also ∃P ∗ such that ∀x,
M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1
2− γ <
1
2
Majority voting according to P ∗ is equivalent to c
. – p.14
Minimax and Majority Votes
Consider the mistake matrix M(h, x) = 1 if h(x) 6= c(x) and 0otherwise
The row player can choose P over H
The column player can choose Q over XNow minP maxx M(P, x) = v = maxQ minh M(h, Q)
However M(h, Q) = Px∼Q[h(x) 6= c(x)]
Now ∃Q∗, M(h, Q∗) = Px∼Q∗ [h(x) 6= c(x)] ≥ v, ∀hFrom weak learning, ∃h, Px∼Q∗ [h(x) 6= c(x)] ≤ 1
2 − γ
Hence v ≤ 12 − γ
Also ∃P ∗ such that ∀x,
M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1
2− γ <
1
2
Majority voting according to P ∗ is equivalent to c
. – p.14
Minimax and Majority Votes
Consider the mistake matrix M(h, x) = 1 if h(x) 6= c(x) and 0otherwise
The row player can choose P over HThe column player can choose Q over X
Now minP maxx M(P, x) = v = maxQ minh M(h, Q)
However M(h, Q) = Px∼Q[h(x) 6= c(x)]
Now ∃Q∗, M(h, Q∗) = Px∼Q∗ [h(x) 6= c(x)] ≥ v, ∀hFrom weak learning, ∃h, Px∼Q∗ [h(x) 6= c(x)] ≤ 1
2 − γ
Hence v ≤ 12 − γ
Also ∃P ∗ such that ∀x,
M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1
2− γ <
1
2
Majority voting according to P ∗ is equivalent to c
. – p.14
Minimax and Majority Votes
Consider the mistake matrix M(h, x) = 1 if h(x) 6= c(x) and 0otherwise
The row player can choose P over HThe column player can choose Q over XNow minP maxx M(P, x) = v = maxQ minh M(h, Q)
However M(h, Q) = Px∼Q[h(x) 6= c(x)]
Now ∃Q∗, M(h, Q∗) = Px∼Q∗ [h(x) 6= c(x)] ≥ v, ∀hFrom weak learning, ∃h, Px∼Q∗ [h(x) 6= c(x)] ≤ 1
2 − γ
Hence v ≤ 12 − γ
Also ∃P ∗ such that ∀x,
M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1
2− γ <
1
2
Majority voting according to P ∗ is equivalent to c
. – p.14
Minimax and Majority Votes
Consider the mistake matrix M(h, x) = 1 if h(x) 6= c(x) and 0otherwise
The row player can choose P over HThe column player can choose Q over XNow minP maxx M(P, x) = v = maxQ minh M(h, Q)
However M(h, Q) = Px∼Q[h(x) 6= c(x)]
Now ∃Q∗, M(h, Q∗) = Px∼Q∗ [h(x) 6= c(x)] ≥ v, ∀hFrom weak learning, ∃h, Px∼Q∗ [h(x) 6= c(x)] ≤ 1
2 − γ
Hence v ≤ 12 − γ
Also ∃P ∗ such that ∀x,
M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1
2− γ <
1
2
Majority voting according to P ∗ is equivalent to c
. – p.14
Minimax and Majority Votes
Consider the mistake matrix M(h, x) = 1 if h(x) 6= c(x) and 0otherwise
The row player can choose P over HThe column player can choose Q over XNow minP maxx M(P, x) = v = maxQ minh M(h, Q)
However M(h, Q) = Px∼Q[h(x) 6= c(x)]
Now ∃Q∗, M(h, Q∗) = Px∼Q∗ [h(x) 6= c(x)] ≥ v, ∀h
From weak learning, ∃h, Px∼Q∗ [h(x) 6= c(x)] ≤ 12 − γ
Hence v ≤ 12 − γ
Also ∃P ∗ such that ∀x,
M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1
2− γ <
1
2
Majority voting according to P ∗ is equivalent to c
. – p.14
Minimax and Majority Votes
Consider the mistake matrix M(h, x) = 1 if h(x) 6= c(x) and 0otherwise
The row player can choose P over HThe column player can choose Q over XNow minP maxx M(P, x) = v = maxQ minh M(h, Q)
However M(h, Q) = Px∼Q[h(x) 6= c(x)]
Now ∃Q∗, M(h, Q∗) = Px∼Q∗ [h(x) 6= c(x)] ≥ v, ∀hFrom weak learning, ∃h, Px∼Q∗ [h(x) 6= c(x)] ≤ 1
2 − γ
Hence v ≤ 12 − γ
Also ∃P ∗ such that ∀x,
M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1
2− γ <
1
2
Majority voting according to P ∗ is equivalent to c
. – p.14
Minimax and Majority Votes
Consider the mistake matrix M(h, x) = 1 if h(x) 6= c(x) and 0otherwise
The row player can choose P over HThe column player can choose Q over XNow minP maxx M(P, x) = v = maxQ minh M(h, Q)
However M(h, Q) = Px∼Q[h(x) 6= c(x)]
Now ∃Q∗, M(h, Q∗) = Px∼Q∗ [h(x) 6= c(x)] ≥ v, ∀hFrom weak learning, ∃h, Px∼Q∗ [h(x) 6= c(x)] ≤ 1
2 − γ
Hence v ≤ 12 − γ
Also ∃P ∗ such that ∀x,
M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1
2− γ <
1
2
Majority voting according to P ∗ is equivalent to c
. – p.14
Minimax and Majority Votes
Consider the mistake matrix M(h, x) = 1 if h(x) 6= c(x) and 0otherwise
The row player can choose P over HThe column player can choose Q over XNow minP maxx M(P, x) = v = maxQ minh M(h, Q)
However M(h, Q) = Px∼Q[h(x) 6= c(x)]
Now ∃Q∗, M(h, Q∗) = Px∼Q∗ [h(x) 6= c(x)] ≥ v, ∀hFrom weak learning, ∃h, Px∼Q∗ [h(x) 6= c(x)] ≤ 1
2 − γ
Hence v ≤ 12 − γ
Also ∃P ∗ such that ∀x,
M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1
2− γ <
1
2
Majority voting according to P ∗ is equivalent to c
. – p.14
Minimax and Majority Votes
Consider the mistake matrix M(h, x) = 1 if h(x) 6= c(x) and 0otherwise
The row player can choose P over HThe column player can choose Q over XNow minP maxx M(P, x) = v = maxQ minh M(h, Q)
However M(h, Q) = Px∼Q[h(x) 6= c(x)]
Now ∃Q∗, M(h, Q∗) = Px∼Q∗ [h(x) 6= c(x)] ≥ v, ∀hFrom weak learning, ∃h, Px∼Q∗ [h(x) 6= c(x)] ≤ 1
2 − γ
Hence v ≤ 12 − γ
Also ∃P ∗ such that ∀x,
M(P ∗, x) = Ph∼P∗ [h(x) 6= c(x)] ≤ v ≤ 1
2− γ <
1
2
Majority voting according to P ∗ is equivalent to c
. – p.14
Putting it together
Boosting maintains distribution over data
Minimax argument requires distribution over hypothesis
Change the mistake matrix to M ′ = 1−MT so thatM ′(x, h) = 1−M(h, x) = 1 if h(x) = c(x) and 0 otherwise
On round t of boostingLet Dt = Pt from hedging over rows (data)Weak learner returns ht with Px∼Dt
[ht(x) = c(x)] ≥ 12 + γ
Update weights using Qt = ht by adversary
Let hfin be the majority vote of ht, [t]T1
. – p.15
Putting it together
Boosting maintains distribution over data
Minimax argument requires distribution over hypothesis
Change the mistake matrix to M ′ = 1−MT so thatM ′(x, h) = 1−M(h, x) = 1 if h(x) = c(x) and 0 otherwise
On round t of boostingLet Dt = Pt from hedging over rows (data)Weak learner returns ht with Px∼Dt
[ht(x) = c(x)] ≥ 12 + γ
Update weights using Qt = ht by adversary
Let hfin be the majority vote of ht, [t]T1
. – p.15
Putting it together
Boosting maintains distribution over data
Minimax argument requires distribution over hypothesis
Change the mistake matrix to M ′ = 1−MT so thatM ′(x, h) = 1−M(h, x) = 1 if h(x) = c(x) and 0 otherwise
On round t of boostingLet Dt = Pt from hedging over rows (data)Weak learner returns ht with Px∼Dt
[ht(x) = c(x)] ≥ 12 + γ
Update weights using Qt = ht by adversary
Let hfin be the majority vote of ht, [t]T1
. – p.15
Putting it together
Boosting maintains distribution over data
Minimax argument requires distribution over hypothesis
Change the mistake matrix to M ′ = 1−MT so thatM ′(x, h) = 1−M(h, x) = 1 if h(x) = c(x) and 0 otherwise
On round t of boostingLet Dt = Pt from hedging over rows (data)Weak learner returns ht with Px∼Dt
[ht(x) = c(x)] ≥ 12 + γ
Update weights using Qt = ht by adversary
Let hfin be the majority vote of ht, [t]T1
. – p.15
Putting it together
Boosting maintains distribution over data
Minimax argument requires distribution over hypothesis
Change the mistake matrix to M ′ = 1−MT so thatM ′(x, h) = 1−M(h, x) = 1 if h(x) = c(x) and 0 otherwise
On round t of boostingLet Dt = Pt from hedging over rows (data)Weak learner returns ht with Px∼Dt
[ht(x) = c(x)] ≥ 12 + γ
Update weights using Qt = ht by adversary
Let hfin be the majority vote of ht, [t]T1
. – p.15
Analysis of Boosting
We have M ′(Pt, ht) = Px∼Pt[ht(x) = c(x)] ≥ 1
2 + γ
Hence, from Corollary,
1
2+ γ ≤ 1
T
∑
t
M ′(Pt, ht) ≤ minx
1
T
∑
t
M ′(x, ht) + ∆T
Therefore, ∀x (and sufficiently large T )
1
T
∑
t
M ′(x, ht) ≥1
2+ γ −∆T >
1
2
Hence, the majority vote hfin(x) = c(x), ∀x
But what happens on unseen data?
. – p.16
Analysis of Boosting
We have M ′(Pt, ht) = Px∼Pt[ht(x) = c(x)] ≥ 1
2 + γ
Hence, from Corollary,
1
2+ γ ≤ 1
T
∑
t
M ′(Pt, ht) ≤ minx
1
T
∑
t
M ′(x, ht) + ∆T
Therefore, ∀x (and sufficiently large T )
1
T
∑
t
M ′(x, ht) ≥1
2+ γ −∆T >
1
2
Hence, the majority vote hfin(x) = c(x), ∀x
But what happens on unseen data?
. – p.16
Analysis of Boosting
We have M ′(Pt, ht) = Px∼Pt[ht(x) = c(x)] ≥ 1
2 + γ
Hence, from Corollary,
1
2+ γ ≤ 1
T
∑
t
M ′(Pt, ht) ≤ minx
1
T
∑
t
M ′(x, ht) + ∆T
Therefore, ∀x (and sufficiently large T )
1
T
∑
t
M ′(x, ht) ≥1
2+ γ −∆T >
1
2
Hence, the majority vote hfin(x) = c(x), ∀x
But what happens on unseen data?
. – p.16
Analysis of Boosting
We have M ′(Pt, ht) = Px∼Pt[ht(x) = c(x)] ≥ 1
2 + γ
Hence, from Corollary,
1
2+ γ ≤ 1
T
∑
t
M ′(Pt, ht) ≤ minx
1
T
∑
t
M ′(x, ht) + ∆T
Therefore, ∀x (and sufficiently large T )
1
T
∑
t
M ′(x, ht) ≥1
2+ γ −∆T >
1
2
Hence, the majority vote hfin(x) = c(x), ∀x
But what happens on unseen data?
. – p.16
Analysis of Boosting
We have M ′(Pt, ht) = Px∼Pt[ht(x) = c(x)] ≥ 1
2 + γ
Hence, from Corollary,
1
2+ γ ≤ 1
T
∑
t
M ′(Pt, ht) ≤ minx
1
T
∑
t
M ′(x, ht) + ∆T
Therefore, ∀x (and sufficiently large T )
1
T
∑
t
M ′(x, ht) ≥1
2+ γ −∆T >
1
2
Hence, the majority vote hfin(x) = c(x), ∀x
But what happens on unseen data?
. – p.16