online(learning(and(regret - university of edinburgh
TRANSCRIPT
![Page 1: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/1.jpg)
Decision Making in Robots and Autonomous Agents
Online Learning and Regret
Subramanian Ramamoorthy School of Informa<cs
3 March, 2015
![Page 2: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/2.jpg)
Recap: Interpreta,on of MAB Type Problems
3/3/2015 2
Related to ‘rewards’
![Page 3: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/3.jpg)
Recap: MAB as Special Case of Online Learning
3/3/2015 3
![Page 4: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/4.jpg)
Recap: How to Evaluate Online Alg -‐ Regret
• ALer you have played for T rounds, you experience a regret: = [Reward sum of op,mal strategy] – [Sum of actual collected rewards]
• If the average regret per round goes to zero with probability
1, asympto,cally, we say the strategy has no-‐regret property ~ guaranteed to converge to an op,mal strategy
• ε-‐greedy is sub-‐op,mal (so has some regret).
3/3/2015 4
[ ]
kk
T
ti
T
tt trETrT
t
µµ
µµρ
max
)(ˆ
*1
*
1
*
=
−=−= ∑∑== Randomness in
draw of rewards & player’s strategy
![Page 5: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/5.jpg)
Solving MAB: Interval Es,ma,on
• AZribute to each arm an “op,mis,c ini,al es,mate” within a certain confidence interval
• Greedily choose arm with highest op,mis,c mean (upper bound on confidence interval)
• Infrequently observed arm will have over-‐valued reward mean, leading to explora,on
• Frequent usage pushes op,mis,c es,mate to true values
3/3/2015 5
![Page 6: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/6.jpg)
Interval Es,ma,on Procedure
• Associate to each arm 100(1 -‐ α)% reward mean upper band
• Assume, e.g., rewards are normally distributed • Arm is observed n ,mes to yield empirical mean & std dev • α-‐upper bound:
• If α is ac,vely controlled, possible zero-‐regret strategy
– For general distribu,ons, we don’t know
3/3/2015 6
dxxtc
cn
u
t
∫ ∞−
−
⎟⎟⎠
⎞⎜⎜⎝
⎛−=
−+=
2exp
21)(
)1(ˆˆ
2
1
π
ασ
µα
Cum. Distribu,on Func,on
![Page 7: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/7.jpg)
Solving MAB: UCB Strategy
• Again, based on no,on of an upper confidence bound but more generally applicable
• Algorithm: – Play each arm once – At ,me t > K, play arm it maximizing
3/3/2015 7
far so playedbeen has j timesofnumber :
ln2)(
,
,
tj
tjj
T
Tttr +
![Page 8: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/8.jpg)
UCB Strategy
3/3/2015 8
![Page 9: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/9.jpg)
Reminder: Chernoff-‐Hoeffding Bound
3/3/2015 9
![Page 10: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/10.jpg)
UCB Strategy – Behaviour
3/3/2015 10
We will not try to prove the following result but I quote (only FYI) the final result to tell you why UCB may be a desirable strategy – regret is bounded.
K = number of arms
![Page 11: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/11.jpg)
Varia,on on So#Max:
• It is possible to drive regret down by annealing τ • Exp3 : Exponen,al weight alg. for explora,on and exploita,on • Probability of choosing arm k (of K) at ,me t is
3/3/2015 11
∑ =
n
bbQ
aQ
t
t
ee
1)(
)(
τ
τ
( ))log(Regret
at pulled is arm if
)()()(
exp)()1(
)(
)()1()(
1
KKTO
otherwisetj
twKtPtr
twtw
Ktw
twtP
j
j
jj
j
k
jj
kk
≈
⎪⎩
⎪⎨
⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛
=+
+−=
∑=
γ
γγ
γ is a user defined open parameter
![Page 12: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/12.jpg)
Solving MAB: Gimns Index
• Each arm delivers reward with a probability • This probability may change through ,me but only when arm
is pulled • Goal is to maximize discounted rewards – future is discounted
by an exponen,al discount factor δ < 1
• The structure of the problem is such that, all you need to do is compute an “index” for each arm and play the one with the highest index (rich theory to explain why)
• Index is of the form:
3/3/2015 12
ν i = supT>0
δ tr(t)t=0
T
∑
δ t
t=0
T
∑
![Page 13: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/13.jpg)
Gimns Index – Intui,on
• Proving op,mality isn’t within our scope; it is based on, • Stopping ,me: the point where you should terminate bandit
• Nice Property: Gimns index for any given bandit is independent of expected outcome of all other bandits – Once you have a good arm, keep playing un,l there is a beZer one – If you add/remove machines, computa,on doesn’t really change
• BUT: – hard to compute, even when you know distribu,ons – Explora,on issues; arms aren’t updated unless used (restless bandits?)
3/3/2015 13
![Page 14: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/14.jpg)
Numerous Applica,ons!
3/3/2015 14
![Page 15: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/15.jpg)
Equilibria
• “… as central to the study of social systems as it is to the analysis of physical phenomena. In the physical world, equilibrium results from the balancing of forces. In socie,es it results from the balancing of inten,ons.” (H. Peyton Young)
• Classical mechanics has both an equilibrium and non-‐equilibrium descrip,on of mo,on
• What about non-‐equilibrium study of strategic interac,ons?
3/3/2015 15
![Page 16: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/16.jpg)
Non-‐Equilibrium Study of Strategic Interac,ons
• Perhaps nearest thing is Bayesian decision theory. If individuals can imagine: – Future states of the world – All possible changes in behaviour, by all individuals, over all possible sequences of states
• As condi,ons unfold, they update beliefs and op,mize expected future payoffs
• If their beliefs put posi,ve probability on the strategies their opponents are actually using, then beliefs and behaviours will gradually come into alignment, and equilibrium or something close to it will obtain
3/3/2015 16
![Page 17: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/17.jpg)
Points to Ponder
• The issue with this high ra,onality viewpoint: – Individuals need sophis,ca,on, i.e., reasoning power – Can all possible futures be actually an,cipated?
• A peculiarity of social systems (versus physical systems) – individuals are learning about a process where others are also learning (self-‐referen,al)
• When the observer is part of the system, the act of learning changes the thing to be learned
3/3/2015 17
![Page 18: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/18.jpg)
Example Applica,on: Choosing Interfaces
Choose parameters that users can work with …
… a continual process!
3/3/2015 18
![Page 19: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/19.jpg)
Example Applica,on: Adap,ng Interfaces
… which can be used to adapt online to individual users.
[Source: http://spyrestudios.com]
Some tasks permit incredible variety…
3/3/2015 19
![Page 20: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/20.jpg)
Example Applica,on: Op,mizing with a Moving Target
User Performance is highly con,ngent on their experiences – on the paths they take in an interface landscape.
3/3/2015 20
![Page 21: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/21.jpg)
Simple(st) Example – Uncertain Game
• Soda Game
• Players know their own payoff but no knowledge of other player (not even, as in Bayesian games, distribu,ons).
• Imagine you are the row player and you have observed: (Payoff) 0 0 0 1 1 0 0 0 1 0 0 Row L R L L R R L R R R R ? Column R L R L R L R L R L L ?
What should you do in the next ,me period?
3/3/2015
L R
L Coke, Coke Sprite, Seven-‐up
R Seven-‐up, Sprite Pepsi, Pepsi
21
![Page 22: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/22.jpg)
What is the Nature of Uncertainty Here?
• We do not know what kind of game we are facing
• If both of us prefer “dark” drinks to “light” drinks or vice versa, it is a coordina,on game (three equilibria, two pure and one mixed)
• If one of us prefers dark and other prefers light, it is like matching pennies, unique mixed equilibrium
3/3/2015 22
![Page 23: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/23.jpg)
A Thorny Problem (Foster & Young, 2001) Imagine a game was constructed as follows: Entries in payoff matrix are determined by independent draws from a normal distribu,on – once at the beginning
• With ra,onal Bayesian players who have a prior over opponent’s strategy space guided by a commonly known payoff distribu,on
• It can be shown that, under any pair of priors, the players will fail to learn the Nash equilibrium with posi=ve probability
• There may be no priors that sa,sfy the necessary condi,on of “absolute con,nuity” (i.e., that players’ prior beliefs capture the set of actual play paths with posi,ve probability) – Need great care in analyzing learning procedures…
… and we have not even men,oned computa,onal cost yet 3/3/2015 23
![Page 24: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/24.jpg)
Model: “Reinforcement” Learning
• Firstly, note that here the term is used slightly differently from what you may be used to!
• At each ,me period t, a subject chooses ac,ons from a finite set X, Nature/external subject chooses ac,on y from set Y
• Realized payoff is u(xt, yt), this is assumed ,me-‐independent • We define another variable, θ, to model subject’s propensity
to play ac,on x at ,me t. So, the probability of an ac,on is,
Let qt and θt represent k-‐dim vectors • Learning: how do the propensi,es evolve over =me?
3/3/2015 24
![Page 25: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/25.jpg)
“Matching” Payoffs
• Define a random unit vector that acts as indicator variable,
• A linear upda,ng model for propensi,es is (u is payoff),
• A simpler update rule is,
3/3/2015
Discount factor Random perturbations
Payoff
25
![Page 26: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/26.jpg)
Cumula,ve Payoff Matching
• Cumula,ve payoffs up to ,me t:
• Sum of ini,al propensi,es is: • Define a new quan,ty:
• So that change in probability of ac,on, per period, is:
• The denominator is unbounded, so eventually this curve flaZens out – power law of prac,ce
3/3/2015 26
![Page 27: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/27.jpg)
Roth-‐Erev RL Model
• Past payoffs are discounted at a constant geometric rate with λ < 1, and in each period there are random perturba,ons or “trembles”
• Marginal impact of period-‐by-‐period payoffs levels off
eventually, as denominator is bounded above • Another interpreta,on – in terms of aspira,on levels
(reinforce ac,on if its current payoff exceeds aspira,on)
3/3/2015 27
![Page 28: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/28.jpg)
Empirical Plausibility
• Many predic,ons of such models are observed in prac,ce – Recency phenomena: Recent payoffs tend to maZer more than long
past ones – Habit forming: Cumula,ve payoffs maZer in addi,on to the average
payoff of ac,on
• However, real human behaviour may not be restricted to simple rules like this.
• On a hierarchy of learning rules, these “RL” rules fall on the lower end of the spectrum – Behaviour depends solely on summary sta,s,cs of players’ payoffs
3/3/2015 28
![Page 29: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/29.jpg)
What is Captured in this type of RL
Despite their simplicity, they already capture some important, qualita,ve features that are shared with other learning methods as well 1. Probabilis,c choice: Subjects’ choice depends on history
and a random component, could be due to • Unmodeled behaviour • Deliberate experimenta,on • Inten,on strategy to keep opponent guessing
2. Sluggish adapta,on: Strong serial correla,on between probability distribu,ons in successive periods
3/3/2015 29
![Page 30: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/30.jpg)
What Other Ways Are There to Learn?
• Examples: No regret learning, Smoothed fic,,ous play, Hypothesis tes,ng with smoothed beZer responses
• Bayesian ra,onal learning does not share all of the similari,es to previous slide: – Unless perfectly indifferent between ac,ons, a Bayesian should prefer pure over mixed strategies
– Op,mum behaviour is sensi,ve to small changes in beliefs, so one can see frequent and radical changes in behaviour
3/3/2015 30
![Page 31: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/31.jpg)
Test: Learning in Sta,onary Environments
• RL presumes no mental model of the world and other agents • Does it s,ll lead to op,mal behaviour against the subjects’
environment? – Convergence to Nash equilibrium may be a tall order – What happens in a sta,onary (stochas,c) environment?
• History: • Behaviour strategy or ‘response rule’: • This gives condi,onal probability of ac,on: • Assume nature plays according to fixed rule,
3/3/2015 31
![Page 32: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/32.jpg)
Learning in Sta,onary Environment
• Combina,on of g and q* leads to a stochas,c process,
realiza,ons from Ω• Let B(q*) denote subset of ac,ons in X that maximize player’s
expected payoff against
• We say that g is op,mal against q* if • Rule g is op,mal against a sta,onary distribu,on if the above
holds for every q* – similarly to equilibrium defini,on (but against fixed distrib)
3/3/2015 32
![Page 33: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/33.jpg)
Result: Sta,onary Environment
Theorem: Given any finite ac,on sets X and Y, cumula=ve payoff matching on X is op,mal against every sta,onary distribu,on on Y
• In general games, this kind of statement is hard to make – Proof of this seemingly simple statement relies on stochas,c
approxima,on theory – Analysis under varying distribu,ons is hard!
• In zero-‐sum games, CPM converges, with probability 1, to a Nash equilibrium
3/3/2015 33
![Page 34: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/34.jpg)
What Next?
• Simple reinforcement rules such as CPM omit any men,on of the cogni,ve process
• What other kinds of criteria might subjects bring in? 1. PaZern of past play: predict opponent’s next ac,on
based on what has happened so far and choose ac,ons to maximize expected payoffs
2. Past payoffs: Could we have done beZer by playing differently in the past? • No predic,ve behavioural model, subjects simply want to
minimize ex post regret
3/3/2015 34
![Page 35: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/35.jpg)
Regret
• Consider simple game of choosing soL drinks: (Payoff) 0 0 0 1 1 0 0 0 1 0 0 Row L R L L R R L R R R R ? Column R L R L R L R L R L L ?
• Imagine you are allowed to replay the game but you must do so by choosing the same ac,on in every period (hypothesis class from which you evaluate).
• We do not really know what opponent would have done if we changed our play but we do have realized performance, so we ask with respect to this – If you just play R, payoff is 5 (for L, payoff is 6); foregone payoff was 3 – Average regret from not playing all L: 3/11; against all R: 2/11
3/3/2015 35
![Page 36: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/36.jpg)
Regret
• Average payoff through to ,me t: • For each ac,on x, define average regret from not having
played x as,
• We have a vector of regret, • A given realiza,on of play has no regret if,
• A behavioural rule g has no regret if, given a pre-‐specified infinite sequence of play by Nature, (y1, y2, …), almost all realiza,ons ω generated by g sa,sfy the above condi,on
3/3/2015 36
![Page 37: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/37.jpg)
Regret Matching
• Many varia,ons on learning using regret exist. A simple and appealing rule due to Hart and Mas-‐Colell is the following
• In each period, t+1, decision maker plays each ac,on with probability propor,onal to the non-‐nega,ve part of his regret up to that ,me,
• If regret for R is 2/11 and for L is 3/11 then under regret
matching, Row player chooses R or L with probability 2/5, 3/5 respec,vely (at t = 12 in our previous example) – As per CPM, R would have been chosen more than L
3/3/2015 37
![Page 38: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/38.jpg)
Regret Matching with ε-‐Experimenta,on
Can one do this without even knowing opponent ac<ons? – Subject experiments, randomly, with small probability ε – When not experimen,ng, he employs regret matching with the following modifica,on:
• “Es,mated” regret for ac,on x is its average payoff in previous periods when he experimented and chose x MINUS Average realized payoff over all ac,ons in all previous periods
Theorem: In a finite game against Nature, given δ > 0, for all sufficiently small ε > 0, regret matching with ε-‐experimenta,on has at most δ-‐regret against every sequence of play by Nature.
3/3/2015 38
![Page 39: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/39.jpg)
Why Does Regret Matching Work?
• Player X has two ac,ons, {1, 2} • Average per period payoff, • If he had just played ac,on 1,
• Regret:
• We want to have, , almost surely, where the non-‐nega,ve part of the regret is being denoted as
3/3/2015 39
![Page 40: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/40.jpg)
How Does Regret Matching Work?
• In period t+1, opponent takes unforeseen ac,on – Irrespec,ve of what ac,on will be, next period regret from playing
ac,on 1 is the nega,ve of that corresponding to ac,on 2 – Incremental regret is of the form (αt+1, -‐αt+1) for an unknown αt+1
• Let us say one is following a mixed strategy,
• Expected incremental regret with respect to this strategy,
• Weighted over ,me,
3/3/2015 40
![Page 41: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/41.jpg)
Regret Matching Procedure
• The goal is to choose a probability to make
• This is the same as making sure that is orthogonal to the current
• This implies,
• which implies,
3/3/2015 41
![Page 42: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/42.jpg)
Condi,onal or Internal Regret
• There exist a pair of ac,ons x,y so that playing x would have yielded higher total payoff over all periods when the subject actually played y – e.g., one may not have done beZer with ac,on such as ‘wear blue’ – Condi,onal statement is that she could have done beZer by always
wearing blue whenever she had instead worn black
• Given a play path, ω, player’s condi,onal regret matrix at ,me t is a matrix Rt(ω) such that
3/3/2015 42
![Page 43: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/43.jpg)
Shapley’s Game
Consider the following game:
Suppose we have a history of play over 10 periods: (Payoff) 1 0 0 0 1 0 0 0 0 0 Row R R B B B Y Y R Y R Column R B Y Y B R R Y R Y
3/3/2015
R Y B
R 1,0 0,0 0,1
Y 0,1 1,0 0,0
B 0,0 0,1 1,0
43
![Page 44: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/44.jpg)
Shapley’s Game – Condi,onal Regret
• Adopt the perspec,ve of the Row player; at the end of ten periods his condi,onal regret matrix is
• If Row had played R in the three periods when he actually played Y, his total for that period would have been 3 instead of 0. So, average condi,onal regret is 3/10 in cell (Y,R)
3/3/2015
R Y B
R 0 0.1 0
Y 0.3 0 0
B -‐0.1 0.1 0
44
![Page 45: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/45.jpg)
Learning with Condi,onal Regret
Fact: There exist learning rules that eliminate condi,onal regrets no maZer what Nature does (Foster & Vohra, 1997)
These rules are of the general form: Reinforcement increments δt are computed (e.g., linear algebraically) from condi,onal regret matrix.
Proof is based on a celebrated result called Blackwell’s Approachability Theorem.
3/3/2015 45
![Page 46: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/46.jpg)
Calibra,on
(P. Dawid) A sequence of binary forecasters is calibrated if in all those periods when forecaster predicts that event “1” will occur with probability p, the empirical frequency distribu,on of 1’s in all of those periods is in fact p
Similar defini,on applies to arbitrary symbols being forecast, e.g., real valued predic,ons, but the defini,on is more intricate in its formula,on…
3/3/2015 46
![Page 47: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/47.jpg)
Example: Bridge Contracts (Keren 1987)
3/3/2015 47
![Page 48: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/48.jpg)
Example: Physicians (Christensen-‐Szalanski et al.)
3/3/2015 48
Why might this bias make sense?
![Page 49: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/49.jpg)
Random Forecas,ng Rules
• Forecast equivalent of a randomized ac,on choice • A rule of the form (z is random variable):
• F is calibrated if, for every ω, the following calibra,on score goes to zero almost surely on player’s sequence of forecasts
3/3/2015
Num times p was forecast upto t
Empirical distribution of outcomes when prediction p was forecast
49
![Page 50: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/50.jpg)
Calibrated Forecasters
Given any finite set Z and ε > 0, there exist random forecas,ng rules that are ε-‐calibrated for all sequences on Z
Theorem: Let G be a finite game. Suppose every player uses a calibrated forecas,ng rule and chooses myopic best response to his forecast. Then empirical frequency distribu=on of play converges with probability one to the set of correlated equilibria of G
3/3/2015 50
![Page 51: Online(Learning(and(Regret - University of Edinburgh](https://reader030.vdocument.in/reader030/viewer/2022012613/61991b12715eb12c275da4ba/html5/thumbnails/51.jpg)
Takeaway Messages
• Equilibrium is a nice concept but a lot of real ac,on is off equilibrium in life
• How do people get to equilibria? • What happens if everyone is learning, groping their way
towards some no,on of ‘equilibrium’ • This area has many counter-‐intui,ve results
– ‘Perfect’ Bayesian learning is not always so – Simple learning rules gives surprisingly useful behaviour – No,ons such as regret enable learning despite limits to modelling of
the underlying process
• Many algorithms, such as regret matching and calibrated forecasts, represent ways to get to equilibrium
3/3/2015 51