bandits: part i stochastic, finite-armed bandits · basic properties of the regret 4 measure...
TRANSCRIPT
![Page 1: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/1.jpg)
Bandits: Part I
Stochastic, Finite-Armed Bandits
Csaba Szepesvari
Department of Computing Science & AICMLUniversity of Alberta & (@ Deepmind since August)
August 31, 2017
Summer School @ DS3
1 / 108
![Page 2: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/2.jpg)
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
2 / 108
![Page 3: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/3.jpg)
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
3 / 108
![Page 4: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/4.jpg)
Overview of Talks
• Talk 1: Basics• What, why, applications• Bandits: Problem definition• Stochastic finite-armed bandits: Basics• Measure concentration. Subgaussianity• Explore-then-commit• UCB, Optimism, Optimality
• Talk 2: Adversarial and linear bandits• Adversarial finite-armed bandits• Contextual bandits• Exp4• Stochastic linear bandits• Adversarial linear bandits
• Talk 3: Bandits in the wild
4 / 108
![Page 5: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/5.jpg)
Learning Material
• This lecture (and more): http://banditalgs.com
• Joint effort with Tor Lattimore;• Book to be published by early next year: Stay tuned!• Tor’s lightweight C++ bandit library
• I will share some exercises later
• Sebastien Bubeck’s tutorial• Blog post 1• Blog post 2
• Bubeck and Cesa-Bianchi’s book;(Bubeck and Cesa-Bianchi, 2012)
5 / 108
banditalgs.com
![Page 6: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/6.jpg)
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
6 / 108
![Page 7: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/7.jpg)
Learning Objectives: Knowledge
The goal is to gain knowledge about:
• Bandit problems: What are they?
• Types of bandit problems: How do they differ?
• Key ideas: Explore vs. exploit; why significant? What to do?
• Basic solution techniques
• Core results: How far did we get? How far can we go?
• Peak into contemporary research
7 / 108
![Page 8: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/8.jpg)
Learning Objectives: Skills
Skills to be acquired. Ability to . . .
• . . . recognize bandit problems;
• . . . recognize types/variants of bandits;
• . . . write code for bandits (algorithms, environment, . . . );
• . . . recognize insurmountable tradeoffs; limits of what is possible;
• . . . get around in the literature.
8 / 108
![Page 9: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/9.jpg)
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
9 / 108
![Page 10: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/10.jpg)
Origin of the Name
Mouse learning in a T-maze1953: Frederick Mosteller andRobert Bush, psychogolists“A Stochastic Model withApplications to Learning”
Generalization to humans:“Two-armed bandits”.
10 / 108
![Page 11: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/11.jpg)
Classics
• First paper on bandits: Thompson (1933)
• Major effort by Herbert Robbins in the 50s; earliest is (Robbins,1952)
• Another pioneer: Herman Chernoff, e.g., (Bather and Chernoff,1967)
• Breakthrough in Bayesian bandits (computation): (Gittins, 1979)
• A breakthrough in stochastic bandits: (Lai and Robbins, 1985),asymptotic regret-optimality in a frequentist setting
• UCB as we know it: (Auer et al., 2002), finite-time optimality
• Exp3, adversarial bandits: (Auer et al., 1995)
• Early books (sequential design, Bayesian setting): (Chernoff,1959; Berry and Fristedt, 1985; Gittins et al., 2011).
11 / 108
![Page 12: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/12.jpg)
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
12 / 108
![Page 13: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/13.jpg)
The Number Game
Number of papers published in 5-year periods (present=2016)on bandits as reported by google scholar.
13 / 108
![Page 14: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/14.jpg)
Why Do People Care?
• Decision making uncertainty is a significant challenge
• Bandits capture key aspects of these challenge: Theexploration-exploitation dilemma
Some examples• Which drugs should a patient receive?
• How should I allocate my study time between courses?
• Which version of a website will generate the most revenue?
• What move should be considered next when playing chess/go?
• . . .
14 / 108
![Page 15: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/15.jpg)
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
15 / 108
![Page 16: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/16.jpg)
The Exploration-Exploitation DilemmaPayoffs after 5 pulls of each arm:
Left arm: 0, 10, 0, 0, 10Right arm: 10, 0, 0, 0, 0
Left arm average payoff: 4 dollars per round;Right arm average payoff: 2 dollars per round.Budget: 20 more trials (pulls).
• Shall we pull left only? “Exploit”?
• Shall we pull the right arm at all? “Explore”?
• Why?
• How many times to pull each arm?
• “Good luck/bad luck?”
Play time!
16 / 108
![Page 17: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/17.jpg)
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
17 / 108
![Page 18: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/18.jpg)
Applications, applications..
1 A/B testing2 Drug testing3 Advert placement4 Network routing (packets,
planes, cars)5 Tree search (MCTS)6 Recommendation services
(for example, news ormovies)
7 Ranking (for example,search)
8 Educational games
9 Resource allocation(memory, bandwidth,manufacturing space)
10 Waiting problems (hard-diskshutdown, auto logout,waiting for a bus)
11 Dynamic pricing (forexample, on Amazon)
12 A core component of RL
18 / 108
![Page 19: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/19.jpg)
Questions?
19 / 108
![Page 20: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/20.jpg)
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
20 / 108
![Page 21: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/21.jpg)
Bandits: Interaction Protocol and Goal
For rounds t = 1, 2, ..., n:
1 Learner chooses an action At from a set A of available actions.The chosen action is sent to the environment;
2 The environment (E) generates a response in the form of areal-valued reward Xt ∈ R, which is sent back to the learner.
The goal of the learner is to maximize the sum of rewards that itreceives,
∑nt=1 Xt .
21 / 108
![Page 22: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/22.jpg)
Bandits: Interaction Protocol and Goal: Concepts
• History at the end of round t: Ht = (A1,X1, . . . ,At ,Xt).
• Learner can use history to base its action At+1 on in round t + 1.
• Learner “uses” a “policy”, a map of all possible histories toactions.
• The learner is also allowed to randomize.
22 / 108
![Page 23: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/23.jpg)
Bandits and Regret
Definition (Regret – Informal definition)
The regret of learner relative to action a =Total reward gained when a is used for all n rounds –Total reward gained by the specific learner in n rounds according toits chosen actions.
• Maximize reward ⇔ minimize regret.
• Advantage? Normalizes the scale so that zero has specialmeaning (kills meaningless shifts).
Questions:
• What does it mean that a learner has positive regret?
• (What does it mean that a learner has positive reward?)
• What does it mean that a learner has zero regret?
23 / 108
![Page 24: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/24.jpg)
Regret – II.
Definition (Zero-regret learner)
“Zero-regret learner”: If Rn is regret at time n, Rn/n→ 0; a.k.a.“vanishing regret”, sublinear regret; Rn = o(n).
In general we care about how fast Rn/n→ 0 happens ⇔ How slowlyRn grows.
Examples:
• Rn = O(√n) (Rn/n = O(1/
√n)).
• Rn = O(log(n)) (Rn/n = O(log(n)/n)).
24 / 108
![Page 25: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/25.jpg)
Discussion
• Why compare with fixed actions? Is this limiting? What doesthis capture?
• Stationarity.
• Does the policy know the “horizon”? Is it “anytime”?
• Good learner: “Small” regret (small worst-case regret!?) over alarge class of environments.
• Instance-dependent regret.
• Regret lower and upper bounds: “Worst-case” vs. instancedependent.
25 / 108
![Page 26: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/26.jpg)
Types of Environments
• Stochastic
• Adversarial (“unconstrained”)
• Unstructured vs. structured payoff
• Contextual
• . . .
26 / 108
![Page 27: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/27.jpg)
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
27 / 108
![Page 28: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/28.jpg)
Stochastic Bandits
DefinitionA K -armed stochastic bandit environment is a tuple ofdistributions ν = (P1,P2, . . . ,PK ), where Pi is a distribution over thereals for each i ∈ [K ]
.= {1, 2, . . . ,K}.
InteractionFor rounds t = 1, 2, 3, . . .
1 Based on its past observations (if any), the learner chooses anaction At ∈ [K ] following some policy π:At ∼ π(·|A1,X1, . . . ,At−1,Xt−1). The chosen action is sent tothe environment.
2 The environment generates a random reward Xt whosedistribution is PAt (in notation: Xt ∼ PAt ). The generatedreward is sent back to the learner.
28 / 108
![Page 29: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/29.jpg)
A note
• In the environment, no joint (over all arms) is specified. Why?
• The learner sees only Xt , not rewards from other arms. Even ifthose exist, they are latent, the learner cannot access them,hence, the properties of joint (even if it was specified) do notmatter.
• Precise answer: All information about an interconnectedenvironment-policy pair (ν, π) is in the joint distribution ofaction-reward sequences.
29 / 108
![Page 30: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/30.jpg)
Typical Environments
Name Symbol Definition
Bernoulli EKB{
(B(µi ))i : µ ∈ [0, 1]K}
Uniform EKU{
(U(ai , bi ))i : a, b ∈ RK , a ≤ b}
Gaussian (known var.) EKN (σ2){
(N (µi , σ2))i : µ ∈ RK
}Gaussian (unknown var.) EKN
{(N (µi , σ
2i ))i : µ ∈ RK , σ2 ∈ RK
+
}Finite variance EKV (σ2)
{(Pi )i : VX∼Pi [X ] ≤ σ2 for all i
}Finite kurtosis EKKurt(κ) {(Pi )i : KurtX∼Pi [X ] ≤ κ for all i}
Bounded support EK[a,b] {(Pi )i : Supp(Pi ) ⊆ [a, b]}
Subgaussian EKSG(σ2) {(Pi )i : Pi is σ-subgaussian for all i}
Supp(P) is the support of distribution P; Kurt(X ) = E[(X−E[X ])4]V[X ]2 .
Table: Typical environment classes for stochastic bandits
30 / 108
![Page 31: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/31.jpg)
Expected Reward/Regret
• Sn =∑n
t=1 Xt : Total reward. Random!!!
• Possible goal: Maximize E [Sn], the expected reward.
• OK?
• Same as minimizing Rn, the (expected) regret, where
Rn.
= nµ∗ − E [Sn] ,
µ∗ = maxi∈[K ] µi , µi =∫ +∞−∞ x Pi(dx).
31 / 108
![Page 32: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/32.jpg)
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
32 / 108
![Page 33: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/33.jpg)
Basic Properties of Regret – I.
• Let Rn(π, ν) be the regret (expected!) of policy π onenvironment ν.
Lemma (Warmup – Exercise 0)
(a) Rn(π, ν) ≥ 0 for all policies π.
(b) The policy π choosing At ∈ argmaxi µi for all t satisfies
Rn(π, ν) = 0 .
(c) If Rn(π, ν) = 0 for some policy π then for all t, At ∈ [K ] isoptimal with probability one:
P (µAt = µ∗) = 1 .
33 / 108
![Page 34: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/34.jpg)
Regret Decomposition – Important!!!
• Let ν = (P1, . . . ,PK ) be a bandit environment.• Let ∆i(ν) = µ∗(ν)− µi(ν): sub-optimality gap or action gap
or immediate regret of action i .• Usage count for action i :
Ti(t) =t∑
s=1
I{As=i} .
• Note: ∆i.
= ∆i(ν) is non-random, Ti(t) is random! (Why?)
Lemma (Regret Decomposition Lemma)
For any policy π and K -armed stochastic bandit environment ν andhorizon n ∈ N, the regret Rn of policy π in ν satisfies
Rn =∑K
i=1 ∆i E [Ti(n)] .
34 / 108
![Page 35: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/35.jpg)
Proof of the Regret Decomposition Lemma
Proof.For any fixed t we have
∑k I{At=k} = 1.
Hence, Sn =∑
t Xt =∑
t
∑k XtI{At=k} and thus
Rn = nµ∗ − E [Sn] =K∑
k=1
n∑t=1
E[(µ∗ − Xt)I{At=k}
].
Now, knowing At , the expected reward is µAt . Thus we have
E[(µ∗ − Xt)I{At=k} |At
]= I{At=k}E [µ∗ − Xt |At ]
= I{At=k}(µ∗ − µAt )
= I{At=k}(µ∗ − µk) .
Take expectations, sum both sides over t = 1, . . . , n.
35 / 108
![Page 36: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/36.jpg)
Questions?
36 / 108
![Page 37: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/37.jpg)
Exercise 1: (get code )Implement a Bernoulli bandit environment in Python using the codesnippet below:
class BernoulliBandit:
# accepts a list of K >= 2 floats , each lying in
[0,1]
def __init__(self , means):
pass
# Function should return the number of arms
def K(self):
pass
# Accepts a parameter 0 <= a <= K-1 and returns the
# realisation of random variable X with P(X = 1)
being
# the mean of the (a+1)th arm.
def pull(self , a):
pass
# Returns the regret incurred so far.
def regret(self):
pass
37 / 108
![Page 38: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/38.jpg)
Exercise 2: Follow-the-Leader (get code )
Implement the following simple algorithm called ‘Follow-the-Leader’(FTL), which chooses each action once and subsequently chooses theaction with the largest average observed so far. Ties should bebroken randomly.
def FollowTheLeader(bandit , n):
# implement the Follow -the -Leader algorithm by
replacing
# the code below that just plays the first arm in
every round
for t in range(n):
bandit.pull (0)
Note: Depending on the literature you are reading, Follow-the-Leadermay be called ‘stay with the winner’ or the ‘greedy algorithm’.
38 / 108
![Page 39: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/39.jpg)
Exercise 3: Distribution of (random) regret
Consider a Bernoulli bandit with two arms and means µ1 = 0.5 andµ2 = 0.6.
(a) Using a horizon of n = 100,run 1000 simulations of yourimplementation ofFollow-the-Leader on theBernoulli bandit above andrecord the (random) regret,nµ∗ − Sn, in each simulation.
(b) Plot the results using ahistogram (see fig. on theright).
(c) Explain the results in thefigure.
0 5 10
0
200
400
Regret
Fre
quen
cyFigure: Histogram of regret for FTLover 1000 trials on Bernoulli banditwith means µ1 = 0.5, µ2 = 0.6
39 / 108
![Page 40: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/40.jpg)
Exercise 4: Regret Over Time
Consider the same Bernoulli bandit as used in the previous question.
(a) Run 1000 simulations of yourimplementation of FTL foreach horizonn ∈ {100, 200, 300, . . . , 1000}.
(b) Plot the average regretobtained as a function of n(see the fig. on the right).Include error bars.
(c) Explain the plot. Do you thinkFTL is a good algorithm?Why/why not?
500 1,000
20
40
n
Exp
ecte
dR
egre
t
Figure: Histogram of regret for FTLover 1000 trials.
40 / 108
![Page 41: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/41.jpg)
Outline
1 Overview of Talks
2 IntroductionLearning ObjectivesBrief HistoryWhy Should We Care?Exploration vs. ExploitationApplications
3 Finite-Armed Stochastic BanditsBanditsStochastic BanditsBasic Properties of the Regret
4 Measure Concentration
41 / 108
![Page 42: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/42.jpg)
Statistics 101
• X1, . . . ,Xn independent, identically distributed (∼ P),real-valued random variables (think rewards).
• What is µ.
= E [X1] (= E [Xt ])?
• Estimate: µ = 1n
∑nt=1 Xt
• “Statistics” of data: any function of X1, . . . ,Xn.
• Notice that µ is random, but µ ≈ µ.
• The distribution of µ depends on P .
• “How close”?: Characterize the distribution of |µ− µ|.• A priori characterization: The characterization depends on P
where all we know that P ∈ P .
• A posteriori characterization: The characterization depends onX1, . . . ,Xn.
42 / 108
![Page 43: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/43.jpg)
Tail probabilities
Imagine X.
= µ.
We care about the probability masses P (µ < µ− ε), P (µ > µ− ε).
Either, given ε, give lower or upper bounds on this (“probabilitybound”). Or for a given probability mass, give upper and/or lowerbounds on ε (“deviation bound”).
43 / 108
![Page 44: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/44.jpg)
Markov and Chebyshev
Lemma
For any random variable X with finite mean and ε > 0 it holds that:
(a) (Markov): P (|X | ≥ ε) ≤ E [|X |]ε
.
(b) (Chebyshev): P (|X − E [X ] | ≥ ε) ≤ V[X ]
ε2.
• Exercise 1: Prove Markov. Hint: Prove it for nonnegative r.v.s.,use E [X ] =
∫∞0
x P(dx) and split the integral.• Exercise 2: Prove Chebyshev. Hint: Apply Markov.• Chebyshev applied to µ: V[µ] = σ2
n. Hence,
P (|µ− µ| ≥ ε) ≤ σ2
nε2 .• Note: Chebyshev is more precise. You can further increase
precision by applying Markov to |X − E [X ] |k , k big.44 / 108
![Page 45: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/45.jpg)
Central Limit Theorem (CLT)
Theorem (CLT)
Let Xt iid, σ2 = V[X1] <∞, Sn =∑n
t=1(Xt − µ), Zn = Sn/√σ2n.
Then FZn → FZ as n→∞ where Z ∼ N (0, 1).
Note: FZ (u) = P (Z ≥ u) =∫∞u
1√2π
exp(− x2
2
)dx .
Bounding FZ (u):∫ ∞u
1√2π
exp(− x2
2
)dx ≤ 1
u√
2π
∫ ∞u
x exp(− x2
2
)dx
=√
12πu2 exp
(−u2
2
).
Hence
P (µ ≥ µ + ε) = P(
Sn√σ2n≥ ε√
nσ2
)≈ P
(Z ≥ ε
√σ2n)
≤√
σ2
2πnε2 exp(− nε2
2σ2
).
45 / 108
![Page 46: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/46.jpg)
How good is the CLT?
P (µn ≥ µ + ε) /√
σ2
2πnε2 exp(− nε2
2σ2
).
Question: Can we safely swap / to ≤, e.g., when Xt ∼ P and P is
supported on [0, 1].
(a) Yes, when n ≥ 30, the error will be very small for thesedistributions.
(b) Yes, when n ≥ 1000, the error will be very small for thesedistributions.
(c) No: No n will make the error uniformly small when P ranges overall distributions with support in [0, 1].
46 / 108
![Page 47: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/47.jpg)
Subgaussianity
Definition (Subgaussianity)
A random variable X is σ-subgaussian if for all λ ∈ R it holds thatE [exp(λX )] ≤ exp (λ2σ2/2).
Moment/cumulant generating function of X , MX , ψX : R→ R:MX (λ) = E [exp(λX )], ψX (λ) = logMX (λ), λ ∈ R.
LemmaX σ-subgaussian iff ψX (λ) ≤ 1
2λ2σ2 for all λ ∈ R.
Example: Z ∼ N(0, σ2). Then, MX (λ) = exp(λ2σ2/2).
Does MX (or ψX ) exist always? No, e.g., for X ∼ Exp, MX (λ) =∞for λ ≥ 1.
47 / 108
![Page 48: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/48.jpg)
Why the Name?
Theorem
If X is σ-subgaussian, then for any ε ≥ 0,
P (X ≥ ε) ≤ exp
(− ε2
2σ2
). (1)
48 / 108
![Page 49: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/49.jpg)
Proof
Proof.We take a generic approach called Cramer-Chernoff’s method.Let λ > 0 be some constant to be tuned later. Then
P (X ≥ ε) = P (exp (λX ) ≥ exp (λε))
≤ E [exp (λX )] exp (−λε) (Markov’s inequality)
≤ exp
(λ2σ2
2− λε
). (Def. of subgaussianity)
Now λ was any positive constant, and in particular may be chosen tominimize the bound above, which is achieved by λ = ε/σ2.
49 / 108
![Page 50: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/50.jpg)
Variations
Union bound: P (A ∪ B) ≤ P (A) + P (B).
Corollary: P (|X | ≥ ε) ≤ 2 exp(−ε2/(2σ2)).
Equivalent “deviation” forms:
P(X ≥
√2σ2 log(1/δ)
)≤ δ P
(|X | ≥
√2σ2 log(2/δ)
)≤ δ ,
or w.p. 1− δ, (−√
2σ2 log(2/δ),√
2σ2 log(2/δ)).
50 / 108
![Page 51: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/51.jpg)
Algebra with Subgaussian Random Variables
Lemma
Suppose that X is σ-subgaussian and X1 and X2 are independent andσ1 and σ2-subgaussian respectively, then:
(a) E[X ] = 0 and Var [X ] ≤ σ2.
(b) cX is |c |σ-subgaussian for all c ∈ R.
(c) X1 + X2 is√σ2
1 + σ22-subgaussian.
NoteNo matter what, X1 + X2 is (σ1 + σ2)-subgaussian.Independence improves this to
√σ2
1 + σ22.
51 / 108
![Page 52: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/52.jpg)
Concentration of the Mean
Corollary
Assume that Xi − µ are independent, σ-subgaussian randomvariables. Then, for any ε ≥ 0,
P (µ ≥ µ + ε) ≤ exp(− nε2
2σ2
)and P (µ ≤ µ− ε) ≤ exp
(− nε2
2σ2
),
(2)
where µ = 1n
∑nt=1 Xt .
Exercise 5Using exp(−x) ≤ 1/(ex) (which holds for all x ≥ 0), show thatexcept for a very small ε the above inequality is strictly stronger thanwhat we obtained via Chebyshev’s inequality and exponentiallysmaller (tighter) if nε2 is large relative to σ2.
52 / 108
![Page 53: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/53.jpg)
Deviation Form
Corollary
For any δ ∈ [0, 1], with probability at least 1− δ,
µ ≤ µ +
√2σ2 log(1/δ)
n. (3)
Symmetrically, it also follows that with probability at least 1− δ,
µ ≥ µ−√
2σ2 log(1/δ)
n. (4)
53 / 108
![Page 54: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/54.jpg)
Further Examples
• If X is distributed like a Gaussian with zero mean and varianceσ2, then X is σ-subgaussian.
• If X is bounded, zero-mean (i.e., E [X ] = 0 and |X | ≤ B almostsurely for some B ≥ 0) then X is B-subgaussian.
• Specifically, if X is a shifted Bernoulli with P (X = 1− p) = pand P (X = −p) = 1− p, it also holds that X is1/2-subgaussian.
Extension of the Definition• X is σ-subgaussian if the noise X − E [X ] is σ-subgaussian.
• A distribution is called σ-subgaussian if a random variable drawnfrom that distribution is σ-subgaussian.
54 / 108
![Page 55: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/55.jpg)
Bibliography
• Basic reference: Boucheron et al. (2013) (concentration underindependence).
• Matrix versions of many standard results: Tropp (2015).
• Survey of classical results: McDiarmid (1998).
• Self-normalization (by standard deviation/variance): Pena et al.(2008).
• Empirical process theory: van de Geer (2000) or Dudley (2014).
55 / 108
![Page 56: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/56.jpg)
Questions?
56 / 108
![Page 57: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/57.jpg)
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
57 / 108
![Page 58: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/58.jpg)
Standing Assumption (until further notice)
Assumptions
All bandit instances are in EKSG(1), i.e., the reward distribution for allarms is 1-subgaussian.
Is this restrictive?
1 All the algorithms that follow rely on the knowledge of σ.
2 Unequal subgaussianity constant across arms.
58 / 108
![Page 59: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/59.jpg)
Notation
Empirical mean of arm i :
µi(t) =1
Ti(t)
t∑s=1
I{As=i}Xs ,
where
Ti(t) =t∑
s=1
I{As=i}
and As ∈ [K ] = {1, . . . ,K} is the index of arm chosen in round s.
59 / 108
![Page 60: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/60.jpg)
Explore-then-Commit (ETC)
• Explore all the K arms m times.
• Go with the winner for the remaining rounds.
1: Input m ∈ N.2: In round t choose action
At =
{i , if (t modK ) + 1 = i and t ≤ mK ;
argmaxi µi(mK ) , t > mK .
(ties in the argmax are broken arbitrarily)
60 / 108
![Page 61: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/61.jpg)
History and Related Algorithms
• ε-greedy and friends: Choose winner in every round withprobability 1− ε, explore uniformly at random all arms withprobability ε (origin lost in history);
• “Certainty equivalence with forcing”: Robbins (1952);
• In more complex bandits:• “epoch-greedy”: Langford and Zhang (2008);• “Forced Exploration”: Abbasi-Yadkori et al. (2009);
Abbasi-Yadkori (2009);• “Phased exploration and greedy exploitation” (PEGE)
Rusmevichientong and Tsitsiklis (2010).
61 / 108
![Page 62: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/62.jpg)
Notes
• Looks silly: Explore for m steps, then exploit? Why should wecare?
• Simplicity is great! Educational!
• ε-greedy and friends: Choose winner in every round withprobability 1− ε, explore uniformly at random all arms withprobability ε.
• In n rounds, a particular arm i will be chosen on the averageabout nε/K times. So, m ≈ nε/K , or ε = mK/n.
• Will the additional randomness help ε greedy (for theenvironments considered)?
• Does it make sense to intermix exploration and exploitationsteps, rather than exploring first and then exploiting?
• How to choose m? (How to choose ε?)
62 / 108
![Page 63: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/63.jpg)
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
63 / 108
![Page 64: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/64.jpg)
How Big Can the Regret Be?
Let a ∧ b = min(a, b) and (x)+ = max(x , 0), for a, b, x ∈ R.
Basic regret decomposition identity:
Rn =K∑i=1
∆iE [Ti(n)] .
Thus, enough to bound E [Ti(n)]:
E [Ti(n)] ≤ m ∧⌈ nK
⌉+ (n −mK )+ P (i = AmK+1)
≤ m ∧⌈ nK
⌉+ (n −mK )+ P
(µi(mK ) ≥ max
j 6=iµj(mK )
).
64 / 108
![Page 65: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/65.jpg)
Bounding the Probability
We have
P
(µi(mK ) ≥ max
j 6=iµj(mK )
)≤ P (µi(mK ) ≥ µ1(mK ))
= P ( µi(mK )− µi − (µ1(mK )− µ1) ≥ ∆i ) .
Claim: µi(mK )− µi − (µ1(mK )− µ1) is√
2/m-subgaussian.
Hence, by (2) we have
P (µi(mK )− µi − µ1(mK ) + µ1 ≥ ∆i) ≤ exp
(−m∆2
i
4
).
65 / 108
![Page 66: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/66.jpg)
ETC Regret Upper Bound
Theorem (Instance-Dependent Bound)
After n rounds, the expected regret Rn of the ETC policy satisfies
Rn ≤(m ∧
⌈ nK
⌉) K∑i=1
∆i + (n −mK )+K∑i=1
∆i exp
(−m∆2
i
4
). (5)
Instance dependent: Because the bound depends on ∆i
(properties of the bandit environment instances).AKA: Gap-dependent, problem-dependent.
How to choose m?!66 / 108
![Page 67: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/67.jpg)
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
67 / 108
![Page 68: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/68.jpg)
Optimal choice of m.
Take K = 2. WLOG, ∆1 = 0, ∆.
= ∆2. Then,
Rn ≤ m∆ + (n − 2m)+∆ exp
(−m∆2
4
)≤ m∆ + n∆ exp
(−m∆2
4
).
(6)
Assume n is reasonably large. Then, optimal choice for m is
m∗(n,∆) = max
{0,
⌈4
∆2log
(n∆2
4
)⌉}(7)
and we get
Rn ≤ ∆ +4
∆
(1 + max
{0, log
(n∆2
4
)}). (8)
68 / 108
![Page 69: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/69.jpg)
A Simple but Important Improvement
Regardless of a policy, Rn = ∆E[T2(n)] ≤ ∆n. Combine with (8), toget
Rn ≤ min
{n∆, ∆ +
4
∆
(1 + log
(n∆2
4
))}. (9)
Corollary ((Infeasible) Worst-Case Bound)
Consider ETC with m = dn/Ke ∧m∗(∆, n). Then, there existsC > 0 such that for any ∆ > 0 and n > 0, Rn ≤ C
√n.
69 / 108
![Page 70: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/70.jpg)
Instance-Adaptive Algorithms
Is there a good, non-cheating choice for m (dependence on n isallowed, but not on ∆)?
Claim: Best such choice gives Rn = Θ(n2/3).
Earlier we had Rn = O(n1/2), much better! Is there some otheralgorithm that achieves this without knowing ∆? “Adaptation to ∆”.(E.g., Auer and Ortner (2010) and Garivier et al. (2016)).
Notice that a worst-case bound shows(!) whether adaptation toindividual instances happens!
70 / 108
![Page 71: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/71.jpg)
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
71 / 108
![Page 72: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/72.jpg)
Exercise 6 (soln. )
Gaussian bandit, K = 2, σ2 = 1, µ1 = 0 and µ2 = −∆. Setn = 1000, repeat simulations N = 104 times and report the average.
Plot, as a function of ∆ ∈ [0, 1],
(a) The theoretical upper boundgiven in (9) (blue).
(b) The regret of the ETCalgorithm with m set assuggested in (7) (green).
(c) The regret of the ETCalgorithm with “the” optimalm (yellow: calculatednumerically using that thenoise is exactly Gaussian).
What can we conclude?
0 0.2 0.4 0.6 0.8 1
20
40
60
80
∆
Exp
ecte
dre
gret
72 / 108
![Page 73: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/73.jpg)
Exercise 7
Let Rn =∑n
t=1 ∆At (random, “pseudo-regret”).
(a) Fix ∆ = 1/10 and plot Rn asa function of m withn = 2000. See upper plot onright.
(b) Plot the variance V[Rn] as afunction of m for the samebandit as above. See lowerplot on right.
(c) Explain the curves andreconcile with theory.
(d) Did it make sense to plotV[Rn]? Why or why not?
0 10020030040040506070
m
Exp
ecte
dR
egre
t
0 100200300400
2,0004,0006,0008,000
m
Var
ianc
eof
Reg
ret
73 / 108
![Page 74: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/74.jpg)
Questions?
74 / 108
![Page 75: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/75.jpg)
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
75 / 108
![Page 76: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/76.jpg)
Optimism Principle
Optimism in the Face of Uncertainty (OFU) Principle
One should choose their actions as if the environment is as nice asplausibly possible.
76 / 108
![Page 77: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/77.jpg)
Illustration
Visiting a new country.
Shall I try local cuisine/beer/. . . ?
Or stick to what I know?
Optimism: Yes.
Pessimism: No.
Optimism leads to exploration, pessimism prevents exploration.
Exploration is necessary: One can be unlucky with one’s priors!Hence, optimism is good.
How much??
77 / 108
![Page 78: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/78.jpg)
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
78 / 108
![Page 79: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/79.jpg)
On What is Plausible
Recall: if X1,X2, . . . ,Xn are independent and 1-subgaussian withmean µ and µ =
∑nt=1 Xt/n, then for any δ ∈ [0, 1],
P
(µ ≥ µ +
√2 log(1/δ)
n
)≤ δ . (10)
Round t. How big can µi be? Data: µi(t − 1) (empirical mean),based on Ti(t − 1) observations. Define
UCBi(t − 1, δ) = µi(t − 1) +
√2 log(1/δ)
Ti(t − 1). (11)
Caveat: Ti(t − 1) is random, hence it is not clear whetherP (UCBi(t − 1, δ) ≥ µi) ≤ δ holds.
79 / 108
![Page 80: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/80.jpg)
The UCB(δ) Algorithm
1: Input K and δ2: Choose each action once3: For rounds t > K choose action
At = argmaxi
UCBi(t − 1, δ)
NoteAlthough there are many versions of the UCB algorithm, we often donot distinguish them by name and hope the context is clear. For therest of this tutorial we’ll usually call UCB(δ) just UCB.
80 / 108
![Page 81: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/81.jpg)
Notes
• The algorithm first chooses each arm once, which is necessarybecause the term inside the square root is undefined whenTi(t − 1) = 0.
• The value inside the argmax is called the index of arm i .• An index algorithm chooses the arm in each round that
maximizes some value (the index), which usually only dependson current time-step and the samples from that arm.
• In the case of UCB, the index is the sum of the empirical meanof rewards experienced so far and the exploration bonus (alsoknown as the confidence width).
• The algorithm will work so that the UCB indices areapproximately the same all the time (why?).
Demo: http://downloads.tor-lattimore.com/bandits/
81 / 108
![Page 82: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/82.jpg)
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
82 / 108
![Page 83: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/83.jpg)
Instance-Dependent Bound for UCB
Theorem
Consider UCB as shown earlier on a stochastic K -armed1-subgaussian bandit problem. For any horizon n, if δ = 1/n2 then
Rn ≤ 3K∑i=1
∆i +∑
i :∆i>0
16 log(n)
∆i.
83 / 108
![Page 84: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/84.jpg)
Proof: Main Ideas
Regret decomposition identity:
Rn =K∑i=1
∆iE [Ti(n)] .
Take i so that ∆i > 0. Bound E [Ti(n)].
Key observation: After initialization, action i can only be chosen if itsindex is higher than that of an optimal arm.
This can only happen if at least one of the following is true:(a) The index of action i is larger than the true mean of a specific
optimal arm.(b) The index of a specific optimal arm is smaller than its true mean.
Both of these have low probability. In particular,E [Ti(n)] ≤ c1 + c2 log(n)/∆2
i . Qu.e.d.84 / 108
![Page 85: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/85.jpg)
Proof: The Reward Consumption Model
For i ∈ [K ], let (Zi ,s)s be an i.i.d. sequence with Zi ,s ∼ Pi .
Define Xt to be the TAt th element of sequence (ZAt ,s)s :
Xt = ZAt ,TAt (t) . (12)
Is there any loss of generality?
No: Interaction protocol ⇔ constraint on the distribution of(A1,X1, . . . ,An,Xn). This constraint clearly holds in this case.
Benefit: Defining
µi ,s =1
s
s∑u=1
Zi ,u , s ∈ [n]
(usual sample means), we have
µi(t) = µi ,Ti (t) .
85 / 108
![Page 86: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/86.jpg)
Proof: Some Details II.
WLOG µ1 = µ∗. Fix i such that ∆i > 0. Let Gi be the ‘good’ eventdefined by
Gi ={µ1 < min
t∈[n]UCB1(t)
}∩{µi ,ui +
√2
uilog
(1
δ
)< µ1
},
where ui ∈ [n] is a constant to be chosen later. Then
1 If Gi occurs, then Ti(n) ≤ ui .
2 The complement event G ci occurs with low probability (governed
in some way yet to be discovered by ui).
Because Ti(n) ≤ n no matter what, this will mean that
E [Ti(n)] = E[I{Gi}Ti(n)
]+ E
[I{G c
i }Ti(n)]≤ ui + P (G c
i ) n . (13)
For details, see our website.
86 / 108
![Page 87: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/87.jpg)
Worst-case Bound
Theorem
If δ = 1/n2, then the regret of UCB(δ) on any ν ∈ EKSG(1)environment is bounded by
Rn ≤ 8√nK log(n) + 3
K∑i=1
∆i .
87 / 108
![Page 88: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/88.jpg)
Notes
Recall: For all ν ∈ EKSG(1),
Rn(ν) ≤ 8√
nK log(n) + 3K∑i=1
∆i .
• Same tuning as in the other result! Good!
• Not anytime. Hmm..
• The additive∑
i ∆i term is unavoidable: all reasonablealgorithms must play each arm once (exercise: what if not??).
• The bound is close to optimal: no algorithm can enjoy regretsmaller than const ·
√nK over all problems in EKSG(1).
• A more complicated variant of UCB(δ) shaves the logarithmicterm from the upper bound given above.
88 / 108
![Page 89: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/89.jpg)
Proof
Previous proof actually gives:
E[Ti(n)] ≤ 3 +16 log(n)
∆2i
.
Using the basic regret decomposition, choosing some ∆ > 0,
Rn =K∑i=1
∆iE [Ti(n)] =∑
i :∆i<∆
∆iE [Ti(n)] +∑
i :∆i≥∆
∆iE [Ti(n)]
≤ n∆ +∑
i :∆i≥∆
(3∆i +
16 log(n)
∆i
)≤ n∆ +
16K log(n)
∆+ 3
∑i
∆i
≤ 8√
nK log(n) + 3K∑i=1
∆i ,
where the first inequality follows because∑
i :∆i<∆ Ti(n) ≤ n and the
last line by choosing ∆ =√
16K log(n)/n. Qu.e.d.89 / 108
![Page 90: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/90.jpg)
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
90 / 108
![Page 91: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/91.jpg)
Demo/Exercise 8
Setup: n = 1000, K = 2, P1 = N (0, 1), P2 = N (−∆, 1). Plotestimates of the expected regret of UCB, ETC withm ∈ {25, 50, 75, 100,Optimum} as ∆ ∈ [0, 1]. Your plot shouldresemble this:
0 0.2 0.4 0.6 0.8 1
20
40
60
80
100
∆
Exp
ecte
dre
gret
ETC (m = 25)ETC (m = 50)ETC (m = 75)ETC (m = 100)ETC (optimal m)UCB
What can we conclude? 91 / 108
![Page 92: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/92.jpg)
Exercise 9/Demo
Compare the regret histogram of UCB and ETC.
Demo
92 / 108
![Page 93: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/93.jpg)
Literature
• Confidence bounds, OFU principle is by Lai and Robbins (1985)(context: parametric bandits, asymptotics).
• UCB algorithm: Katehakis and Robbins (1995) (Gaussianbandits) and Agrawal (1995). Still asymptotics. Agrawal(1995)’s analysis is modular: All that is needed is appropriateUCBs (the form is irrelevant).
• Independently, Kaelbling (1993) also discovered UCB. No regretanalysis, or clear advice on how to tune the confidenceparameter.
• “This” UCB is most similar to UCB1 of Auer et al. (2002),except that n is t in UCB1. Finite-time regret bound, [0, 1]bounded payoffs (i.e., 1/2 subgaussian).
• Worst-case bound: Bubeck and Cesa-Bianchi (2012), whichfocuses on the subgaussian setup.
93 / 108
![Page 94: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/94.jpg)
Questions?
94 / 108
![Page 95: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/95.jpg)
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
95 / 108
![Page 96: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/96.jpg)
Asymptotically Optimal UCB (AO-UCB)
1: Input K2: Choose each arm once3: Subsequently choose
At = argmaxi
(µi(t − 1) +
√2 log f (t)
Ti(t − 1)
)
where f (t) = 1 + t log2(t)
96 / 108
![Page 97: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/97.jpg)
Asymptotic Regret, Asymptotic Optimality
Theorem (Upper Bound)
The regret of AO-UCB satisfies
lim supn→∞
Rn(ν)
log(n)≤∑
i :∆i>0
2
∆i. (14)
Theorem (Lower Bound)
For any policy π that has subpolynomial regret for all 1-subgaussianenvironments (i.e., Rn(ν, π) = o(np) for all p > 0 and all ν), for anyinstance ν with gaps ∆ = ∆(ν),
lim infn→∞
Rn(ν, π)
log(n)≥∑
i :∆i>0
2
∆i. (15)
97 / 108
![Page 98: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/98.jpg)
Finite-time Regret for AO-UCB
Corollarythere exists some universal constant C > 0 such that the regret ofAO-UCB is bounded by
Rn ≤ C∑
i :∆i>0
(∆i +
log(n)
∆i
),
and, in particular,
Rn ≤ CK∑i=1
∆i + 2√
CnK log(n) .
98 / 108
![Page 99: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/99.jpg)
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
99 / 108
![Page 100: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/100.jpg)
Zoo of UCBs
Strategy Conf. Dist. free regret Asy. Opt. Inst. Opt. Anytime
UCBAuer et al. (2002)
t√
kn log(n) 3 3 3
UCB*Lai (1987)
n/T√
kn log(k) 3 3 7
UCB+Garivier and Cappe (2011)
t/T√
kn log(k) 3 3 3
MOSSAudibert and Bubeck (2009)
n/(kT )√kn 3 7 7
Anytime MOSSDegenne and Perchet (2016)
t/(kT )√kn 3 7 3
OCUCBLattimore (2015)
(n/t)1+ε√kn 7 3 7
UCB†Lattimore (2017)
see ref.√kn 3 3 7
UCB‡Lattimore (2017)
see ref.√
kn log log(n) 3 3 3
All strategies choose At = t for 1 ≤ t ≤ k and subsequently
At = argmaxi
µi(t − 1) +
√2 log (Conf.)
Ti(t − 1)︸ ︷︷ ︸exploration bonus
, (16)
where log(x) ∼ log(x) is approximately logarithmic. In the table Tstands for Ti(t − 1).
Demo with UCB and OCUCB:http://downloads.tor-lattimore.com/bandits/
100 / 108
![Page 101: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/101.jpg)
Dealing with Risk
Setup: K = 2, Gaussian noise, gap ∆. If δ is the probability ofmissing the optimal arm by some (reasonable) algorithm then
Rn = O
(n∆δ +
1
∆log
(1
δ
)),
Optimize δ to get δ = 1/(n∆2) and
Rn = O
(1
∆
(1 + log
(n∆2
))).
How about E [R2n ]? We get: E [R2
n ] ≈ δ(n∆)2 = n. Too big!!Choose δ = (n∆)−2 to get
Rn = O
(1
∆
(1
n+ log
(n2∆2
)))and E [R2
n ] ≈ log2(n)!! For further info, see, Audibert et al. (2007).101 / 108
![Page 102: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/102.jpg)
Outline
5 Explore-then-Commit (ETC)AlgorithmRegret Upper BoundTuning ETCExercise/Illustration
6 Upper Confidence Bound (UCB)Optimism PrincipleThe UCB AlgorithmRegret Upper Bounds for UCBEmpirical IllustrationAsymptopiaZoo of UCBs and Risk Management
7 Summary
8 Further Reading
102 / 108
![Page 103: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/103.jpg)
Summary
• Bandits: The drosophila of “Explore-Exploit” problems.• Learner interacts with its environment, which is initially
unknown.• We care about the total reward/regret.
• Exploration is necessary to avoid large loss due to (plausible)unlucky start
• Exploitation is necessary to avoid large loss due to being justcurious all the time
• What is the right amount?• Stochastic, finite-armed bandits
• Explore-then-commit: Ideal, but infeasible tuning shows whatcan be achieved.
• Optimism does it: UCB.• Simultaneously satisfying many optimality criteria is possible
with a carefully tuned UCB.
• We left out lower bound proofs and many other topics.103 / 108
![Page 104: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/104.jpg)
Questions?
104 / 108
![Page 105: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/105.jpg)
References I
Abbasi-Yadkori, Y. (2009). Forced-exploration based algorithms for playing in bandits with largeaction sets. Master’s thesis, University of Alberta, Department of Computing Science.
Abbasi-Yadkori, Y., Antos, A., and Szepesvari, C. (2009). Forced-exploration based algorithmsfor playing in stochastic linear bandits. In COLT Workshop on On-line Learning with LimitedFeedback.
Agrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multi-armedbandit problem. Advances in Applied Probability, pages 1054–1078.
Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits.In Proceedings of Conference on Learning Theory (COLT), pages 217–226.
Audibert, J.-Y., Munos, R., and Szepesvari, C. (2007). Tuning bandit algorithms in stochasticenvironments. In Algorithmic Learning Theory (ALT), pages 150–165. Springer.
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmedbandit problem. Machine Learning, 47:235–256.
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a riggedcasino: The adversarial multi-armed bandit problem. In Foundations of Computer Science,1995. Proceedings., 36th Annual Symposium on, pages 322–331. IEEE.
Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochasticmulti-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65.
105 / 108
![Page 106: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/106.jpg)
References II
Bather, J. and Chernoff, H. (1967). Sequential decisions in the control of a spaceship. In FifthBerkeley Symposium on Mathematical Statistics and Probability, volume 3, pages 181–207.
Berry, D. and Fristedt, B. (1985). Bandit problems : sequential allocation of experiments.Chapman and Hall, London ; New York :.
Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: Anonasymptotic theory of independence. OUP Oxford.
Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and NonstochasticMulti-armed Bandit Problems. Foundations and Trends in Machine Learning. NowPublishers Incorporated.
Chernoff, H. (1959). Sequential design of experiments. The Annals of Mathematical Statistics,30(3):755–770.
Degenne, R. and Perchet, V. (2016). Anytime optimal algorithms in stochastic multi-armedbandits. In Proceedings of International Conference on Machine Learning (ICML).
Dudley, R. M. (2014). Uniform central limit theorems, volume 142. Cambridge university press.
Garivier, A. and Cappe, O. (2011). The KL-UCB algorithm for bounded stochastic bandits andbeyond. In Proceedings of Conference on Learning Theory (COLT).
Garivier, A., Kaufmann, E., and Lattimore, T. (2016). On explore-then-commit strategies. InAdvances in Neural Information Processing Systems (NIPS).
Gittins, J. (1979). Bandit processes and dynamic allocation indices. Journal of the RoyalStatistical Society. Series B (Methodological), 41(2):148–177.
106 / 108
![Page 107: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/107.jpg)
References III
Gittins, J., Glazebrook, K., and Weber, R. (2011). Multi-armed bandit allocation indices. JohnWiley & Sons.
Kaelbling, L. P. (1993). Learning in embedded systems. MIT press.
Katehakis, M. N. and Robbins, H. (1995). Sequential choice from several populations.Proceedings of the National Academy of Sciences of the United States of America,92(19):8584.
Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. TheAnnals of Statistics, pages 1091–1114.
Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advancesin applied mathematics, 6(1):4–22.
Langford, J. and Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits withside information. In NIPS, pages 817–824.
Lattimore, T. (2015). Optimally confident UCB: Improved regret for finite-armed bandits. arXivpreprint arXiv:1507.07880.
Lattimore, T. (2017). Auto-tuning the confidence level for optimistic bandit strategies.technical report.
McDiarmid, C. (1998). Concentration. In Probabilistic methods for algorithmic discretemathematics, pages 195–248. Springer.
Pena, V. H., Lai, T. L., and Shao, Q.-M. (2008). Self-normalized processes: Limit theory andStatistical Applications. Springer Science & Business Media.
107 / 108
![Page 108: Bandits: Part I Stochastic, Finite-Armed Bandits · Basic Properties of the Regret 4 Measure Concentration 2/108. Outline 5 Explore-then-Commit (ETC) Algorithm Regret Upper Bound](https://reader035.vdocument.in/reader035/viewer/2022063016/5fd52f4d0ffdb846f9068b4b/html5/thumbnails/108.jpg)
References IV
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of theAmerican Mathematical Society, 58(5):527–535.
Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematicsof Operations Research, 35(2):395–411.
Thompson, W. (1933). On the likelihood that one unknown probability exceeds another in viewof the evidence of two samples. Biometrika, 25(3/4):285–294.
Tropp, J. A. (2015). An introduction to matrix concentration inequalities. Foundations andTrends® in Machine Learning, 8(1-2):1–230.
van de Geer, S. (2000). Empirical Processes in M-estimation, volume 6. Cambridge universitypress.
108 / 108