tor lattimore & csaba szepesvari´ · tor lattimore & csaba szepesvari´ outline 1 from...

102
Bandit Algorithms Tor Lattimore & Csaba Szepesv´ ari

Upload: vukhanh

Post on 23-Mar-2019

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Bandit AlgorithmsTor Lattimore & Csaba Szepesvari

Page 2: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline

1 From Contextual to Linear Bandits2 Stochastic Linear Bandits3 Confidence Bounds for Least-Squares Estimators4 Improved Regret for Fixed, Finite Action Sets

Page 3: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline

5 Sparse Stochastic Linear Bandits6 Minimax Regret7 Asymptopia8 Summary

Page 4: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Page 5: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Stochastic Contextual BanditsSet of contexts, C, set of actions [K]; distributions (Pc,a).InteractionFor rounds t = 1,2,3, . . . :

1 Context Ct ∈ C is revealed to the learner.2 Based on its past observations (including Ct), the learnerchooses an action At ∈ [K]. The chosen action is sent to theenvironment.3 The environment sends the reward Xt ∼ PCt,At to the learner.

Page 6: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Regret DefinitionDefinition: Expected reward for action a under context c:

r(c, a) =

∫x Pc,a(dx) .

Regret:Rn = E

[ n∑t=1

maxa∈[K]r(Ct, a)−

n∑t=1

Xt].

Page 7: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Poor Man’s Contextual Bandit Algorithm

Assumption: C is finite.Idea: Assign a bandit to each context.Worst-case regret: Rn = Θ(

√nMK), where M = |C|.Problem: M (and K) can be very large.How to save this? Assume structure.

Page 8: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Linear ModelsAssumption:

r(c, a) = 〈ψ(c, a), θ∗〉 , ∀(c, a) ∈ C× [K] .

where ψ : C× [K]→ Rd, θ∗ ∈ Rd.• ψ: feature map;• Hψ

.= span(ψ(c, k) : c ∈ C, k ∈ [K] ) ⊂ Rd: feature space;

• Sψ .= rθ : rθ : C× [K]→ R, θ ∈ Rd, whererθ(c, k) = 〈ψ(c, a), θ〉:

space of linear reward functions.Note: RKHS let us deal with d =∞ (or d very large).

Page 9: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Page 10: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Choosing ψ⇒ betting on smoothness of rLet ‖·‖ be a norm on Rd, ‖·‖∗ its dual (e.g., both 2-norms).Holder’s inequality:

|r(c, a)− r(c′, a′)| ≤ ‖θ∗‖ ‖ψ(c, a)− ψ(c′, a′)‖∗ .r = 〈ψ, θ∗〉 ⇒ r is (ρ, ‖θ∗‖)-smooth:

|r(z)− r(z′)| ≤ ‖θ∗‖ ρ(z, z′)where z = (c, a), z′ = (c′, a′), ρ(z, z′) = ‖ψ(z)− ψ(z′)‖∗.

Choice of ψ + preference for small ‖θ∗‖ ⇔preference for ρ-smooth r.Influence of ψ is the largest when d = +∞.

Page 11: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

SparsityConcern: Is r ∈ Sψ really true?Thinking of free lunches: Can’t we just add a lot of features tomake sure r ∈ Sψ?

• Feature i unused: θ∗,i = 0;• Small ‖θ∗‖0 =

∑di=1 Iθ∗i 6=0;• ‘0-norm’.

Sparsity

Page 12: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Page 13: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Features Are All What You NeedGiven C1,A1, . . . ,Ct,At, the reward Xt in round t satisfies

E [Xt|C1,A1, . . . ,Ct,At] = 〈ψ(Ct,At), θ∗〉 ,for some known ψ and unknown θ∗.

⇔At the beginning of any round t, observe action set At ⊂ Rd. IfAt ∈ At,

E [Xt|A1,A1, . . . ,At,At] = 〈At, θ∗〉with some unknown θ∗.Why? Let At = ψ(Ct, a) : a ∈ [K] .

Page 14: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Stochastic Linear Bandits1 In round t, observe action set At ⊂ Rd.2 The learner chooses At ∈ At and receives Xt, satisfying

E [Xt|A1,A1, . . . ,At,At] = 〈At, θ∗〉with some unknown θ∗.

Goal: Keep regretRn = E

[ n∑t=1

maxa∈At〈a, θ∗〉 − Xt

]

small.Additional assumptions: (i) A1, . . . ,An is any fixed sequence; (ii)Xt − 〈At, θ∗〉 is light tailed, given A1,A1, . . . ,At,At.

Page 15: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Finite-armed banditsCase (a): At has always the same number of vectors in it:

“finite-armed stochastic contextual bandit”.Case (b): Also, At does not change, or At = a1, . . . , aK:

“finite-armed stochastic linear bandit”.Case (c): If the vectors in At are also orthogonal to each other:

“finite-armed stochastic bandit”.Difference between cases (c) and (b):

• Case (c): Learn about mean of arm i⇔ Choose action i;• Case (b): Learn about mean of arm i⇔ Choose action j s.t.〈xj, xi〉 6= 0.

Page 16: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Page 17: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Once an Optimist, Always an OptimistOptimism in the Face of Uncertainty Principle:

“Choose the best action in the best environment amongst theplausible ones.”Environment⇔ θ ∈ Rd.Plausible environments:Ct ⊂ Rd s.t. P (θ∗ 6∈ Ct) ∼ 1/t.

Best environment:θt = argmaxθ∈Ct maxa∈A〈a, θ〉.

Best action:argmaxa∈A〈a, θt〉.

Page 18: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Choosing the Confidence SetSay, reward in round t is Xt, action in round t is At ∈ Rd:

Xt = 〈At, θ∗〉+ ηt ,ηt is noise.Regularized least-squares estimator: θt = V−1t

∑ts=1 AsXs,V0 = λI , Vt = V0 +

t∑s=1

AsA>s .

Choice of Ct:Ct ⊂ Et .=

θ ∈ Rd : ‖θ − θt−1‖2Vt−1 ≤ βt

.

Here, (βt)t decreasing, βt ≥ 1, for A positive definite, ‖x‖2A = x>Ax.

Page 19: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

LinUCBChoose Ct = Et with suitable (βt)t and let

At = argmaxa∈A maxθ∈Ct〈a, θ〉 .

Then,At = argmaxa 〈a, θt〉+

√βt ‖a‖V−1t−1 .

LinUCB (a.k.a. LinRel, OFUL, ConfEllips, . . . )

Page 20: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Page 21: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Regret AnalysisAssumptions:

1 Bounded scalar mean reward: |〈a, θ∗〉| ≤ 1 for any a ∈ ∪tAt.2 Bounded actions: for any a ∈ ∪tAt, ‖a‖2 ≤ L.3 Honest confidence intervals: There exists a δ ∈ (0, 1) such thatwith probability 1− δ, for all t ∈ [n], θ∗ ∈ Ct where Ct satisfiesCt ⊂ Et with (βt)t ≥ 1, decreasing as on the previous slide.

Theorem (LinUCB Regret)Let the conditions listed above hold. Then with probability 1− δ theregret of LinUCB satisfies

Rpseudon ≤√

8nβn log

(det Vndet V0

)≤

√√√√8dnβn log

(trace(V0) + nL2

d det1d (V0)

).

Page 22: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

ProofAssume θ∗ ∈ Ct, t ∈ [n]. Let A∗t .

= argmaxa∈At〈a, θ∗〉, rt = 〈A∗t − At, θ∗〉and let θt ∈ Ct s.t. 〈At, θt〉 = UCBt(At).From θ∗ ∈ Ct and the definition of LinUCB,〈A∗t , θ∗〉 ≤ UCBt(A∗t ) ≤ UCBt(At) = 〈At, θt〉 .

Then,rt ≤ 〈At, θt − θ∗〉 ≤ ‖At‖V−1t−1 ‖θt − θ∗‖Vt−1 ≤ 2 ‖At‖V−1t−1 βt .

From 〈a, θ∗〉 ≤ 1, rt ≤ 2. This combined with βn ≥ max1, βt givesrt ≤ 2 ∧ 2√βt ‖At‖V−1t−1 ≤ 2√βn(1 ∧ ‖At‖V−1t−1) .

Jensen’s inequality shows thatRpseudon =

n∑t=1

rt ≤√√√√n n∑

t=1r2t ≤ 2

√√√√nβnn∑

t=1(1 ∧ ‖At‖2V−1t−1) .

Page 23: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Lemma

LemmaLet x1, . . . , xn ∈ Rd, Vt = V0 +

∑ts=1 xsx>s , t ∈ [n], v0 = trace(V0) andL ≥ maxt ‖xt‖2. Then,n∑

t=1

(1 ∧ ‖xt‖2V−1t−1)≤ 2 log

(det Vndet V0

)≤ d log

( v0 + nL2d det1/d(V0)

).

Page 24: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Page 25: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

LinUCB and Finite-Armed BanditsRecall that if At = e1, . . . , ed, we get back the finite armedbandits.LinUCB:

At = argmaxa 〈a, θt〉+√βt ‖a‖V−1t−1 .

If we set λ = 0, 〈ei, θt〉 = µi,t: empirical mean,Vt−1 = diag(T1(t− 1), . . . ,TK(t− 1)). Hence,√βt ‖ei‖V−1t−1 =

√βtTi(t− 1)

,

andAt = argmaxa 〈a, θt〉+

√βtTi(t− 1)

,

We recover UCB when βt = 2 log(·).

Page 26: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

History• Abe and Long (1999) introduced stochastic linear bandits intomachine learning literature.• Auer (2002) was the first to consider optimism for linearbandits (LinRel, SupLinRel). Main restriction: |At| < +∞.• Confidence ellipsoids: Dani et al. (2008) (ConfidenceBall2),Rusmevichientong and Tsitsiklis (2010) (Uncertainty EllipsoidPolicy), Abbasi-Yadkori et al. (2011) (OFUL).• The name LinUCB comes from Chu et al. (2011).• Alternative routes:

• Explore then commit for action sets with smooth boundary.Abbasi-Yadkori et al. (2009); Abbasi-Yadkori (2009);Rusmevichientong and Tsitsiklis (2010).• Phased elimination• Thompson sampling

Page 27: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Extensions of Linear Bandits• Generalized linear model (Filippi et al., 2009):

Xt = g−1(〈At, θ∗〉+ η) , (1)where g : R→ R is called the link function.Common choice: g(p) = log(p/(1− p)) wheng−1(x) = 1/(1 + exp(−x)) (sigmoid).

• Spectral bandits: Spectral Eliminator Valko et al. (2014).• Kernelised UCB: Valko et al. (2013).• (Nonlinear) Structured bandits: r : C× [K]→ [0, 1] belongs tosome known set (Anantharam et al., 1987; Russo and Roy,2013; Lattimore and Munos, 2014).

Page 28: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Page 29: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Setting

1 Subgaussian rewards: The reward is Xt = 〈At, θ∗〉+ ηt, where ηtis conditionally 1-subgaussian (ηt|Ft−1 ∼ subG(1)):E[exp(ληt)|Ft−1] ≤ exp(λ2/2) almost surely for all λ ∈ R ,

where Ft = σ(A1, η1, . . . ,At−1, ηt−1,At).2 Bounded parameter vector: ‖θ∗‖2 ≤ S with S > 0 known.

Page 30: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Least Squares: RecapLinear model:

Xt = 〈At, θ∗〉+ ηt .Regularized squared loss:

Lt(θ) =t∑

s=1(Xs − 〈As, θ〉)2 + λ ‖θ‖22 ,

Least squares estimate: θt = argminθ Lt(θ):θt = Vt(λ)−1 t∑

s=1XsAs with Vt(λ) = λI +

t∑s=1

AsA>s . (2)Abbreviation: Vt = Vt(0).Main difficultyThe actions (As)s<t are neither fixed nor independent, but areintricately correlated via the rewards (Xs)s<t.

Page 31: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Page 32: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Fixed Design1 Nonsingular Grammian: λ = 0 and Vt is invertible.2 Independent subgaussian noise: (ηs)s are independent and1-subgaussian.3 Fixed design: A1, . . . ,At are deterministically chosen withoutthe knowledge of X1, . . . ,Xt.

Notation: A1, . . . ,At replaced by a1, . . . , at, soVt =

t∑s=1

asa>s and θt = V−1tt∑

s=1asXs .

Note: (as)ts=1 must span Rd for ∃V−1t ⇒ t ≥ d.

Page 33: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

First StepsFix x ∈ Rd. From Xs = a>s θ∗ + ηs,

θt = V−1tt∑

s=1asXs =

V−1t Vt θ∗ + V−1tt∑

s=1asηs ,

so〈x, θt − θ∗〉 =

t∑s=1〈x,V−1t as〉 ηs .

Since (ηs)s are independent and 1-subgaussian: With probability1− δ,

〈x, θt − θ∗〉 <√2∑ts=1〈x,V−1t as〉2 log

( 1δ

)=√2‖x‖2V−1t

log( 1δ

).

Page 34: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Bounding ‖θt − θ∗‖VtFor x ∈ Rd fixed, with probability 1− δ,

〈x, θt − θ∗〉 <√2‖x‖2V−1t

log( 1δ

). (*)

We have ‖u‖2Vt = (Vtu)>u.Idea 1: Apply (*) to X = Vt(θt − θ∗)..Problem: (*) holds for non-random x only!Idea 2: Take finite Cε s.t. X ≈ε x for some x ∈ Cε. Make (*) hold foreach x ∈ Cε, combine.Refinement: Let X = V1/2t (θt − θ∗)/‖θt − θ∗‖Vt .Then X ∈ Sd−1 = x ∈ Rd : ‖x‖2 = 1 and we can choose a finiteCε ⊂ Sd−1 s.t. for any u ∈ Sd−1, ∃x ∈ Cε with ‖u− x‖ ≤ ε.

Page 35: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Bounding ‖θt − θ∗‖Vt: Part II.Let X =

V1/2t (θt − θ∗)‖θt − θ∗‖Vt

, X∗ = argminx∈Cε ‖x− X‖ .Then, with probability 1− δ,

‖θt − θ∗‖Vt = 〈X,V1/2t (θt − θ∗)〉= 〈X − X∗,V1/2t (θt − θ∗)〉+ 〈V1/2t X∗, θt − θ∗〉

≤ ε‖θt − θ∗‖Vt +

√2‖V1/2t X∗‖2V−1t log

(|Cε|δ

).

or‖θt − θ∗‖Vt ≤

11− ε

√2 log

(|Cε|δ

).

We can choose Cε so that |Cε| ≤ (5/ε)d:‖θt − θ∗‖Vt < 2

√2(

d log(10) + log

(1δ

)).

Page 36: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Page 37: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Bounding ‖θt − θ∗‖Vt: Sequential DesignXs = 〈As, θ∗〉+ ηs,ηs|Fs−1 ∼ subG(1) for Fs = σ(A1, η1, . . . ,As, ηs).Previous bound exploited A1, . . . ,At fixed, non-random. Known as:

fixed designWhen A1, . . . ,At is i.i.d., we have a

random designBandits: As is chosen based on A1,X1, . . . ,As−1,Xs−1!

sequential designHow to bound ‖θt − θ∗‖Vt in this case?

Page 38: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Bounding ‖θt − θ∗‖Vt: Sequential Design

• Linearization trick• Vector Chernoff• Laplace method

Page 39: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

A Start: Linearization & Vector ChernoffLet St =

∑ts=1 ηsAs. “Linearization” of the quadratic:12‖θt − θ∗‖2Vt = 12 ‖St‖2V−1t = maxx∈Rd〈x,St〉 − 12 ‖x‖2Vt .

LetMt(x) = exp

(〈x,St〉 − 12‖x‖2Vt

).

One can show that E [Mt(x)] ≤ 1 for any x ∈ Rd. Chernoff’s method:P( 12‖θt − θ∗‖2Vt ≥ u) = P

(exp(maxx log Mt(x)) ≥ exp(u)

)≤ E

[exp(maxx log Mt(x))

]exp(−u) = E

[maxx Mt(x))

]exp(−u) .

Can we control E [maxx Mt(x)]?

Page 40: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Controlling E [maxx Mt(x)]: Covering ArgumentRecall:

Mt(x) = exp(〈x,St〉 − 12‖x‖2Vt

).

Let Cε ⊂ Rd be finite, to be chosen later,X = argmax

x∈RdMt(x), Y = argminy∈Cε ‖X − y‖ .

Then,maxx∈Rd Mt(x) = Mt(X) = Mt(X)−Mt(Y) + Mt(Y) ≤ ε +

∑y∈Cε

Mt(y) .

Challenge: ensure Mt(X)−Mt(Y) ≤ ε!

Page 41: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Laplace: One Step Back, Two ForwardThe need to control E [maxx Mt(x)] comes from the identityexp( 12‖θt − θ∗‖2Vt) = maxx Mt(x).Laplace: Integral of exp(sf(x)) is dominated by exp(s maxx f(x)):∫ b

a esf(x)dx ∼ esf(x0)

√ 2πs|f ′′(x0)| ,

x0 = argmaxx∈[a,b] f(x) ∈ (a, b).Idea: Replace maxx Mt(x) with ∫ Mt(x)h(x)dx with h appropriate:

1 ∫ Mt(x)h(x)dx ≈ maxx Mt(x) (in a way);2 E

[∫ Mt(x)h(x)dx] =∫

E [Mt(x)] h(x)dx ≤ 1.Choose h(x) as density of N (0,H−1) for H 0.

Page 42: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Step 2: Finishing∫

Mt(x)h(x)dx =

(det(H)

det(H + V)

)1/2exp

( 12 ‖St‖2

(H+Vt)−1).

Choose H = λI. Then, with probability 1− e−u,12‖St‖2V−1t (λ)

< u +12 log

(det(Vt(λ))

λd)

(**)and from θt − θ∗ = V−1t (λ)St − λV−1t (λ)θ∗,

‖θt − θ∗‖Vt(λ) ≤ ‖V−1t (λ)St‖Vt(λ) + λ∥∥V−1t (λ)θ∗

∥∥Vt(λ)

≤ ‖St‖V−1t (λ) + λ1/2 ‖θ∗‖ .

Page 43: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Confidence Ellipsoid for Sequential DesignAssumptions: ‖θ∗‖ ≤ S, and let (As)s, (ηs)s be so that for any1 ≤ s ≤ t, ηs|Fs−1 ∼ subG(1), where Fs = σ(A1, η1, . . . ,As−1, ηs−1,As)

Fix δ ∈ (0, 1). Letβt+1 =

√λS +

√2 log

( 1δ

)+ log

(det Vt(λ)

λd),

andCt+1 =

θ ∈ Rd : ‖θt − θ∗‖Vt(λ) ≤ βt+1

.

TheoremCt+1 is a confidence set for θ∗ at level 1− δ:

P (θ∗ ∈ Ct+1) ≥ 1− δ .Note: βt+1 is a function of (As)s≤t.

Page 44: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Freedman’s Stopping Trick

We want θ∗ ∈ Ct hold with probability 1− δ, simultaneously for all1 ≤ t ≤ n.Can we avoid the union bound over time?

Page 45: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Freedman’s Stopping Trick: IILetEt =

‖θt − θ∗‖Vt−1(λ) ≥

√λS +

√2u + log

(det Vt−1(λ)

λd)

, t ∈ [n] .

Define τ ∈ [n] as follows: τ to be the smallest round index t ∈ [n]such that Et holds, or n when none of E1, . . . , En hold.Note: Because θt and Vt−1(λ) is a function ofHt−1 = (A1, η1, . . . ,At−1, ηt−1), whether Et holds can be decidedbased on Ht−1.⇒ τ is a (Ht)t stopping time⇒ E [Mτ (x)] ≤ 1 and also∫

E [Mτ (x)] h(x)dx ≤ 1 and thus P(∪t∈[n]Et

)≤ P (Eτ ) ≤ e−u. n→∞.

CorollaryP (∃t ≥ 0 such that θ∗ 6∈ Ct+1) ≤ δ.

Page 46: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Historical Remarks• Presentation mostly follows Abbasi-Yadkori et al. (2011).• Auer (2002); Chu et al. (2011) avoided the need to constructellipsoidal confidence sets• Previous ellipsoidal constructions by Dani et al. (2008) andRusmevichientong and Tsitsiklis (2010) used coveringarguments.• The improvement that results from using Laplace’s method ascompared to the previous ellipsoidal constructions that arebased on covering arguments is quite enormous.• Laplace’s method is also called the “Method of Mixtures”(Pena et al., 2008); its use goes back to the work of Robbinsand Siegmund in the 1970s (Robbins and Siegmund, 1970,1971).• Freedman’s Stopping is by Freedman (1975).

Page 47: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Page 48: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Regret for LinUCB: Final StepsPreviously we have seen, for βt ≥ 1, nondecreasing, using LinUCBwith V0 = λI, w.p. 1− δ,

Rpseudon ≤√

8nβn log

(det Vndet V0

)≤√

8dnβn log

(λd + nL2

λd).

Now,βn =

√λS +

√2 log

( 1δ

)+ log

(det Vn−1(λ)

λd)

≤√λS +

√2 log

( 1δ

)+ d log

(λd + nL2

λd),

from ‖As‖ ≤ L and log det V ≤ d log(trace(V)/d).⇒ Rpseudon ≤ C1d√n log(n) + C2

√nd log(1/δ) + C3.

Page 49: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

SummaryRpseudon ≤ C1d√n log(n) + C2

√nd log(1/δ) + C3.

• Optimism, confidence ellipsoid• Getting the ellipsoid is tricky because the bandit algorithmmakes (As)s and (ηs)s interdependent: “sequential design”.• Hsu et al. (2012): Random design, fixed design• Kernelization: One can directly kernelize the proof presentedhere. See Abbasi-Yadkori (2012). Gaussian Process Banditseffectively do the same (Srinivas et al., 2010).• This presentation followed mostly Abbasi-Yadkori et al.(2011); Abbasi-Yadkori (2012).

Page 50: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Page 51: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Challenge!

The previous bound is O(d√n) even for A = e1, . . . , ed –finite-armed stochastic bandits.

Can we do better?

Page 52: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Setting1 Fixed finite action set: The set of actions available in round t isA ⊂ Rd and |A| = K for some natural number K.

2 Subgaussian rewards: The reward is Xt = 〈θ∗,At〉+ ηt where ηtis conditionally 1-subgaussian: ηt|Ft−1 ∼ subG(1), whereFt = σ(A1, η1, . . . ,At−1, ηt,At).

3 Bounded mean rewards: ∆a = maxb∈A〈θ∗, b− a〉 ≤ 1 for alla ∈ A.Key difference to previous setting:

Finite, fixed action set.

Page 53: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Avoiding Sequential DesignsRecall result for fixed design:For x ∈ Rd fixed, with probability 1− δ,

〈x, θt − θ∗〉 <√2‖x‖2V−1t

log( 1δ

). (*)

Goal: Use this result! How?Idea: Use a phased elimination algorithm!

• A = A1 ⊃ A2 ⊃ A3 ⊃ . . . .• In phase `, use actions in A` to collect enough data to ensurethat by the end of the phase, the data collected in the phase issufficient to rule out all ε` .= 2−`-suboptimal actions.

Which action to use & how many times in phase `?

Page 54: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

How to Collect Data?Recall: If Vt =

∑ts=1 asa>s , for any x ∈ Rd.〈x, θt − θ∗〉 <

√2‖x‖2V−1tlog( 1δ

). (*)

If we need to know whether 〈x, θt − θ∗〉 ≤ 2−`, x ∈ A, we betterchoose a1, a2, . . . , at so thatmaxx∈A ‖x‖2V−1t (*)

is minimized (and make t big enough).⇒ Experimental design

Minimizing (*) is known as the G-optimal design problem.

Page 55: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Page 56: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

G-optimal DesignLet π : A → [0, 1] be a distribution on A: ∑a∈A π(a) = 1. Define

V(π) =∑a∈A

π(a)aa> , g(π) = maxa∈A ‖a‖2V(π)−1 .

G-optimal design π∗:g(π∗) = min

πg(π) .

How to use this?

Page 57: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Using a Design πGiven a design π, for a ∈ Supp(π), set

na =⌈π(a) g(π)

ε2 log( 1δ

)⌉.

Choose each action a ∈ Supp(π) exactly na times. Then:V =

∑a∈Supp(π)

na aa> ≥ g(π)ε2 log

( 1δ

)V(π) ,

and so for any a ∈ A, w.p. 1− δ,〈θ − θ∗, a〉 ≤

√‖a‖2V−1 log

( 1δ

)≤ ε .

How big is n?n =

∑a∈Supp(π)

na =∑

a∈Supp(π)

⌈π(a) g(π)

ε2 log( 1δ

)⌉≤ | Supp(π)|+ g(π)

ε2 log( 1δ

).

Page 58: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Bounding g(π) and | Supp(π)|Theorem (Kiefer–Wolfowitz)The following are equivalent:

1 π∗ is a minimizer of g.2 π∗ is a minimizer of f(π) = − log det V(π).3 g(π∗) = d.

Note: Designs, minimizing f are known as D-optimal designs.KW says that G-optimality is the same as D-optimality.Combining this with John’s Theorem for minimum-volumeenclosing ellipsoids (John, 1948), we get | Supp(π)| ≤ d(d + 3)/2.

Page 59: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Page 60: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

PEGOE Algorithm1Input: A ⊂ Rd and δ. Set A1 = A, ` = 1, t = 1.

1 Let t` = t: current round. Find G-optimal design π` : A` → [0, 1]that maximizeslog det V(π`) subject to ∑

a∈A`π`(a) = 1

2 Let ε` = 2−` andN`(a) =

⌈2π(a)

ε2 log

(K`(` + 1)

δ

)⌉and N` =

∑a∈A`

N`(a)

3 Choose each action a ∈ A` exactly N`(a) times4 Calculate estimate: θ = V−1

`

∑t`+N`t=t` AtXt.5 Eliminate poor arms:

A`+1 =

a ∈ A` : maxb∈A`〈θ`, b− a〉 ≥ 2ε`

.

1Phased Elimination with G-Optimal Exploration

Page 61: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

The Regret of PEGOETheoremWith probability at least 1− δ the pseudo-regret of PEGOE is atmost:

Rpseudon ≤ C√

nd log

(K log(n)

δ

),

where C > 0 is a universal constant. If δ = O(1/n), thenE[Rn] ≤ C√nd log(Kn)

for appropriately chosen universal constant C > 0.

Page 62: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Summary and Historical Remarks• Phased exploration allows one to use methods developed forfixed-design• PEGOE: Exploration tuned to maximize information gain• Finding an (1 + ε)-optimal design is sufficient; convex problem• This algorithm and analysis in this form is new.• “Phased Elimination” is well known: Even-Dar et al. (2006)(pure exploration), Auer and Ortner (2010) (finite-armedbandits), Valko et al. (2014) (linear bandits, spanners insteadof G-optimality).• Finite, but changing action set: PEGOE cannot be applied!SupLinRel and SupLinUCB get the same bound (Auer, 2002;Chu et al., 2011). Sadly, these algorithms are veryconservative..

Page 63: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline5 Sparse Stochastic Linear BanditsWarmup (Hypercube)LinUCB with SparsityConfidence Sets & Online Linear RegressionSummary6 Minimax RegretA Minimax Lower Bound7 AsymptopiaLower BoundWhat About Optimism?8 Summary

Page 64: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

General Setting

1 (Sparse parameter) There exist known constants M0 and M2such that ‖θ∗‖0 ≤ M0 and ‖θ∗‖2 ≤ M2.2 (Bounded mean rewards): 〈a, θ∗〉 ≤ 1 for all a ∈ At and allrounds t.3 (Subgaussian noise): The reward is Xt = 〈At, θ∗〉+ ηt whereηt|Ft−1 ∼ subG(1) for Ft = σ(A1, η1, . . . ,At, ηt).

Page 65: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

The Case of the Hypercube

A = [−1, 1]d , θ.

= θ∗ , Xt = 〈At, θ〉+ ηt.

Assumptions:1 (Bounded mean rewards): ‖θ‖1 ≤ 1, which ensures that〈a, θ〉| ≤ 1 for all a ∈ A.

2 (Subgaussian noise): The reward is Xt = 〈At, θ∗〉+ ηt whereηt|Ft−1 ∼ subG(1) for Ft = σ(A1, η1, . . . ,At, ηt).

Page 66: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Selective Explore-Then-Commit (SETC)Recall: θ = θ∗.

For any i ∈ [d] such that Ati is randomized:Ati(A>t θ + ηt) = θi + Ati

∑j6=i

Atjθj + Atiηt︸ ︷︷ ︸“noise′′

.

Page 67: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Regret of SETCTheoremThere exists a universal constant C > 0 such that the regret ofSETC satisfies:

Rn ≤ 2 ‖θ‖1 + C ∑i:θi 6=0

log(n)

|θi| .

Furthermore Rn ≤ C ‖θ‖0√n log(n).

SETC adapts to ‖θ‖0!

Page 68: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline5 Sparse Stochastic Linear BanditsWarmup (Hypercube)LinUCB with SparsityConfidence Sets & Online Linear RegressionSummary6 Minimax RegretA Minimax Lower Bound7 AsymptopiaLower BoundWhat About Optimism?8 Summary

Page 69: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

(General) LinUCB: RecapGLinUCBChoose Ct ⊂ Rd and let

At = argmaxa∈A maxθ∈Ct〈a, θ〉 .

Previous choice leads to regret O(d√n).How to choose Ct, knowing that ‖θ∗‖0 ≤ p,so that the regret gets smaller?

Page 70: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline5 Sparse Stochastic Linear BanditsWarmup (Hypercube)LinUCB with SparsityConfidence Sets & Online Linear RegressionSummary6 Minimax RegretA Minimax Lower Bound7 AsymptopiaLower BoundWhat About Optimism?8 Summary

Page 71: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Online Linear Regression (OLR)Learner-environment interaction:

1 The environment chooses Xt ∈ R and At ∈ Rd in an arbitraryfashion.2 The value of At is revealed to the learner (but not Xt).3 The learner produces a real-valued prediction Xt in some way.4 The environment reveals Xt to the learner and the loss is

(Xt − Xt)2.Goal: Compete with the total loss of the best linear predictors insome set Θ ⊂ Rd.

Regret against θ ∈ Θ:ρn(θ) =

n∑t=1

(Xt − Xt)2 −n∑

t=1(Xt − 〈At, θ〉)2 .

Page 72: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

From OLR to Confidence SetsLet L be a learner that enjoys a regret guaranteeBn = Bn(A1,X1, . . . ,An,Xn) relative to Θ: For any strategy of theenvironment,

supθ∈Θ

ρn(θ) ≤ Bn .

Combineρn(θ) =

n∑t=1

(Xt − Xt)2 −n∑

t=1(Xt − 〈At, θ〉)2 .

and Xt = 〈At, θ∗〉+ ηt to getQt .=

t∑s=1

(Xs − 〈As, θ∗〉)2 = ρt(θ∗) + 2 t∑s=1

ηs(Xs − 〈As, θ∗〉)

≤ Bt + 2 t∑s=1

ηs(Xs − 〈As, θ∗〉) .

Page 73: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

From OLR to Confidence Sets: II.Qt ≤ Bt + 2Zt , Zt =

t∑s=1

ηs(Xs − 〈As, θ∗〉) . (*)Goal: Bound Zt for t ≥ 0.Xt, chosen by OLR learner L, is Ft−1-measurable,

(Zt − Zt−1)|Ft−1 ∼ subG(σt) , where σ2t = (Xt − 〈At, θ∗〉)2 .

Previous self-normalized bound (**): With probability 1− δ,|Zt| <

√(1 + Qt) log

( 1+Qtδ2), t = 0, 1, . . . .

Combining with (*), solve for Qt:Qt ≤ βt(δ) , βt(δ) = 1 + 2Bt + 32 log

(√8+√1+Btδ

).

Page 74: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

OLR to Confidence Sets: III.

TheoremLet δ ∈ (0, 1) and assume that θ∗ ∈ Θ and supθ∈Θ ρt(θ) ≤ Bt. IfCt+1 =

θ ∈ Rd : ‖θ‖22 +

t∑s=1

(Xs − 〈As, θ〉)2 ≤ M22 + βt(δ)

,

then P (exists t ∈ N such that θ∗ 6∈ Ct+1) ≤ δ .

Page 75: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Sparse LinUCB1: Input OLR Learner L, regret bound Bt, confidence parameterδ ∈ (0, 1)

2: for t = 1, . . . , n3: Receive action set At4: Compute confidence set:

Ct =

θ ∈ Rd : ‖θ‖22 +

t−1∑s=1

(Xs − 〈As, θ〉)2 ≤ M22 + βt(δ)

5: Calculate optimistic actionAt = argmaxa∈At

maxθ∈Ct〈a, θ〉

6: Feed At to L and obtain prediction Xt7: Play At and receive reward Xt8: Feed Xt to L as feedback

Page 76: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Regret of OLR-UCB

TheoremWith probability at least 1− δ the pseudo-regret of OLR-UCBsatisfies

Rpseudon ≤√

8dn (M22 + βn−1(δ))

log(1 +

nd).

Page 77: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

The Regret of OLR-UCB(π)Theorem (Sparse OLR Algorithm)∃π for the learner such that for any θ ∈ Rd, the regret ρn(θ) of πagainst any strategic environment such that maxt∈[n] ‖At‖2 ≤ L andmaxt∈[n] |Xt| ≤ X satisfiesρn(θ) ≤ cX2 ‖θ‖0

log(e + n1/2L) + Cn log(1 +

‖θ‖1‖θ‖0 )

+ (1 + X2)Cn ,

where c > 0 is some universal constant andCn = 2 + log2 log(e + n1/2L).CorollaryThe expected regret of OLR-UCB when using the strategy π fromabove satisfies Rn = O(

√dpn) .

Page 78: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline5 Sparse Stochastic Linear BanditsWarmup (Hypercube)LinUCB with SparsityConfidence Sets & Online Linear RegressionSummary6 Minimax RegretA Minimax Lower Bound7 AsymptopiaLower BoundWhat About Optimism?8 Summary

Page 79: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Summary• OLR algorithm used inside OLR-UCB to construct center• Regret guarantee of the OLR controls “width” of confidenceellipsoid• Regret: O(

√dpn), and p is known.• Hypercube: p√n, p is unknown!• In general, the regret can be as high as Ω(

√pdn) (p = 1: thinkof A = e1, . . . , ed)• Under parameter noise (Xt = 〈At, θ∗ + ηt〉), for “rounded” actionsets, O(p√n) is possible!• Very much unlike in the “passive” case:Major conflict between exploration and exploitation!

Page 80: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Historical Notes

• Selective Explore-Then-Commit algorithm is due to (Lattimoreet al., 2015).• OLR-UCB is from Abbasi-Yadkori et al. (2012).• The Sparse OLR algorithm is due to Gerchinovitz (2013).• Rakhlin and Sridharan (2015) also discusses relationshipbetween online learning regret bounds and self-normalizedtail bounds of the type given here.

Page 81: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline5 Sparse Stochastic Linear BanditsWarmup (Hypercube)LinUCB with SparsityConfidence Sets & Online Linear RegressionSummary6 Minimax RegretA Minimax Lower Bound7 AsymptopiaLower BoundWhat About Optimism?8 Summary

Page 82: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Minimax Lower Bound

TheoremLet the action set be A = −1, 1d and Θ = −n−1/2, n−1/2d. Thenfor any policy π there exists a θ ∈ Θ such that

Rπn (A, θ) ≥ C d√nfor some universal constant C > 0.

Page 83: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Some Thoughts• LinUCB with our confidence set construction is “nearly”worst-case optimal.• The theorem is “new”, but the proof is standard; see (Shamir,2015).• Similar results for some other action sets: Rusmevichientongand Tsitsiklis (2010) (`2-ball), Dani et al. (2008) (products of2D balls).• Some action sets will have smaller minimax regret! Can youthink of one?

Page 84: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline5 Sparse Stochastic Linear BanditsWarmup (Hypercube)LinUCB with SparsityConfidence Sets & Online Linear RegressionSummary6 Minimax RegretA Minimax Lower Bound7 AsymptopiaLower BoundWhat About Optimism?8 Summary

Page 85: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Lower BoundSetting:

1 Actions: A ⊂ Rd finite, K = |A|.2 Reward is Xt = 〈At, θ〉+ ηt, where θ ∈ Rd and ηt is a sequenceof independent standard Gaussian variables.

Regret of policy π:Rπn (A, θ) = Eθ,π

[ n∑t=1

∆At

], ∆a = maxa′∈A〈a′ − a, θ〉 ,

Recall: a policy π is consistent in some class of bandits E if theregret is subpolynomial for any bandit in that class:Rπn (A, θ) = o(np) for all p > 0 and θ ∈ Rd .

Page 86: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Lower Bound: II

TheoremAssume that A ⊂ Rd is finite and spans Rd and suppose π isconsistent. Let θ ∈ Rd be any parameter such that there is a uniqueoptimal action and let Gn = Eθ,π

[∑nt=1 AtA>t ] be the expected Grammatrix . Then lim infn→∞ λmin(Gn)/ log(n) > 0. Furthermore, for anya ∈ A it holds that:lim supn→∞ log(n) ‖a‖2G−1n ≤

∆2a2 .

Page 87: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Lower Bound: IIICorollaryLet A ⊂ Rd be a finite set that spans Rd and θ ∈ Rd be such thatthere is a unique optimal action. Then for any consistent policy π,

lim infn→∞Rπn (A, θ)

log(n)≥ c(A, θ) ,

where c(A, θ) is defined asc(A, θ) = inf

α∈[0,∞)A

∑a∈A

α(a)∆a

subject to ‖a‖2H−1α≤ ∆2a2 for all a ∈ A with ∆a > 0 ,

where H =∑

a∈A α(a)aa>.

Page 88: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Outline5 Sparse Stochastic Linear BanditsWarmup (Hypercube)LinUCB with SparsityConfidence Sets & Online Linear RegressionSummary6 Minimax RegretA Minimax Lower Bound7 AsymptopiaLower BoundWhat About Optimism?8 Summary

Page 89: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Poor Outlook for Optimism

Page 90: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Poor Outlook for OptimismActions:A = a1, a2, a3, a1 = e1, a2 = e2,a3 = (1− ε, γε). ε > 0 small, γ ≥ 1.Let θ = (1,0), so a∗ = a1.Solving for the lower bound,α(a2) = 2γ2and α(a3) = 0, c(A, θ) = 2γ2 and

lim infn→∞Rπn (A, θ)

log(n)= 2γ2 .

Moreover, for γ large, ε sufficiently small, π “optimistic”,lim supn→∞

Rπn (A, θ)

log(n)= Ω(1/ε) ,

Page 91: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Instance-Optimal Asymptotic Regret

TheoremThere exists a policy π that is consistent and satisfies

lim supn→∞Rπn (A, θ)

log(n)= c(A, θ) ,

where c(A, θ) was defined in the lower bound.

Page 92: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Illustration: LinUCB

Page 93: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

SummaryThe instance-optimal regret of consistent algorithms isasymptotically c(A, θ) log(n).

Optimistic algorithms fail to achieve this: Their regret can beworse by an arbitrarily large constant factor.Remember:

Finite-armed banditsCase (a): At has always the same number of vectors in it:

“nite-armed stochastic contextual bandit”.Case (b): Also, At does not change, or At = a, . . . , aK:

“nite-armed stochastic linear bandit”.Case (c): If the vectors in At are also orthogonal to each other:

“nite-armed stochastic bandit”.Difference between cases (c) and (b):

• Case (c): Learn about mean of arm i , Choose action i;• Case (b): Learn about mean of arm i , Choose action j s.t.hxj, xii 6= .

Page 94: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Departing Thoughts• These results are from Lattimore and Szepesvari (2016)• The asymptotically optimal algorithm is given there (thealgorithm solves for the optimal allocation, while monitoringwhether things went wrong)• Combes et al. (2017) refine the algorithm and generalize it toother settings.• Soare et al. (2014), in best arm identification with linear payofffunctions, gave essentially the same example that we use toargue for the large regret of optimistic algorithms.• Open questions:• Simultaneously finite-time near-optimal and asymptoticallyoptimal algorithm• Changing, or infinite action sets?

Page 95: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Summary

Page 96: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Summary of This Talk• Contextual vs. linear bandits:

Changing action sets can model contextual bandits• Optimistic algorithms:

• Optimism can achieve minimax optimality• Optimism can be expensive• Optimistic algorithms require a careful design of the underlyingconfidence sets• Sparsity:

Exploiting sparsity is sometimes at odds with the requirementto collect rewards

Page 97: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

What’s Next for Bandits?• Today: Finite-armed and linear stochastic bandits.• But bandits come in all forms and shapes!

• Adversarial (finite, linear, . . . )• Combinatorial action sets: From shortest path to ranking• Continuous action sets, continuous time, delays• Resourceful, nonstationary, various structures (low-rank), . . .• Nearby problems:

• Reinforcement learning/Markov decision processes• Partial monitoring

Page 98: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

Learning Material• Bandit Visualizer:https://github.com/alexrutar/banditvis

• Online bandit simulator:http://downloads.tor-lattimore.com/bandits/

• Most of this tutorial (and more): http://banditalgs.com• Book to be published by early next year: Looking for reviewers!• Tor’s lightweight C++ bandit library

• Sebastien Bubeck’s tutorial• Blog post 1• Blog post 2

• Bubeck and Cesa-Bianchi’s book;(Bubeck and Cesa-Bianchi, 2012)banditalgs.com

Page 99: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

References IAbbasi-Yadkori, Y. (2009). Forced-exploration based algorithms for playing in bandits with largeaction sets. PhD thesis, University of Alberta.Abbasi-Yadkori, Y. (2012). Online Learning for Linearly Parametrized Control Problems. PhD thesis,University of Alberta.Abbasi-Yadkori, Y., Antos, A., and Szepesvari, C. (2009). Forced-exploration based algorithms forplaying in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback.Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. (2012). Online-to-confidence-set conversions andapplication to sparse stochastic bandits. In Artificial Intelligence and Statistics, pages 1–9.Abbasi-Yadkori, Y., Szepesvari, C., and Tax, D. (2011). Improved algorithms for linear stochasticbandits. In Advances in Neural Information Processing Systems (NIPS), pages 2312–2320.Abe, N. and Long, P. M. (1999). Associative reinforcement learning using linear probabilisticconcepts. In ICML, pages 3–11.Anantharam, V., Varaiya, P., and Walrand, J. (1987). Asymptotically efficient allocation rules for themultiarmed bandit problem with multiple plays-part i: Iid rewards. IEEE Transactions onAutomatic Control, 32(11):968–976.Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of MachineLearning Research, 3(Nov):397–422.Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armedbandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65.

Page 100: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

References IIBubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armedBandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorporated.Chu, W., Li, L., Reyzin, L., and Schapire, R. E. (2011). Contextual bandits with linear payoff functions.In AISTATS, volume 15, pages 208–214.Combes, R., Magureanu, S., and Proutiere, A. (2017). Minimal exploration in structured stochasticbandits. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., andGarnett, R., editors, Advances in Neural Information Processing Systems 30, pages 1761–1769.Curran Associates, Inc.Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under banditfeedback. In Proceedings of Conference on Learning Theory (COLT), pages 355–366.Even-Dar, E., Mannor, S., and Mansour, Y. (2006). Action elimination and stopping conditions for themulti-armed bandit and reinforcement learning problems. Journal of machine learning research,7(Jun):1079–1105.Filippi, S., Cappe, O., Garivier, A., and Szepesvari, Cs. (2009). Parametric bandits: The generalizedlinear case. In NIPS-22, pages 586–594.Freedman, D. (1975). On tail probabilities for martingales. The Annals of Probability, 3(1):100–118.Gerchinovitz, S. (2013). Sparsity regret bounds for individual sequences in online linear regression.Journal of Machine Learning Research, 14(Mar):729–769.Hsu, D., Kakade, S. M., and Zhang, T. (2012). Random design analysis of ridge regression. InConference on Learning Theory, pages 9–1.

Page 101: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

References IIIJohn, F. (1948). Extremum problems with inequalities as subsidiary conditions. Courant AnniversaryVolume, Interscience.Lattimore, T., Crammer, K., and Szepesvari, C. (2015). Linear multi-resource allocation withsemi-bandit feedback. In Advances in Neural Information Processing Systems, pages 964–972.Lattimore, T. and Munos, R. (2014). Bounded regret for finite-armed structured bandits. In Advancesin Neural Information Processing Systems, pages 550–558.Lattimore, T. and Szepesvari, C. (2016). The end of optimism? an asymptotic analysis offinite-armed linear bandits. arXiv preprint arXiv:1610.04491.Pena, V. H., Lai, T. L., and Shao, Q.-M. (2008). Self-normalized processes: Limit theory and StatisticalApplications. Springer Science & Business Media.Rakhlin, A. and Sridharan, K. (2015). On equivalence of martingale tail bounds and deterministicregret inequalities. arXiv preprint arXiv:1510.03925.Robbins, H. and Siegmund, D. (1970). Boundary crossing probabilities for the Wiener process andsample sums. Annals of Math. Statistics, 41:1410–1429.Robbins, H. and Siegmund, D. (1971). A convergence theorem for non-negative almostsupermartingales and some applications. In Rustagi, J., editor, Optimizing Methods in Statistics,pages 235–257. Academic Press, New York.Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics ofOperations Research, 35(2):395–411.Russo, D. and Roy, B. V. (2013). Eluder dimension and the sample complexity of optimisticexploration. In NIPS, pages 2256–2264.

Page 102: Tor Lattimore & Csaba Szepesvari´ · Tor Lattimore & Csaba Szepesvari´ Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Con•dence Bounds for Least-Squares

References IV

Shamir, O. (2015). On the complexity of bandit linear optimization. In Conference on Learning Theory,pages 1523–1551.Soare, M., Lazaric, A., and Munos, R. (2014). Best-arm identification in linear bandits. In Advances inNeural Information Processing Systems, pages 828–836.Srinivas, N., Krause, A., Kakade, S., and Seeger, M. W. (2010). Gaussian process optimization in thebandit setting: No regret and experimental design. In ICML, pages 1015–1022.Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. (2013). Finite-time analysis ofkernelised contextual bandits. arXiv preprint arXiv:1309.6869.Valko, M., Munos, R., Kveton, B., and Kocak, T. (2014). Spectral bandits for smooth graph functions.In International Conference on Machine Learning, pages 46–54.