tor lattimore & csaba szepesvari´ · tor lattimore & csaba szepesvari´ outline 1 from...

Bandit AlgorithmsTor Lattimore & Csaba Szepesvari

Outline

1 From Contextual to Linear Bandits2 Stochastic Linear Bandits3 Confidence Bounds for Least-Squares Estimators4 Improved Regret for Fixed, Finite Action Sets

Outline

5 Sparse Stochastic Linear Bandits6 Minimax Regret7 Asymptopia8 Summary

1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Stochastic Contextual BanditsSet of contexts, C, set of actions [K]; distributions (Pc,a).InteractionFor rounds t = 1,2,3, . . . :

1 Context Ct ∈ C is revealed to the learner.2 Based on its past observations (including Ct), the learnerchooses an action At ∈ [K]. The chosen action is sent to theenvironment.3 The environment sends the reward Xt ∼ PCt,At to the learner.

Regret DefinitionDefinition: Expected reward for action a under context c:

r(c, a) =

∫x Pc,a(dx) .

Regret:Rn = E

[ n∑t=1

maxa∈[K]r(Ct, a)−

n∑t=1

Xt].

Poor Man’s Contextual Bandit Algorithm

Assumption: C is finite.Idea: Assign a bandit to each context.Worst-case regret: Rn = Θ(

√nMK), where M = |C|.Problem: M (and K) can be very large.How to save this? Assume structure.

Linear ModelsAssumption:

r(c, a) = 〈ψ(c, a), θ∗〉 , ∀(c, a) ∈ C× [K] .

where ψ : C× [K]→ Rd, θ∗ ∈ Rd.• ψ: feature map;• Hψ

.= span(ψ(c, k) : c ∈ C, k ∈ [K] ) ⊂ Rd: feature space;

• Sψ .= rθ : rθ : C× [K]→ R, θ ∈ Rd, whererθ(c, k) = 〈ψ(c, a), θ〉:

space of linear reward functions.Note: RKHS let us deal with d =∞ (or d very large).

Outline1 From Contextual to Linear BanditsOn the Choice of Features2 Stochastic Linear BanditsSettingOptimism and LinUCBGeneric Regret AnalysisMiscellaneous Remarks3 Confidence Bounds for Least-Squares EstimatorsLeast Squares: RecapFixed DesignSequential DesignCompleting the Regret Bound4 Improved Regret for Fixed, Finite Action SetsChallenge!G-Optimal DesignsThe PEGOE Algorithm

Choosing ψ⇒ betting on smoothness of rLet ‖·‖ be a norm on Rd, ‖·‖∗ its dual (e.g., both 2-norms).Holder’s inequality:

|r(c, a)− r(c′, a′)| ≤ ‖θ∗‖ ‖ψ(c, a)− ψ(c′, a′)‖∗ .r = 〈ψ, θ∗〉 ⇒ r is (ρ, ‖θ∗‖)-smooth:

|r(z)− r(z′)| ≤ ‖θ∗‖ ρ(z, z′)where z = (c, a), z′ = (c′, a′), ρ(z, z′) = ‖ψ(z)− ψ(z′)‖∗.

Choice of ψ + preference for small ‖θ∗‖ ⇔preference for ρ-smooth r.Influence of ψ is the largest when d = +∞.

SparsityConcern: Is r ∈ Sψ really true?Thinking of free lunches: Can’t we just add a lot of features tomake sure r ∈ Sψ?

• Feature i unused: θ∗,i = 0;• Small ‖θ∗‖0 =

∑di=1 Iθ∗i 6=0;• ‘0-norm’.

Sparsity

Features Are All What You NeedGiven C1,A1, . . . ,Ct,At, the reward Xt in round t satisfies

E [Xt|C1,A1, . . . ,Ct,At] = 〈ψ(Ct,At), θ∗〉 ,for some known ψ and unknown θ∗.

⇔At the beginning of any round t, observe action set At ⊂ Rd. IfAt ∈ At,

E [Xt|A1,A1, . . . ,At,At] = 〈At, θ∗〉with some unknown θ∗.Why? Let At = ψ(Ct, a) : a ∈ [K] .

Stochastic Linear Bandits1 In round t, observe action set At ⊂ Rd.2 The learner chooses At ∈ At and receives Xt, satisfying

E [Xt|A1,A1, . . . ,At,At] = 〈At, θ∗〉with some unknown θ∗.

Goal: Keep regretRn = E

[ n∑t=1

maxa∈At〈a, θ∗〉 − Xt

]

small.Additional assumptions: (i) A1, . . . ,An is any fixed sequence; (ii)Xt − 〈At, θ∗〉 is light tailed, given A1,A1, . . . ,At,At.

Finite-armed banditsCase (a): At has always the same number of vectors in it:

“finite-armed stochastic contextual bandit”.Case (b): Also, At does not change, or At = a1, . . . , aK:

“finite-armed stochastic linear bandit”.Case (c): If the vectors in At are also orthogonal to each other:

“finite-armed stochastic bandit”.Difference between cases (c) and (b):

• Case (c): Learn about mean of arm i⇔ Choose action i;• Case (b): Learn about mean of arm i⇔ Choose action j s.t.〈xj, xi〉 6= 0.

Once an Optimist, Always an OptimistOptimism in the Face of Uncertainty Principle:

“Choose the best action in the best environment amongst theplausible ones.”Environment⇔ θ ∈ Rd.Plausible environments:Ct ⊂ Rd s.t. P (θ∗ 6∈ Ct) ∼ 1/t.

Best environment:θt = argmaxθ∈Ct maxa∈A〈a, θ〉.

Best action:argmaxa∈A〈a, θt〉.

Choosing the Confidence SetSay, reward in round t is Xt, action in round t is At ∈ Rd:

Xt = 〈At, θ∗〉+ ηt ,ηt is noise.Regularized least-squares estimator: θt = V−1t

∑ts=1 AsXs,V0 = λI , Vt = V0 +

t∑s=1

AsA>s .

Choice of Ct:Ct ⊂ Et .=

θ ∈ Rd : ‖θ − θt−1‖2Vt−1 ≤ βt

.

Here, (βt)t decreasing, βt ≥ 1, for A positive definite, ‖x‖2A = x>Ax.

LinUCBChoose Ct = Et with suitable (βt)t and let

At = argmaxa∈A maxθ∈Ct〈a, θ〉 .

Then,At = argmaxa 〈a, θt〉+

√βt ‖a‖V−1t−1 .

LinUCB (a.k.a. LinRel, OFUL, ConfEllips, . . . )

Regret AnalysisAssumptions:

1 Bounded scalar mean reward: |〈a, θ∗〉| ≤ 1 for any a ∈ ∪tAt.2 Bounded actions: for any a ∈ ∪tAt, ‖a‖2 ≤ L.3 Honest confidence intervals: There exists a δ ∈ (0, 1) such thatwith probability 1− δ, for all t ∈ [n], θ∗ ∈ Ct where Ct satisfiesCt ⊂ Et with (βt)t ≥ 1, decreasing as on the previous slide.

Theorem (LinUCB Regret)Let the conditions listed above hold. Then with probability 1− δ theregret of LinUCB satisfies

Rpseudon ≤√

8nβn log

(det Vndet V0

)≤

√√√√8dnβn log

(trace(V0) + nL2

d det1d (V0)

).

ProofAssume θ∗ ∈ Ct, t ∈ [n]. Let A∗t .

= argmaxa∈At〈a, θ∗〉, rt = 〈A∗t − At, θ∗〉and let θt ∈ Ct s.t. 〈At, θt〉 = UCBt(At).From θ∗ ∈ Ct and the definition of LinUCB,〈A∗t , θ∗〉 ≤ UCBt(A∗t ) ≤ UCBt(At) = 〈At, θt〉 .

Then,rt ≤ 〈At, θt − θ∗〉 ≤ ‖At‖V−1t−1 ‖θt − θ∗‖Vt−1 ≤ 2 ‖At‖V−1t−1 βt .

From 〈a, θ∗〉 ≤ 1, rt ≤ 2. This combined with βn ≥ max1, βt givesrt ≤ 2 ∧ 2√βt ‖At‖V−1t−1 ≤ 2√βn(1 ∧ ‖At‖V−1t−1) .

Jensen’s inequality shows thatRpseudon =

n∑t=1

rt ≤√√√√n n∑

t=1r2t ≤ 2

√√√√nβnn∑

t=1(1 ∧ ‖At‖2V−1t−1) .

Lemma

LemmaLet x1, . . . , xn ∈ Rd, Vt = V0 +

∑ts=1 xsx>s , t ∈ [n], v0 = trace(V0) andL ≥ maxt ‖xt‖2. Then,n∑

t=1

(1 ∧ ‖xt‖2V−1t−1)≤ 2 log

(det Vndet V0

)≤ d log

( v0 + nL2d det1/d(V0)

).

LinUCB and Finite-Armed BanditsRecall that if At = e1, . . . , ed, we get back the finite armedbandits.LinUCB:

At = argmaxa 〈a, θt〉+√βt ‖a‖V−1t−1 .

If we set λ = 0, 〈ei, θt〉 = µi,t: empirical mean,Vt−1 = diag(T1(t− 1), . . . ,TK(t− 1)). Hence,√βt ‖ei‖V−1t−1 =

√βtTi(t− 1)

,

andAt = argmaxa 〈a, θt〉+

√βtTi(t− 1)

,

We recover UCB when βt = 2 log(·).

History• Abe and Long (1999) introduced stochastic linear bandits intomachine learning literature.• Auer (2002) was the first to consider optimism for linearbandits (LinRel, SupLinRel). Main restriction: |At| < +∞.• Confidence ellipsoids: Dani et al. (2008) (ConfidenceBall2),Rusmevichientong and Tsitsiklis (2010) (Uncertainty EllipsoidPolicy), Abbasi-Yadkori et al. (2011) (OFUL).• The name LinUCB comes from Chu et al. (2011).• Alternative routes:

• Explore then commit for action sets with smooth boundary.Abbasi-Yadkori et al. (2009); Abbasi-Yadkori (2009);Rusmevichientong and Tsitsiklis (2010).• Phased elimination• Thompson sampling

Extensions of Linear Bandits• Generalized linear model (Filippi et al., 2009):

Xt = g−1(〈At, θ∗〉+ η) , (1)where g : R→ R is called the link function.Common choice: g(p) = log(p/(1− p)) wheng−1(x) = 1/(1 + exp(−x)) (sigmoid).

• Spectral bandits: Spectral Eliminator Valko et al. (2014).• Kernelised UCB: Valko et al. (2013).• (Nonlinear) Structured bandits: r : C× [K]→ [0, 1] belongs tosome known set (Anantharam et al., 1987; Russo and Roy,2013; Lattimore and Munos, 2014).

Setting

1 Subgaussian rewards: The reward is Xt = 〈At, θ∗〉+ ηt, where ηtis conditionally 1-subgaussian (ηt|Ft−1 ∼ subG(1)):E[exp(ληt)|Ft−1] ≤ exp(λ2/2) almost surely for all λ ∈ R ,

where Ft = σ(A1, η1, . . . ,At−1, ηt−1,At).2 Bounded parameter vector: ‖θ∗‖2 ≤ S with S > 0 known.

Least Squares: RecapLinear model:

Xt = 〈At, θ∗〉+ ηt .Regularized squared loss:

Lt(θ) =t∑

s=1(Xs − 〈As, θ〉)2 + λ ‖θ‖22 ,

Least squares estimate: θt = argminθ Lt(θ):θt = Vt(λ)−1 t∑

s=1XsAs with Vt(λ) = λI +

t∑s=1

AsA>s . (2)Abbreviation: Vt = Vt(0).Main difficultyThe actions (As)s<t are neither fixed nor independent, but areintricately correlated via the rewards (Xs)s<t.

Fixed Design1 Nonsingular Grammian: λ = 0 and Vt is invertible.2 Independent subgaussian noise: (ηs)s are independent and1-subgaussian.3 Fixed design: A1, . . . ,At are deterministically chosen withoutthe knowledge of X1, . . . ,Xt.

Notation: A1, . . . ,At replaced by a1, . . . , at, soVt =

t∑s=1

asa>s and θt = V−1tt∑

s=1asXs .

Note: (as)ts=1 must span Rd for ∃V−1t ⇒ t ≥ d.

First StepsFix x ∈ Rd. From Xs = a>s θ∗ + ηs,

θt = V−1tt∑

s=1asXs =

V−1t Vt θ∗ + V−1tt∑

s=1asηs ,

so〈x, θt − θ∗〉 =

t∑s=1〈x,V−1t as〉 ηs .

Since (ηs)s are independent and 1-subgaussian: With probability1− δ,

〈x, θt − θ∗〉 <√2∑ts=1〈x,V−1t as〉2 log

( 1δ

)=√2‖x‖2V−1t

log( 1δ

).

Bounding ‖θt − θ∗‖VtFor x ∈ Rd fixed, with probability 1− δ,

〈x, θt − θ∗〉 <√2‖x‖2V−1t

log( 1δ

). (*)

We have ‖u‖2Vt = (Vtu)>u.Idea 1: Apply (*) to X = Vt(θt − θ∗)..Problem: (*) holds for non-random x only!Idea 2: Take finite Cε s.t. X ≈ε x for some x ∈ Cε. Make (*) hold foreach x ∈ Cε, combine.Refinement: Let X = V1/2t (θt − θ∗)/‖θt − θ∗‖Vt .Then X ∈ Sd−1 = x ∈ Rd : ‖x‖2 = 1 and we can choose a finiteCε ⊂ Sd−1 s.t. for any u ∈ Sd−1, ∃x ∈ Cε with ‖u− x‖ ≤ ε.

Bounding ‖θt − θ∗‖Vt: Part II.Let X =

V1/2t (θt − θ∗)‖θt − θ∗‖Vt

, X∗ = argminx∈Cε ‖x− X‖ .Then, with probability 1− δ,

‖θt − θ∗‖Vt = 〈X,V1/2t (θt − θ∗)〉= 〈X − X∗,V1/2t (θt − θ∗)〉+ 〈V1/2t X∗, θt − θ∗〉

≤ ε‖θt − θ∗‖Vt +

√2‖V1/2t X∗‖2V−1t log

(|Cε|δ

).

or‖θt − θ∗‖Vt ≤

11− ε

√2 log

(|Cε|δ

).

We can choose Cε so that |Cε| ≤ (5/ε)d:‖θt − θ∗‖Vt < 2

√2(

d log(10) + log

(1δ

)).

Bounding ‖θt − θ∗‖Vt: Sequential DesignXs = 〈As, θ∗〉+ ηs,ηs|Fs−1 ∼ subG(1) for Fs = σ(A1, η1, . . . ,As, ηs).Previous bound exploited A1, . . . ,At fixed, non-random. Known as:

fixed designWhen A1, . . . ,At is i.i.d., we have a

random designBandits: As is chosen based on A1,X1, . . . ,As−1,Xs−1!

sequential designHow to bound ‖θt − θ∗‖Vt in this case?

Bounding ‖θt − θ∗‖Vt: Sequential Design

• Linearization trick• Vector Chernoff• Laplace method

A Start: Linearization & Vector ChernoffLet St =

∑ts=1 ηsAs. “Linearization” of the quadratic:12‖θt − θ∗‖2Vt = 12 ‖St‖2V−1t = maxx∈Rd〈x,St〉 − 12 ‖x‖2Vt .

LetMt(x) = exp

(〈x,St〉 − 12‖x‖2Vt

).

One can show that E [Mt(x)] ≤ 1 for any x ∈ Rd. Chernoff’s method:P( 12‖θt − θ∗‖2Vt ≥ u) = P

(exp(maxx log Mt(x)) ≥ exp(u)

)≤ E

[exp(maxx log Mt(x))

]exp(−u) = E

[maxx Mt(x))

]exp(−u) .

Can we control E [maxx Mt(x)]?

Controlling E [maxx Mt(x)]: Covering ArgumentRecall:

Mt(x) = exp(〈x,St〉 − 12‖x‖2Vt

).

Let Cε ⊂ Rd be finite, to be chosen later,X = argmax

x∈RdMt(x), Y = argminy∈Cε ‖X − y‖ .

Then,maxx∈Rd Mt(x) = Mt(X) = Mt(X)−Mt(Y) + Mt(Y) ≤ ε +

∑y∈Cε

Mt(y) .

Challenge: ensure Mt(X)−Mt(Y) ≤ ε!

Laplace: One Step Back, Two ForwardThe need to control E [maxx Mt(x)] comes from the identityexp( 12‖θt − θ∗‖2Vt) = maxx Mt(x).Laplace: Integral of exp(sf(x)) is dominated by exp(s maxx f(x)):∫ b

a esf(x)dx ∼ esf(x0)

√ 2πs|f ′′(x0)| ,

x0 = argmaxx∈[a,b] f(x) ∈ (a, b).Idea: Replace maxx Mt(x) with ∫ Mt(x)h(x)dx with h appropriate:

1 ∫ Mt(x)h(x)dx ≈ maxx Mt(x) (in a way);2 E

[∫ Mt(x)h(x)dx] =∫

E [Mt(x)] h(x)dx ≤ 1.Choose h(x) as density of N (0,H−1) for H 0.

Step 2: Finishing∫

Mt(x)h(x)dx =

(det(H)

det(H + V)

)1/2exp

( 12 ‖St‖2

(H+Vt)−1).

Choose H = λI. Then, with probability 1− e−u,12‖St‖2V−1t (λ)

< u +12 log

(det(Vt(λ))

λd)

(**)and from θt − θ∗ = V−1t (λ)St − λV−1t (λ)θ∗,

‖θt − θ∗‖Vt(λ) ≤ ‖V−1t (λ)St‖Vt(λ) + λ∥∥V−1t (λ)θ∗

∥∥Vt(λ)

≤ ‖St‖V−1t (λ) + λ1/2 ‖θ∗‖ .

Confidence Ellipsoid for Sequential DesignAssumptions: ‖θ∗‖ ≤ S, and let (As)s, (ηs)s be so that for any1 ≤ s ≤ t, ηs|Fs−1 ∼ subG(1), where Fs = σ(A1, η1, . . . ,As−1, ηs−1,As)

Fix δ ∈ (0, 1). Letβt+1 =

√λS +

√2 log

( 1δ

)+ log

(det Vt(λ)

λd),

andCt+1 =

θ ∈ Rd : ‖θt − θ∗‖Vt(λ) ≤ βt+1

.

TheoremCt+1 is a confidence set for θ∗ at level 1− δ:

P (θ∗ ∈ Ct+1) ≥ 1− δ .Note: βt+1 is a function of (As)s≤t.

Freedman’s Stopping Trick

We want θ∗ ∈ Ct hold with probability 1− δ, simultaneously for all1 ≤ t ≤ n.Can we avoid the union bound over time?

Freedman’s Stopping Trick: IILetEt =

‖θt − θ∗‖Vt−1(λ) ≥

√λS +

√2u + log

(det Vt−1(λ)

λd)

, t ∈ [n] .

Define τ ∈ [n] as follows: τ to be the smallest round index t ∈ [n]such that Et holds, or n when none of E1, . . . , En hold.Note: Because θt and Vt−1(λ) is a function ofHt−1 = (A1, η1, . . . ,At−1, ηt−1), whether Et holds can be decidedbased on Ht−1.⇒ τ is a (Ht)t stopping time⇒ E [Mτ (x)] ≤ 1 and also∫

E [Mτ (x)] h(x)dx ≤ 1 and thus P(∪t∈[n]Et

)≤ P (Eτ ) ≤ e−u. n→∞.

CorollaryP (∃t ≥ 0 such that θ∗ 6∈ Ct+1) ≤ δ.

Historical Remarks• Presentation mostly follows Abbasi-Yadkori et al. (2011).• Auer (2002); Chu et al. (2011) avoided the need to constructellipsoidal confidence sets• Previous ellipsoidal constructions by Dani et al. (2008) andRusmevichientong and Tsitsiklis (2010) used coveringarguments.• The improvement that results from using Laplace’s method ascompared to the previous ellipsoidal constructions that arebased on covering arguments is quite enormous.• Laplace’s method is also called the “Method of Mixtures”(Pena et al., 2008); its use goes back to the work of Robbinsand Siegmund in the 1970s (Robbins and Siegmund, 1970,1971).• Freedman’s Stopping is by Freedman (1975).

Regret for LinUCB: Final StepsPreviously we have seen, for βt ≥ 1, nondecreasing, using LinUCBwith V0 = λI, w.p. 1− δ,

Rpseudon ≤√

8nβn log

(det Vndet V0

)≤√

8dnβn log

(λd + nL2

λd).

Now,βn =

√λS +

√2 log

( 1δ

)+ log

(det Vn−1(λ)

λd)

≤√λS +

√2 log

( 1δ

)+ d log

(λd + nL2

λd),

from ‖As‖ ≤ L and log det V ≤ d log(trace(V)/d).⇒ Rpseudon ≤ C1d√n log(n) + C2

√nd log(1/δ) + C3.

SummaryRpseudon ≤ C1d√n log(n) + C2

√nd log(1/δ) + C3.

• Optimism, confidence ellipsoid• Getting the ellipsoid is tricky because the bandit algorithmmakes (As)s and (ηs)s interdependent: “sequential design”.• Hsu et al. (2012): Random design, fixed design• Kernelization: One can directly kernelize the proof presentedhere. See Abbasi-Yadkori (2012). Gaussian Process Banditseffectively do the same (Srinivas et al., 2010).• This presentation followed mostly Abbasi-Yadkori et al.(2011); Abbasi-Yadkori (2012).

Challenge!

The previous bound is O(d√n) even for A = e1, . . . , ed –finite-armed stochastic bandits.

Can we do better?

Setting1 Fixed finite action set: The set of actions available in round t isA ⊂ Rd and |A| = K for some natural number K.

2 Subgaussian rewards: The reward is Xt = 〈θ∗,At〉+ ηt where ηtis conditionally 1-subgaussian: ηt|Ft−1 ∼ subG(1), whereFt = σ(A1, η1, . . . ,At−1, ηt,At).

3 Bounded mean rewards: ∆a = maxb∈A〈θ∗, b− a〉 ≤ 1 for alla ∈ A.Key difference to previous setting:

Finite, fixed action set.

Avoiding Sequential DesignsRecall result for fixed design:For x ∈ Rd fixed, with probability 1− δ,

〈x, θt − θ∗〉 <√2‖x‖2V−1t

log( 1δ

). (*)

Goal: Use this result! How?Idea: Use a phased elimination algorithm!

• A = A1 ⊃ A2 ⊃ A3 ⊃ . . . .• In phase `, use actions in A` to collect enough data to ensurethat by the end of the phase, the data collected in the phase issufficient to rule out all ε` .= 2−`-suboptimal actions.

Which action to use & how many times in phase `?

How to Collect Data?Recall: If Vt =

∑ts=1 asa>s , for any x ∈ Rd.〈x, θt − θ∗〉 <

√2‖x‖2V−1tlog( 1δ

). (*)

If we need to know whether 〈x, θt − θ∗〉 ≤ 2−`, x ∈ A, we betterchoose a1, a2, . . . , at so thatmaxx∈A ‖x‖2V−1t (*)

is minimized (and make t big enough).⇒ Experimental design

Minimizing (*) is known as the G-optimal design problem.

G-optimal DesignLet π : A → [0, 1] be a distribution on A: ∑a∈A π(a) = 1. Define

V(π) =∑a∈A

π(a)aa> , g(π) = maxa∈A ‖a‖2V(π)−1 .

G-optimal design π∗:g(π∗) = min

πg(π) .

How to use this?

Using a Design πGiven a design π, for a ∈ Supp(π), set

na =⌈π(a) g(π)

ε2 log( 1δ

)⌉.

Choose each action a ∈ Supp(π) exactly na times. Then:V =

∑a∈Supp(π)

na aa> ≥ g(π)ε2 log

( 1δ

)V(π) ,

and so for any a ∈ A, w.p. 1− δ,〈θ − θ∗, a〉 ≤

√‖a‖2V−1 log

( 1δ

)≤ ε .

How big is n?n =

∑a∈Supp(π)

na =∑

a∈Supp(π)

⌈π(a) g(π)

ε2 log( 1δ

)⌉≤ | Supp(π)|+ g(π)

ε2 log( 1δ

).

Bounding g(π) and | Supp(π)|Theorem (Kiefer–Wolfowitz)The following are equivalent:

1 π∗ is a minimizer of g.2 π∗ is a minimizer of f(π) = − log det V(π).3 g(π∗) = d.

Note: Designs, minimizing f are known as D-optimal designs.KW says that G-optimality is the same as D-optimality.Combining this with John’s Theorem for minimum-volumeenclosing ellipsoids (John, 1948), we get | Supp(π)| ≤ d(d + 3)/2.

PEGOE Algorithm1Input: A ⊂ Rd and δ. Set A1 = A, ` = 1, t = 1.

1 Let t` = t: current round. Find G-optimal design π` : A` → [0, 1]that maximizeslog det V(π`) subject to ∑

a∈A`π`(a) = 1

2 Let ε` = 2−` andN`(a) =

⌈2π(a)

ε2 log

(K`(` + 1)

δ

)⌉and N` =

∑a∈A`

N`(a)

3 Choose each action a ∈ A` exactly N`(a) times4 Calculate estimate: θ = V−1

`

∑t`+N`t=t` AtXt.5 Eliminate poor arms:

A`+1 =

a ∈ A` : maxb∈A`〈θ`, b− a〉 ≥ 2ε`

.

1Phased Elimination with G-Optimal Exploration

The Regret of PEGOETheoremWith probability at least 1− δ the pseudo-regret of PEGOE is atmost:

Rpseudon ≤ C√

nd log

(K log(n)

δ

),

where C > 0 is a universal constant. If δ = O(1/n), thenE[Rn] ≤ C√nd log(Kn)

for appropriately chosen universal constant C > 0.

Summary and Historical Remarks• Phased exploration allows one to use methods developed forfixed-design• PEGOE: Exploration tuned to maximize information gain• Finding an (1 + ε)-optimal design is sufficient; convex problem• This algorithm and analysis in this form is new.• “Phased Elimination” is well known: Even-Dar et al. (2006)(pure exploration), Auer and Ortner (2010) (finite-armedbandits), Valko et al. (2014) (linear bandits, spanners insteadof G-optimality).• Finite, but changing action set: PEGOE cannot be applied!SupLinRel and SupLinUCB get the same bound (Auer, 2002;Chu et al., 2011). Sadly, these algorithms are veryconservative..

Outline5 Sparse Stochastic Linear BanditsWarmup (Hypercube)LinUCB with SparsityConfidence Sets & Online Linear RegressionSummary6 Minimax RegretA Minimax Lower Bound7 AsymptopiaLower BoundWhat About Optimism?8 Summary

General Setting

1 (Sparse parameter) There exist known constants M0 and M2such that ‖θ∗‖0 ≤ M0 and ‖θ∗‖2 ≤ M2.2 (Bounded mean rewards): 〈a, θ∗〉 ≤ 1 for all a ∈ At and allrounds t.3 (Subgaussian noise): The reward is Xt = 〈At, θ∗〉+ ηt whereηt|Ft−1 ∼ subG(1) for Ft = σ(A1, η1, . . . ,At, ηt).

The Case of the Hypercube

A = [−1, 1]d , θ.

= θ∗ , Xt = 〈At, θ〉+ ηt.

Assumptions:1 (Bounded mean rewards): ‖θ‖1 ≤ 1, which ensures that〈a, θ〉| ≤ 1 for all a ∈ A.

2 (Subgaussian noise): The reward is Xt = 〈At, θ∗〉+ ηt whereηt|Ft−1 ∼ subG(1) for Ft = σ(A1, η1, . . . ,At, ηt).

Selective Explore-Then-Commit (SETC)Recall: θ = θ∗.

For any i ∈ [d] such that Ati is randomized:Ati(A>t θ + ηt) = θi + Ati

∑j6=i

Atjθj + Atiηt︸︷︷︸“noise′′

.

Regret of SETCTheoremThere exists a universal constant C > 0 such that the regret ofSETC satisfies:

Rn ≤ 2 ‖θ‖1 + C ∑i:θi 6=0

log(n)

|θi| .

Furthermore Rn ≤ C ‖θ‖0√n log(n).

SETC adapts to ‖θ‖0!

(General) LinUCB: RecapGLinUCBChoose Ct ⊂ Rd and let

At = argmaxa∈A maxθ∈Ct〈a, θ〉 .

Previous choice leads to regret O(d√n).How to choose Ct, knowing that ‖θ∗‖0 ≤ p,so that the regret gets smaller?

Online Linear Regression (OLR)Learner-environment interaction:

1 The environment chooses Xt ∈ R and At ∈ Rd in an arbitraryfashion.2 The value of At is revealed to the learner (but not Xt).3 The learner produces a real-valued prediction Xt in some way.4 The environment reveals Xt to the learner and the loss is

(Xt − Xt)2.Goal: Compete with the total loss of the best linear predictors insome set Θ ⊂ Rd.

Regret against θ ∈ Θ:ρn(θ) =

n∑t=1

(Xt − Xt)2 −n∑

t=1(Xt − 〈At, θ〉)2 .

From OLR to Confidence SetsLet L be a learner that enjoys a regret guaranteeBn = Bn(A1,X1, . . . ,An,Xn) relative to Θ: For any strategy of theenvironment,

supθ∈Θ

ρn(θ) ≤ Bn .

Combineρn(θ) =

n∑t=1

(Xt − Xt)2 −n∑

t=1(Xt − 〈At, θ〉)2 .

and Xt = 〈At, θ∗〉+ ηt to getQt .=

t∑s=1

(Xs − 〈As, θ∗〉)2 = ρt(θ∗) + 2 t∑s=1

ηs(Xs − 〈As, θ∗〉)

≤ Bt + 2 t∑s=1

ηs(Xs − 〈As, θ∗〉) .

From OLR to Confidence Sets: II.Qt ≤ Bt + 2Zt , Zt =

t∑s=1

ηs(Xs − 〈As, θ∗〉) . (*)Goal: Bound Zt for t ≥ 0.Xt, chosen by OLR learner L, is Ft−1-measurable,

(Zt − Zt−1)|Ft−1 ∼ subG(σt) , where σ2t = (Xt − 〈At, θ∗〉)2 .

Previous self-normalized bound (**): With probability 1− δ,|Zt| <

√(1 + Qt) log

( 1+Qtδ2), t = 0, 1, . . . .

Combining with (*), solve for Qt:Qt ≤ βt(δ) , βt(δ) = 1 + 2Bt + 32 log

(√8+√1+Btδ

).

OLR to Confidence Sets: III.

TheoremLet δ ∈ (0, 1) and assume that θ∗ ∈ Θ and supθ∈Θ ρt(θ) ≤ Bt. IfCt+1 =

θ ∈ Rd : ‖θ‖22 +

t∑s=1

(Xs − 〈As, θ〉)2 ≤ M22 + βt(δ)

,

then P (exists t ∈ N such that θ∗ 6∈ Ct+1) ≤ δ .

Sparse LinUCB1: Input OLR Learner L, regret bound Bt, confidence parameterδ ∈ (0, 1)

2: for t = 1, . . . , n3: Receive action set At4: Compute confidence set:

Ct =

θ ∈ Rd : ‖θ‖22 +

t−1∑s=1

(Xs − 〈As, θ〉)2 ≤ M22 + βt(δ)

5: Calculate optimistic actionAt = argmaxa∈At

maxθ∈Ct〈a, θ〉

6: Feed At to L and obtain prediction Xt7: Play At and receive reward Xt8: Feed Xt to L as feedback

Regret of OLR-UCB

TheoremWith probability at least 1− δ the pseudo-regret of OLR-UCBsatisfies

Rpseudon ≤√

8dn (M22 + βn−1(δ))

log(1 +

nd).

The Regret of OLR-UCB(π)Theorem (Sparse OLR Algorithm)∃π for the learner such that for any θ ∈ Rd, the regret ρn(θ) of πagainst any strategic environment such that maxt∈[n] ‖At‖2 ≤ L andmaxt∈[n] |Xt| ≤ X satisfiesρn(θ) ≤ cX2 ‖θ‖0

log(e + n1/2L) + Cn log(1 +

‖θ‖1‖θ‖0 )

+ (1 + X2)Cn ,

where c > 0 is some universal constant andCn = 2 + log2 log(e + n1/2L).CorollaryThe expected regret of OLR-UCB when using the strategy π fromabove satisfies Rn = O(

√dpn) .

Summary• OLR algorithm used inside OLR-UCB to construct center• Regret guarantee of the OLR controls “width” of confidenceellipsoid• Regret: O(

√dpn), and p is known.• Hypercube: p√n, p is unknown!• In general, the regret can be as high as Ω(

√pdn) (p = 1: thinkof A = e1, . . . , ed)• Under parameter noise (Xt = 〈At, θ∗ + ηt〉), for “rounded” actionsets, O(p√n) is possible!• Very much unlike in the “passive” case:Major conflict between exploration and exploitation!

Historical Notes

• Selective Explore-Then-Commit algorithm is due to (Lattimoreet al., 2015).• OLR-UCB is from Abbasi-Yadkori et al. (2012).• The Sparse OLR algorithm is due to Gerchinovitz (2013).• Rakhlin and Sridharan (2015) also discusses relationshipbetween online learning regret bounds and self-normalizedtail bounds of the type given here.

Minimax Lower Bound

TheoremLet the action set be A = −1, 1d and Θ = −n−1/2, n−1/2d. Thenfor any policy π there exists a θ ∈ Θ such that

Rπn (A, θ) ≥ C d√nfor some universal constant C > 0.

Some Thoughts• LinUCB with our confidence set construction is “nearly”worst-case optimal.• The theorem is “new”, but the proof is standard; see (Shamir,2015).• Similar results for some other action sets: Rusmevichientongand Tsitsiklis (2010) (`2-ball), Dani et al. (2008) (products of2D balls).• Some action sets will have smaller minimax regret! Can youthink of one?

Lower BoundSetting:

1 Actions: A ⊂ Rd finite, K = |A|.2 Reward is Xt = 〈At, θ〉+ ηt, where θ ∈ Rd and ηt is a sequenceof independent standard Gaussian variables.

Regret of policy π:Rπn (A, θ) = Eθ,π

[ n∑t=1

∆At

], ∆a = maxa′∈A〈a′ − a, θ〉 ,

Recall: a policy π is consistent in some class of bandits E if theregret is subpolynomial for any bandit in that class:Rπn (A, θ) = o(np) for all p > 0 and θ ∈ Rd .

Lower Bound: II

TheoremAssume that A ⊂ Rd is finite and spans Rd and suppose π isconsistent. Let θ ∈ Rd be any parameter such that there is a uniqueoptimal action and let Gn = Eθ,π

[∑nt=1 AtA>t ] be the expected Grammatrix . Then lim infn→∞ λmin(Gn)/ log(n) > 0. Furthermore, for anya ∈ A it holds that:lim supn→∞ log(n) ‖a‖2G−1n ≤

∆2a2 .

Lower Bound: IIICorollaryLet A ⊂ Rd be a finite set that spans Rd and θ ∈ Rd be such thatthere is a unique optimal action. Then for any consistent policy π,

lim infn→∞Rπn (A, θ)

log(n)≥ c(A, θ) ,

where c(A, θ) is defined asc(A, θ) = inf

α∈[0,∞)A

∑a∈A

α(a)∆a

subject to ‖a‖2H−1α≤ ∆2a2 for all a ∈ A with ∆a > 0 ,

where H =∑

a∈A α(a)aa>.

Poor Outlook for Optimism

Poor Outlook for OptimismActions:A = a1, a2, a3, a1 = e1, a2 = e2,a3 = (1− ε, γε). ε > 0 small, γ ≥ 1.Let θ = (1,0), so a∗ = a1.Solving for the lower bound,α(a2) = 2γ2and α(a3) = 0, c(A, θ) = 2γ2 and

lim infn→∞Rπn (A, θ)

log(n)= 2γ2 .

Moreover, for γ large, ε sufficiently small, π “optimistic”,lim supn→∞

Rπn (A, θ)

log(n)= Ω(1/ε) ,

Instance-Optimal Asymptotic Regret

TheoremThere exists a policy π that is consistent and satisfies

lim supn→∞Rπn (A, θ)

log(n)= c(A, θ) ,

where c(A, θ) was defined in the lower bound.

Illustration: LinUCB

SummaryThe instance-optimal regret of consistent algorithms isasymptotically c(A, θ) log(n).

Optimistic algorithms fail to achieve this: Their regret can beworse by an arbitrarily large constant factor.Remember:

Finite-armed banditsCase (a): At has always the same number of vectors in it:

“nite-armed stochastic contextual bandit”.Case (b): Also, At does not change, or At = a, . . . , aK:

“nite-armed stochastic linear bandit”.Case (c): If the vectors in At are also orthogonal to each other:

“nite-armed stochastic bandit”.Difference between cases (c) and (b):

• Case (c): Learn about mean of arm i , Choose action i;• Case (b): Learn about mean of arm i , Choose action j s.t.hxj, xii 6= .

Departing Thoughts• These results are from Lattimore and Szepesvari (2016)• The asymptotically optimal algorithm is given there (thealgorithm solves for the optimal allocation, while monitoringwhether things went wrong)• Combes et al. (2017) refine the algorithm and generalize it toother settings.• Soare et al. (2014), in best arm identification with linear payofffunctions, gave essentially the same example that we use toargue for the large regret of optimistic algorithms.• Open questions:• Simultaneously finite-time near-optimal and asymptoticallyoptimal algorithm• Changing, or infinite action sets?

Summary

Summary of This Talk• Contextual vs. linear bandits:

Changing action sets can model contextual bandits• Optimistic algorithms:

• Optimism can achieve minimax optimality• Optimism can be expensive• Optimistic algorithms require a careful design of the underlyingconfidence sets• Sparsity:

Exploiting sparsity is sometimes at odds with the requirementto collect rewards

What’s Next for Bandits?• Today: Finite-armed and linear stochastic bandits.• But bandits come in all forms and shapes!

• Adversarial (finite, linear, . . . )• Combinatorial action sets: From shortest path to ranking• Continuous action sets, continuous time, delays• Resourceful, nonstationary, various structures (low-rank), . . .• Nearby problems:

• Reinforcement learning/Markov decision processes• Partial monitoring

Learning Material• Bandit Visualizer:https://github.com/alexrutar/banditvis

• Online bandit simulator:http://downloads.tor-lattimore.com/bandits/

• Most of this tutorial (and more): http://banditalgs.com• Book to be published by early next year: Looking for reviewers!• Tor’s lightweight C++ bandit library

• Sebastien Bubeck’s tutorial• Blog post 1• Blog post 2

• Bubeck and Cesa-Bianchi’s book;(Bubeck and Cesa-Bianchi, 2012)banditalgs.com

https://github.com/alexrutar/banditvis

http://downloads.tor-lattimore.com/bandits/

http://banditalgs.com

https://github.com/tor/libbandit

https://www.dropbox.com/s/b5u2ep29szmdscn/BubeckTutorialMLSS16.pdf?dl=0

https://blogs.princeton.edu/imabandit/2016/05/11/bandit-theory-part-i/

https://blogs.princeton.edu/imabandit/2016/05/11/bandit-theory-part-ii/

http://sbubeck.com/tutorial2.pdf

banditalgs.com

References IAbbasi-Yadkori, Y. (2009). Forced-exploration based algorithms for playing in bandits with largeaction sets. PhD thesis, University of Alberta.Abbasi-Yadkori, Y. (2012). Online Learning for Linearly Parametrized Control Problems. PhD thesis,University of Alberta.Abbasi-Yadkori, Y., Antos, A., and Szepesvari, C. (2009). Forced-exploration based algorithms forplaying in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback.Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. (2012). Online-to-confidence-set conversions andapplication to sparse stochastic bandits. In Artificial Intelligence and Statistics, pages 1–9.Abbasi-Yadkori, Y., Szepesvari, C., and Tax, D. (2011). Improved algorithms for linear stochasticbandits. In Advances in Neural Information Processing Systems (NIPS), pages 2312–2320.Abe, N. and Long, P. M. (1999). Associative reinforcement learning using linear probabilisticconcepts. In ICML, pages 3–11.Anantharam, V., Varaiya, P., and Walrand, J. (1987). Asymptotically efficient allocation rules for themultiarmed bandit problem with multiple plays-part i: Iid rewards. IEEE Transactions onAutomatic Control, 32(11):968–976.Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of MachineLearning Research, 3(Nov):397–422.Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armedbandit problem. Periodica Mathematica Hungarica, 61(1-2):55–65.

References IIBubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armedBandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorporated.Chu, W., Li, L., Reyzin, L., and Schapire, R. E. (2011). Contextual bandits with linear payoff functions.In AISTATS, volume 15, pages 208–214.Combes, R., Magureanu, S., and Proutiere, A. (2017). Minimal exploration in structured stochasticbandits. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., andGarnett, R., editors, Advances in Neural Information Processing Systems 30, pages 1761–1769.Curran Associates, Inc.Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under banditfeedback. In Proceedings of Conference on Learning Theory (COLT), pages 355–366.Even-Dar, E., Mannor, S., and Mansour, Y. (2006). Action elimination and stopping conditions for themulti-armed bandit and reinforcement learning problems. Journal of machine learning research,7(Jun):1079–1105.Filippi, S., Cappe, O., Garivier, A., and Szepesvari, Cs. (2009). Parametric bandits: The generalizedlinear case. In NIPS-22, pages 586–594.Freedman, D. (1975). On tail probabilities for martingales. The Annals of Probability, 3(1):100–118.Gerchinovitz, S. (2013). Sparsity regret bounds for individual sequences in online linear regression.Journal of Machine Learning Research, 14(Mar):729–769.Hsu, D., Kakade, S. M., and Zhang, T. (2012). Random design analysis of ridge regression. InConference on Learning Theory, pages 9–1.

References IIIJohn, F. (1948). Extremum problems with inequalities as subsidiary conditions. Courant AnniversaryVolume, Interscience.Lattimore, T., Crammer, K., and Szepesvari, C. (2015). Linear multi-resource allocation withsemi-bandit feedback. In Advances in Neural Information Processing Systems, pages 964–972.Lattimore, T. and Munos, R. (2014). Bounded regret for finite-armed structured bandits. In Advancesin Neural Information Processing Systems, pages 550–558.Lattimore, T. and Szepesvari, C. (2016). The end of optimism? an asymptotic analysis offinite-armed linear bandits. arXiv preprint arXiv:1610.04491.Pena, V. H., Lai, T. L., and Shao, Q.-M. (2008). Self-normalized processes: Limit theory and StatisticalApplications. Springer Science & Business Media.Rakhlin, A. and Sridharan, K. (2015). On equivalence of martingale tail bounds and deterministicregret inequalities. arXiv preprint arXiv:1510.03925.Robbins, H. and Siegmund, D. (1970). Boundary crossing probabilities for the Wiener process andsample sums. Annals of Math. Statistics, 41:1410–1429.Robbins, H. and Siegmund, D. (1971). A convergence theorem for non-negative almostsupermartingales and some applications. In Rustagi, J., editor, Optimizing Methods in Statistics,pages 235–257. Academic Press, New York.Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics ofOperations Research, 35(2):395–411.Russo, D. and Roy, B. V. (2013). Eluder dimension and the sample complexity of optimisticexploration. In NIPS, pages 2256–2264.

References IV

Shamir, O. (2015). On the complexity of bandit linear optimization. In Conference on Learning Theory,pages 1523–1551.Soare, M., Lazaric, A., and Munos, R. (2014). Best-arm identification in linear bandits. In Advances inNeural Information Processing Systems, pages 828–836.Srinivas, N., Krause, A., Kakade, S., and Seeger, M. W. (2010). Gaussian process optimization in thebandit setting: No regret and experimental design. In ICML, pages 1015–1022.Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. (2013). Finite-time analysis ofkernelised contextual bandits. arXiv preprint arXiv:1309.6869.Valko, M., Munos, R., Kveton, B., and Kocak, T. (2014). Spectral bandits for smooth graph functions.In International Conference on Machine Learning, pages 46–54.

tor lattimore & csaba szepesvari´ · tor lattimore & csaba szepesvari´ outline 1 from...

Documents