tor lattimore and csaba szepesv´ari · solutions to selected exercises in bandit algorithms tor...

118
Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ ari Draft of Friday 11 th September, 2020 Contents 2 Foundations of Probability 5 2.1, 2.3, 2.5, 2.6, 2.7, 2.8, 2.9, 2.10, 2.11, 2.12, 2.14, 2.15, 2.16, 2.18, 2.19 3 Stochastic Processes and Markov Chains 11 3.1, 3.5, 3.8, 3.9 4 Stochastic Bandits 13 4.9, 4.10 5 Concentration of Measure 14 5.10, 5.12, 5.13, 5.14, 5.15, 5.16, 5.17, 5.18, 5.19 6 The Explore-then-Commit Algorithm 22 6.2, 6.3, 6.5, 6.6, 6.8 7 The Upper Confidence Bound Algorithm 26 7.1 8 The Upper Confidence Bound Algorithm: Asymptotic Optimality 27 8.1 9 The Upper Confidence Bound Algorithm: Minimax Optimality 28 9.1, 9.4 10 The Upper Confidence Bound Algorithm: Bernoulli Noise 29 10.1, 10.3, 10.4, 10.5 11 The Exp3 Algorithm 35 11.2, 11.5, 11.6, 11.7 12 The Exp3-IX Algorithm 37 12.1, 12.4 13 Lower Bounds: Basic Ideas 39 13.2 14 Foundations of Information Theory 39 14.4, 14.10, 14.11 15 Minimax Lower Bounds 41 15.1 16 Instance-Dependent Lower Bounds 42 16.2, 16.7 17 High-Probability Lower Bounds 43 17.1 1

Upload: others

Post on 16-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Solutions to Selected Exercises in Bandit Algorithms

Tor Lattimore and Csaba Szepesvari

Draft of Friday 11th September, 2020

Contents

2 Foundations of Probability 52.1, 2.3, 2.5, 2.6, 2.7, 2.8, 2.9, 2.10, 2.11, 2.12, 2.14, 2.15, 2.16, 2.18, 2.19

3 Stochastic Processes and Markov Chains 113.1, 3.5, 3.8, 3.9

4 Stochastic Bandits 134.9, 4.10

5 Concentration of Measure 145.10, 5.12, 5.13, 5.14, 5.15, 5.16, 5.17, 5.18, 5.19

6 The Explore-then-Commit Algorithm 226.2, 6.3, 6.5, 6.6, 6.8

7 The Upper Confidence Bound Algorithm 267.1

8 The Upper Confidence Bound Algorithm: Asymptotic Optimality 278.1

9 The Upper Confidence Bound Algorithm: Minimax Optimality 289.1, 9.4

10 The Upper Confidence Bound Algorithm: Bernoulli Noise 2910.1, 10.3, 10.4, 10.5

11 The Exp3 Algorithm 3511.2, 11.5, 11.6, 11.7

12 The Exp3-IX Algorithm 3712.1, 12.4

13 Lower Bounds: Basic Ideas 3913.2

14 Foundations of Information Theory 3914.4, 14.10, 14.11

15 Minimax Lower Bounds 4115.1

16 Instance-Dependent Lower Bounds 4216.2, 16.7

17 High-Probability Lower Bounds 4317.1

1

Page 2: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

18 Contextual Bandits 4418.1, 18.6, 18.7, 18.8, 18.9

19 Stochastic Linear Bandits 4719.3, 19.4, 19.5, 19.6, 19.7, 19.8

20 Confidence Bounds for Least Squares Estimators 5420.1, 20.2, 20.3, 20.4, 20.5, 20.8, 20.9, 20.10, 20.11

21 Optimal Design for Least Squares Estimators 5821.1, 21.2, 21.3, 21.5

22 Stochastic Linear Bandits for Finitely Many Arms 6023 Stochastic Linear Bandits with Sparsity 60

23.224 Minimax Lower Bounds for Stochastic Linear Bandits 60

24.125 Asymptotic Lower Bounds for Stochastic Linear Bandits 61

25.326 Foundations of Convex Analysis 61

26.2, 26.3, 26.9, 26.13, 26.14, 26.1527 Exp3 for Adversarial Linear Bandits 64

27.1, 27.4, 27.6, 27.8, 27.9, 27.1128 Follow-the-Regularised-Leader and Mirror Descent 69

28.1, 28.5, 28.10, 28.11, 28.12, 28.13, 28.14, 28.15, 28.16, 28.1729 The Relation between Adversarial and Stochastic Linear Bandits 77

29.2, 29.430 Combinatorial Bandits 78

30.4, 30.5, 30.6, 30.831 Non-stationary Bandits 81

31.1, 31.332 Ranking 82

32.2, 32.633 Pure Exploration 84

33.3, 33.4, 33.5, 33.6, 33.7, 33.934 Foundations of Bayesian Learning 90

34.4, 34.5, 34.13, 34.14, 34.15, 34.1635 Bayesian Bandits 95

35.1, 35.2, 35.3, 35.6, 35.736 Thompson Sampling 100

36.3, 36.5, 36.6, 36.1337 Partial Monitoring 103

37.3, 37.10, 37.12, 37.13, 37.1438 Markov Decision Processes 106

38.2, 38.4, 38.5, 38.7, 38.8, 38.9, 38.10, 38.11, 38.12, 38.13, 38.14, 38.15, 38.16, 38.17,38.19, 38.21, 38.22, 38.23, 38.24

2

Page 3: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Bibliography 118

3

Page 4: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

4

Page 5: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Chapter 2 Foundations of Probability

2.1 Let h = g f . Let A ∈ H. We need to show that h−1(A) ∈ F . We claim thath−1(A) = f−1(g−1(A)). Because g is G/H-measurable, g−1(A) ∈ G and thus because f is F/G-measurable, f−1(g−1(A)) is F -measurable, thus completing the proof, once we show that the claimholds. To show the claim, we show two-sided containment. For showing h−1(A) ⊂ f−1(g−1(A)) letx ∈ h−1(A). Thus, h(x) ∈ A. By definition, h(x) = g(f(x)) ∈ A. Hence, f(x) ∈ g−1(A) and thusx ∈ f−1(g−1(A)). For the other direction let x ∈ f−1(g−1(A)). This implies that f(x) ∈ g−1(A),which implies that h(x) = g(f(x)) ∈ A.

2.3 Since X(u) ∈ V for all u ∈ U we have X−1(V) = U . Therefore U ∈ ΣX . Suppose thatU ∈ ΣX , then by definition there exists a V ∈ Σ such that X−1(V ) = U . Because ΣX is aσ-algebra we have V c ∈ Σ and by definition of ΣX we have U c = X−1(V c) ∈ ΣX . ThereforeΣX is closed under complements. Finally let (Ui)i be a countable sequence with Ui ∈ ΣX . Then⋃i Ui = X−1(∪iX(Ui)) ∈ ΣX , which means that ΣX is closed under countable unions and the proof

is completed.

2.5

(a) Let A be the set of all σ-algebras that contain G and define

F∗ =⋂

F∈AF .

We claim that F∗ is the smallest σ-algebra containing G. Clearly F∗ contains G and is containedin all σ-algebras containing G. Furthermore, by definition it contains exactly those A that arein every σ-algebra that contains G. It remains to show that F∗ is a σ-algebra. Since Ω ∈ Ffor all F ∈ A it follows that Ω ∈ F∗. Now suppose that A ∈ F∗. Then A ∈ F for all F ∈ Aand Ac ∈ F for all F ∈ A. Therefore Ac ∈ F∗. Therefore F∗ is closed under complements.Finally, suppose that (Ai)i is a family in F∗. Then (Ai)i are families in F for all F ∈ A and so⋃iAi ∈ F for all F ∈ A and again we have ⋃iAi ∈ F∗. Therefore F∗ is a σ-algebra.

(b) Define H =A : X−1(A) ∈ F. Then Ω ∈ H and for A ∈ H we have X−1(Ac) = X−1(A)c so

Ac in H. Furthermore, for (Ai)i with Ai ∈ H we have

X−1(⋃

i

Ai

)=⋃

i

X−1(Ai) .

Therefore H is a σ-algebra on Ω and by definition σ(G) ⊆ H. Now for any A ∈ H we havef−1(A) ∈ F by definition. Therefore f−1(A) ∈ F for all A ∈ σ(G).

(c) We need to show that I A−1 (B) ∈ F for all B ∈ B(R). There are four cases. If 0, 1 ∈ B,then I A−1 (B) = Ω ∈ F . If 1 ∈ B, then I A−1 (B) = A ∈ F . If 0 ∈ B, then

5

Page 6: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

I A−1 (B) = Ac ∈ F . Finally, if 0, 1 ∩B = ∅, then I A−1 (B) = ∅ ∈ F . Therefore I A isF-measurable.

2.6 Trivially, σ(X) = ∅,R. Hence Y is not σ(X)/B(R)-measurable because Y −1([0, 1]) = [0, 1] 6∈σ(X).

2.7 First P (∅ |B) = P (∅ ∩B) /P (B) = 0 and P (Ω |B) = P (Ω ∩B) /P (B) = 1. Let (Ei)i be acountable collection of disjoint sets with Ei ∈ F . Then

P(⋃

i

Ei

∣∣∣∣∣B)

= P (B ∩⋃iEi)P (B) = P (⋃i(B ∩ Ei))

P (B)

=∑

i

P (B ∩ Ei)P (B) =

i

P (Ei |B) .

Therefore P( · |B) satisfies the countable additivity property and the proof is complete.

2.8 Using the definition of conditional probability and the assumption that P (A) > 0 and P (B) > 0we have:

P (A |B) = P (A ∩B)P (B) = P (B |A)P (A)

P (B) .

2.9 For part (a),

P (X1 < 2 |X2 is even) = P (X1 < 2 and X2 is even)P (X2 is even) = 3/(62)

18/(62) = 16 = P (X1 < 2) .

Therefore X1 < 2 is independent from X2 is even. For part (b) note that σ(X1) = C × [6] :C ∈ 2[6] and σ(X2) = [6]× C : C ∈ 2[6]. It follows that for |A ∩B| = |A||B|/62 and so

P (A |B) = P (A ∩B)P (B) = |A ∩B|/6

2

|B|/62 = |A|62 = P (A) .

Therefore A and B are independent.

2.10

(a) Let A ∈ F . Then P (A ∩ Ω) = P (A) = P (A)P (Ω) and P (A ∩ ∅) = 0 = P (∅)P (A). Intuitively,Ω and ∅ happen surely/never respectively, so the occurrence or not of any other event cannotalter their likelihood.

(b) Let A ∈ F satisfy P (A) = 1 and B ∈ F be arbitrary. Then P (B ∩Ac) ≤ P (Ac) = 0.Therefore P (A ∩B) = P (A ∩B) + P (Ac ∩B) = P (B) = P (A)P (B). When P (A) = 0 we haveP (A ∩B) ≤ P (A) = 0 = P (A)P (B).

(c) If A and Ac are independent, then 0 = P (∅) = P (A ∩Ac) = P (A)P (Ac) = P (A) (1 − P (A)).Therefore P (A) ∈ 0, 1. This makes sense because the knowledge of A provides the knowledgeof Ac, so the two events can only be independent if one occurs with probability zero.

6

Page 7: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(d) If A is independent of itself, then P (A ∩A) = P (A)2. Therefore P (A) ∈ 0, 1 as before. Theintuition is the same as the previous part.

(e) Ω = (1, 1), (1, 2), (2, 1), (2, 2) and F = 2Ω.

Ω and A for all A ∈ F (16 pairs)∅ and A for all A ∈ F − Ω (15 pairs)(1, 0), (1, 1) and (0, 0), (1, 0)(1, 0), (1, 1) and (0, 1), (1, 1)(0, 0), (0, 1) and (0, 0), (1, 0)(0, 0), (0, 1) and (0, 1), (1, 1)

(f) P (X1 ≤ 2, X1 = X2) = P (X1 = X2 = 1) + P (X1 = X2 = 2) = 2/9 = P (X1 ≤ 2)P (X1 = X2)because P (X1 ≤ 2) = 2/3 and P (X1 = X2) = 1/3.

(g) If A and B are independent, then |A ∩ B|/n = P (A ∩B) = P (A)P (B) = |A||B|/n2.Rearranging shows that n|A ∩ B| = |A||B|. All steps can be reversed showing the reversedirection.

(h) Assume that n is prime. By the previous part, n|A ∩ B| = |A||B| must hold if A and B areindependent of each other. If |A∩B| = 0, the events will be trivial. Hence, assume |A∩B| > 0.Since n is prime, it follows then that n must be either a factor of |A| or a factor of |B|. Withoutloss of generality, assume that it is a factor of |A|. This implies n ≤ |A|. But |A| ≤ n also holds,hence |A| = n, i.e., A is a trivial event.

(i) Let X1 and X2 be independent Rademacher random variables and X3 = X1X2. Clearly theserandom variables are not mutually independent since X3 takes multiple values with nonzeroprobability and is fully determined by X1 and X2. And yet X3 and Xi are independent fori ∈ 1, 2, which ensures that pairwise independence holds.

(j) No. Let Ω = [6] and F = 2Ω and P be the uniform measure. Define events A = 1, 3, 4 andB = 1, 3, 5 and C = 3, 4, 5, 6. Then A and B are clearly dependent and yet

P (A ∩B ∩ C) = P (A ∩B |C)P (C) = P (A)P (B)P (C) .

2.11

(a) σ(X) = (Ω, ∅) is trivial. Let Y be another random variable, then X and Y are independentif and only if for all A ∈ σ(X) and B ∈ σ(Y ) it holds that P (A ∩B) = P (B), which is trivialwhen A ∈ Ω, ∅.

(b) Let A be an event with P (A) > 0. Then P (X = x |A) = P (X = x ∩A) /P (A) = 1 = P (X = x).Similarly, P (X 6= x |A) = 0 = P (X 6= x). Therefore X is independent of all events, includingthose generated by Y .

7

Page 8: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(c) Suppose that A and B are independent. Then P (Ac |B) = 1− P (A |B) = 1− P (A) = P (Ac).Therefore Ac and B are independent and by the same argument so are Ac and Bc as well as Aand Bc. The ‘if’ direction follows by noting that σ(X) = Ω, A,Ac, ∅ and σ(Y ) = Ω, B,Bc, ∅and recalling that every event is independent of Ω or the empty set. For the ‘only if’ notethat independence of X and Y means that any pair of events taken from σ(X) × σ(Y ) areindependent, which by the above includes the pair A,B.

(d) Let (Ai)i be a countable family of events and Xi(ω) = I ω ∈ Ai be the indicator of the ithevent. When the random variables/events are pairwise independent, then the above argumentgoes through unchanged for each pair. In the case of mutual independence the ‘only if’ is againthe same. For the ‘if’, suppose that (Ai) are mutually independent. Therefore for any finitesubset K ⊂ N we have

P(⋂

i∈KAi

)=∏

i∈KP (Ai)

The same argument as the previous part shows that for any disjoint finite sets K,J ⊂ N wehave

P(⋃

i∈KAi ∪

i∈JAci

)=∏

i∈KP (Ai)

i∈JP (Aci ) .

Therefore for any finite set K ⊂ N and (Vi)i∈K with Vi ∈ σ(Xi) = Ω, ∅, Ai, Aci it holds that

P(⋂

i∈KVi

)=∏

i∈KP (Vi) ,

which completes the proof that (Xi)i are mutually independent.

2.12

(a) Let A ⊂ R be an open set. By definition, since f is continuous it holds that f−1(A) is open.But the Borel σ-algebra is generated by all open sets and so f−1(A) ∈ B(R) as required.

(b) Since | · | : R→ R is continuous and by definition a random variable X on measurable space(Ω,F) is F/B(R)-measurable it follows by the previous part that |X| is F/B(R)-measurableand therefore a random variable.

(c) Recall that (X)+ = max0, X and (X)− = −min0, X. Therefore (|X|)+ = |X| =(X)+ + (X)− and (|X|)− = 0. Recall that E[X] = E[(X)+] − E[(X)−] exists if an only ifboth expectations are defined. Therefore if X is integrable, then |X| is integrable. Now supposethat |X| is integrable, then X is integrable by the dominated convergence theorem.

2.14 Assume without (much) loss of generality that Xi ≥ 0 for all i. The general case follows by

8

Page 9: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

considering positive and negative parts, as usual. First we claim that for any n it holds that

E[n∑

i=1Xi

]=

n∑

i=1E[Xi] .

To show this, note the definition that

E[n∑

i=1Xi

]= sup

Ωh dP : h is simple and 0 ≤ h ≤

n∑

i=1Xi

=n∑

i=1sup

Ωh dP : h is simple and 0 ≤ h ≤ Xi

.

Next let Sn = ∑ni=1Xi and note that by the monotone convergence theorem we have limn→∞ E[Sn] =

E[X], which means that

E[ ∞∑

i=1Xi

]= lim

n→∞E[Sn] = lim

n→∞

n∑

i=1E[Xi] .

2.15 Suppose that X(ω) = ∑ni=1 αiI ω ∈ Ai is simple and c > 0. Then cX is also simple and

E[cX] =n∑

i=1cαiI ω ∈ Ai = c

n∑

i=1αiI ω ∈ Ai = cE[X] .

Now suppose that X is positive (but maybe not simple) and c > 0, then cX is also positive and

E[cX] = sup E[h] : h is simple and h ≤ cX= sup E[ch] : h is simple and h ≤ X= sup cE[h] : h is simple and h ≤ X= cE[X] .

Finally for arbitrary random variables and c > 0 we have

E[cX] = E[(cX)+]− E[(cX)−] = cE[(X)+]− cE[(X)−] = cE[X] .

For negative c simply note that (cX)+ = −c(X)− and (cX)− = −c(X)+ and repeat the aboveargument.

9

Page 10: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

2.16 Suppose X = ∑Ni=1 αiI Ai and Y = ∑N

i=1 βiI Bi are simple functions. Then

E[XY ] = E[N∑

i=1αiI Ai

N∑

i=1βiI Bi

]

=N∑

i=1

N∑

j=1αiβjP (Ai ∩Bj)

=N∑

i=1

N∑

j=1αiβjP (Ai)P (Aj)

= E[X]E[Y ] .

Now suppose that X and Y are arbitrary non-negative independent random variables. Then

E[XY ] = sup E[h] : h is simple and h ≤ XY = sup E[hg] : h ∈ σ(X), g ∈ σ(Y ) are simple and h ≤ X, g ≤ Y = sup E[h]E[g] : h ∈ σ(X), g ∈ σ(Y ) are simple and h ≤ X, g ≤ Y = sup E[h] : h ∈ σ(X) is simple and h ≤ X sup E[h] : h ∈ σ(Y ) is simple and h ≤ Y = E[X]E[Y ] .

Finally, for arbitrary random variables we have via the previous display and the linearity ofexpectation that

E[XY ] = E[((X)+ − (X)−)((Y )+ − (Y )−)]= E[(X)+(Y )+]− E[(X)+(Y )−]− E[(X)−(Y )+] + E[(X)−(Y )−]= E[(X)+]E[(Y )+]− E[(X)+]E[(Y )−]− E[(X)−]E[(Y )+] + E[(X)−]E[(Y )−]= E[(X)+ − (X)−]E[(Y )+ − (Y )−] .

2.18 Let X be a standard Rademacher random variable and Y = X. Then E[X]E[Y ] = 0 andE[XY ] = 1.

2.19 Using the fact that∫ a

0 1dx = a for a ≥ 0 and the non-negativity of X we have

X(ω) =∫ ∞

0I [0, X(ω)] (x)dx .

Then by Fubini’s theorem,

E[X] = E[∫ ∞

0I [0, X(ω)] (x)dx

]

=∫ ∞

0E[I [0, X(ω)] (x)]dx

=∫ ∞

0P (X(ω) ≥ x) dx .

10

Page 11: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Chapter 3 Stochastic Processes and Markov Chains

3.1

(a) We have F1(x) = I x ∈ [1/2, 1] and F2(x) = I x ∈ [1/4, 2/4) ∪ [3/4, 4/4]. More generally,Ft(x) = IUt(x) where

Ut = 1 ∪⋃

1≤s≤2t−1

[(2s− 1)/2t, 2s/2t) .

Since Ut ∈ B([0, 1]), Ft are random variables (see Fig. 3.1).

U4

U3

U2

U1

Figure 3.1: Illustration of events (Ut)∞t=1.

(b) We have P(Ut) = λ(Ut) = ∑2t−1s=1 (1/2t) = 1/2.

(c) Given an index set K ⊂ N+ we need to show that Fk : k ∈ K are independent. Or equivalently,that

P

k∈KUk

=

k∈KP (Uk) = 2−|K| . (3.1)

Let k = maxK. Then

λ

Uk ∩

j∈K\kUj

= 1

j∈K\k

.

Then Eq. (3.1) follows by induction.

(d) It follows directly from the definition of independence that any subsequence of an independentsequence is also an independent sequence. That P (Xm,t = 0) = P (Xm,t = 1) = 1/2 follows fromPart (b).

11

Page 12: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(e) By the previous parts Xt = ∑∞t=1Xm,t2−t is a weighted sum of an independent sequence of

uniform Bernoully random variables. Therefore Xt has the same law as Y = ∑∞t=1 Ft2−t. But

Y (x) = x is the identity. Hence Y is uniformly distributed and so too is Xt.

(f) This follows from the definition of (Xm,t)∞t=1 as disjoint subsets of independent random variables(Ft)∞t=1 and the ‘grouping’ result that whenever (Ft)t∈T is a collection of independent σ-algebrasand T1, T2 are disjoint subsets of T , then σ(∪t∈T1Ft) and σ(∪t∈T2Ft) are independent [Kallenberg,2002, Corollary 3.7]. This latter result is a good exercise. Use a monotone class argument.

3.5 Let A ∈ G and suppose that X(ω) = IA(ω). Then∫X X(x)K(ω, dx) = K(ω,A), which is

F-measurable by the definition of a probability kernel. The result extends to simple functions bylinearity. For nonnegative X let Xn ↑ X be a monotone increasing sequence of simple functionsconverging point-wise to X [Kallenberg, 2002, Lemma 1.11]. Then Un(ω) =

∫X Xn(x)K(ω, dx) is

F-measurable. Monotone convergence ensures that limn→∞ Un(ω) =∫X limn→∞Xn(x)K(ω, dx) =∫

X X(x)K(ω, dx) = U(ω). Hence limn→∞ Un(ω) = U(ω) and a point-wise convergent sequence ofmeasurable functions is measurable, it follows that U is F-measurable. The result for arbitrary Xfollows by decomposing into positive and negative parts.

3.8 Let (Xt)nt=0 be F = (Ft)nt=1-adapted and τ = mint : Xt ≥ ε. By the submartingale property,E[Xn | Ft] ≥ Xt. Therefore

I τ = tXt ≤ I τ = tE[Xn | Ft] = E[I τ = tXn | Ft] .

Therefore E[XtI τ = t] ≤ E[I τ = tXn].

P(

maxt∈0,1,...,n

Xt ≥ ε)

=n∑

t=0P (τ = t)

=n∑

t=0P (XtI τ = t ≥ ε)

≤ 1ε

n∑

t=0E[XtI τ = t]

≤ 1ε

n∑

t=0E[XnI τ = t]

≤ E[Xn]ε

.

3.9 Let ΣX ( ΣY) be the σ-algebra underlying X (respectively, Y). It suffices to verify that for

12

Page 13: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

A ∈ ΣX , B ∈ ΣY , P(X,Y )(A×B) = (PY ⊗ PX|Y )(A×B). We have

P(X,Y )(A×B) = P (X ∈ A, Y ∈ B)= E [E [I X ∈ A I Y ∈ B |Y ]] (tower rule)= E [I Y ∈ BE [I X ∈ A |Y ]] (I Y ∈ B is σ(Y )-measurable)= E [I Y ∈ BP (X ∈ A | Y )] (relation of expectation and probability)

= E[I Y ∈ BPX|Y (X ∈ A | Y )

](definition of PX|Y )

=∫

BPY (dy)PX|Y (X ∈ A | y) (pushforward property)

= (PY ⊗ PX|Y )(B ×A) . (definition of ⊗)

Chapter 4 Stochastic Bandits

4.9

(a) The statement is true. Let i be a suboptimal arm. By Lemma 4.5 we have

0 = limn→∞

Rn(π, ν)n

= lim supn→∞

k∑

i=1

E[Ti(n)]∆i

n≥ lim sup

n→∞

E[Ti(n)]n

∆i .

Hence lim supn→∞ E[Ti(n)]/n ≤ 0 ≤ lim infn→∞ E[Ti(n)]/n and so limn→∞ E[Ti(n)]/n = 0 forsuboptimal arms i. Since ∑k

i=1 E[Ti(n)]/n = 1 it follows that limn→∞∑i:∆i=0 E[Ti]/n = 1.

(b) The statement is false. Consider a two-armed bandit for which the second arm is suboptimaland an algorithm that chooses the second arm in rounds t ∈ 1, 2, 4, 8, 16, . . ..

4.10

(a) (Sketch) Fix policy π and n and assume without loss of generality the reward-stack model. Weturn policy π into a retirement policy π′ by reshuffling the order in which π uses the two armsduring the n rounds so that if π uses action 1, say m times out of the n rounds, π′ will useaction 1 in the first m rounds and then switches to arm 2. By using the regret decomposition,we see that this suffices to show that π′ achieves no more regret than π (actually, achieves thesame regret).So what is policy π′? Policy π′ will keep querying policy π for at most n times. If π returns byproposing to use action 1, π′ will play this action, get the reward from the environment andfeeds the obtained reward to policy π. If π returns by proposing to play action 2, π′ does notplay this action for now, just feeds π with zero. After π was queried n times, action 2 is playedup in the remaining rounds out of the total n rounds.

(b) Assume that arm 1 has a Bernoulli payoff with parameter p ∈ [0, 1] and arm 2 has a fixedpayoff of 0.5 (so µ1 = p and µ2 = 0.5). Note that whether π ever retires on these Bernoulli

13

Page 14: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

environments depends on whether there exists some t > 0 and x1, . . . , xt−1 ∈ 0, 1 such thatπt(2|1, x1, . . . , 1, xt−1) > 0, or

supt>0

supx1,...,xt−1∈0,1

πt(2|1, x1, . . . , 1, xt−1) > 0 . (4.1)

We have the two cases. When (4.1) does not hold then π will have linear regret whenp < 0.5. When (4.1) does hold then take the t > 0 and x1, . . . , xt−1 ∈ 0, 1 such thatρ = πt(2|1, x1, . . . , 1, xt−1) > 0 (these must exist). Assume that t > 0 is smallest possible:Hence, πs(1|1, x′1, . . . , 1, x′s−1) = 1 for any s < t and x′1, . . . , x

′s−1 ∈ 0, 1. Now, take an

environment when p > 0.5 (so arm 1 is the optimal arm) and let Rn denote the regret of π inthis environment. Then letting ∆ = p− 0.5 > 0, we have

Rn = ∆E [T2(n)]≥ ∆E [I A1 = 1, X1 = x1, . . . , At−1 = 1, Xt−1 = xt−1, At = 2T2(n)]= ∆E [I A1 = 1, X1 = x1, . . . , At−1 = 1, Xt−1 = xt−1, At = 2 (n− t+ 1)]= ∆P (A1 = 1, X1 = x1, . . . , At−1 = 1, Xt−1 = xt−1, At = 2) (n− t+ 1)

= ∆(n− t+ 1)ρt∏

s=1pxs(1− p)1−xs

≥ c(n− t+ 1) ,

where c = ∆ρ∏t−1s=1 p

xs(1− p)1−xs > 0. It follows that lim infn→∞Rn/n ≥ c > 0.

Chapter 5 Concentration of Measure

5.10

(a) The Cramer-Chernoff method gives that for any λ ≥ 0,

P (µn ≥ ε) = P (exp(λnµn) ≥ exp(nλε)) ≤ exp(−nλε)E[exp(λ

n∑

t=1Xt)

]

= exp(−nλε)MX(λ)n .

Taking logarithm of both sides and reordering gives

1n

logP (µn ≥ ε) ≤ −(λε−MX(λ)) .

Since this holds for any λ ≥ 0, taking the supremum over λ ∈ R gives the desired inequality(allowing λ < 0 potentially makes the resulting inequality loser).

(b) Let X be a Rademacher variable. We have ψX(λ) = 12(exp(−λ) + exp(λ)) = cosh(λ). To

get the Fenchel dual of logψX , we find the maximum value of f(λ) = λε − logψX(λ) =

14

Page 15: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

λε − log cosh(λ). We have ddλ log cosh(λ) = (eλ − e−λ)/(eλ + e−λ) = tanh(λ) ∈ [−1, 1].

Hence, supλ f(λ) = +∞ when |ε| > 1. In the other case, we get that the maximum isψ∗X(ε) = f(tanh−1(ε)) = tanh−1(ε)ε − log cosh(tanh−1(ε)). Using tanh−1(ε) = 1

2 log(1+ε1−ε)

we find that etanh−1(ε) = (1+ε1−ε)1/2 and e− tanh−1(ε) = (1−ε

1+ε)1/2, hence cosh(tanh−1(ε)) =12((1+ε

1−ε)1/2+(1−ε1+ε)1/2) = 1

2( (1+ε)+(1−ε)(1−ε2)1/2 ) = 1√

1−ε2 . Therefore, ψ∗X(ε) = ε2 log(1+ε

1−ε)+ 12 log(1−ε2) =

1+ε2 log(1 + ε) + 1−ε

2 log(1− ε).

(c) We have ψX(λ) = λ(p + ε) − log(1 − p + peλ). The maximiser of this is λ∗ = log( (1−p)(p+ε)p(1−(p+ε)))

provided that p+ ε < 1. Plugging in this value, after some algebra, gives the desired result. Theresult also extends to p+ ε = 1: In this case ψX is increasing and limλ→∞ λ− log(1− p+ peλ) =limλ→∞ λ− log(peλ) = log(1/p) = d(1, p). For ε > 0 so that p+ ε > 1, ψ∗X(ε) = +∞ becauseas λ→∞, λ(p+ ε)− log(1− p+ peλ) ∼ λ(p+ ε− 1)→∞.

(d) Set σ = 1 for simplicity. We have MX(λ) = 1√2π∫

exp(−(x2 − 2λx)/2) dx = 1√2π∫

exp(−(x−λ)2/2) exp(λ2/2) dx = exp(λ2/2). Hence, f(λ) = λε − logMX(λ) = λε − λ2/2 andsupλ f(λ) = f(2ε) = ε2/2.

(e) We need to calculate limn→∞1n log(1 − Φ(ε

√n/σ2)). By Eq. (5.3) we have 1 − Φ(x) ≤√

1/(2πx2) exp(−x2/2). Further, by Eq. (13.4), 1 − Φ(x) ≥ exp(−x2/2)/(√π(x/

√2 +√

x2/2 + 2)). Taking logarithm, plugging in x = ε√n/σ2, dividing by n and taking n → ∞

gives

limn→∞

1n

log(1− Φ(ε√n/σ2)) = ε2/(2σ2) .

When X is a Rademacher random variable, ψ∗X(ε) = 1+ε2 log(1 + ε) + 1−ε

2 log(1− ε) ≥ ε2/2 forany ε ∈ R, with equality holding only at ε = 0. Hence, the question-marked equality cannothold. (In fact, this is very easy to see also by noting that if X is supported on [−1, 1] thenµn ∈ [−1, 1] almost surely and thus P (µn > ε) = 0 for any ε ≥ 1, while the approximation fromthe CLT gives ε2/(2σ2), a strictly larger value: The CLT can significantly overestimate tailprobabilities. What goes wrong with the careless application of the (strong form) of the CLTis that limn→∞ supx |fn(x) − f(x)| = 0 does not imply | log fn(xn) − log f(xn)| = o(n) for allchoices of xn. For example, one can take f(x) = exp(−x), fn(x) = exp(−x(1 + 1/n)) so thatlog fn(x) − log f(x) = −x/n. Then, | log fn(n2) − log f(n2)| = n 6= o(n). The same problemhappens in the specific case that was investigated.

5.12 Part (d) The plots of p 7→ Q(p) and p 7→√p(1− p) are shown below:

15

Page 16: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

0 0.25 0.5 0.75 10

0.25

0.5

p

Q(p)p 7→

√p(1− p)

As can be seen,√p(1− p) ≤ Q(p) for p ∈ [0, 1].

Part (e): Consider 0 ≤ λ < 4 and λ ≥ 4 separately. In the latter case use λ2 ≥ 4λ. For theformer case consider the extremes p = 1 and p = 1/2 and then use convexity. The general conclusionis that the subgaussianity constant may be misleadingly large when it comes to studying tails ofdistributions: Tail bounds (for the upper tail) only need bounds on the MGF for nonnegative valuesof λ!

5.13

(a) Using linearity of expectation E[pn] = E[∑nt=1Xt/n] = ∑n

t=1 E[Xt]/n = p. Similarly,V[pn] = p(1− p)/n.

(b) The central limit theorem says that

limn→∞

[P(√n(pn − p) ≥ x

)− P(√nZn ≥ x

)]= 0 for all x ∈ R . (5.1)

(c) This is an empirical question, the solution to which we omit. You should do this calculationdirectly using the binomial distribution.

(d.i) Let d(p, q) = p log(p/q) + (1− p) log((1− p)/(1− q)) be the relative entropy between Bernoullidistributions with means p and q. By large deviation theory (see Exercise 5.10),

P (pn ≥ p+ ∆) = exp (−n(dBer + εn)) andP (Zn ≥ p+ ∆) = exp (−n(dGauss + ξn)) ,

where dBer = d(p + ∆, p) and dGauss = ∆2/(2p(1 − p)) and (εn)∞n=1 and (ξn)∞n=1 satisfylimn→∞ εn = 0 and limn→∞ ξn = 0. It should be clear that limδ→0 ni(δ, p,∆) = ∞ fori ∈ Ber,Gauss. Hence, by inverting the above displays we arrive at

ni(δ, p,∆) =( 1di

+ o(1))

log(1δ

), (5.2)

16

Page 17: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

where the o(1) term vanishes as δ tends to zero (see below for the precise argument). Thereforewhen ∆ = p = 1/10,

limδ→0

nBer(δ, p,∆)nGauss(δ, p,∆) = dGauss

dBer= ∆2/(2p(1− p))

d(p+ ∆, p) ≈ 1.2512 .

It remains to see the validity of Eq. (5.2). This follows from an elementary but somewhat tediousargument. The precise claim is as follows: Let (pn) be a sequence taking values in [0, 1], n(δ) =minn ≥ 1 : pn ≤ δ such that n(δ)→∞ as δ → 0 and log(1/pn) = n(d+ o(1)). We claim thatfrom these it follows that n(δ) = (1/d+ o(1)) log(1/δ) as δ → 0. To show this it suffices to provethat for any ε > 0, for any δ > 0 small enough, n(δ) ∈ [1/(d+ ε) log(1/δ), 1/(d− ε) log(1/δ) + 1].Fix ε > 0. Then, by our assumption on log(1/pn) there exist some n0 > 0 such that for anyn ≥ n0, log(1/pn) ∈ [n(d− ε), n(d+ ε)]. Further, by our assumption on n(δ), there exists δ0 > 0such that for any δ < δ0, n(δ) − 1 ≥ n0. Take some δ < δ0 and let n′ = n(δ). By definition,pn′ ≤ δ < pn′−1 and hence log(1/pn′) ≥ log(1/δ) > log(1/pn′−1). Since n′ ≥ n′ − 1 ≥ n0, wealso have that (n′ − 1)(d− ε) ≤ log(1/pn′−1) < log(1/δ) ≤ log(1/pn′) ≤ n′(d+ ε), from which itfollows that log(1/δ)

d+ε ≤ n′ < log(1/δ)d−ε + 1, finishing the proof.

(d.ii) The central limit theorem only shows Eq. (5.1). In particular, you cannot choose x to dependon n. A second try is to use Berry-Esseen (Exercise 5.5) which warrants that

|P (pn − p ≥ ∆)− P (Zn ≥ ∆) | = O(1/√n) .

The problem is that this provides very little information in the regime where ∆ is fixed andn tends to infinity where both probabilities tend to zero exponentially fast and the error termwashes away the comparison. In particular, for the inversion process to work, one needs nontriviallower and upper bounds on P (pn − p ≥ ∆) and the central limit theorem only asserts that thisprobability is in the range of [0, O(1/

√n)] (irrespective of the value of p and ∆), which does not

lead to nontrivial bounds on nBer(δ; p,∆).To summarise, the study of ni(δ, p,∆) as δ tends to zero is a question about the large deviationregime, where the central limit theorem and Berry-Esseen do not provide meaningful information.To make use of the central limit theorem and Berry-Esseen, one needs to choose the deviationlevel x so that the probability P (

√n(pn − p) ≥ x) is of a larger magnitude than O(1/

√n), which

is the range of ‘small deviations’.As an aside, comparisons between normal and binomial distributions have been studied extensively.If you are interested, the most relevant lower bound for this discussion is Slud’s inequality [Slud,1977].

5.14

(a) We have g′(x) = x2(exp(x)−1)−2x(exp(x)−1−x)x4 so that x3g′(x) = h(x) = xex − 2ex + 2 + x. We

have h′(x) = xex − ex + 1 and h′′(x) = xex. Hence, h′ is increasing on (0,∞) and decreasing on(−∞, 0). Since h(0) = 0, so sign(h(x)) = sign(x) and thus g′(x) > 0 for x 6= 0.

17

Page 18: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(b) We have exp(x) = 1 + x+ g(x)x2. Therefore, E [exp(X)] = 1 +E[g(X)X2] ≤ 1 +E

[g(b)X2] =

1 + g(b)V[X], where the last inequality used that g is increasing.

(c) Calculation – left to the reader.

(d) Let Zt = Xt − EXt so that S = ∑nt=1 Zt. By the Cramer-Chernoff method, for any λ ≥ 0,

P (S ≥ ε) ≤ exp(−λε)n∏

t=1E [exp(λZt)] .

Using E [exp(λZt)] ≤ 1 + g(λb)λ2V[Zt] ≤ exp(g(λb)λ2V[Zt]), we get

P(∑

t

Zt ≥ ε)≤ exp(−λε+ g(λb)λ2v) . (5.3)

Differentiation shows that the exponent is minimised by λ = 1/b log(1 + α) where recall thatα = bε/v. Plugging in this value we get (5.10) and then using the bound in Part ((c)) we get(5.11).

(e) We need to solve δ = exp(− ε2

2v(1+ bε3v )

)for ε ≥ 0. Algebra gives that this is quadratic equation in

ε: Using the abbreviation L = log(1/δ), this quadratic equation is ε2− 23bLε−2vL. The positive

root is ε = 12

(23bL+

√(2

3bL)2 + 8vL). Hence, with probability 1 − δ, S ≤ ε. Further upper

bounding ε using√|a|+ |b| ≤

√|a|+

√|b| gives that with probability 1− δ, Sn ≤ 2

3bL+√

2vL,which is the desired inequality.

(f) We start by modifying the Cramer-Chernoff method. In particular, consider the problemof bounding the probability of event A where for a random vector X ∈ Rd and a fixedvector x ∈ Rd, A takes the form A = X ≥ x. Notice that for f : Rd → [0,∞),P (A) ≤ E

[I A ef(X)

]≤ E

[ef(X)

]. We use this with X = (S,−V ) and x = (ε, v) so that A =

S ≥ ε, V ≤ v = X ≥ (ε, v). Then, for λ > 0 letting h(S, V ) = λS − g(λb)λ2V we have onA that h(S, V ) ≥ h(ε, v) and so f(S, V ) = h(S, V )− h(ε, v) ≥ 0 and P (A) ≤ e−h(ε,v)E

[eh(S,V )

].

We have eh(S,V ) = U1 . . . Un where Ut = eλZt−λ2g(λb)Et−1[Z2

t ] and Zt = Xt − µt = Xt − Et−1[Xt].Furthermore, owning to λ > 0, λZt ≤ λb, hence

Es−1[Us] = e−λ2g(λb)Es−1[Z2

s ] Es−1[eλZs ]

≤ e−λ2g(λb)Es−1[Z2s ] (1 + g(λb)Es−1[(λZs)2])

= e−λ2g(λb)Es−1[Z2

s ] eλ2g(λb)Es−1[Z2

s ] = 1 ,

and thus

E[eg(Sn,Vn)

]= E

[n∏

t=1Ut

]= E [U1 . . . Un−1En−1[Un]]

≤ E [U1 . . . Un−1] ≤ · · · ≤ 1 .

18

Page 19: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Thus, P (A) ≤ e−h(ε,v). Notice that the expression on the right-hand side is the same as inEq. (5.3), finishing the proof.

5.15 Let αt = ηEt−1[(Xt − µt)2]. We use the Cramer-Chernoff method:

P(

n∑

t=1(Xt − µt − αt) ≥

log(1δ

))= P

(exp

n∑

t=1(Xt − µt − αt)

)≥ 1δ

)

≤ δE[exp

n∑

t=1(Xt − µt − αt)

)].

All that remains is to show that the term inside the expectation is a supermartingale. Using thefact that exp(x) ≤ 1 + x+ x2 for x ≤ 1 and 1 + x ≤ exp(x) for all x ∈ R we have

Et−1 [exp (η(Xt − µt − αt))] = exp(−ηαt) Et−1 [exp(η(Xt − µt))]≤ exp(−ηαt)

(1 + η2Et−1[(Xt − µt)2]

)

= exp(−ηαt + η2Et−1[(Xt − µt)2]

)= 1 .

Therefore exp(η∑nt=1(Xt − µt − αt)) is a supermartingale, which completes the proof of Part (a).

The proof of Part (b) follows in the same fashion.

5.16 By assumption P (Xt ≤ x) ≤ x, which means that for λ < 1,

E[exp(λ log(1/Xt))] =∫ ∞

0P (exp(λ log(1/Xt)) ≥ x) dx

= 1 +∫ ∞

1P(Xt ≤ x−1/λ

)dx ≤ 1 +

∫ ∞

1x−1/λdx = 1

1− λ .

Applying the Cramer-Chernoff method,

P(

n∑

t=1log(1/Xt) ≥ ε

)= P

(exp

n∑

t=1log(1/Xi)

)≥ exp(λε)

)

≤ exp(−λε)E[exp

n∑

t=1log(1/Xt)

)]≤( 1

1− λ

)nexp(−λε) .

Choosing λ = (ε− n)/ε completes the claim.

5.17 The 1-norm can be re-written as

‖p− p‖1 = maxλ∈−1,1m

〈λ, p− p〉 .

Next, let λ ∈ −1, 1m be fixed. Then,

〈λ, p− p〉 = 1n

n∑

t=1〈λ, p− eXt〉 .

19

Page 20: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Now, |〈λ, p− eXt〉| ≤ ‖λ‖∞‖p− eXt‖1 ≤ 2 and E[〈λ, p− eXt〉] = 0. Then, by Hoeffding’s bound,

P(〈λ, p− p〉 ≥

√2n

log(1δ

))≤ δ .

Taking a union bound over all λ ∈ −1, 1m shows that

P(

maxλ∈−1,1m

〈λ, p− p〉 ≥√

2n

log(2mδ

))≤ δ .

5.18 Let λ > 0. Then

exp(λE[Z]) ≤ E[exp(λZ)] ≤n∑

t=1E[exp(λXt)] ≤ n exp(λ2σ2/2) .

Rearranging shows that

E[Z] ≤ log(n)λ

+ λσ2

2 .

Choosing λ = 1σ

√2 log(n) shows that E[Z] ≤

√2σ2 log(n). For Part (b), a union bound in

combination with Theorem 5.3 suffices.

5.19 Let P be the set of measures on ([0, 1],B([0, 1])) and for q ∈ P let µq be its mean. Thetheorem will be established by induction over n. The claim is immediate when x > n or n = 1.Assume that n ≥ 2 and x ∈ (1, n] and the theorem holds for n− 1. Then

P(

n∑

t=1E[Xt | Ft−1]

)= E

[P(

n∑

t=2E[Xt | Ft−1] ≥ x− E[X1 | F0]

∣∣∣∣∣F0

)]

≤ E[fn−1

(x− E[X1 | F0]

1−X1

)]

= E[E[fn−1

(x− E[X1 | F0]

1−X1

) ∣∣∣∣F0

]]

≤ supq∈P

∫ 1

0fn−1

(x− µq1− y

)dq(y) ,

where the first inequality follows from the inductive hypothesis and the fact that∑nt=2Xt/(1−X1) ≤ 1

almost surely. The result is completed by proving that for all q ∈ P,

Fn(q) .=∫ 1

0fn−1

(x− µq1− y

)dq(y) ≤ fn(x) . (5.4)

Let q ∈ P have mean µ and y0 = max(0, 1− x+ µ). In Lemma 5.1 below it is shown that

fn−1

(x− µ1− y

)≤ 1− y

1− y0fn−1

(x− µ1− y0

),

20

Page 21: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

which after integrating implies that

Fn(q) ≤ 1− µ1− y0

fn−1

(x− µ1− y0

).

Considering two cases. First, when y0 = 0 the display shows that Fn(q) ≤ (1−µ)fn−1(x−µ). On theother hand, if y0 > 0 then x−1 < µ ≤ 1 and Fn(q) ≤ (1−µ)/(x−µ) ≤ (1−(x−1))fn−1(x−(x−1)).Combining the two cases we have

Fn(q) ≤ supµ∈[0,1]

(1− µ)fn−1(1− µ) = fn(x) .

Lemma 5.1. Suppose that n ≥ 1 and u ∈ (0, n] and y0 = max(0, 1− u). Then

fn

(u

1− y

)≤ 1− y

1− y0fn−1

(u

1− y0

)for all y ∈ [0, 1] .

Proof. The lemma is equivalent to the claim that the line connecting (y0, fn(u/(1− y0))) and (1, 0)lies above fn(u/(1 − y)) for all y ∈ [0, 1] (see figure below). This is immediate for n = 1 whenfn(u/(1− y)) = I y ≤ 1− u. For larger n basic calculus shows that fn(u/(1− y)) is concave as afunction of y on [1− u, 1− u/n] and

∂yfn(u/(1− y))

∣∣∣y=1−u

= −1/u .

Since fn(1) = 1 this means that the line connecting (1− u, 1) and (1, 0) lies above fn(u/(1− y)).This completes the proof when y0 = 1− u. Otherwise y0 ∈ [1− u, 1− u/n] and the result follows byconcavity of fn(u/(1− y)) on this interval.

0 y0 10

1

y

fn(u/(1 − y))

21

Page 22: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Chapter 6 The Explore-then-Commit Algorithm

6.2 If ∆ ≤ 1/√n then from Rn ≤ n∆ we get Rn ≤

√n. Now, if ∆ > 1/

√n then from

Rn ≤ ∆ + 4∆

(1 + log+

(n∆2

4

))≤ ∆ + 4

√n + maxx>0

1x log+(nx2/4). A simple calculation shows

that maxx>01x log+(nx2/4) = e−2√n. Putting things together we get that Rn ≤ ∆ + (4 + e−2)

√n

holds no matter the value of ∆ > 0.

6.3 Assume for simplicity that n is even and let ∆ = max∆1,∆2 and

m = minn/k,

4∆2 log

(1δ

).

When 2m = n the pseudo regret is bounded by Rn ≤ m∆. Now suppose that 2m < n. Then

P (T2(n) > m) ≤ P (µ2(2m)− µ2 − µ1(2m) + µ1 ≥ ∆) ≤ exp(−m∆2

4

)≤ δ .

Hence with probability at least 1− δ the pseudo regret is bounded by

Rn ≤ m∆ = minn∆2 ,

4∆ log

(1δ

).

6.5 By slightly abusing notation, to reduce clutter we abbreviate Rn(ν) to Rn and ∆ν to ∆.

(a) For the first part we need to show that Rn ≤ (∆ +C)n2/3 when m = f(n) for a suitable chosenfunction f and C > 0 is a universal constant. By Eq. (6.4), Rn ≤ m∆ + n∆ exp

(−m∆2

4

)≤

m∆ + nmax∆>0 ∆ exp(−m∆2

4

)= m∆ + 2n

√1m exp

(−1

2

), where the equality follows because

maxx x exp(−cx2) is at the value x∗ =√

1/(2c) and is equal to√

1/(2c) exp(−1/2) as a simplecalculation shows (so, ∆∗ =

√4/(2m) =

√2/m =

√2/n2/3 =

√2n−1/3). That Rn ≤ ∆ +Cn2/3

cannot hold follows because m∆/2 ≤ Rn. Hence, if Rn ≤ ∆ + Cn2/3 was also true, we wouldget that for any ∆ > 0, m∆/2 ≤ ∆ + Cn2/3 holds. Dividing both sides by ∆ and letting∆→∞, this would imply that m ≤ 2. However, if m ≤ 2 then Rn = Ω(n) on some instances:In particular, there is a positive probability that the arm chosen after trying both arms at mosttwice is the suboptimal arm.

(b) For any fixed m, µi(2m) − µi is√

1/m subgaussian. Hence defining G = |µi(2m) − µi| ≤√2 log(n/δ)/m, i = 1, 2,m = 1, 2, . . . , bn/2c, using n ≥ 2bn/2c union bounds, we have that

P (G) ≥ 1− δ. Introduce w(m) =√

2 log(n/δ)/m. Let M = min1 ≤ m ≤ bn/2c : |µ1(2m)−µ2(2m)| > 2w(m) (note that M =∞ if the condition is never met). Then on G if M < +∞and say 1 = argmaxi µi(2M) then µ1 ≥ µ1(2M) − w(m) > µ2(2M) + 2w(M) − w(M) ≥ µ2where the first and last inequalities used that we are on G and the middle one used the stoppingcondition and that we assumed that at stopping, arm one has the highest mean. Hence,

22

Page 23: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Rn = P (Gc)n∆/2 + E [MI G] ∆/2 ≤ δn+ E [MI G] ∆/2. We now show a bound on M onG. To reduce clutter assume that µ1 > µ2. Assume G holds and let m < M . Then, 2w(m) ≥|µ1(2m)−µ2(2m)| ≥ µ1(2m)−µ2(2m) ≥ (µ1−w(m))−(µ2+w(m)) = ∆−2w(m). Reordering wesee that 4w(m) ≥ ∆, which, using the definition of w(m), is equivalent to m ≤ (4/∆)22 log(n/δ).Hence, onG, M = 1+maxm : 2w(i) ≥ |µ1(2i)−µi(2i)|, i = 1, 2, . . . ,m ≤ 1+(4/∆)22 log(n/δ).Plugging this in and setting δ = 1/n, we get Rn ≤ ∆ + 16

∆ log(n).

(c) In addition to the said inequality, of course, Rn ≤ n∆ also holds. If ∆ ≤√

log(n)/n, we thushave Rn ≤

√n/ log(n) ≤

√n log(n). If ∆ >

√log(n)/n, Rn ≤ ∆ + C

√n log(n). Combining

the inequalities, we have Rn ≤ ∆ + (C ∨ 1)√n log(n).

(d) Change the definition of w(m) from Part (b) to w(m) =√

2 log(n/(mδ))/m. Then,P (Gc) ≤ δ

n

∑n/2m=1m ≤ cδn for a suitable universal constant c > 0. We will choose

δ = 1/(cn2) so that P (Gc) ≤ 1/n. Hence, w(m) =√c′ log(n/m)/m with a suitable universal

constant c′ > 0. With the same reasoning as in Part (b), we find that M ≤ 1 + m∗

where m∗ = maxm ≥ 1 : m ≤ c′ log(n/m)/∆2. A case analysis then gives thatm∗ ≤ c′′ log(e∨(∆2n))

∆2 for a suitable universal constant c′′ > 0. Finishing as in Part (b),Rn = P (Gc)n∆/2 + E [MI G] ∆/2 ≤ δn+ E [MI G] ∆/2 ≤ ∆ + c′′ log(e∨(∆2n))

∆ .

(e) See the solution to Exercise 6.2.

6.6

(a) Let N0 = 0 and for ` > 1, let N` = min(N`−1 + n`, n), T` = N`−1 + 1, . . . , N`. The intervals(T`)`max

`=1 are non-overlapping and policy π is used with horizon n` on interval T`. Since ν is astochastic environment,

Rn(π∗, ν) =`max∑

`=1R|T`|(π(n`), ν) ≤

`max∑

`=1max

1≤t≤n`Rt(π(n`), ν) ≤

`max∑

`=1fn`(ν) , (6.1)

where the first inequality uses that |T`| ≤ n` (in fact for ` < `max, |T`| = n`), and the secondinequality uses (6.10).

(b) We have ∑`i=1 ni = ∑`−1

i=0 2i = 2` − 1, hence `max = dlog2(n + 1)e and 2`max ≤ 2(n + 1). ByEq. (6.1), the assumption on fn and the choice of (n`)`,

Rn(π∗, ν) ≤`max∑

`=1

√2`−1 ≤ 1√

2− 1√

2`max ≤ 1√2− 1

√2n√

1 + 1n

= 2(1 +√

2)√n .

23

Page 24: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(c) By Eq. (6.1) we have

Rn(π∗, ν) ≤ g(ν)`max∑

`=1log(2`−1) = log(2)g(ν)

`max−1∑

`=0`

= log(2)g(ν)(`max − 1)`max2 ≤ Cg(ν) log2(n+ 1)

with some universal constant C > 0. Hence, the regret significantly worsens in this case. Abetter choice n` = 22`−1 . With this,

Rn(π∗, ν) ≤ g(ν)`max∑

`=1log(22`−1) = log(2)g(ν)

`max−1∑

`=02` ≤ log(2)g(ν)2`max

≤ Cg(ν) log(n) ,

with another universal constant C > 0.

(d) The power/advantage of the doubling trick is its generality. It shows, that, as long as we do notmind losing a constant factor, under mild conditions, adapting to an unknown horizon is notchallenging. The first disadvantage is that π∗ does lose a constant factor when perhaps there isno need to lose a constant factor. The second disadvantage is that oftentimes one can designalgorithms such that the immediate expected regret at time t decreases as the time t increases.This is a highly desirable property and can often be met, as it was explained in the chapter.Yet, by applying the doubling trick this monotone decrease of the immediate expected regretwill be lost, which will raise questions in the user.

6.8

(a) Using the definition of the algorithm and concentration for subgaussian random variables:

P (1 /∈ A`+1, 1 ∈ A`) ≤ P(1 ∈ A`, exists i ∈ A` \ 1 : µi,` ≥ µ1,` + 2−`

)

= P(1 ∈ A`, exists i ∈ A` \ 1 : µi,` − µ1,` ≥ 2−`

)

≤ k exp(−m`2−2`

4

),

where in the last final inequality we used (c) of Lemma 5.4 and Theorem 5.3.

(b) Again, concentration and the algorithm definition show that:

P (i ∈ A`+1, 1 ∈ A`, i ∈ A`) ≤ P(1 ∈ A`, i ∈ A`, µi,` + 2−` ≥ µ1,`

)

= P(1 ∈ A`, i ∈ A`, (µi,` − µi)− (µ1,` − µ1) ≥ ∆i − 2−`

)

≤ exp(−m`(∆i − 2−`)2

4

).

24

Page 25: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(c) Let δ ∈ (0, 1) be some constant to be chosen later and

m` = 24+2` log(`/δ) .

Then by Part (a),

P (exists ` : 1 /∈ A`) ≤∞∑

`=1P (1 /∈ A`+1, 1 ∈ A`)

≤ k∞∑

`=1exp

(−m`22`

4

)

≤ kδ∞∑

`=1

1`2

= kπ2δ

6 .

Furthermore, by Part (b),

P (i ∈ A`i+1) ≤ P (i ∈ A`i+1, i ∈ A`i , 1 ∈ A`i) + P (1 /∈ A`i)

≤ exp(−m`(∆i − 2−`i)2

4

)+ kπ2δ

6

≤ exp(−m`2−2`i

16

)+ kπ2δ

6

≤ δ(

1 + kπ2

6

).

Choosing δ = n−1(1 + kπ2/6)−1 completes the result.

(d) For n < k, the result is trivial because all actions are tried at most once and hence Rn ≤∑i:∆i

∆i

which is below the desired bound provided that C > 1. Hence, assume that n ≥ k. If there isno suboptimal action, the statement is again trivial. Otherwise, let i be a suboptimal action.Notice that 2−`i ≥ ∆i/4 and hence 22`i ≤ 16/∆2

i . Furthermore, m` ≥ m1 ≥ 1 for ` ≥ 1. Hence,

E[Ti(n)] ≤ nP (i ∈ A`i+1) +`i∧n∑

`=1m`

≤ 1 +`i∧n∑

`=124+2` log

(n

δ

)

≤ 1 + C22`i log(nk)

≤ 1 + 16C∆2i

log(nk) ,

where C > 1 is a suitably large universal constant derived by naively bounding the logarithmicterm and the geometric series. The result follows from upper bounding log(nk) ≤ 2 log(n) which

25

Page 26: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

follows from k ≤ n and the standard regret decomposition (Lemma 4.5).

(e) Briefly, the idea is to choose

m` = C22` log max(e, kn2−2`

),

where C is a suitably large universal constant chosen so that

P (1 /∈ A`) ≤22`

nand P (i ∈ A`i+1) ≤ 22`

n.

From this it follows that

E[Ti(n)] ≤ 16∆2i

+ C`i∑

`=122` log max

(e, kn2−2`

).

Bounding the sum by an integral and some algebraic gymnastics eventually leads to the desiredresult. Note, you have to justify the k in the logarithm is a lower order term. Argue by splittingthe suboptimal arms into those with ∆i ≤

√k/n and the rest.

(f) Using the analysis in Part (e) and letting ∆ =√k/n. Then

Rn =k∑

i=1∆iE[Ti(n)]

≤ n∆ +∑

i:∆i≥∆

C ′

∆ilog max

(e, nk∆2

i

)

≤ C ′′√nk log(k) ,

where the constant C ′ is derived from C in Part (e) and the last inequality follows by consideringthe monotonicity properties of x 7→ 1/x log max(e, nx2).

Chapter 7 The Upper Confidence Bound Algorithm

7.1

26

Page 27: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(a) We have

P

µ− µ ≥

√2 log(1/δ)

T

=

∞∑

n=1E[I T = n I

n∑

t=1(Xt − µ) ≥

√2n log(1/δ)

]

=∞∑

n=1E[E[I T = n I

n∑

t=1(Xt − µ) ≥

√2n log(1/δ)

∣∣∣∣∣T]]

=∞∑

n=1E[I T = nE

[I

n∑

t=1(Xt − µ) ≥

√2n log(1/δ)

∣∣∣∣∣T]]

≤∞∑

n=1E [I T = n δ]

= δ .

(b) Let T = minn : ∑n

t=1(Xt − µ) ≥√

2n log(1/δ)

. By the law of the iterated logarithm, T <∞almost surely. The result follows.

(c) Note that

P

µ− µ ≥

√2 log(T (T + 1)/δ)

T

≤ P

(exists n :

n∑

t=1(Xt − µ) ≥

√2n log(n(n+ 1)/δ)

)

≤∞∑

n=1

δ

n(n+ 1)

= δ .

Chapter 8 The Upper Confidence Bound Algorithm: Asymptotic Optimality

8.1 Following the hint, F ≤ exp(−a)/(1− exp(−a)) where a = ε2/2. Reordering exp(−a)/(1−exp(−a)) ≤ 1/a gives 1 + a ≤ exp(a) which is well known (and easy to prove). Then

n∑

t=1

1f(t) ≤

20∑

t=1

1f(t) +

∫ ∞

20

dt

f(t) ≤20∑

t=1

1f(t) +

∫ ∞

20

dt

t log(t)2

=20∑

t=1

1f(t) + 1

log(20) ≤52 .

27

Page 28: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Chapter 9 The Upper Confidence Bound Algorithm: Minimax Optimality

9.1 Clearly (Mt) is F-adapted. Then by Jensen’s inequality and convexity of the exponentialfunction,

E[Mt | Ft−1] = exp(λ

t∑

s=1Xs

)E[exp(λXt) | Ft−1]

≥ exp(λ

t∑

s=1Xs

)exp(λE[Xt | Ft−1])

= exp(λt−1∑

s=1Xs

)a.s.

Hence Mt is a F-submartingale.

9.4

(i) Consider the policy that plays each arm once and subsequently chooses

At = argmaxi∈[k] µi(t− 1) +√

2Ti(t− 1) log

(h

(n

Ti(t− 1)k

)).

The requested experimental validation is omitted from this solution.

(j) The proof more-or-less follows the proof of Theorem 9.1, but uses the new concentrationinequalities. Define random variable

∆ = min ∆ ≥ 0 : µ1s ≥ µ1 −∆ for all s ∈ [n] .

Let ε > 0. By Part (g), P (∆ ≥ ε) ≤ 5/(4nε2), where we used the fact the 2c/√π ≤ 5/4 when

c = 11/10. For suboptimal arm i define

κi =n∑

s=1Iµis +

√2s

log(h

(n

ks

))≤ µ1 − ε

.

By the definition of the algorithm and κi and ∆,

Ti(n) ≤ κi + nI ∆ ≥ ε .

The expectation of κi is bounded using the same technique as in Theorem 9.1, but the tighter

28

Page 29: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

confidence bound leads to an improved bound.

E[κi] ≤1

∆2i

+n∑

s=1P

µis +

√2s

log(h

(n∆2

k

))≤ µ1 − ε

= 2 log(n)(∆i − ε)2 + o(log(n)) .

Combining the two parts shows that for all ε > 0,

E[Ti(n)] ≤ 2 log(n)(∆i − ε)2 + o(log(n)) .

The result follows from the fundamental regret decomposition lemma (Lemma 4.5) and bytaking the limit as n tends to infinity and ε tends to zero at appropriate rates.

Chapter 10 The Upper Confidence Bound Algorithm: Bernoulli Noise

10.1 Let g be as in the hint. We have

g′(x) = x

( 1(p+ x)(1− (p+ x)) − 4

).

Clearly, g′(0) = 0. Further, since q(1− q) ≤ 1/4 for any q ∈ [0, 1], g′(x) ≥ 0 for x > 0 and g′(x) ≤ 0for x < 0. Hence, g is increasing for positive x and decreasing for negative x. Thus, x = 0 is aminimiser of g. Here, g(0) = 0, and so g(x) ≥ 0 over [−p, 1− p].

10.3 We have g(λ, µ) = −λµ+ log(1 + µ(eλ− 1)). Taking derivatives, ddµg(λ, µ) = −λ+ eλ−1

1+µ(eλ−1)

and d2

dµ2 g(λ, µ) = − (eλ−1)2

(1+µ(eλ−1))2 ≤ 0, showing that g(λ, ·) is concave as suggested in the hint. Now,let Sp = ∑p

t=1(Xt − µt), p ∈ [n] and let S0 = 0. Then for p ∈ [n],

E [exp(λSp)] = E [exp(λSp−1)E [exp(λ(Xp − µp)) | Fp−1]] ,

and, by Note 2, E [exp(λ(Xp − µp)) | Fp−1] ≤ exp(g(λ, µp)). Hence, using that µn is not random,

E [exp(λSp)] ≤ E [exp(λSp−1)] exp(g(λ, µp)) .

Chaining this inequalities, using that S0 = 0 together with that g(λ, ·) is concave, we get

E [exp(λSn)] ≤(

exp(

1n

n∑

t=1g(λ, µt)

))n≤ exp (ng(λ, µ)) .

29

Page 30: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Thus,

P (µ− µ ≥ ε) = P (exp(λSn) ≥ exp(λnε))≤ E [exp(λSn)] exp(−λnε)≤ (µ exp(λ(1− µ− ε)) + (1− µ) exp(−λ(µ+ ε)))n .

From this point, repeat the proof of Lemma 10.3 word by word.

10.4 When the exponential family is in canonical form the mean of Pθ is µ(θ) = Eθ[S] = A′(θ).Since A is strictly convex by the assumption that M is nonsingular it follows that µ(θ) is strictlyincreasing and hence invertible. Let µsup = supθ∈Θ µ(θ) and µinf = infθ∈Θ and define

θ(x) =

sup Θ if x ≥ µsup

inf Θ if x ≤ µinf

µ−1(x) otherwise .

The function θ is the bridge between the empirical mean and the maximum likelihood estimator of θ.Precisely, let X1, . . . , Xn be independent and identically distribution from Pθ and µn = 1

n

∑nt=1Xt.

Then provided that θn = θ(µn) ∈ Θ, then θn is the maximum likelihood estimator of θ,

θn = argmaxθ∈Θ

n∏

t=1

dPθdh

(Xt) .

There is an irritating edge case that µn does not lie in the range of µ : Θ→ R. When this occursthere is no maximum likelihood estimator.

Part I: Algorithm

Then define d(x, y) = I x ≤ y limz↓x d(z, y) and d(x, y) = I x ≥ y limz↑x d(z, y). The algorithmchooses At = t for the first k rounds and subsequently At = argmaxi Ui(t) where

Ui(t) = supθ ∈ Θ : d(θi(t− 1), θ) ≤ log(f(Ti(t− 1)))

Ti(t− 1)

.

Part II: Concentration

Given a fixed θ ∈ Θ and independent random variables X1, . . . , Xn sampled from Pθ andst = 1

t

∑tu=1 S(Xu) and θt = θ(st). Let θ ∈ Θ be such that d(θ, θ) = ε > 0. Then

P(d(θt, θ) ≥ ε

)≤ P

(st ≥ s(θ)

)≤ exp(−td(θ, θ)) = exp(−tε) , (10.1)

where the second inequality follows from Part (e) of Exercise 34.5. Similarly

P(d(θt, θ) ≥ ε

)≤ exp(−tε) . (10.2)

30

Page 31: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Define random variable τ by

τ = mint : d(θs, θ − ε) <

log(f(t))s

for all s ∈ [n].

In order to bound the expectation of τ we need a connection between d(θs, θ − ε) and d(θs, θ). Letx ≤ y − ε and g(z) = d(x, z). Then

g(y) = g(y − ε) +∫ y

y−εg′(z)dz

= g(y − ε) +∫ y

y−ε(z − x)A′′(z)dz

≥ g(y − ε) + infz∈[y−ε,y]

A′′(z)∫ y

y−ε(z − x)dz

= g(y − ε) + 12 infz∈[y−ε,y]

A′′(z)ε(2y − 2x− ε)

≥ g(y − ε) +ε2 infz∈[y−ε,y]A

′′(z)2 .

Note that infz∈[y−ε,y]A′′(z) > 0 is guaranteed because A′′ is continuous and [y − ε, y] is compact

and because M was assumed to be nonsingular. Using this, the expectation of τ is bounded by

E[τ ] =n∑

t=1P (τ ≥ t)

≤n∑

t=1

n∑

s=1P(d(θs, θ − ε) ≥

log(f(t))s

)

≤n∑

t=1

n∑

s=1P(d(θs, θ) ≥

ε2 infz∈[θ−ε,θ]A′′(z)

2 + log(f(t))s

)

≤n∑

t=1

n∑

s=1

exp(−s infz∈[θ−ε,θ]A′′(z)ε2/2)

f(t)

= O(1) , (10.3)

where the last inequality follows from Eq. (10.1) and the final inequality is the same calculation asin the proof of Lemma 10.7. Next let

κ = mins ≥ 1 : θu − θ < ε for all u ≥ s .

The expectation of κ is easily bounded using Eq. (10.2),

E[κ] ≤n∑

s=1

∞∑

u=s(exp(−ud(θ + ε, θ))) = O(1) , (10.4)

where we used the fact that M is non-singular to ensure strict positivity of the divergences.

31

Page 32: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Part III: Bounding E[Ti(n)]

For each arm i let θis = θ(µis). Now fix a suboptimal arm i and let ε < (θ1 − θi)/2 and

τ = mint : d(θs, θ − ε) <

log(f(t))s

for all s ∈ [n].

Then define

κ = mins ≥ 1 : θiu < θi + ε for all u ≥ s .

Then by Eq. (10.3) and Eq. (10.4), E[τ ] = O(1) and E[κ] = O(1). Suppose that t ≥ τ andTi(t− 1) ≥ κ and At = i. Then Ui(t) ≥ U1(t) ≥ θ1 − ε and hence

d(θi + ε, θ1 − ε) <log(f(n))Ti(t− 1) .

From this we conclude that

Ti(n) ≤ 1 + τ + κ+ log(f(n))d(θi + ε, θ1 − ε)

.

Taking expectations and limits shows that

lim supn→∞

E[Ti(n)]log(n) ≤

1d(θi + ε, θ1 − ε)

.

Since the above holds for all sufficiently small ε > 0 and the divergence d is continuous it followsthat

lim supn→∞

E[Ti(n)]log(n) ≤

1d(θi, θ1)

for all suboptimal arms i. The result follows from the fundamental regret decomposition lemma(Lemma 4.5).

10.5 For simplicity we assume the first arm is uniquely optimal. Define

d(x, y) = I x ≥ y limz↑x

d(z, y) , d(x, y) = I x ≤ y limz↓x

d(x, y) .

Let µ(θ) =∫R xdPθ(x) and s(θ) = Eθ[S] = A′(θ) and S = s(θ) : θ ∈ Θ. Define θ : R→ cl(Θ) by

θ(x) =

s−1(x) , if x ∈ S ;sup Θ , if x ≥ supS ;inf Θ , if x ≤ inf S .

32

Page 33: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

The algorithm is a generalisation of KL-UCB. Let

ti(t) = 1Ti(t)

t∑

s=1I At = iS(Xt) and θi(t) = θ(ti(t)) ,

which is the empirical estimator of the sufficient statistic. Like UCB, the algorithm plays At = t fort ∈ [k] and subsequently At = argmaxi Ui(t), where

Ui(t) = supµ(θ) : d(θi(t− 1), θ) ≤ log(f(Ti(t− 1))f(t))

Ti(t− 1)

and ties in the argmax are broken by choosing the arm with the largest number of plays.

Part I: Concentration Given a fixed θ ∈ Θ and independent random variables X1, . . . , Xn

sampled from Pθ and st = 1t

∑tu=1 S(Xu) and θt = θ(st). Let θ ∈ Θ be such that d(θ, θ) = ε > 0.

Then

P(d(θs, θ) ≥ ε

)≤ P

(st ≥ s(θ)

)≤ exp(−td(θ, θ)) = exp(−tε) . (10.5)

Using an identical argument,

P(d(θt, θ) ≥ ε) ≤ exp(−tε)

). (10.6)

Define random variable τ by

τ = mint : d(θt, θ) <

log(f(s)f(t))s

for all s ∈ [n].

Then the expectation of τ is bounded by

E[τ ] =n∑

t=1P (τ ≥ t)

≤n∑

t=1

n∑

s=1P(d(θs, θ + ε) ≥ log(f(s)f(t))

s

)

≤n∑

t=1

n∑

s=1

2f(s)f(t)

= O(1) , (10.7)

where the final equality follows from Eqs. (10.5) and (10.6) and the same calculation as in the proofof Lemma 10.7. Next let

κ = mins ≥ 1 : |θu − θ| < ε for all u ≥ s .

33

Page 34: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

The expectation of κ is easily bounded Eq. (10.5) and Eq. (10.6):

E[κ] ≤n∑

s=1

∞∑

u=s(exp(−ud(θ + ε, θ) + exp(−ud(θ − ε, θ))) = O(1) , (10.8)

where we used the fact that M is non-singular to ensure strict positivity of the divergences.

Part II: Bounding E[Ti(n)] Choose ε > 0 sufficiently small that for all suboptimal arms i,

supφ∈[θi−ε,θi+ε]

µ(φ) < µ∗

and define

di,min(ε) = minφ∈Θd(θi + x, φ) : µ(φ) = µ∗, x = ±ε

di,inf(ε) = infφ∈Θd(θi + x, φ) : µ(φ) > µ∗, x = ±ε .

Let θis be the empirical estimate of θi based on the first s samples of arm i, which means thatθi(t) = θiTi(t). Let τ be the smallest t such that

d(θ1s, θ1) < log(f(s)f(t))s

for all s ∈ [n] ,

which means that U1(t) ≥ µ∗ for all t ≥ τ . For suboptimal arms i let κi be the random variable

κi = mins : |θiu − θi| < ε for all u ≥ s .

Now suppose that t ≥ τ and Ti(t− 1) ≥ κi and At = i. Then Ui(t) ≥ U1(t) ≥ µ∗, which implies that

di,min(ε) ≥ log(f(Ti(t− 1))f(t))Ti(t− 1) .

This means that

i>1Ti(t) ≤ τ +

i>1

(1 + max

κi,

log(t4)di,min(ε)

). (10.9)

Then let

Λ = maxt : T1(t− 1) ≤ max

i>1Ti(t− 1)

,

which by Eq. (10.9) and Eq. (10.7) and Eq. (10.8) satisfies E[Λ] = O(1). Suppose now that t ≥ Λ.Then T1(t − 1) > maxi>1 Ti(t − 1) and by the definition of the algorithm At = i implies that

34

Page 35: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Ui(t) > µ∗ and so

Ti(t− 1) ≤ log(Ti(t− 1)2f(t))di,inf(ε)

.

Hence

Ti(n) ≤ 1 + Λ + log(f(Ti(n))f(t))di,inf(ε)

.

Since E[Λ] = O(1) we conclude that

lim supn→∞

E[Ti(n)]log(n) ≤

1di,inf(ε)

.

The result because limε→0 di,inf(ε) = di,inf and by the fundamental regret decomposition (Lemma 4.5).

Chapter 11 The Exp3 Algorithm

11.2 Let π be a deterministic policy. Therefore At is a function of x1A1 , . . . , xt−1,At−1 . We define(xt)nt=1 inductively by

xti =

0 if At = i

1 otherwise .

Clearly the policy will collect zero reward and yet

maxi∈[k]

n∑

t=1xti ≥

1k

n∑

t=1

k∑

i=1xti = n(k − 1)

k.

Therefore the regret is at least n(k − 1)/k = n(1− 1/k) as required.

11.5 Let X as stated in the problem and let f(x) = ∑i PiX(i, xi). Let x ∈ Rk be

arbitrary and x′ ∈ Rk be such that x′i = xi except possibly for component j > 1. Note thatf(x) = x1 = x′1 = f(x′) and thus 0 = f(x) − f(x′) = Pj(X(j, xj) − X(j, x′j)). Dividingby Pj > 0 implies that X(j, xj) = X(j, x′j). Since xj and x′j were arbitrary, X(j, ·) ≡ const.Call the value that X(j, ·) is equal to aj . Now, let x, x′ ∈ Rk be such that they agree on allcomponents except possibly on component one and let x′1 = 0. Further, let a1 = X(1, 0). Then,x1−0 = f(x)−f(x′) = P1(X(1, x1)−a1). Reordering gives that for any x1 ∈ R, X(1, x1) = a1+x1/P1.Finally, let x be such that x1 = 0. Then, 0 = f(x) = ∑

i Piai, finishing the proof.

11.6 The first two parts are purely algebraic and are omitted.

(c) Let G0, . . . , Gs be a sequence of independent geometrically distributed random variables with

35

Page 36: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

E[Gu] = 1/qu(α). Then

P (T2(n/2) ≥ s+ 1) = P(

s∑

u=0Gu ≤ n/2

)

≥ P(s−1∑

u=0Gu ≤ n/4

)P (Gs ≤ n/4)

=(

1− P(s−1∑

u=0Gu > n/4

))(1−

(1− 1

8n

)n/4)

≥ 12(1− exp(−1/32))

≥ 165 .

(d) Suppose that T2(n/2) ≥ s+ 1. Then for all t > n/2,

Lt2 =t−1∑

u=1Yu2 ≥ 8αn ≥ 2n .

On the other hand,

Lt1 =t−1∑

u=1Yu1 =

t−1∑

u=n/2+1

1Pu1

.

Using induction and the fact that Pt1 ≥ 1/2 as long as Lt1 ≤ Lt2 it follows that on the eventE = Tt(n/2) ≥ s+ 1 that Pt1 ≥ 1/2 for all t. Therefore

Pt2 ≤ exp(−η

t−1∑

s=1(Ys2 − Ys1)

)

≤ exp(η

(t−1∑

s=1Ys1 − 2n

))

≤ exp (−nη) ,

which combined with the previous part means that

P (At = 1 for all t > n/2) ≥ 165

(1− n

2 exp (−nη)).

The result follows because on the event At = 1 for all t > n/2, the regret satisfies

Rn ≥n

2 −αn

2 ≥n

4 .

(e) Markov’s inequality need not hold for negative random variables and Rn can be negative. For

36

Page 37: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

this problem it even holds that E[Rn] < 0.

(f) Since for n = 104, the probability of seeing a large regret is about 1/65 by the answer to theprevious part, Exp3 was run m = 500 times, which gives us a good margin to encounter largeregrets. The results are shown in Fig. 11.4. As can be seen, as predicted by the theory asignificant fraction of the cases, the regret is above n/4 = 2500. As seen from the figure, themean regret is negative.

11.7 First, note that if G = − log(− log(U)) with U uniform on [0, 1] then

P (G ≤ g) = e− exp(−g) .

Now, the result follows from a long sequence of equalities:

P(

log ai +Gi ≥ maxj∈[k]

log aj +Gj

)= E

j 6=iP (log aj +Gj ≤ log ai +Gi |Gi)

= E

j 6=iexp

(−ajai

exp(−Gi))

= E[U

∑j 6=i

ajai

i

]

= 11 +∑

j 6=iajai

= ai∑kj=1 aj

.

Chapter 12 The Exp3-IX Algorithm

12.1

(a) We have µt = Et−1[Yti] = PtiytiPti+γ . Further, Vt−1[Yti](= Et−1[(Yti − µt)2]) = Pti(1−Pti)y2

ti(Pti+γ)2 ≤

(Pti+γ)yti(Pti+γ)2 = yti

Pti+γ . For any η > 0 such that η(Yti − µt) = η (Ati−Pti)ytiPti+γ ≤ 1 almost surely for all

t ∈ [n],

Lni −∑

t

PtiytiPti + γ

≤ η∑

t

ytiPti + γ

+ 1η

log(1/δ) .

Choosing η = γ, the constraints η(Yti − µt) ≤ 1 are satisfied for t ∈ [n]. Plugging in this valueand reordering gives the desired inequality.

(b) We have µt = Et−1[∑i Yti] = ∑iPtiytiPti+γ . Further, Vt−1[∑i Yti] ≤ Et−1[(∑i Yti)2] = ∑

i Et−1[Y 2ti ] =

∑i

Ptiy2ti

(Pti+γ)2 ≤∑i

ytiPti+γ . To satisfy the constraint on η we calculate η(∑i Yti − µt) ≤ η

∑i Yti =

37

Page 38: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

η∑iAtiytiPti+γ ≤

ηγ

∑iAti = η

γ . Hence, any η ≤ γ is suitable. Choosing η = γ, we get

i

Lni −∑

t

i

PtiytiPti + γ

≤∑

t

i

γytiPti + γ

+ 1γ

log(1/δ) .

Reordering as before gives the desired result.

12.4 We proceed in five steps.

Step 1: Decomposition Using that ∑a Pta = 1 and some algebra we get

n∑

t=1

k∑

a=1Pta(Zta − ZtA∗)

=n∑

t=1

k∑

a=1Pta(Zta − ZtA∗)

︸ ︷︷ ︸(A)

+n∑

t=1

k∑

a=1Pta(Zta − Zta)

︸ ︷︷ ︸(B)

+n∑

t=1(ZtA∗ − ZtA∗)

︸ ︷︷ ︸(C)

.

Step 2: Bounding (A) By assumption (c) we have βta ≥ 0, which by assumption (a) meansthat ηZta ≤ ηZta ≤ η|Zta| ≤ 1 for all a. A straightforward modification of the analysis in the lastchapter shows that (A) is bounded by

(A) ≤ log(k)η

+ ηn∑

t=1

k∑

a=1PtaZ

2ta

= log(k)η

+ ηn∑

t=1

k∑

a=1Pta(Z2

ta + β2ta)− 2η

n∑

t=1

k∑

a=1PtaZtaβta

≤ log(k)η

+ ηn∑

t=1

k∑

a=1PtaZ

2ta + 3

n∑

t=1

k∑

a=1Ptaβta ,

where in the last two line we used the assumptions that ηβta ≤ 1 and η|Zta| ≤ 1.

Step 3: Bounding (B) For (B) we have

(B) =n∑

t=1

k∑

a=1Pta(Zta − Zta) =

n∑

t=1

k∑

a=1Pta(Zta − Zta + βta) .

We prepare to use Exercise 5.15. By assumptions (c) and (d) respectively we have ηEt−1[Z2ta] ≤ βta

and Et−1[Zta] = Zta. By Jensen’s inequality,

ηEt−1

(

k∑

a=1Pta(Zta − Zta)

)2 ≤ η

k∑

a=1PtaEt−1[Z2

ta] ≤k∑

a=1Ptaβta .

38

Page 39: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Therefore by Exercise 5.15, with probability at least 1− δ

(B) ≤ 2n∑

t=1

k∑

a=1Ptaβta + log(1/δ)

η.

Step 4: Bounding (C) For (C) we have

(C) =n∑

t=1(ZtA∗ − ZtA∗) =

n∑

t=1

(ZtA∗ − ZtA∗ − βtA∗

).

Because A∗ is random we cannot directly apply Exercise 5.15, but need a union bound over all actions.Let a be fixed. Then by Exercise 5.15 and the assumption that η|Zta| ≤ 1 and Et−1[Zta] = Zta andηEt−1[Z2

ta] ≤ βta, with probability at least 1− δ.n∑

t=1

(Zta − Zta − βta

)≤ log(1/δ)

η.

Therefore by a union bound we have with probability at most 1− kδ,

(C) ≤ log(1/δ)η

.

Step 5: Putting it together Combining the bounds on (A), (B) and (C) in the last threesteps with the decomposition in the first step shows that with probability at least 1− (k + 1)δ,

Rn ≤3 log(1/δ)

η+ η

n∑

t=1

k∑

a=1PtaZ

2ta + 5

n∑

t=1

k∑

a=1Ptaβta .

where we used the assumption that δ ≤ 1/k.

Chapter 13 Lower Bounds: Basic Ideas

13.2 Notice that a policy has zero regret on all bandits for which the first arm is optimal if andonly if Pνπ(At = 1) = 1 for all t ∈ [n]. Hence the policy that always plays the first arm is optimal.

Chapter 14 Foundations of Information Theory

14.4 Let µ = P −Q, which is a signed measure on (Ω,F). By the Hahn decomposition theoremthere exist disjoint sets A,B ⊂ Ω such that A ∪B = Ω and µ(E) ≥ 0 for all measurable E ⊆ A and

39

Page 40: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

µ(E) ≤ 0 for all measurable E ⊆ B. Then∫

ΩXdP −

ΩXdQ =

AXdµ+

BXdµ

≤ bµ(A) + aµ(B)= (b− a)µ(A)≤ (b− a)δ(P,Q) ,

where we used the fact that µ(B) = P (B)−Q(B) = Q(A)− P (A) = −µ(A).

14.10 Dobrushin’s theorem says that for any ω,

D(P (· |ω), Q(· |ω)) = sup(Ai)

i

P (Ai |ω) log(P (Ai |ω)Q(Ai |ω)

),

where the supremum is taken over all finite partitions of R with rational-valued end-points. By thedefinition of a probability kernel it follows that the quantity inside the supremum on the right-handside is F-measurable as a function of ω for any finite partition. Since the supremum is over acountable set, the whole right-hand side is F-measurable as required.

14.11 First assume that P Q. Then let P t and Qt be the restrictions of P and Q to (Rt,B(Rt))given by

P t(A) = P (A× Ωn−t) and Qt(A) = Q(A× Ωn−t) .

You should check that P Q implies that P t Qt and hence there exists a Radon-Nikodymderivative dP t/dQt. Define

F (xt |x1, . . . , xt−1) = dP t

dQt(x1, . . . , xt)

/dP t−1

dQt−1 (x1, . . . , xt−1) ,

which is well defined for all x1, . . . , xt−1 ∈ Rt−1 except for a set of P t−1-measure zero. Then for anyA ∈ B(Rt−1) and B ∈ B(R),

A

BF (xt |ω)Qt(dxt |ω)P t−1(dω) =

A

B

dP t

dQt(xt, ω)Qt(dxt |ωQt−1(dω)

=∫

A×B

dP t

dQtdQt

= P (A×B) .

A monotone class argument shows that F (xt |ω) is P t−1-almost surely the Radon-Nikodym derivative

40

Page 41: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

of Pt(· |ω) with respect to Qt(· |ω). Hence

D(P,Q) = EP[log

(dP

dQ

)]

=n∑

t=1EP [log (F (Xt |X1, . . . , Xt−1))]

=n∑

t=1EP [D(Pt(· |X1, . . . , Xt−1), Qt(· |X1, . . . , Xt−1))] .

Now suppose that P 6 Q. Then by definition D(P,Q) = ∞. We need to show this impliesthere exists a t ∈ [n] such that D(Pt(· |ω), Qt(· |ω)) = ∞ with nonzero probability. Proving thecontrapositive, let

Ut = ω : D(Pt(· |ω), Qt(· |ω)) <∞

and assume that P (Ut = 1) for all t. Then U = ∩nt=1Ut satisfies P (U) = 1. OnUt let F (xt |x1, . . . , xt−1) = dPt(· |x1, . . . , xt−1)/dQt(· |x1, . . . , xt−1)(xt) and otherwise letF (xt |x1, . . . , xt−1) = 0. Iterating applications of Fubini’s theorem shows that for any (At)nt=1with At ∈ B(R) it holds that

A1×···×An

n∏

t=1F (xt |x1, . . . , xt−1)Q(dx1, . . . , dxn) = P (A1 × · · · ×An) .

Hence ∏nt=1 F (xt |x1, . . . , xt−1) behaves like the Radon-Nikodym derivative of P with respect to

Q on rectangles. Another monotone class argument extends this to all measurable sets and theexistence of dP/dQ guarantees that P Q.

Chapter 15 Minimax Lower Bounds

15.1 Abbreviate θ = θ(X1, . . . , Xn) and let R(P ) = EP [d(θ, P )]. By the triangle inequality

d(θ, P0) + d(θ, P1) ≥ d(P0, P1) = ∆ .

Let E = d(θ, P0) ≤ ∆/2. On Ec it holds that d(θ, P0) ≥ ∆/2 and on E it holds thatd(θ, P1) ≥ ∆− d(θ, P0) ≥ ∆/2.

R(P0) +R(P1) ≥ ∆2 (P0(Ec) + P1(E)) ≥ ∆

4 exp(−D(P0, P1)) .

41

Page 42: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

The result follows because maxa, b ≥ (a+ b)/2.

Chapter 16 Instance-Dependent Lower Bounds

16.2

(a) Suppose that µ 6= µ′. Then D(R(µ),R(µ′)) = ∞, since R(µ) and R(µ′) are not absolutelycontinuous. Therefore dinf(R(µ), µ∗,M) =∞.

(b) Notice that each arm returns exactly two possible rewards and once these have been observed,then the mean is known. Consider the algorithm that plays each arm until it has observed bothpossible rewards from that arm and subsequently plays optimally. The expected number oftrials before both rewards from an arm are observed is ∑∞i=2 i21−i = 3. Hence

Rn ≤ 3k∑

i=1∆i .

(c) Let Pµ be the shifted Rademacher distribution with mean µ. Then D(Pµ, Pµ+∆) is notdifferentiable as a function of ∆.

16.7 Fix any policy π. If π is not a consistent policy for Ek[0,b] then π cannot have logarithmicregret, which contradicts both Eq. (16.7) and Eq. (16.8). Hence, we may assume that π is consistentfor this environment class. We can thus apply Theorem 16.2. Let M be the set of probabilitydistributions supported on [0, b] and let M′ be the set of scaled Bernoulli distributions supportedon [0, b]: P ∈M′ if P = (1− p)δ0 + pδb for p ∈ [0, 1] where δx is the Dirac distribution supportedon x. Then, thanks to Theorem 16.2, for any ν ∈ Ek[0,b],

lim infn→∞

Rn(π, ν)log(n) ≥

i:∆i>0

∆i

dinf(Pi, µ∗,M) ≥∑

i:∆i>0

∆i

dinf(Pi, µ∗,M′),

where the latter inequality holds thanks to M′ ⊂ M. Note that any P ∈ M′ is uniquelydetermined by its Bernoulli parameter p. Choose some ν ∈ (M′)k, ν = (Pi) and let pi theBernoulli parameter underlying Pi. Introduce p∗ = maxi pi. Then, dinf(Pi, µ∗,M′) = d(pi, p∗) whered(p, q)(= D(B(p),B(q))) = p log(p/q)+(1−p) log((1−p)/(1−q)) (cf. Definition 10.1). Furthermore,∆i = b(p∗ − pi). Hence, we conclude that

lim supn→∞

Rn(π, ν)log(n) ≥ lim inf

n→∞Rn(π, ν)log(n) ≥

i:pi<p∗

b(p∗ − pi)d(pi, p∗)

.

We will consider environments ν = νδ given by the Bernoulli parameters ((1+δ)/2, . . . , (1+δ)/2, (1−

42

Page 43: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

δ)/2) for some δ ∈ [0, 1]. Thus,

lim supn→∞

Rn(π, ν)log(n) ≥

i:pi<p∗

b(p∗ − pi)d(pi, p∗)

= bδ

d((1− δ)/2, (1 + δ)/2)

= b

log((1 + δ)/(1− δ)) .

Denote the right-hand side by f(δ). Noticing that limδ→0 f(δ) = ∞ we immediately see thatEq. (16.7) cannot hold: As the action gap gets small, if we maintain the variance at constant (as inthis example) then the regret blows up with the inverse action gap. To show that Eq. (16.8) cannothold either, consider the case when δ → 1. The right-hand side of Eq. (16.8) is σ2

k/∆k = b(1−δ2)/(4δ).Now, f(δ)∆k

σ2k

= 4δ(1−δ2) log((1+δ)/(1−δ)) →∞ as δ → 1. (Note that in this case the variance decreases

to zero, while the gap is maintained. Since the algorithm does not know that the variance is zero, ithas to pay a logarithmic cost).

Chapter 17 High-Probability Lower Bounds

17.1

Proof of Claim 17.5. We have

δ ≤ PQ(Rn ≥ u) =∫

[0,1]n×kPδx(Rn ≥ u)dQ(x) .

Therefore there exists an x with Pδx(Rn ≥ u) ≥ δ.

Proof of Claim 17.6. Abbreviate Pi to be the law of A1, X1, . . . , An, Xn induced by PQi and let Eibe the corresponding expectation operator. Following the standard argument, let j be the arm thatminimises E1[Tj(n)], which satisfies

E1[Tj(n)] ≤ n

k − 1 .

Therefore by Theorem 14.2 and Lemma 15.1,

max P1(T1(n) ≤ n/2),Pj(Tj(n) ≤ n/2) ≥ P1(T1(n) ≤ n/2) + Pj(T1(n) > n/2)

≥ 12 exp

(−E1[Ti(n)]2∆2

)

≥ 12 exp

(−E1[Ti(n)]k − 1

nlog

( 18δ

))

≥ 4δ .

Therefore there exists an i such that Pi(Ti(n) ≤ n/2) ≥ 2δ.

43

Page 44: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Proof of Claim 17.7. Notice that if ηt + 2∆ < 1 and ηt > 0, then Xtj ∈ (0, 1) for all j ∈ [k]. Now

Pi (ηt + 2∆ ≥ 1 or ηt ≤ 0) ≤ exp(−(1/2− 2∆)2

2σ2

)+ exp

(−(1/2)2

2σ2

)

≤ exp(−100

32

)+ exp

(−25

2

)

≤ 18 .

Let M = ∑nt=1 I ηt ≤ 0 or ηt + 2∆ ≥ 1, which is an upper bound on the number of rounds where

clipping occurs. By Hoeffding’s bound,

Pi

M ≥ Ei[M ] +

√n log(1/δ)

2

≤ δ ,

which means that with probability at least 1− δ,

M ≤ n

8 +

√n log(1/δ)

2 ≤ n

4 ,

where we used the fact that n ≥ 32 log(1/δ).

Chapter 18 Contextual Bandits

18.1

(a) By Jensen’s inequality,

c∈C

√√√√n∑

t=1I ct = c = |C|

c∈C

1|C|

√√√√n∑

t=1I ct = c

≤ |C|√√√√∑

c∈C

1|C|

n∑

t=1I ct = c

=√|C|n ,

where the inequality follows from Jensen’s inequality and the concavity of√· and the last

equality follows since ∑c∈C∑nt=1 I ct = c = n.

(b) When each context occurs n/|C| times we have

c∈C

√√√√n∑

t=1I ct = c =

√n|C| ,

44

Page 45: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

which matches the upper bound proven in the previous part.

18.6

(a) This follows because Xi ≤ maxj Xj for any family of random variables (Xi)i. HenceE [Xi] ≤ E [maxj Xj ].

(b) Modify the proof by proving a bound on

E[n∑

t=1E

(t)m∗xt −

n∑

t=1Xt

]

for an arbitrarily fixed m∗. By the definition of learning experts, Et[Xt] = xt and so Eq. (18.10)also remains valid. Note this would not be true in general if E(t) were allowed to depend on At.The rest follows the same way as in the oblivious case.

18.7 The inequality E∗n ≤ nk is trivial since maxmE(t)mi ≤ 1. To prove E∗n ≤ nM , let

m∗ti = argmaxmE(t)mi. Then

E∗n =n∑

t=1

k∑

i=1

M∑

m=1E

(t)miI m = m∗ti ≤

n∑

t=1

M∑

m=1

k∑

i=1E

(t)mi = nM ,

where in the last step we used the fact that ∑ki=1E

(t)mi = 1.

18.8 Let Xtφ = kI At = φ(Ct). Assume that t ≤ m. Since At is chosen uniformly at random,

E[Xtφ] = E [kI At = φ(Ct)Xt]= E [kI At = φ(Ct)E [Xt |At, Ct]]= E [kI At = φ(Ct)µ(Ct, At)]= E [µ(Ct, φ(Ct))]= µ(φ) .

The variance satisfies

E[(Xtφ − µ(φ))2] ≤ E[X2tφ] ≤ k2E[I At = φ(Ct)] = k .

Finally note that Xtφ ∈ [0, k]. Using the technique from Exercise 5.14,

E[exp(λ(µ(φ)− µ(φ)))] ≤ exp(kλ2g(λk/m)

m

),

where g(x) = (exp(x) − 1 − x)/x2, which for x ∈ (0, 1] satisfies g(x) ≤ 1/2 + x/4. Suppose that

45

Page 46: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

m ≥ 2k log(2|Φ|). Then by following the argument in Exercise 5.18,

E[maxφ|µ(φ)− µ(φ)|

]≤ inf

λ>0

(log(2|Φ|)

λ+ kλ

2m + k2λ2

4m2

)

≤√

2k log(2|Φ|)m

+ k log(2|Φ|)2m ,

where the second inequality follows by choosing

λ =

√2m log(2|Φ|)

k≤ m

k.

Therefore the regret is bounded by

Rn ≤ m+ 2n

√2k log(2|Φ|)

m+ nk log(2|Φ|)

m.

By tuning m it follows that

Rn = O(n2/3(k log(|Φ|))1/3

)

18.9 Let C1, . . . , Cn ∈ C be the i.i.d. sequence of contexts and for k ∈ [n] let C1:k = (C1, . . . , Ck).The algorithm is as follows: Following the hint, for the first m rounds the algorithm selects armsin an arbitrary fashion. The regret from this period is bounded by m. The algorithm then picksM = |ΦC1:m | functions Φ′ = φ1, . . . , φM from Φ so that Φ′|C1:m = ΦC1:m and for the remainingn−m rounds uses Exp4 with the set Φ′.

The regret of Exp4 for competing against Φ′ is√

4n log(M) ≤√

4n log((em/d)d) =√4nd log(em/d), where the inequality follows from Sauer’s lemma. It remains to show that

the best expert in Φ′ achieves almost as much reward as the best expert in Φ. For this, it suffices toshow that for any φ∗ ∈ Φ, the expert φ′∗ ∈ Φ′ that agrees with φ∗ on C1:m agrees with φ∗ on mostof the rest of the rounds. Let d∗ be a positive integer to be chosen later. We will show that withhigh probability φ′∗ and φ∗ agree except for at most d∗ rounds.

We need a few definitions: For a, b finite sequences of equal length k let d(a, b) = ∑ki=1 I ai 6= bi

be their Hamming distance. For a sequence c ∈ Ck, let φ(c) = (φ(c1), . . . , φ(ck)) ∈ 1, 2k.For φ, φ′ ∈ Φ, let d

(k)C (φ, φ′) = d(φ(C1:k), φ′(C1:k)). For a permutation π on [n], let Cπ =

(Cπ(1), Cπ(2), . . . , Cπ(n)).Let π be a random permutation on [n], chosen uniformly from the set of all permutations on [n],

independently of C. We have

p = P(d

(n)C (φ∗, φ′∗) ≥ d∗

)≤ P

(∃φ, φ′ ∈ Φ : d(n)

C (φ, φ′) ≥ d∗, d(m)C (φ, φ′) = 0

)

= P(∃φ, φ′ ∈ Φ : d(n)

Cπ (φ, φ′) ≥ d∗, d(m)Cπ (φ, φ′) = 0

)

= E[P(∃φ, φ′ ∈ Φ : d(n)

Cπ (φ, φ′) ≥ d∗, d(m)Cπ (φ, φ′) = 0|C

)],

46

Page 47: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

where the first equality is the definition of p, the second holds by the exchangeability of (C1, . . . , Cn).Now, by a union bound and because π and C are independent,

P(∃φ, φ′ ∈ Φ : d(n)

Cπ (φ, φ′) ≥ d∗, d(m)Cπ (φ, φ′) = 0|C

)

≤∑

a,b∈ΦC1:n

P (d(aπ, bπ) ≥ d∗, d(aπ1:m, bπ1:m) = 0) .

By Sauer’s lemma, |ΦC1:n | ≤ (en/d)d. For any fixed a, b ∈ 1, 2n, p(a, b) =P (d(aπ, bπ) ≥ d∗, d(aπ1:m, b

π1:m) = 0) is the probability that all bits in a random subsequence of

length m of a bit sequence of length n with at least d∗ one bits has only zero bits. As the probabilityof a randomly chosen bit to be equal to zero is 1 − d∗/n, p(a, b) ≤

(1− d∗

n

)m≤ exp(−d∗m/n).

Choosing d∗ = d nm log(

(en/d)2d

δ

)e we find that p ≤ δ.

Choosing δ = 1/n, we get that

Rn ≤ 1 +m+ d∗ +√

4nd log(em

d

)

≤ 2 +m+ n

m

(log(n) + (2d) log

(en

d

))+√

4nd log(em

d

).

Letting m =√dn gives the desired bound.

Chapter 19 Stochastic Linear Bandits

19.3 Let T be the set of rounds t when ‖at‖V −1t−1≥ 1 and Gt = V0 +∑t

s=1 IT (s)asa>s . Then

(dλ+ |T |L2

d

)d≥(trace(Gn)

d

)d

≥ det(Gn)= det(V0)

t∈T(1 + ‖at‖2G−1

t−1)

≥ det(V0)∏

t∈T(1 + ‖at‖2V −1

t−1)

≥ λd2|T | .

Rearranging and taking the logarithm shows that

|T | ≤ d

log(2) log(

1 + |T |L2

).

47

Page 48: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Abbreviate x = d/ log(2) and y = L2/dλ, which are both positive. Then

x log (1 + y(3x log(1 + xy))) ≤ x log(1 + 3x2y2

)≤ x log(1 + xy)3 = 3x log(1 + xy) .

Since z − x log(1 + yz) is decreasing for z ≥ 3x log(1 + xy) it follows that

|T | ≤ 3x log(1 + xy) = 3dlog(2) log

(1 + L2

λ log(2)

).

19.4

(a) Let A = B + uu>. Then, for x 6= 0,

‖x‖2A‖x‖2B

= 1 + (x>u)2

‖x‖2B≤ 1 +

‖x‖2B‖u‖2B−1

‖x‖2B= 1 + ‖u‖2B−1 = detA

detB . (19.1)

where the inequality follows by Cauchy-Schwartz, and the equality follows because det(A) =det(B) det(I + B−1/2uu>B−1/2) = det(B)(1 + ‖u‖2B−1) per the proof of Lemma 19.4. In thegeneral case, C = A− B 0 can be written using its eigendecomposition as C = ∑k

i=1 uiu>i

with 0 ≤ k ≤ d. Letting Aj = B +∑i≤j uiu

>i , 0 ≤ j ≤ k note that A = Ak and B = A0 and

thus by (19.1),

‖x‖2A‖x‖2B

=‖x‖2Ak‖x‖2A0

=‖x‖2Ak‖x‖2Ak−1

‖x‖2Ak−1

‖x‖2Ak−2

. . .‖x‖2A1

‖x‖2A0

≤ detAkdetAk−1

detAk−1detAk−2

. . .detA1detA0

= detAdetB .

(b) We need some notation. As before, for t ≥ 0, we let Vt(λ) = V0 +∑s≤tAsA

>s , where As is the

action taken in round s. Let τ0 = 0. For t ≥ 1, we let τt ∈ [t] τt denote the round index whenthe phase that contains round t starts. That is,

τt =

τt−1, if detVt−1(λ) ≤ (1 + ε) detVτt−1−1(λ) ;t, otherwise .

Further,

At =

At−1 , if τt = τt−1 ;argmaxa∈AUCBt(a) , otherwise .

Here, UCBt(a) is the upper confidence bound based on all the data available up the beginningof round t.

Let the event when θ∗ ∈ ∩t∈[n]Ct hold. Define θt ∈ Ct so that if At = argmaxa∈AUCBt(a) then〈θt, At〉 = maxa∈AUCBt(a). Letting a∗ = argmaxa∈A〈θ∗, a〉 and noting that maxa UCBτt(a) =

48

Page 49: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

UCBτt(Aτt) and that At = Aτt ,

〈θ∗, a∗〉 ≤ UCBτt(a∗) ≤ UCBτt(Aτt) = 〈θτt , Aτt〉 = 〈θτt , At〉 .

Hence,

rt = 〈θ∗, a∗ −At〉 ≤ 〈θτt − θ∗, At〉 ≤ ‖θτt − θ∗‖Vt−1‖At‖V −1t−1

.

Now, by Part (a), thanks to Vτt−1 Vt−1 and that θ∗, θτt ∈ Cτt and that Cτt is contained in anellipsoid of radius βτt ≤ βt and “shape” determined by Vτt−1,

‖θτt − θ∗‖Vt−1 ≤ ‖θτt − θ∗‖Vτt−1

( detVt−1detVτt−1

)1/2≤ 2

√(1 + ε)βt .

Combined with the previous inequality, we see that

rt ≤ 2√

(1 + ε)βt‖At‖V −1t−1

,

which is the same as Eq. (19.10) except that βt is replaced with (1 + ε)βt. This implies in turnthat Rn = ∑n

t=1 ≤ Rn((1 + ε)βn).

19.5 We only present in detail the solution for the first part.

(a) Partition C into m equal length subintervals, call these C1, . . . , Cm. Associate a bandit algorithmwith each of these subintervals. In round t, upon seeing Ct ∈ C, play the bandit algorithmassociated with the unique subinterval that Ct belongs to. For example, one could use a Exp3as in the previous chapter, or UCB. The regret is

Rn = E[n∑

t=1maxa∈[k]

r(Ct, a)−n∑

t=1Xt

]

= E[n∑

t=1maxa∈[k]

r(Ct, a)−maxa

r([Ct], a)]

︸ ︷︷ ︸(I)

+E[n∑

t=1maxa∈[k]

r([Ct], a)−Xt

]

︸ ︷︷ ︸(II)

,

where for c ∈ C, [c] is the index of the unique part Ci that c belongs to and for i ∈ [m],

r(i, a) =

E [r(Ct, a) | [Ct] = i] if P ([Ct] = i) > 00 otherwise .

The first term in the regret decomposition is called the approximation error, and the second isthe error due to learning. The approximation error is bounded using the Lipschitz assumption

49

Page 50: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

and the definition of the discretisation:

maxa∈[k]

r(c, a)−maxa∈[k]

r([c], a) ≤ maxa∈[k]

(r(c, a)− r([c], a))

= maxa∈[k]

E[r(c, a)− r(Ct, a) | [Ct] = [c]])

≤ maxa∈[k]

E[L|c− Ct| | [Ct] = [c]])

≤ L/m ,

where in the second last inequality we used the assumption that r is Lipschitz and in the lastthat when [Ct] = [c] it holds that |Ct− c| ≤ 1/m. It remains to bound the error due to learning.Note that E[Xt | [Ct] = i] = r(i, At). As a result, the data experienced by the bandit associatedwith Ci satisfies the conditions of a stochastic bandit environment. Consider the case whenExp3 is used with the adaptive learning rate as described in Exercise 28.13. For i ∈ [m], letTi = t ∈ [n] | [Ct] = i, Ni = |Ti| and

Rni =∑

t∈Ti

(maxa∈[k]

r([Ct], a)−Xt

).

Then, by Eq. (11.2), E[Rni] ≤ CE[√k log(k)Ni] and thus

(II) ≤m∑

i=1E [Rni] ≤ CE

[(k log(k))1/2

m∑

i=1N

1/2i

]≤ C

√k log(k)mn ,

where the last inequality follows from Cauchy-Schwarz. Hence,

Rn ≤Ln

m+ C

√k log(k)mn .

Optimizing m gives m = (L/C)2/3(n/(k log(k)))1/3 and

Rn ≤ C ′n2/3(Lk log(k))1/3

for some constant C ′ that depends only on C. The same argument works with no change ifthe bandit algorithm is switched to UCB, just a little extra work is needed to deal with thefact that UCB will be run for a random number of rounds. Luckily, the number of rounds isindependent of the rewards experienced and the actions taken.

(b) Consider a lower bound that hides a ‘spike’ at one of m positions.

(c) Make an argument that C can be partitioned into m = (3dL/ε)d partitions to guarantee that forfixed a the function r(c, a) varies by at most ε within each partition. Then bound the regret by

Rn ≤ nε+

2nk(3dL

ε

)dlog(k) .

50

Page 51: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

By optimizing ε you should arrive at a bound that depends on the horizon like O(n(d+1)/(d+2)),which is really quite bad, but also not improvable without further assumptions. You might findthe results of Exercise 20.3 to be useful.

19.6

(a) We need to show that Lt(θ∗) ≤ β1/2t for all t with high probability. By definition,

Lt(θ∗) =∥∥∥∥∥gt(θ∗)−

t∑

s=1XsAs

∥∥∥∥∥V −1t

=∥∥∥∥∥λθ∗ +

t∑

s=1(µ(〈θ∗, As〉)As − (µ(〈θ∗, As〉)As + ηs)As)

∥∥∥∥∥V −1t

=∥∥∥∥∥λθ∗ +

t∑

s=1Asηs

∥∥∥∥∥V −1t

≤∥∥∥∥∥

t∑

s=1Asηs

∥∥∥∥∥V −1t

+√λ‖θ∗‖2 .

The result now follows by Theorem 20.4.

(b) Let g′t denote the derivative of gt (which exists by assumption). By the mean value theorem,there exists a ξ on the segment connecting θ and θ′ such that

gt(θ)− gt(θ′) = g′t(ξ)(θ − θ′)

=(λI +

t∑

s=1µ′(〈ξ, As〉)AsA>s

)(θ − θ′)

= Mt(θ − θ′) ,

where Mt = λI +∑ts=1 µ

′(〈ξ, As〉)AsA>s c1Vt. Hence,

‖gt(θ)− gt(θ′)‖V −1t

= ‖θ − θ′‖MtV−1t Mt

≥ c1‖θ − θ′‖Vt .

(c) Let θt be such that µ(〈θt, At〉) = maxa∈At maxθ∈Ct µ(〈θ, a〉). Using the fact that θ∗ ∈ Ct,

rt = µ(〈θ∗, A∗t 〉)− µ(〈θ∗, At〉) ≤ µ(〈θt, At〉)− µ(〈θ∗, At〉) ≤ c2〈θt − θ∗, At〉≤ c2‖θt − θ∗‖Vt−1‖At‖V −1

t−1

≤ c2c1‖gt−1(θt)− gt−1(θ∗)‖V −1

t−1‖At‖V −1

t−1

≤ c2c1

(Lt−1(θt) + Lt−1(θ∗)

)‖At‖V −1

t−1

≤ 2c2β1/2t−1

c1‖At‖V −1

t−1.

51

Page 52: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(d) By Part (a), with probability at least 1 − δ, θ∗ ∈ Ct for all t. Hence, by Part (c), withprobability at least 1− δ,

Rn =n∑

t=1rt ≤

2c2β1/2n

c1

n∑

t=1(1 ∧ ‖At‖V −1

t−1)

≤ 2c2c1

√2ndβn log

(1 + nL2

d

),

where we used Lemma 19.4 and the same argument as in the proof of Theorem 19.2.

19.7

(a) Let ξ1, . . . , ξd be the eigenvalues of Vt in increasing order. By the Courant–Fischer min-maxtheorem, δi = ξi − λi ≥ 0. Hence,

det(Vt)det(V0) =

d∏

i=1

ξiλi

=d∏

i=1

(1 + δi

λi

).

A trace argument shows that ∑di=1 δi ≤ nL2, which, assuming that deff < d, implies that

log(det(Vt)

det(V0)

)=

d∑

i=1log

(1 + δi

λi

)

=deff∑

i=1log

(1 + δi

λi

)+

d∑

i=deff+1

δiλdeff+1

(log(1 + x) ≤ x, x ≥ 0, λi increasing)

≤deff∑

i=1log

(1 + nL2

λ

)+ nL2

λdeff+1

≤ 2deff log(

1 + nL2

λ

).

When deff = d, the desired inequality trivially holds.

(b) By Theorem 20.4 in the next chapter, it holds with probability least 1− δ that for all t,∥∥∥∥∥

t∑

s=1AsXs

∥∥∥∥∥V −1t

≤√

2 log(1δ

)+ log

(det(Vt)det(V0)

).

52

Page 53: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

On this event,

‖θt − θ‖Vt =∥∥∥∥∥V−1t

t∑

s=1AsA

>s θ − θ + V −1

t

t∑

s=1Asηs

∥∥∥∥∥Vt

≤∥∥∥V −1

t V0θ∥∥∥Vt

+∥∥∥∥∥

t∑

s=1Asηs

∥∥∥∥∥Vt

≤ ‖θ‖V0+√

2 log(1δ

)+ log

(det(Vt)det(V0)

)

≤ m+√

2 log(1δ

)+ log

(det(Vt)det(V0)

).

(c) Using again the argument as in the proof of Theorem 19.2,

rt = 〈A∗t −At, θ∗〉 ≤ 2β1/2n min1, ‖At‖V −1

t−1 .

Then, using Lemma 19.4,

n∑

t=1rt ≤ 2β1/2

n

√√√√nn∑

t=1min1, ‖At‖2V −1

t−1

≤ 2β1/2n

√2n log

(detVtdetV0

)

≤ 2β1/2n

√2ndeff log

(1 + nL2

λ

).

(d) The result follows by combining the result of the last part and that of Part (a) while choosingδ = 1/n.

(e) The bound to improves when deff d. For this, V0 should have a large number of largeeigenvalues while maintaining ∑i λi〈vi, θ∗〉2 = ‖θ∗‖2V0

≤ m2: Choosing V0 is betting on thedirections (vi) in which θ∗ will have small components as in those directions we can choose λito be large. This way, one can have a finer control than just choosing features of some fixeddimension: The added flexibility is the main advantage of non-uniform regularisation. However,this is not for free: Non-uniform regularisation may increase ‖θ∗‖V0 for some θ∗ which may lead toworse regret. In particular, when ‖θ∗‖V0 ≤ m is not satisfied, the confidence set will still containθ∗, but the coverage probability 1−δ will worsen to about 1−δ′ where δ′ ≈ δ exp((‖θ∗‖V0−m)2

+),which increases the δn term in the regret to δ′n. Thus, the degradation is smooth, but can bequite harsh. Since the confidence radius is probably conservative, the degradation may be lessnoticeable than what one would expect based on this calculation.

19.8 For the last part, note that it suffices to store the inverse matrix Ut = V −1t (λ), St = ∑

s≤tAsXs,together with dt = log detVt(λ)

detV0(λ) . Choosing V0 = λI, we have U0 = λ−1I, S0 = 0 and d0 = 0. Then,

53

Page 54: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

for t ≥ 0 and a ∈ Rd we have

UCBt(a) = 〈θt, a〉+√βt‖a‖Ut−1

where

θt = Ut−1St−1 ,√βt = m2

√λ+

√2 log(1/δ) + dt ,

and where after receiving (At, Xt), the following updates need to be executed:

Ut = Ut−1 −(Ut−1At)(Ut−1At)>

1 +A>t Ut−1At,

dt = dt−1 + log(1 +A>t Ut−1At) ,St = St−1 +AtXt .

Here, the update of dt follows from Eq. (19.9). Note that the O(d2) operations are the calculationof θt and the update of Ut.

Chapter 20 Confidence Bounds for Least Squares Estimators

20.1 From the definition of the design Vn = mI and the ith coordinate of θn − θ∗ is the sum of mindependent standard Gaussian random variables. Hence

E[‖θn − θ∗‖2Vn

]= m

d∑

i=1

1m

= d .

20.2 Let m = n/d and assume for simplicity that m is a whole number. The actions (At)nt=1are chosen in blocks of size m with the ith block starting in round ti = (i − 1)m + 1. For alli ∈ 1, . . . , d we set Ati = ei. For t ∈ ti, . . . , ti + m − 1 we set Ati = ei if ηti > 0 and At = 0otherwise. Clearly V −1

n ≥ I and hence ‖1‖V −1n≤ d. The choice of (At)nt=1 ensures that E[θn,i] is

independent of i. Furthermore,

E[θn,1] = E[I η > 0

∑mt=1 ηtm

+ I η1 < 0 η1

]= 1√

( 1m− 1

).

Therefore

E

〈θn,1〉

2

‖1‖2V −1n

≥ 1

dE[〈θn,1〉2

]≥ 1dE[〈θn,1〉

]2= d

(m− 1m

)2.

20.3

54

Page 55: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(a) If C ⊂ A is an ε-covering then it is also an ε′-covering with any ε′ ≥ ε. Hence, ε→ N(ε) is adecreasing function of ε.

(b) The inequality M(2ε) ≤ N(ε) amounts to showing that any 2ε packing has a cardinality at mostthe cardinality of any ε covering. Assume this does not hold, that is, there is a 2ε packing P ⊂ Aand an ε-covering C ⊂ A such that |P | ≥ |C|+1. By the pigeonhole principle, there is c ∈ C suchthat there are distinct x, y ∈ P such that x, y ∈ B(c, ε). Then ‖x− y‖ ≤ ‖x− c‖+ ‖c− y‖ ≤ 2ε,which contradicts that P is a 2ε-packing.

If M(ε) = ∞, the inequality N(ε) ≤ M(ε) is trivially true. Otherwise take a maximumε-packing P of A. This packing is automatically an ε-covering as well (otherwise P would notbe a maximum packing), hence, the result.

(c) We show the inequalities going left to right. For the first inequality, if N .= N(ε) =∞ then thereis nothing to be shown. Otherwise let C be a minimum cardinality ε-cover of A. Then fromthe definition of cover and the additivity of volume, vol(A) ≤∑x∈A′ vol(B(x, ε)) = Nεdvol(B).Reordering gives the inequality.

The next inequality, namely that N(ε) ≤M(ε) has already been shown.

Consider now the inequality bounding M .= M(ε). Let P be a maximum cardinality ε-packingof A. Then, for any x, y ∈ P distinct, B(x, ε/2) ∩ B(y, ε/2) = ∅. Further, for x ∈ P ,B(x, ε/2) ⊂ A + ε

2B and thus ∪x∈PB(x, ε/2) ⊂ A + ε2B, hence, by the additivity of volume,

Mvol( ε2B) ≤ vol(A+ ε2B).

For the next inequality note that εB ⊂ A immediately implies that A+ ε2B ⊂ A+ 1

2A (checkthe containment using the definitions), while the convexity of A implies that A+ 1

2A ⊂ 32A.

For this second claim let u ∈ A+ 12A. Then u = x+ 1

2y for some x, y ∈ A. By the convexityof A, 2

3u = 23x + 1

3y ∈ A and hence u = 32(2

3u) ∈ 32A. For the final inequality note that for

measurable X and c > 0 we have vol(cX) = cdvol(X). This is true because cX is the image ofX under the linear mapping represented by a diagonal matrix with c on the diagonal and thismatrix has determinant cd.

(d) Let A be bounded, and say, A ⊂ rB for some r > 0. Then vol(A+ ε/2B) ≤ vol(rB + ε/2B) =vol((r + ε/2)B) < +∞, hence, the previous part gives that N(ε) ≤ M(ε) < +∞. Nowassume that N(ε) < ∞ and let C be a minimum cover of A. Then A ⊂ ∪x∈CB(x, ε) ⊂∪x∈C(‖x‖+ ε)B ⊂ maxx∈C(‖x‖+ ε)B hence, A is bounded.

20.4 The result follows from Part (c) of Exercise 20.3 by taking A = B = x ∈ Rd : ‖x‖2 ≤ 1,which shows that the covering number N(B, ε) ≤ (3/ε)d, from which the result follows.

20.5 Proving that Mt is Ft-measurable is actually not trivial. It follows because Mt(·) is measurableand by the ‘sections’ lemma [Kallenberg, 2002, Lemma 1.26]. It remains to show that E[Mt | Ft−1] ≤Mt−1 almost surely. Proceeding by contradiction, suppose that P(E[Mt | Ft−1] − Mt−1 > 0) > 0.Then there exists an ε > 0 such that the set A = ω : E[Mt | Ft−1](ω) − Mt−1(ω) > ε ∈ Ft−1

55

Page 56: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

satisfies P (A) > 0. Then

0 <∫

A(E[Mt | Ft−1]− Mt−1)dP =

A(Mt − Mt−1))dP

=∫

A

Rd(Mt(x)−Mt−1(x))dh(x)dP

=∫

Rd

A(Mt(x)−Mt−1(x))dPdh(x)

≤ 0 ,

where the first equality follows from the definition of conditional expectation, the second bysubstituting the definition of Mt and the third from Fubini-Tonelli’s theorem. The last followsfrom Lemma 20.2 and the definition of conditional expectation again. The proof is completed bynoting the deep result that 0 6< 0. In this proof it is necessary to be careful to avoid integratingover conditional E[Mt(x) | Ft−1], which are only defined for each x almost surely and need not bemeasurable as a function of x (though a measurable choice can be constructed using separability ofRd and continuity of x 7→Mt(x)).

20.8 Let f(λ) = 1√2π exp(−λ2/2) be the density of the standard Gaussian and define

supermartingale Mt by

Mt =∫

Rf(λ) exp

(λSt −

tσ2λ2

2

)dλ = 1√

tσ2 + 1exp

(S2t

2σ2(t+ 1)

).

Since E[Mτ ] = M0 = 1, the maximal inequality shows that P (suptMt ≥ 1/δ) ≤ δ, which afterrearranging the previous display completes the result.

20.9

(a) This follows from straightforward calculus.

(b) The result is trivial for Λ < 0. For Λ ≥ 0 we have

Mn =∫

Rf(λ) exp

(λSn −

λ2n

2

)dλ

≥∫ Λ(1+ε)

Λf(λ) exp

(λSn −

λ2n

2

)dλ

≥ εΛf(Λ(1 + ε)) exp(

Λ(1 + ε)Sn −Λ2(1 + ε)2n

2

)

= εΛf(Λ(1 + ε)) exp(

(1− ε2)S2n

2n

).

(c) Let n ∈ N. Since Mt is a supermartingale with M0 = 1 it follows that

Pn = P (exists t ≤ n : Mt ≥ 1/δ) ≤ δ .

56

Page 57: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Hence P (exists t : Mt ≥ 1/δ) ≤ δ. Substituting the result from the previous part and rearrangingcompletes the proof.

(d) A suitable choice of f is f(λ) = I λ ≤ e−eλ log

(1λ

) (log log

(1λ

))2 .

(e) Let εn = min1/2, 1/ log log(n) and δ ∈ [0, 1] be the largest (random) value such that Sn neverexceeds

√2n

1− ε2n

(log

(1δ

)+ log

( 1εnΛnf(Λn(1 + εn))

)).

By Part (c) we have P (δ > 0) = 1. Furthermore, lim supn→∞ Sn/n = 0 almost surely by thestrong law of large numbers, so that Λn → 0 almost surely. On the intersection of these almostsure events we have

lim supn→∞

Sn√2n log log(n)

≤ 1 .

20.10 We first show a bound on the right tail of St. A symmetric argument suffices for the left tail.Let Ys = Xs − µs|Xs| and Mt(λ) = exp(∑t

s=1(λYs − λ2|Xs|/2)). Define filtration G1 ⊂ · · · ⊂ Gn byGt = σ(Ft−1, |Xt|). Using the fact that Xs ∈ −1, 0, 1 we have for any λ > 0 that

E[exp(λYs − λ2|Xs|/2) | Gs] ≤ 1 .

Therefore Mt(λ) is a supermartingale for any λ > 0. The next step is to use the method of mixtureswith a uniform distribution on [0, 2]. Let Mt =

∫ 20 Mt(λ)dλ. Then Markov’s inequality shows that

for any Gt-measurable stopping time τ with τ ≤ n almost surely, P (Mτ ≥ 1/δ) ≤ δ. Next we need abound on Mτ . The following holds whenever St ≥ 0.

Mt = 12

∫ 2

0Mt(λ)dλ

= 12

√π

2Nt

(erf(

St√2Nt

)+ erf

(2Nt − St√2Nt

))exp

(S2t

2Nt

)

≥ erf(√

2)2

√π

2Ntexp

(S2t

2Nt

).

The bound on the upper tail completed via a stopping time, which shows that

P

exists t ≤ n : St ≥

√√√√√2Nt log

2δ erf(

√2)

√2Nt

π

and Nt > 0

≤ δ .

The result follows by symmetry and union bound.

20.11

57

Page 58: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(a) Following the hint, we show that exp(Lt(θ∗)) is a martingale. Indeed, letting Ft = σ(X1, . . . , Xt),

E [exp(Lt(θ∗))|Ft−1] = E[pθt−1

(Xt)/pθ∗(Xt)]

exp(Lt−1(θ∗))

= exp(Lt−1(θ∗))∫

pθ∗(x)pθt−1

(x)XXXXpθ∗(x) dµ(x)

= exp(Lt−1(θ∗))∫pθt−1

(x)dµ(x)

= exp(Lt−1(θ∗)) .

Then, applying the Cramer-Chernoff trick,

P (Lt(θ∗) ≥ log(1/δ) for some t ≥ 1) = P(

supt∈N

exp(Lt(θ∗)) ≥ 1/δ)

≤ δE [exp(L0(θ∗))] = δ ,

where the inequality is due to Theorem 3.9, the maximal inequality of nonnegativesupermartingales.

(b) This follows from the definition of Ct and Part (a).

Chapter 21 Optimal Design for Least Squares Estimators

21.1 Following the hint,

∇f(π)a = trace(adj(V (π))aa>)det(V (π)) = a> adj(V (π))a

det(V (π)) = a>V (π)−1a = ‖a‖2V (π)−1 ,

where in the third equality we used that adj(V (π)) is symmetric since V (π) is symmetric, hence,following the hint, adj(V (π)/ det(V (π)) = V (π)−1.

21.2 By the determinant product rule,

log det(H + tZ) = log det(H1/2(I + tH−1/2ZH−1/2)H1/2)= log det(H) + log det(I + tH−1/2ZH−1/2)= log det(H) +

i

log(1 + tλi) ,

where λi are the eigenvalues of H−1/2ZH−1/2. Since log(1 + tλi) is concave, their sum is alsoconcave, proving that t 7→ log det(H + tZ) is concave.

21.3 Let A be a compact subset of Rd and (An)n be a sequence of finite subsets with An ⊂ An+1and span(An) = Rd and limn→∞ d(A,An) = 0 where d is the Hausdorff metric. Then let πn be aG-optimal design for An with support of size at most d(d+ 1)/2 and Vn = V (πn). Given any a ∈ A

58

Page 59: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

we have

‖a‖V −1n≤ min

b∈An

(‖a− b‖V −1

n+ ‖b‖V −1

n

)≤√d+ min

b∈An‖a− b‖V −1

n.

Let W ∈ Rd×d be matrix with columns w1, . . . , wd in A1 that span Rd. The operator norm of V −1/2n

is bounded by

‖V −1/2n ‖ = ‖W−1WV −1/2

n ‖≤ ‖W−1‖‖V −1/2

n W‖= ‖W−1‖ sup

‖Wx‖V −1

n: ‖x‖2 = 1

≤ ‖W−1‖ sup

d∑

i=1|xi|‖wi‖V −1

n: ‖x‖2 = 1

≤ ‖W−1‖√d supx:‖x‖2=1

d∑

i=1xi

≤ d‖W−1‖ .

Taking the limit as n tends to infinity shows that

lim supn→∞

‖a‖V −1n≤√d+ lim sup

n→∞minb∈A‖a− b‖V −1

n

≤√d+ d‖W−1‖ lim sup

n→∞minb∈A‖a− b‖2

≤√d .

Since ‖ · ‖V −1n

: A → R is continuous and A is compact it follows that

lim supn→∞

supa∈A‖a‖2

V −1n≤ d .

Notice that πn may be represented as a tuple of vector/probability pairs with at most d(d+ 1)/2entries and where the vectors lie in A. Since the set of all such tuples with the obvious topologyforms a compact set it follows that (πn) has a cluster point π∗, which represents a distribution onA with support at most d(d + 1)/2. The previous display shows that g(π∗) ≤ d. The fact thatg(π∗) ≥ d follows from the same argument as the proof of Theorem 21.1.

21.5 Let π be a Dirac at a and π(t) = π∗ + t(π∗ − π). Since π∗(a) > 0 it follows for sufficientlysmall t > 0 that π(t) is a distribution over A. Because π∗ is a minimiser of f ,

0 ≥ d

dtf(π(t))|t=0 = 〈∇f(π∗), π∗ − π〉 = d− ‖a‖2V (π)−1 .

59

Page 60: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Rearranging shows that ‖a‖2V (π)−1 ≥ d. The other direction follows by Theorem 21.1.

Chapter 22 Stochastic Linear Bandits for Finitely Many Arms

Chapter 23 Stochastic Linear Bandits with Sparsity

23.2 The usual idea does the trick. Recall that Rn = ∑di=1Rni where

Rni = n|θi| − E[n∑

t=1Atiθi

].

We proved that

Rni ≤ 3|θi|+C log(n)|θi|

.

Clearly Rni ≤ 2n|θi|. Let ∆ > 0 be a constant to be tuned later. Then

Rni ≤∑

i:|θi|>∆

(3|θi|+

C log(n)|θi|

)+

i:|θi|∈(0,∆)2n∆

≤ 3‖θ‖1 + C‖θ‖0 log(n)∆ + 2‖θ‖0n∆ .

Choosing ∆ =√

log(n)/n completes the result.

Chapter 24 Minimax Lower Bounds for Stochastic Linear Bandits

24.1 Assume without loss of generality that i = 1 and let θ(−1) ∈ Θp−1. The objective is to provethat

1|Θ|

θ(1)∈Θ

Rn1(θ) ≥√kn

8 .

For j ∈ [k] let Tj(n) = ∑nt=1 I Bt1 = j be the number of times base action j is played in the first

bandit. Define ψ0 ∈ Rd to be the vector with ψ(−1)0 = θ(−1) and ψ(1)

0 = 0. For j ∈ [k] let ψj ∈ Rd begiven by ψ(−1)

j = θ(−1) and ψ(1)j = ∆ej . Abbreviate Pj = Pψj and Ej [·] = EPj [·]. With this notation,

we have

1|Θ|

θ(1)∈Θ

Rn1(θ) = 1k

k∑

j=1∆(n− Ej [Tj(n)]) . (24.1)

60

Page 61: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Lemma 15.1 gives that

D(P0,Pj) = 12E0

[n∑

t=1〈At, ψ0 − ψj〉2

]= ∆2

2 E0 [Tj(n)] .

Choosing ∆ =√k/n/2 and applying Pinsker’s inequality yields

k∑

j=1Ej [Tj(n)] ≤

k∑

j=1E0[Tj(n)] + n

k∑

j=1

√12 D(P0,Pj)

= n+ nk∑

j=1

√∆2

4 E0[Tj(n)]

≤ n+ n

√√√√√k∆2

4

k∑

j=1E0[Tj(n)] (Cauchy-Schwarz)

= n+ n

√k∆2n

4≤ 3nk/4 . (since k ≥ 2)

Combining the above display with Eq. (24.1) completes the proof:

1|Θ|

θ(1)∈Θ

Rn1(θ) = 1k

k∑

j=1∆(n− Ej [Tj(n)]) ≥ n∆

4 = 18√kn .

Chapter 25 Asymptotic Lower Bounds for Stochastic Linear Bandits

25.3 For (a) let θ1 = ∆ and θi = 0 for i > 1 and let A = e1, . . . , ed−1. Then adding edincreases the asymptotic regret. For (b) let θ1 = ∆ and θi = 0 for 1 < i < d and θd = 1 andA = e1, . . . , ed−1. Then for small values of ∆ adding ed decreases the asymptotic regret.

Chapter 26 Foundations of Convex Analysis

26.2 Let P be the on the space on which X is defined. Following the hint, let x0 = E[X] ∈ Rd.Then let a ∈ Rd and b ∈ R be such that 〈a, x0〉 + b = f(x0) and 〈a, x〉 + b ≤ f(x) for all x ∈ Rd.The hyperplane x : 〈a, x〉+ b− f(x0) = 0 is guaranteed to exist by the supporting hyperplanetheorem. Then

∫f(X)dP ≥

∫(〈a,X〉+ b)dP = 〈a, x0〉+ b = f(x0) = f(E[X]) .

61

Page 62: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

An alternative is of course to follow the ideas next to the picture in the main text. As you mayrecall that proof is given for the case when X is discrete. To extend the proof to the general case,one can use the ‘standard machinery’ of building up the integral from simple functions, but theresulting proof, originally due to Needham [1993], is much longer than what was given above.

26.3

(a) Using the definition,

f∗∗(x) = supu∈Rd〈x, u〉 − f∗(u)

= supu∈Rd〈x, u〉 − ( sup

y∈Rd〈y, u〉 − f(y))

≤ supu∈Rd〈x, u〉 − (〈x, u〉 − f(x))

= f(x) .

(b) We only need to show that f∗∗(x) ≥ f(x).

f∗∗(x) = supu∈Rd〈x, u〉 − f∗(u)

≥ 〈x,∇f(x)〉 − f∗(∇f(x))= 〈x,∇f(x)〉 − ( sup

y∈Rd〈y,∇f(x)〉 − f(y))

≥ 〈x,∇f(x)〉 − ( supy∈Rd〈y,∇f(x)〉 − f(x)− 〈y − x,∇f(x)〉)

= f(x) ,

where in the second inequality we used the definition of convexity to ensure that f(y) ≥f(x) + 〈y − x,∇f(x)〉.

26.9

(a) Fix u ∈ Rd. By definition f∗(u) = supx〈x, u〉 − f(x). To find this value we solve for x wherethe derivative of 〈x, u〉 − f(x) in x is equal to zero. As calculated before, ∇f(x) = log(x).Thus, we need to find the solution to u = log(x), giving x = exp(u). Plugging this value, weget f∗(u) = 〈exp(u), u〉 − f(exp(u)). Now, f(exp(u)) = 〈exp(u), log(exp(u))〉 − 〈exp(u),1〉 =〈exp(u), u〉 − 〈exp(u),1〉. Hence, f∗(u) = 〈exp(u),1〉 and ∇f∗(u) = exp(u).

(b) From our calculation, dom(∇f∗) = Rd.

(c) Df∗(u, v) = f∗(u)− f∗(v)− 〈∇f∗(v), u− v〉 = 〈exp(u)− exp(v),1〉 − 〈exp(v), u− v〉.

(d) To check Part (a) of Theorem 26.6 note that ∇f(x) = log(x) and ∇f∗(u) = exp(u), whichare indeed inverses of each other and their respective domains match that of int(dom(f)) and

62

Page 63: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

int(dom(f∗)), respectively. To check Part (b) of Theorem 26.6, we calculate Df∗(∇f(y),∇f(x)):

Df∗(∇f(y),∇f(x))= 〈exp(log(y))− exp(log(x)),1〉 − 〈exp(log(x)), log(y)− log(x)〉= 〈y − x,1〉 − 〈x, log(y)− log(x)〉 ,

which is indeed equal to Df (x, y).

26.13

(a) Let g(z) = f(z)− 〈z − y,∇f(y)〉. Then by definition

z ∈ argminA g(z) ,

which exists by the assumption that A is compact. By convexity of A and the first-orderoptimality condition it follows that

∇x−zf(z)− 〈x− z,∇f(y)〉 = ∇x−zg(z) ≥ 0 .

Therefore

∇x−zf(z) ≥ 〈x− z,∇f(y)〉 = 〈x− y,∇f(y)〉 − 〈z − y,∇f(y)〉= ∇x−yf(y)−∇z−yf(y) .

Substituting the definition of the Bregman divergence shows that

Df (x, y) ≥ Df (x, z) +Df (z, y)

The proof fails when f is not differentiable at y because the map v 7→ ∇vf(y) need not belinear.

(b) Consider the function function f(x, y) = −(xy)1/4 and let y = (0, 0) and x = (1, 0) andA = (t, 1− t) : t ∈ [0, 1]. Then Df (x, y) = Df (z, y) = 0, but Df (x, z) =∞.

26.14 Parts (a) and (b) are immediate from convexity and the definitions. For Part (c), we have

Df (x, y) = f(x)− f(y)− 〈∇f(y), x− y〉 ,

which is linear as a function of x. Note that here we used differentiability of f at y. An exampleshowing that differentiability at y is necessary occurs when f : R2 → R is given by

f(x) = max1, ‖x‖1 .

Then consider y = (2, 0) and x = (0, 2) and z = (0,−2). Then Df (z, y) = Df (x, y) = 0, butDf ((x+ z)/2, y) = Df (0, y) = 1 ≥ (Df (x, y) +Df (z, y))/2.

63

Page 64: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

26.15 The first part follows immediately from Taylor’s theorem. The second part takes a littlework. To begin, abbreviate ‖x− y‖z = ‖x− y‖∇2f(z) and for t ∈ (0, 1) let

δ(x, y, t) = Df (x, y)− 12‖x− y‖

2tx+(1−t)y ,

which is continuous on int(dom(f))× int(dom(f))× [0, 1]. Let (a, b) ⊂ (0, 1) and consider

A =⋃

δ∈(0,b−a)∩Q

ε∈(0,1)∩Q

t∈U∩Q(x, y) : δ(x, y, t) ≤ ε .

As we mentioned already, Taylor’s theorem ensures there exists a t ∈ [0, 1] such that δ(x, y, t) = 0for all (x, y) ∈ int(dom(f)) × int(dom(f)). By the continuity of δ it follows that (x, y) ∈ A ifand only if there exists a t ∈ (a, b) such that δ(x, y, t) = 0. Since (x, y) 7→ δ(x, y, t) is measurablefor each t it follows that A is measurable. Let T (x, y) = t : δ(x, y, t) = 0. Then by theKuratowski–Ryll-Nardzewski measurable selection theorem (theorem 6.9.4, Bogachev 2007) thereexists a measurable function τ : int(dom(f))× int(dom(f))→ (0, 1) such that τ(x, y) ∈ T (x, y) forall (x, y) ∈ dom(f)× dom(f). Therefore g(x, y) = τ(x, y)x+ (1− τ(x, y))y is measurable and theresult is complete.

Chapter 27 Exp3 for Adversarial Linear Bandits

27.1 Let

Pt(a) = exp(−η∑t−1s=1 Ys(a))

∑a′∈A exp(−η∑t−1

s=1 Ys(a′)).

Then,

Pt = (1− γ)Pt + γπ . (27.1)

Let Ln(a) = ∑nt=1 Yt(a), Ln = ∑n

t=1〈Pt, Yt〉 and Ln = ∑nt=1〈Pt, Yt〉, where we abuse 〈·, ·〉 by defining

〈p, y〉 = ∑a∈A p(a)y(a) for p, y : A → R. Then,

Rn = maxa∈A

Rn(a) where Rn(a) = E[n∑

t=1〈At, yt〉 − 〈a, yt〉

].

As in the proof of Theorem 11.1,

Rn(a) = E[Ln − Ln(a)

].

64

Page 65: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Now, by (27.1),

Ln = (1− γ)Ln + γn∑

t=1〈π, Yt〉 .

Repeating the steps of the proof of Theorem 11.1 shows that, thanks to ηYt(a) ≥ −1,

Ln ≤ Ln(a) + log kη

+ ηn∑

t=1〈Pt, Y 2

t 〉 (27.2)

≤ Ln(a) + log kη

+ η

1− γn∑

t=1〈Pt, Y 2

t 〉 ,

where Y 2t denotes the function a 7→ Y 2

t (a) and the second inequality used that Pt = Pt−γπ1−γ ≤ Pt

1−γ .Now,

Ln − Ln(a) ≤ γn∑

t=1〈π, Yt〉+ (1− γ)Ln(a) + log k

η+ η

n∑

t=1〈Pt, Y 2

t 〉 − Ln(a)

= log kη

+ ηn∑

t=1〈Pt, Y 2

t 〉+ γn∑

t=1〈π − ea, Yt〉 ,

where ea(a′) = I a = a′. Now, thanks to −1 ≤ 〈a, yt〉 ≤ 1,

E[〈π − ea, Yt〉

]= 〈π − ea, yt〉 ≤ 2 .

Putting things together,

Rn ≤ maxa

Rn(a) ≤ log kη

+ 2γn+ ηn∑

t=1E[〈Pt, Y 2

t 〉],

thus, finishing the proof.

27.4 Note that it suffices to show that ‖x‖2B−1 − ‖x‖2A−1 = ‖x‖2B−1−A−1 ≥ 0 for any x ∈ Rd. Letx ∈ Rd. Then, by the Cauchy-Schwarz inequality,

‖x‖2A−1 = 〈x,A−1x〉 ≤ ‖x‖B−1‖A−1x‖B ≤ ‖x‖B−1‖A−1x‖A = ‖x‖B−1‖x‖A−1 .

Hence ‖x‖A−1 ≤ ‖x‖B−1 for all x, which completes the claim.

27.6

(a) A straightforward calculation shows that L =y ∈ Rd : ‖y‖V ≤ 1

. Let Tx = V 1/2x and note

that T−1L = TA = B =u ∈ Rd : ‖u‖2 ≤ 1

. Then let U be an ε-cover of B with respect

to ‖ · ‖2 with |U| ≤ (3/ε)d and C = T−1U . Given x ∈ A let u = Tx and u′ ∈ U be such that

65

Page 66: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

‖u− u′‖2 ≤ ε and x′ = T−1u′. Then

‖x− x′‖ = supy∈L〈x− x′, y〉 = sup

y∈L〈T−1u− T−1u′, y〉 = sup

y∈L〈u− u′, T−1y〉 ≤ ε .

(b) Notice that L is convex, symmetric, bounded and span(L) = Rd. Let E =y ∈ Rd : ‖y‖V ≤ 1

be the ellipsoid of maximum volume contained by cl(L). Then let

E∗ =y ∈ Rd : ‖y‖E ≤ 1

=y ∈ Rd : ‖y‖V −1 ≤ 1

,

which satisfies A ⊆ E∗. Since span(L) = Rd the matrix V −1 is positive definite. By the previousresult there exists a C ⊂ E∗ of size at most (3d/ε)d such that

supx∈E∗

infx′∈C‖x− x′‖E ≤ ε/d .

Using the fact that L ⊆ dE we have

supx∈E∗

infx′∈C‖x− x′‖L ≤ ε .

We are nearly done. The problem is that C may contain elements not in A. To resolve this issuelet C = Π(x) : x ∈ C where Π(x) ∈ argminx′∈A ‖x− x′‖E . Then note that

‖x−Π(x′)‖L ≤ d‖x−Π(x′)‖E ≤ d‖x− x′‖E .

(c) Let C be an ε/2-cover of cl(co(A)), which by the previous part has size at most (6d/ε)d. Theresult follows by choosing C =

Π(x) : x ∈ C

where Π is the projection onto A with respect to

‖ · ‖E where E is the maximum volume ellipsoid contained by cl(co(A)).

27.8 Consider the case when d = 1, k = n and A = 1,−1, ε, ε/2, ε/4, . . . for suitably smallε. Then, for t = 1, Pt is uniform on A and hence Q−1

t ≈ 2/k and with probability 1 − 2/k,|Yt| ≈ 1/k = 1/n. If η is small, then the algorithm will barely learn. If η is large, then it learnsquickly that either 1 or −1 is optimal, but is too unstable for small regret.

27.9 We can copy the proof presented for the finite-action case in the solution to Exercise 27.1 inan almost verbatim manner: The minor change is that is that we need to replace the sums over theaction space with integrals. In particular, here we have

Pt(a) = exp(−η∑t−1s=1 Ys(a))

∫A exp(−η∑t−1

s=1 Ys(a′))da′

and 〈p, y〉 =∫A p(a)y(a)da for p, y : A → R. Now up to (27.2) everything is the same. Recall that

the inequality in this display was obtained by using the steps of the proof of Theorem 11.1. Here,we need add a little detail because we need to change this inequality slightly.

66

Page 67: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

We argue as follows: Define (Wt)nt=0 by

Wt =∫

Aexp

(−η

t∑

s=1Ys(a)

)da ,

which means that W0 = vol(A) and

Wn = vol(A)n−1∏

t=0

Wt+1Wt

.

Following the proof of Theorem 11.1, thanks to −ηYt(a) ≤ 1,

Wt+1Wt

=∫

Aexp(−ηYt(a))Pt(a)da ≤

A

(1− ηYt(a) + η2Y 2

t (a))Pt(a)da

≤ exp(−η〈Pt, Yt〉+ η2〈Pt, Y 2

t 〉).

Therefore,

logWn ≤ log(vol(A))− ηn∑

t=1〈Pt, Yt〉+ η2

n∑

t=1〈Pt, Y 2

t 〉 .

Recalling that Ln = ∑nt=1〈Pt, Yt〉, a rearrangement of the previous display gives

Ln ≤1η

log(vol(A)

Wn

)+ η

n∑

t=1〈Pt, Y 2

t 〉 . (27.3)

Let a∗ = argmina∈A∑nt=1〈yt, a〉. Note that

Ln(a∗) =n∑

t=1Yt(a∗) = −1

ηlog

1

exp(η∑nt=1 Yt(a∗)

)

.

By adding and subtracting Ln(a∗) to the right-side of Eq. (27.3) and using the last identity and thedefinition of Wn, we get

Ln ≤ Ln(a∗) + 1η

log(Kn) + ηn∑

t=1〈Pt, Y 2

t 〉 ,

where

Kn = vol(A)∫

exp(−η∑n

t=1(Yt(a)− Yt(a∗)))da

.

which is the inequality that replaces (27.2). In particular, the only difference between (27.2) andthe above display is that in the above display log(k) got replaced by log(Kn). From here, we can

67

Page 68: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

follow the steps of the proof of Exercise 27.1 up to the end, to get

Rn ≤E [log(Kn)]

η+ 2γn+ ηE

[n∑

t=1〈Pt, Y 2

t 〉],

The result is completed by noting that γ = ηd and the same argument as in the proof of Theorem 27.1to bound

ηE[n∑

t=1〈Pt, Y 2

t 〉]≤ ηdn .

27.11 Throughout ‖ · ‖ = ‖ · ‖2 is the standard Euclidean norm. By translating K we may assumethat x∗ = 0. Note that supx,y∈K〈x − y, u〉 = supx∈K〈x, u〉 (the last equality uses that x∗ = 0).Clearly, the claim is equivalent to

Kexp (−〈x, u〉) dx

vol(K) ≥ exp(−d(1 + log+ sup

x∈K〈x, u〉/d)

)

︸ ︷︷ ︸=:g(supx∈K〈x,u〉)

.

We claim that it suffices to show the result for the case when ‖u‖ = 1. Indeed, if the claim was truefor ‖u‖ = 1 then for any u 6= 0 it would follow that

∫K exp(−〈x, u〉)dx

vol(K) =∫K exp(−〈x‖u‖, u/‖u‖〉)dx

vol(K) =∫‖u‖K exp(−〈y, u/‖u‖〉)dy

‖u‖dvol(K)

=∫‖u‖K exp(−〈y, u/‖u‖〉)dy

vol(‖u‖K) ≥ g( supx∈‖u‖K

〈x, u/‖u‖〉) = g(supx∈K〈x, u〉) .

Hence, it remains to show the claim for vectors u such that ‖u‖ = 1. With an entirely similarreasoning we can show that it suffices to show the claim for the case when vol(K) = 1. Hence, fromnow on we will assume these.

Introduce α = supx∈K〈x, u〉. For t ∈ [0, α], define f(t) to be the volume of the sliceKt = x ∈ K : 〈x, u〉 = t with respect to the (d − 1)-form. Since vol(K) = 1, f(t) > 0 for0 < t < α and 1 =

∫ α0 f(t)dt. Clearly, we also have

∫K exp(−〈x, u〉)dx =

∫ α0 f(t) exp(−t)dt. Now,

since K is convex, for any t ∈ (0, α), f(t) ≥ volt−1(q/tKq) for any t ≤ q ≤ α.

Since e−t decreasing, a rearrangement argument shows that the function f that minimises∫ α0 f(t) exp(−t)dt and which satisfies the above properties is f(t) = (tf(α))d−1 for a suitable value

of f(α) so that∫ α0 f(t)dt = 1 (we want the function to increase as fast as possible). Note that f(t)

gives the volume of the tKα for a suitable set Kα, as shown in the figure below:

68

Page 69: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

u

0

The whole triangle is K = tKα : t ∈ [0, α] with x∗ = 0 at the bottom corner. The thin linesrepresent tKα for different values of t, which are (d− 1)-dimensional subsets of K that lie in affinespaces with normal vector u.

From the constraint∫ α

0 f(t)dt = 1 we get f(α)d−1 = (∫ α

0 td−1dt)−1. We calculate

Kexp(−〈x, u〉)dx ≥

∫ α

0exp(−t)(tf(α))d−1dt =

∫ α0 exp(−t)td−1dt∫ α

0 td−1dt

≥ min

1, (d/α)d/ed = g(α) ,

where the final inequality follows because∫ α0 td−1dt = αd/d and

∫ α

0td−1 exp(−t)dt ≥ 1

ed

∫ α∧d

0td−1dt = min(α, d)d

edd.

Chapter 28 Follow-the-Regularised-Leader and Mirror Descent

28.1 The mapping a 7→ DF (a, b) is the sum of Legendre function F and a linear function, whichis clearly Legendre. Hence, Φ is Legendre. Suppose now that c ∈ ∂int(D) and let d ∈ A ∩ int(D) bearbitrary. Then, the map α 7→ Φ(αc+ (1− α)d) must be decreasing. And yet,

d

dαΦ(αc+ (1− α)d) = 〈∇Φ(αc+ (1− α)d), c− d〉

= 〈y +∇F (αc+ (1− α)d), c− d〉 ,

which converges to infinity as α tends to one by Proposition 26.7 and is a contradiction.

28.5 The first step is the same as the proof of Theorem 28.4:

Rn(a) =n∑

t=1〈at − a, yt〉 =

n∑

t=1〈at − at+1, yt〉+

n∑

t=1〈at+1 − a, yt〉 .

69

Page 70: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Next let Φt(a) = F (a)/η +∑ts=1〈a, ys〉 so that

n∑

t=1〈at+1 − a, yt〉 =

n∑

t=1〈at+1, yt〉 − Φn(a) + F (a)

η

=n∑

t=1(Φt(at+1)− Φt−1(at+1))− Φn(a) + F (a)

η

= −Φ0(a1) +n−1∑

t=0(Φt(at+1)− Φt(at+2)) + Φn(an+1)− Φn(a) + F (a)

η

≤ F (a)− F (a1)η

+n−1∑

t=0(Φt(at+1)− Φt(at+2)) . (28.1)

Now DΦt(a, b) = 1ηDF (a, b). Therefore,

Φt(at+1)− Φt(at+2) = −∇at+2−at+1Φt(at+1)− 1ηDF (at+2, at+1)

≤ −1ηDF (at+2, at+1) ,

where the inequality follows by the first-order optimality condition applied to at+1 =argmina∈A∩dom(F ) Φt(a) and at+2. Substituting this into Eq. (28.1) completes the proof.

28.10

(a) For the first relation, direct calculation shows that Pt+1,i = Pti exp(−ηYti) and

DF (Pt, Pt+1) =k∑

i=1Pti log

(Pti

Pt+1,i

)−

k∑

i=1Pti +

k∑

i=1Pt+1,i

=k∑

i=1Pti(exp

(−ηYti

)− 1 + ηYti

).

The second relation follows from the inequality exp(x) ≤ 1 + x+ x2/2 for x ≤ 0.

(b) Using part (a), we have

1ηE[n∑

t=1DF (Pt, Pt+1)

]≤ η

2E[n∑

t=1

k∑

i=1PtiY

2ti

]≤ η

2E[n∑

t=1

k∑

i=1

I At = iPti

]

= ηnk

2 .

(c) Simple calculus shows that for p ∈ Pk−1, F (p) ≥ − log(k) − 1 and F (p) ≤ −1 is obvious.Therefore diamF (Pk−1) = maxp,q∈Pk−1 F (p)− F (q) ≤ log(k).

(d) By the previous exercise, Exp3 chooses At sampled from Pt. Then applying the second boundof Theorem 28.4 and parts (b) and (c) and choosing η =

√log(k)/(2nk) yields the result.

70

Page 71: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

28.11 Abbreviate D(x, y) = DF (x, y). By the definition of at+1 and the first-order optimalityconditions we have ηtyt = ∇F (at)−∇F (at+1). Therefore

〈at − a, yt〉 = 1ηt〈at − a,∇F (at)−∇F (at+1)〉

= 1ηt

(−〈a− at,∇F (at)〉 − 〈at − at+1,∇F (at+1)〉+ 〈a− at+1,∇F (at+1)〉)

= 1ηt

(D(a, at)−D(a, at+1) +D(at, at+1)) .

Summing completes the proof. For the second part use the generalised Pythagorean theorem(Exercise 26.13) and positivity of the Bregman divergence to argue that D(a, at+1) ≥ D(a, at+1).

28.12

(a) We use the same argument as in the solution to Exercise 28.5. First,

Rn(a) =n∑

t=1〈at − a, yt〉 =

n∑

t=1〈at − at+1, yt〉+

n∑

t=1〈at+1 − a, yt〉 .

The next step also mirrors that in Exercise 28.5, but now we have to keep track of the changingpotentials:

n∑

t=1〈at+1 − a, yt〉 =

n∑

t=1〈at+1, yt〉 − Φn+1(a) + Fn+1(a)

=n∑

t=1(Φt+1(at+1)− Φt(at+1)) +

n∑

t=1(Ft(at+1)− Ft+1(at+1))− Φn+1(a) + Fn+1(a)

= −Φ1(a1) +n−1∑

t=0(Φt+1(at+1)− Φt+1(at+2)) + Φn+1(an+1)− Φn+1(a)

+ Fn+1(a) +n∑

t=1(Ft(at+1)− Ft+1(at+1))

≤ Fn+1(a)− F1(a1) +n−1∑

t=0(Φt+1(at+1)− Φt+1(at+2)) +

n∑

t=1(Ft(at+1)− Ft+1(at+1)) .

Now DΦt(a, b) = DFt(a, b). Therefore

Φt+1(at+1)− Φt+1(at+2) = −∇at+2−at+1Φt+1(at+1)−DFt+1(at+2, at+1)≤ −DFt+1(at+2, at+1) ,

which combined with the previous big display completes the proof.

(b) Note that adding a constant to the potential does not change the policy or the Bregmandivergence. Applying the previous part with Ft(a) = (F (a) − minb∈A F (b))/ηt immediatelygives the result.

71

Page 72: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

28.13

(a) Apply your solution to Exercise 26.11.

(b) Since Yt is unbiased we have

Rn = maxi∈[k]

E[n∑

t=1(ytAt − yti)

]= max

i∈[k]E[n∑

t=1〈Pt − P, Yt〉

].

Then apply the result from Exercise 28.12 combined with the fact that diamF (A) = log(k).

(c) Consider two cases. First, if Pt+1,At ≥ PtAt , then

〈Pt − Pt+1, Yt〉 = (PtAt − Pt+1,At)YtAt ≤ 0 .

On the other hand, if Pt+1,At ≤ PtAt , then by Theorem 26.13 with H = ∇2f(q) = diag(1/q) forsome q ∈ [Pt, Pt+1] we have

〈Pt − Pt+1, Yt〉 −DF (Pt+1, Pt)

ηt≤ ηt

2 ‖Yt‖2H−1 ,

Since Pt+1,At ≤ PtAt we have

ηt2 ‖Yt‖

2H−1 =

ηtqAt Y2tAt

2 ≤ ηtPtAt Y2tAt

2 ≤ ηty2tAt

2PtAt.

Therefore

〈Pt − Pt+1, Yt〉 −DF (Pt+1, Pt)

ηt≤ ηty

2tAt

2PtAt≤ ηt

2PtAt.

(d) Continuing from the previous part and using the fact that E[1/PtAt ] = k shows that

Rn ≤ E[

log(k)ηn

+ 12

n∑

t=1

ηtPtAt

]= log(k)

ηn+ k

2

n∑

t=1ηt .

(e) Choose ηt =√

log(k)kt and use the fact that ∑n

t=1√

1/t ≤ 2√n.

28.14

(a) The result is obvious for any algorithm when n < k. Assume for the remainder that n ≥ k. Thelearning rate is chosen to be

ηt =

√√√√ k log(n/k)1 +∑t−1

s=1 y2tAt

,

72

Page 73: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

which is obviously decreasing. By noting that∫f ′(x)/

√f(x) dx = 2

√f(x) and making a simple

approximation,

n∑

t=1ηty

2tAt ≤ 2

√√√√kn∑

t=1y2tAt

log(n/k) . (28.2)

Define Rn(p) = ∑nt=1〈Pt − p, Yt〉. Then

Rn = supp∈Pk−1

E[Rn(p)] ≤ k + supp∈[1/n,1]k∩Pk−1

E[Rn(p)] . (28.3)

For the remainder of the proof let p ∈ [1/n, 1] ∩ Pk−1 be arbitrary. Notice that F (p) −minq∈Pk−1 F (q) ≤ k log(n/k). By the result in Exercise 28.12,

Rn(p) ≤ k log(n/k)ηn

+n∑

t=1〈Pt − Pt+1, Yt〉 −

DF (Pt+1, Pt)ηt

, (28.4)

If Pt+1,At ≥ PtAt , then 〈Pt−Pt+1, Yt〉 ≤ 0. Now suppose that Pt+1,At ≤ PtAt . By Theorem 26.12,there exists a ξ ∈ [Pt, Pt+1] such that

〈Pt − Pt+1, Yt〉 −DFt−1(Pt+1, Pt) ≤ηt2 ‖Yt‖

2∇2F (ξ)−1

= ηt2 ξ

2At Y

2tAt ≤

ηt2 P

2tAt Y

2tAt =

ηty2tAt

2 .

By the definition of ηn and Eq. (28.2),

Rn(p) ≤ k log(n/k)ηn

+ 12

n∑

t=1ηty

2tAt

≤ 2

√√√√k(

1 +n−1∑

t=1y2tAt

)log(n/k) .

Therefore by Eq. (28.3),

Rn ≤ k + 2E

√√√√k

(1 +

n−1∑

t=1y2tAt

)log(n/k)

≤ k + 2

√√√√k(

1 + E[n−1∑

t=1y2tAt

])log(n/k) ,

where the second line follows from Jensen’s inequality.

73

Page 74: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(b) Combining the previous result with the fact that yt ∈ [0, 1]k shows that

Rn ≤ k + 2

√√√√k(

1 + E[n∑

t=1ytAt

])log(n/k)

= k + 2

√√√√k(

1 +Rn + mina∈[k]

n∑

t=1yta

)log(n/k) .

Solving the quadratic in Rn shows that for a suitably large universal constant C,

Rn ≤ k(1 + log(n/k)) + C

√√√√k(

1 + mina∈[k]

n∑

t=1yta

)log(n/k) .

28.15 The first parts are mechanical and are skipped.

(e) By Part (c), Theorem 28.5 and Theorem 26.13,

Rn ≤2√k

η+ η

2E[n∑

t=1‖Yt‖2∇2F (Zt)−1

], (28.5)

where Zt ∈ Pk−1 = αPt + (1− α)Pt+1 for some α ∈ [0, 1]. Then

E[n∑

t=1

k∑

i=1‖Yt‖2∇2F (Zt)−1

]= 2E

[n∑

t=1

k∑

i=1

Atiy2ti

P 2ti

Z3/2ti

]

≤ 2E[n∑

t=1

k∑

i=1

Ati

P1/2ti

]

= 2E[n∑

t=1

k∑

i=1P

1/2ti

]

≤ 2n√k ,

where the first inequality follows from Part (b) and the second from the Cauchy-Schwarzinequality. The result follows by substituting the above display into Eq. (28.5) and choosingη =

√2/n.

28.16

(a) Following the suggestion in the hint let F be the negentropy potential and

xt = argminx∈X

(F (x) + η

t−1∑

s=1f(x, ys)

)

yt = argminy∈Y

(F (y)− η

t−1∑

s=1f(xs, y)

).

74

Page 75: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Then let εd(n) =√

2 log(d)n . By Proposition 28.7,

minx∈X

maxy∈Y

f(x, y) ≤ maxy∈Y

f(xn, y)

= maxy∈Y

1n

n∑

t=1f(xt, y)

≤ 1n

n∑

t=1f(xt, yt) + εk(n)

≤ 1n

minx∈X

n∑

t=1f(x, yt) + εj(n) + εk(n)

= minx∈X

f(x, yn) + εj(n) + εk(n)

≤ maxy∈Y

minx∈X

f(x, y) + εj(n) + εk(n) .

Taking the limit as n tends to infinity shows that

minx∈X

maxy∈Y

f(x, y) ≤ maxy∈Y

minx∈X

f(x, y) .

Since the other direction holds trivially, equality holds.

(b) Following a similar plan. Let F (x) = 12‖x‖22 and gs(x) = f(x, ys) and hs(y) = f(xs, y). Then

define

xt = argminx∈X

(F (x) + η

t−1∑

s=1〈x,∇gs(xs)〉

)

yt = argminy∈Y

(F (y)− η

t−1∑

s=1〈y,∇hs(ys)〉

).

Let G = supx∈X,y∈Y ‖∇f(x, y)‖2 and B = supz∈X∪Y ‖z‖2. Then let ε(n) = GB√

1/n. Astraightforward generalisation of the above argument and the analysis in Proposition 28.6 shows

75

Page 76: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

that

1n

n∑

t=1f(xt, yt) = 1

n

n∑

t=1gt(xt)

= minx∈X

1n

(n∑

t=1gt(x) +

n∑

t=1(gt(xt)− gt(x))

)

≤ minx∈X

1n

(n∑

t=1gt(x) + 1

n

n∑

t=1〈xt − x,∇gt(xt)〉

)

≤ minx∈X

1n

n∑

t=1gt(x) + ε(n)

= minx∈X

1n

n∑

t=1f(x, yt) + ε(n)

≤ minx∈X

f(x, yn) + ε(n) .

In the same manner,

maxy∈Y

f(xn, y) ≤ 1n

n∑

t=1f(xt, yt) + ε(n) .

Hence

minx∈X

maxy∈Y

f(x, y) ≤ maxy∈Y

f(xn, y)

≤ minx∈X

f(x, yn) + 2ε(n)

≤ maxy∈Y

minx∈X

f(x, y) + 2ε(n) .

And the result is again completed by taking the limit as n tends to infinity.

In both cases the pair of average iterates (xn, yn) has a cluster point that is a saddle point off(·, ·). In general the iterates (xn, yn) may not have a cluster point that is a saddle point.

28.17 Let X = Y = R and f(x, y) = x + y. Clearly X and Y are convex topologicalvector spaces and f is linear and linear in both arguments. Then infx∈X supy∈Y f(x, y) = ∞and supy∈Y infx∈X f(x, y) = −∞. For a bounded example, consider X = Y = [1,∞) and

76

Page 77: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

f(x, y) = y/(x+ y).

Chapter 29 The Relation between Adversarial and Stochastic Linear Bandits

29.2 First we check that θt = dEtAtYt/(1− ‖At‖2) is appropriately bounded. Indeed,

η‖θt‖2 = ηdEt‖At‖2|Yt|1− ‖At‖2

≤ ηd

1− ‖At‖2≤ 1

2 ,

where the last step holds by choosing 1− r = 2ηd. All of the steps in the proof of Theorem 28.11are the same until the expectation of the dual norm of θt. Then

E[‖θt‖2∇F (Zt)−1

]≤ E

[(1− ‖Zt‖2)‖θt‖2

]

= d2E[

(1− ‖Zt‖2)EtY 2t

(1− ‖At‖2)2

]

≤ 2d2 .

This last inequality is where things have changed, with the d becoming a d2. From this we concludethat

Rn ≤1η

log( 1

2ηd

)+ (1− r)n+ ηnd2 ≤ 1

ηlog

( 12ηd

)+ 2ηnd+ ηnd2

and the result follows by tuning η.

29.4

(a) Let a∗ = argmina∈A `(a) be the optimal action. Then

Rn = E[n∑

t=1`(At)− `(a∗)

]

≤ E[n∑

t=1〈At − a∗, θ〉

]+ 2nε

= E[n∑

t=1〈At − a∗, θ〉

]+ 2nε . (29.1)

77

Page 78: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

The estimator is θt = dEtAtYt/(1− ‖At‖2), which is no longer unbiased. Then

E[θt | Ft−1] = E[dEtAtYt

1− ‖At‖2

∣∣∣∣∣Ft−1

]

= dE[EtAt(〈At, θ〉+ ηt + ε(At))

1− ‖At‖2

∣∣∣∣∣Ft−1

]

= θ +d∑

i=1ε(ei)ei ,

which, given that At is Ft−1, implies that

E[〈At − a∗, θ〉] ≤ E[〈At − a∗, θt〉] + εE[‖At − a∗‖1‖1‖∞]≤ E[〈At − a∗, θt〉] + 2ε

√d .

Combining with Eq. (29.1) shows that

Rn ≤ E[n∑

t=1〈At − a∗, θt〉

]+ 2εn+ 2εn

√d .

Then we need to check that η‖θt‖2 = η‖dEtAtYt/(1 − ‖At‖2)‖2 ≤ ηd/(1 − r) ≤ 1/2. Nowproceed as in Exercise 29.2.

(b) When ε(a) = 0 for all a, the lower bound is Ω(d√n). Now add a spike ε(a) = −ε in the vicinity

of the optimal arm. Since A is continuous, the learner will almost surely never identify the‘needle’ and hence its regret is Ω(d

√n+ εn). The

√d factor cannot be improved greatly, but

the argument is more complicated [Lattimore and Szepesvari, 2019].

Chapter 30 Combinatorial Bandits

30.4

(b) Using the independence of (Xj)dj=1 shows that almost surely,

E[Mj |Mj−1] = Mj−1

∫ Mj−1

0exp(−x) dx+

∫ ∞

Mj−1x exp(−x) dx .

= Mj−1 + exp(−Mj−1) .

Taking the expectation of both sides yields the result.

(c) The base case when j = 1 is immediate. For j ≥ 2,

E[exp(−aMj)] = E[exp(−aMj−1)]− a

a+ 1E[exp(−(a+ 1)Mj−1] .

78

Page 79: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Therefore by induction it follows that

E[exp(−aMj)] = a!∏ab=1(j + b) .

(d) Combining (b) and (c) shows that for j ≥ 2,

E[Mj ] = E[Mj−1] + E[exp(−Mj−1)] = E[Mj−1] + 1j.

The result follow by induction.

30.5 Since A is compact, dom(φ) = Rd. Let D be the set of points x at which φ is differentiable.Then, as noted in the hint, λ(Rd \ D) = 0 where λ is the Lebesgue measure. Since Q λ, thenQ(Rd \ D) = 0 as well. Define a(x) = argmaxa∈A〈a, x〉. Let v ∈ Rd be non-zero. Then, by thesecond part of the hint, the directional derivative of φ is

∇vφ(x) = maxa∈A(x)

〈a, v〉 .

By the last part of the hint, for x ∈ D this implies that A(x) is a singleton and thus ∇φ(x) = a(x).Let q = dQ

dλ be the density of Q with respect to the Lebesgue measure. Then, for any v ∈ Rd,

∇v∫

Rdφ(x+ z)q(z) dz =

Rd∇vφ(x+ z)q(z) dz

=∫

D+x∇vφ(x+ z)q(z) dz

=∫

D+x〈a(x+ z), v〉 q(z) dz

=⟨∫

D+xa(x+ z)q(z) dz, v

⟩,

where the exchange of limit (hidden in the derivative) and integral is justified by the dominatedconvergence theorem. By the last part of the hint, since v ∈ Rd was arbitrary, it follows that∇ ∫Rd φ(x+ z)q(z) dz exists and is equal to

∫D+x a(x+ z)q(z) = E [a(x+ Z)].

30.6

(a) To show that F is well defined we need to show that F ∗ is the Fenchel dual of a unique properconvex closed function. Let g = (F ∗)∗. It is not hard to see that F ∗ is a proper convex function,dom(F ∗) = Rd, and hence the epigraph of F ∗ is closed. Then, by the hint, g∗ = (F ∗)∗∗ = F ∗,hence F ∗ is the Fenchel dual of g. By the hint, the Fenchel dual of g is a proper convex closedfunction, so we can take F = g∗. It remains to show that there is only a single proper convexclosed function whose Fenchel dual is F ∗. To show this let g, h be proper convex closed functionssuch that g∗ = h∗ = F ∗. Then g = g∗∗ = (F ∗)∗ = h∗∗ = h, hence, F is uniquely defined.

79

Page 80: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(b) By Part (c) of Theorem 26.6, it suffices to show that F ∗ is Legendre. As noted earlier, thedomain of F ∗ is all of Rd. From Exercise 30.5, it follows that F ∗ is everywhere differentiable.Part (c) of the definition of Legendre functions is automatically satisfied since ∂Rd = ∅, henceit remains to prove that F ∗ is strictly convex.

To prove this, we need to prove that for all x 6= y,

F ∗(y) > F ∗(x) + 〈y − x,∇F ∗(x)〉 .

Let a(x) = argmaxa∈A〈a, x〉, with ties broken arbitrarily and let q = dQdλ be the density of Q

with respect to the Lebesgue measure. Recalling the definitions and the result of Exercise 30.5,

F ∗(y)− F ∗(x)− 〈y − x,∇F ∗(x)〉

=∫

Rdφ(y + z)q(z)dz −

Rdφ(x+ z)q(z)dz −

⟨y − x,

Rda(x+ z)q(z)dz

=∫

Rd〈y + z, a(y + z)− a(x+ z)〉 q(z) dz

=∫

Rd〈u, a(u)− a(u+ δ)〉 q(u− y) du ,

where δ = x − y. Clearly the term f(u) = 〈a(u) − a(u + δ), u〉 is nonnegative for anyu ∈ Rd. Since by assumption q > 0, it suffices to show that f is strictly positive over aneighborhood of zero that has positive volume. The assumption that span(A) = Rd means that〈a(−δ/2)− a(δ/2),−δ/2〉 = ε > 0. To see this, notice that

co(A) ⊂ x : 〈x, δ/2〉 ≤ 〈a(δ/2), δ/2〉 andco(A) ⊂ x : 〈x,−δ/2〉 ≤ 〈a(−δ/2),−δ/2〉 = x : 〈x, δ/2〉 ≥ 〈a(−δ/2), δ/2〉 .

Were it the case that 〈a(−δ/2), δ/2〉 = 〈a(δ/2), δ/2〉, then co(A) would be a subset of a (d− 1)-dimensional hyperplane, contradicting the assumption that span(A) = Rd. Let u = −δ/2 and‖ · ‖ = ‖ · ‖2 and diam(A) = diam‖·‖(A). Then

〈a(−δ/2 + v),−δ/2〉 = 〈a(v − δ/2), v − δ/2〉 − 〈a(v − δ/2), v〉≥ 〈a(−δ/2), v − δ/2〉 − ‖v‖diam(A)≥ 〈a(−δ/2),−δ/2〉 − 2‖v‖diam(A) .

80

Page 81: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Similarly, 〈a(v + δ/2), δ/2〉 ≥ 〈a(δ/2), δ/2〉 − 2‖v‖diam(A) and hence

f(u+ v) = 〈a(u+ v)− a(u+ δ + v), u+ v〉≥ 〈a(u+ v)− a(u+ δ + v), u〉 − 2‖v‖diam(A)= 〈a(v − δ/2)− a(v + δ/2),−δ/2〉 − 2‖v‖diam(A)= 〈a(v − δ/2),−δ/2〉+ 〈a(δ/2 + v), δ/2〉 − 2‖v‖diam(A)≥ 〈a(−δ/2),−δ/2〉+ 〈a(δ/2), δ/2〉 − 6‖v‖diam(A)= 〈a(−δ/2)− a(δ/2),−δ/2〉 − 6‖v‖diam(A)= ε− 6‖v‖diam(A) .

Thus, for sufficiently small ‖v‖, it holds that f(u+ v) ≥ ε/2 and the claim follows.

(c) We need to show that int(dom(F )) = int(co(A)). By the first two parts of the exercise, F isLegendre, and hence by Part (a) of Theorem 26.6 and by Exercise 30.5, we have

int(dom(F )) = ∇F ∗(Rd) =∫

Rda(x+ z)q(z)dz : x ∈ Rd

.

Clearly, this is a subset of int(co(A)). To establish the equality, by convexity of co(A) it sufficesto show that for any extreme point a ∈ A and ε > 0 there exists an x such that ‖∇F ∗(x)−a‖ ≤ ε.To show this, choose a vector x0 ∈ Rd so that a(x0 + v) = a for any v in the unit ball centeredat zero. Such a vector exist because of the conditions on A. Let Kε be a closed ball centeredat zero such that Q(Kε) ≥ 1− ε/(maxa∈A ‖a‖). This exist because Q(Rd) = 1. Let r be theradius of Kε. Pick any c > rε. Then, for any v ∈ Kε, a(cx0 + v) = a(x0 + v/c) = a and hence∇F ∗(cx0) = s+

∫Kεa(cx0 + z)q(z) = s+ a, where ‖s‖ ≤ ε, finishing the proof.

30.8 Following the advice, assume that the learner plays m bandits in parallel, each having k = d/m

actions. Let Rni be the regret of the learner in the ith bandit. Then, Rni ≥ c√nk = c

√nd/m for

some universal constant c > 0. Further, if Rn is the regret of the learner, Rn = ∑mi=1Rni. Hence,

Rn ≥ c√ndm.

An alternative to this is to emulate a k = d/m-armed bandit with scaled rewards: For thisimagine that the d items (components of the combinatorial action) are partitioned into k parts, eachhaving m items in it. Unlike in multi-task bandits, the learner needs to choose a part and receivesfeedback for all the items in it. Hence, the the rewards received belong to the [0,m] interval and wealso get Rn ≥ cm

√nk = c

√ndm.

Chapter 31 Non-stationary Bandits

31.1 As suggested, Exp4 is used with each element of Γnm identified with one expert. Consideran arbitrary enumeration of Γnm = a(1), . . . , a(G) where G = |Γnm|. The predictions of expertg ∈ [G] for round t ∈ [n] encoded as a probability distribution over [k] (as required by the prediction-with-expert-advice framework) is Etg,j = I

a

(g)t = j

, j ∈ [k]. The expected regret of Exp4 when

81

Page 82: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

used with these experts is

Rexpertsn = E

[n∑

t=1ytAt − min

g∈[G]

n∑

t=1E(t)g yt

],

where compared to Chapter 18 we switched to losses. By definition,

n∑

t=1E(t)g yt =

n∑

t=1yt,a

(g)t

and hence

Rexpertsn = Rnm .

Thus, Theorem 18.1 indeed proves (31.1). To prove (31.2) it remains to show that G = |Γnm| ≤Cm log(kn/m). For this note that G = ∑m

s=1G∗n,s where G∗n,s is the number of sequences from [k]n

that switch exactly s − 1 times. When m − 1 ≤ n/2, a crude upper bound on G is mG∗nm. Fors = 1, G∗n,s = k. For s > 1, a sequence with s − 1 switches is determined by the location of theswitches, and the identity of the action taken in each segment where the action does not change.The possible switch locations are of the form (t, t+ 1) with t = 1, . . . , n− 1. Thus the number ofthese locations is n − 1, of which, we need to choose s − 1. There are

(n−1s−1)

ways of doing this.Since there are s segments and for the first segment we can choose any action and for the others wecan choose any other action than the one chosen for the previous segments, there are kks−1 validways of assigning actions to segments. Thus, G∗n,s = kks−1(n−1

s−1). Define Φm(n) = ∑m

i=0(ni

). Hence,

G ≤ km∑m−1s=0

(n−1s

)= kmΦm−1(n− 1) ≤ kmΦm(n). Now note that for n ≥ m, 0 ≤ m/n ≤ 1, hence

(m

n

)mΦm(n) ≤

m∑

i=0

(m

n

)i(ni

)≤

n∑

i=0

(m

n

)i(ni

)=(

1 + m

n

)n≤ em .

Reordering gives Φm(n) ≤ ( enm)m. Hence, log(G) ≤ m log(ekn/m). Plugging this into (31.1) gives

(31.2).

31.3 Use the construction and analysis in Exercise 11.6 and note that when m = 2 the randomversion of the regret is nonnegative on the bandit constructed there.

Chapter 32 Ranking

32.2 The argument is half-convincing. The heart of the argument is that under the criterionthat at least one item should attract the user, it may be suboptimal to present the list composedof the fittest items. The example with the query ‘jaguar’ is clear: Assume half of the users willmean ‘jaguar’ as the big cat, while the other half will mean it as the car. Presenting items that arerelevant for both meanings may have a better chance to satisfy a randomly picked user than goingwith the top m list, which may happen to support only one of the meanings. This shows that there

82

Page 83: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

is indeed an issue with ‘linearizing’ the problem by just considering individual item fitness values.

However, the argument is confusing in other ways. First, it treats conditions (for example,independence of attractiveness) that are sufficient but not necessary to validate the probabilisticranking principle (PRP) as if they were also necessary. In fact, in click model studied here, thementioned independence assumption is not needed. To clarify, the strong assumption in the stochasticclick model, is that the optimal list is indeed optimal. Under this assumption, the independenceassumption is not needed.

Next, that the same document can have different relevance to different users fits even the cascademodel, where the vector of attraction values are different each time they are sampled from themodel. So this alone would not undermine the PRP.

Finally, the last sentence confuses relevance and ‘usefulness’. Again, in the cascade model, therelevance (attractiveness) of a document (item) does not depend on the relevance of any otherdocument. Yet in the reward in the cascade model is exactly one if and only if at least one documentpresented is relevant (attractive).

32.6 Following the proof of Theorem 32.2 the first part until Eq. (32.5) we have

Rn ≤ nmP(Fn) +∑

j=1

minm,j−1∑

i=1E[I F cn

n∑

t=1Utij

].

As before the first term is bounded using Lemma 32.4. Then using the first part of the proof ofLemma 32.7 shows that

I F cnn∑

t=1Utij ≤ 1 +

√√√√2Nnij log(c√n

δ

).

Substituting into the previous display and applying Cauchy-Schwarz shows that

Rn ≤ nmP(Fn) +m`+

√√√√√2m`E

j=1

minm,j−1∑

i=1Nnij

log

(c√n

δ

).

Writing out the definition of Nnij reveals that we need to bound

E

j=1

minm,j−1∑

i=1Nnij

n∑

t=1E

E

Mt∑

d=1

j∈Ptd

i∈Ptd∩[m]Utij

∣∣∣∣∣Ft−1

≤n∑

t=1E

E

Mt∑

d=1

j∈Ptd

i∈Ptd∩[m](Cti + Ctj)

∣∣∣∣∣Ft−1

= (A) .

83

Page 84: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Expanding the two terms in the inner sum and bounding each separately leads to

E

Mt∑

d=1

j∈Ptd

i∈Ptd∩[m]Cti

∣∣∣∣∣Ft−1

= E

Mt∑

d=1|Ptd|

i∈Ptd∩[m]Cti

∣∣∣∣∣Ft−1

≤Mt∑

d=1|Itd ∩ [m]||Ptd ∩ [m]| ≤ m2 ,

where the inequality follows from the fact that for i ∈ Ptd,

E[Cti | Ft−1] ≤ P(A−1t (i) ∈ [m] | Ft−1

)= |Itd ∩ [m]|

|Itd|= |Itd ∩ [m]|

|Ptd|.

For the second term that makes up (A),

Et−1

Mt∑

d=1

j∈Ptd

i∈Ptd∩[m]Ctj

∣∣∣∣∣Ft−1

= Et−1

Mt∑

d=1|Ptd ∩ [m]|

j∈Ptd

Ctj

∣∣∣∣∣Ft−1

≤Mt∑

d=1|Ptd ∩ [m]||Itd ∩ [m]| ≤ m2 .

Hence (A) ≤ nm2 and Rn ≤ nmP(Fn) + m` +√

4m3`n log(c√nδ

)and the result follows from

Lemma 32.4.

Chapter 33 Pure Exploration

33.3 Abbreviate f(α) = infd∈D〈α, d〉, which is clearly positively homogeneous: f(cα) = cf(α) forany c ≥ 0. Because D is nonempty, f(0) = 0. Hence we can ignore α = 0 in both optimisationproblems and so

(sup

α∈Pk−1

f(α))−1

L = infα∈Pk−1

L

f(α)

= infα≥0:‖α‖1>0

L‖α‖1f(α)

= infα≥0:‖α‖1>0

‖Lα/f(α)‖1

= inf ‖α‖1 : f(α) ≥ L ,

where we used the positive homogeneity of f(α) and the `1 norm.

33.4

84

Page 85: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(a) For each i > 1 define

Ei = ν ∈ E : µ1(ν) = µi(ν) and µj(ν) = µj(ν) for j /∈ 1, i .

You can easily show that

infν∈Ealt(ν)

k∑

i=1αi D(νi, νi) = min

i>1infν∈Ei

(α1 D(ν1, ν1) + αi D(νi, νi))

= mini>1

infµ∈R

(α1(µ1(ν)− µ)2

2σ21

+ αi(µi(ν)− µ)2

2σ2i

)

= 12 mini>1

α1αi∆2i

α1σ2i + αiσ2

1.

(b) Let α1 = α so that α2 = 1− α. By the previous part

(c∗(ν))−1 = maxα∈[0,1]

infν∈Ealt(ν)

k∑

i=1αi D(νi, νi) = max

α∈[0,1]

α1α2∆22

α1σ22 + αiσ2

1

= maxα∈[0,1]

α(1− α)∆22

ασ22 + (1− α)σ2

1

= ∆22

2(σ21 + σ2

2) .

(c) By the result in Exercise 33.3 and Part (a) of this exercise,

c∗(ν) = inf‖α‖1 : α ∈ [0,∞)k, inf

ν∈Ealt(ν)

k∑

i=1αi D(νi, νi) = 1

= inf‖α‖1 : α ∈ [0,∞)k, min

i>1

α1αi∆2i

2α1σ2i + 2αiσ2

1= 1

.

Let α1 = 2aσ21/∆2

min, which by the constraint that α ≥ 0 must satisfy a > 1. Then

c∗(ν) = infα1>2σ2

1/∆2min

α1 +k∑

i=2

2α1σ2i

α1∆2i − 2σ2

1

≤ infa>1

2aσ21

∆2min

+ a

a− 1

k∑

i=2

2σ2i /∆2

i

a− 1

=

2σ21

∆2min

+

√√√√k∑

i=2

2σ2i

∆2i

2

= 2σ21

∆2min

+k∑

i=2

2σ2i

∆2i

+ 4σ1∆min

√√√√k∑

i=2

2σ2i

∆2i

.

85

Page 86: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(d) From the previous part

c∗(ν) = inf‖α‖1 : α ∈ [0,∞)k, min

i>1

α1αi∆2i

2α1σ2i + 2αiσ2

1= 1

,

Let α1 = 2aσ21/∆2

min with a > 1. Then

c∗(ν) = infα1>2σ2

1/∆2min

(α1 +

k∑

i=2

2α1σ2i

α1∆2i − 2σ2

1

)

≤ infa>1

(2aσ2

1∆2

min+ a

a− 1

k∑

i=2

2σ2i

∆2i

)

=

2σ21

∆2min

+

√√√√k∑

i=2

2σ2i

∆2i

2

= 2σ21

∆2min

+k∑

i=2

2σ2i

∆2i

+ 4σ1∆min

√√√√k∑

i=2

2σ2i

∆2i

.

(e) Notice that the inequality in the previous part is now an equality.

33.5

(a) Let ν ∈ E be an arbitrary Gaussian bandit with µ1(ν) > maxi>1 µi(ν) and assume that

lim infn→∞

− log(Pνπ(∆An+1 > 0)

)

log(n) > 1 + ε . (33.1)

Notice that if Eq. (33.1) were not true then we would be done. Then let ν ′ be a Gaussianbandit in Ealt(ν) with µ(ν ′) = µ(ν) except that µi(ν ′) = µi(ν) + ∆i(ν)(1 + δ) where i > 1 andδ =√

1 + ε− 1. By Theorem 14.2 and Lemma 15.1,

Pνπ(An+1 6= 1) + Pν′π(An+1 6= i) ≥ 12 exp (−D(Pνπ,Pν′π))

≥ 12 exp

(−(1 + δ)2∆i(ν)2Eνπ[Ti(n)]

2

)

= 12 exp

(−(1 + ε)∆i(ν)2Eνπ[Ti(n)]

2

).

Because π is asymptotically optimal, limn→∞ Eνπ[Ti(n)]/ log(n) = 2/∆i(ν)2 and hence

Pνπ(An+1 6= 1) + Pν′π(An+1 6= i) ≥ 12

( 1n

)1+ε+εn,

86

Page 87: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

where limn→∞ εn = 0. Using Eq. (33.1) shows that

lim infn→∞

n1+ε+εnPν′π(An+1 6= i) > 0 ,

which implies that

lim infn→∞

− log (Pν′π(An+1 6= i))log(n) ≤ lim sup

n→∞

− log (Pν′π(An+1 6= i))log(n) ≤ 1 + ε .

(b) No. Consider the algorithm that plays UCB on rounds t /∈ 2k2 : k ∈ N and otherwise playsround-robin.

(c) The same argument as Part (a) shows there exists a ν ∈ E with a unique optimal arm such that

lim infn→∞

− log (Pνπ(An+1 6∈ i∗(ν)))log(n) = O(1) ,

which means the probability of selecting a suboptimal arm decays only polynomially with n.

33.6

(a) Assume without loss of generality that arm 1 is unique in ν. By the work in Part (a) ofExercise 33.4, α∗(ν) = argmaxα∈Pk−1 Φ(ν, α) with

Φ(ν, α) = 12 infν∈Ealt(ν)

k∑

i=1αi(µi(ν)− µi(ν))2

= 12 mini>1

α1αi∆2i

α1 + αi= 1

2 mini>1

fi(α1, αi)

where Φ(ν, α) = 0 if αi = 0 for any i and the last equality serves as the definition of fi.The function Φ(ν, ·) is the minimum of a collection of concave functions and hence concave.Abbreviate α∗ = α∗(ν) and notice that α∗ must equalize the functions (fi) so that fi(α∗1, α∗i ) isconstant for i > 1. Hence, for all i > 1,

12α∗1α

∗i∆2

i

α∗1 + α∗i= max

α∈Pk−1Φ(ν, α) = Φ(ν) .

Rearranging shows that

α∗i = 2α∗1Φ(ν)∆2iα∗1 − 2Φ(ν) .

Therefore

α∗i +k∑

i=2

2α∗1Φ(ν)∆2iα∗1 − 2Φ(ν) = 1 .

87

Page 88: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

The solutions to this equation are the roots of a polynomial and by the fundamental theoremof algebra, either this polynomial is zero or there are finitely many roots. Since the former isclearly not true, we conclude there are at most finitely many maximisers. Yet concavity of theobjective means that the number of maximisers is either one or infinite. Therefore there is aunique maximiser.

(b) Notice that i∗(ξ) = i∗(ν) whenever d(ξ, ν) is sufficiently small. Hence, by the previouspart, the function Φ(·, ·) is continuous at (ν, α) for any α. Suppose that α∗(·) is notcontinuous at ν. Then there exists a sequence (νn)∞n=1 with limn→∞ d(νn, ν) = 0 and forwhich lim infn→∞ ‖α∗(ν)− α∗(νn)‖∞ > 0. By compactness of Pk−1, the sequence α∗(νn) has acluster point α∗∞, which by assumption must satisfy α∗(ν) 6= α∗∞. And yet, taking limits alongan appropriate subsequence, Φ(α∗(ν), ν) = limn→∞Φ(α∗(ν), νn) ≤ limn→∞Φ(α∗(νn), νn) =Φ(α∗∞, ν). Therefore by Part (a), α∗(ν) = α∗∞, which is a contradiction.

(c) We’ll be a little lackadasical about constants here. Define random variable

Λ = minλ ≥ 1 : d(νt, ν) ≤

√2 log(2λkt(t+ 1))

mini Ti(t)for all t

,

which by the usual concentration analysis and union bounding satisfies P (Λ ≥ x) ≤ 1/x.Therefore

E[log(Λ)2] =∫ ∞

0P(Λ ≥ exp(x1/2)

)dx ≤

∫ ∞

0exp(−x1/2)dx = 2 .

By the definition of λ,

τν(ε) ≤ 1 + maxt :√

2 log(Λkt(t+ 1))mini Ti(t)

> ε

.

The forced exploration in the algorithm means that Ti(t) = Ω(√t) almost surely and hence

E[τν(ε)] = O(E[log(Λ)2]

)= O(1) .

(d) Let w(ε) = infx : d(ω, ν) ≤ x =⇒ ‖α∗(ν)− α∗(ω)‖∞ ≤ ε, which by (b) satisfies w(ε) > 0for all ε > 0. Hence E[τα(ε)] ≤ E[τν(w(ε))] <∞.

(e) By definition of the algorithm At = i implies that either Ti(t − 1) ≤√t or At =

argmaxi α∗i (νt−1)− Ti(t− 1)/(t− 1). Now suppose that

t ≥ max

2kτα(ε/(2k))ε

,16k2

ε2

.

88

Page 89: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Then the definition of the algorithm implies that

Ti(t) ≤ maxTi(τα(ε/(2k))), 1 + t(α∗i (ν) + ε/(2k)), 1 +

√t

≤ t(α∗i (ν) + ε

k

).

Furthermore, since ∑ki=1 Ti(t) = t,

Ti(t) ≥ t−∑

j 6=iTj(t) ≥ t−

j 6=it

(α∗j (ν) + ε

k

)≥ t(α∗i (ν)− ε) .

And the result follows from the previous part, which ensures that

E[max

2kτα(ε/(2k))

ε,16k2

ε2

]<∞ .

(f) Given ε > 0 let τβ(ε) = 1 + max t : tΦ(ν, α∗(ν)) < βt(δ) + εt and

u(ε) = supω,αΦ(ω, α) : d(ω, ν) ≤ ε, ‖α− α∗(ν)‖∞ ≤ ε .

Then for t ≥ maxτν(ε), τT (ε), τβ(u(ε)) it holds that

tZt = tΦ(νt, T (t)/t) ≥ t(Φ(ν, α∗(ν))− u(ε)) ≥ βt(δ) ,

which implies that

τ ≤ max τν(ε), τT (ε), τβ(u(ε)) ≤ τν(ε) + τT (ε) + τβ(u(ε)) .

Taking the expectation,

E[τ ] ≤ E[τν(ε)] + E[τT (ε)] + E[τβ(u(ε))] .

Taking the limit as δ → 0 and using the previous parts shows that for any sufficiently smallε > 0,

lim supδ→0

E[τ ]log(1/δ) ≤ lim sup

δ→0

E[τβ(u(ε))]log(1/δ) = 1

Φ(ν, α∗(ν))− u(ε) .

Continuity of Φ(·, ·) at (ν, α∗(ν)) ensures that limε→0 u(ε) = 0 and the result follows sincec∗(ν) = 1/Φ(ν, α∗(ν)). Note that taking the limit as δ → 0 only works because the policy doesnot depend on δ. Hence the expectations of τν(ε), τα(ε) and τT (ε) do not depend on δ.

33.7

89

Page 90: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(a) Recalling the definitions,

H1(µ) =k∑

i=1min

1

∆2min

,1

∆2i

and H2(µ) = max

i:∆i>0

i

∆2i

,

where ∆1 ≤ ∆2 ≤ · · · ≤ ∆k. Now, for any i ∈ [k] with ∆i > 0,

H1(µ) ≥i∑

j=1min

1

∆2min

,1

∆2j

i∑

j=1min

1

∆2min

,1

∆2i

≥ i

∆2i

.

Therefore H1(µ) ≥ H2(µ). For the second inequality, let imin = mini : ∆i > 0. Then

H1(µ) = imin∆2

min+

k∑

i=imin+1

1i

i

∆2i

≤1 +

k∑

i=imin+1

1i

H2(µ) ≤ (1 + log(k))H2(µ) .

The result follows because imin > 1 and because ∑ki=3 ≤ log(k).

(b) When ∆2 = · · · = ∆k > 0 it holds that H1(µ) = H2(µ). For the other direction let ∆i =√i for

i ≥ 2 so that i/∆2i = 1 = H2(µ) and

H1(µ) = 1 +k∑

i=3

1i

= L = LH2(µ) .

33.9 We have P(maxi∈[n] µ(Xi) < µ∗α

)= P (µ(X1) < µ∗α)n ≤ (1 − α)n ≤ δ. Solving for n gives

the required inequality.

Chapter 34 Foundations of Bayesian Learning

34.4

(a) By the ‘sections’ Lemma 1.26 in [Kallenberg, 2002], d(x) =∫

Θ pψ(x)q(ψ)dν(ψ) is H-measurable.

90

Page 91: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Therefore N = d−1(0) ∈ H is measurable. Then

0 =∫

N

Θpψ(x)q(ψ) dν(ψ)dµ(x)

=∫

Θ

Npψ(x)dµ(x)q(ψ) dν(ψ)

=∫

ΘPψ(N) dν(ψ)

= P(X ∈ N)= PX(N) ,

where the first equality follows from the definition of N . The second is Fubini’s theorem, thethird by the definition of the Radon-Nikodym derivative, the fourth by the definition of P andthe last by the definition of PX .

(b) Note that q(θ |x) = pθ(x)q(θ)/d(x), which is jointly measurable in θ and x. The fact thatQ(A |x) is a probability measure for all x is straightforward from the definition of expectationand because for x ∈ N ,

Q(Θ |x) =∫

Θ q(θ |x)∫Θ pψ(x)q(ψ) dν(ψ) = 1 .

That Q(A | ·) is H-measurable follows from the sections lemma and the fact that N ∈ H. LetA ∈ G and B ∈ σ(X) ⊆ F , which can be written as B = Θ× C for some C ∈ H. Then,

BQ(A |X(ω)) dP(ω) =

B

Aq(θ |X(ω)) dν(θ)dP(ω)

=∫

Θ

C

Aq(θ |x) dν(θ) pθ(x)q(θ) dµ(x)dν(θ)

=∫

Cd(x)

Aq(θ |x) dν(θ)dµ(x)

=∫

C

Apθ(x)q(θ) dν(θ)dµ(x)

=∫

Apθ(C)q(θ) dν(θ)

= P(θ ∈ A,X ∈ C)= P(θ ∈ A,X ∈ C)

=∫

BIA(θ) dP ,

which is the required averaging property.

34.5 Abbreviate pθ(x) = dPθdh (x).

(a) Clearly pθ(x) ≥ 0. By definition, for B ∈ B(R), Pθ(B) =∫B pθ(x)dh(x). Hence Pθ(B) ≥ 0.

91

Page 92: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Furthermore,

Pθ(R) =∫

Rexp(θT (x)−A(θ))dh(x) = exp(−A(θ))

Rexp(θT (x))dh(x) = 1 .

Additivity is immediate since∫B fdh+

∫C fdh =

∫B∪C fdh for disjoint B,C.

(b) Using the chain rule and passing the derivative under the integral yields the result:

A′(θ) =ddθ

∫R exp(θT (x))dh(x)∫

R exp(θT (x))dh(x)

=∫R T (x) exp(θT (x))dh(x)∫

R exp(θT (x))dh(x)

=∫

RT (x) exp(θT (x)−A(x))dh(x)

=∫

RT (x)pθ(x)dh(x)

= Eθ[T ] .

In order to justify the exchange of integral and derivative use the identity that for all sufficientlysmall ε > 0 and all a > 0,

a ≤ exp(aε) + exp(−aε)ε

.

Hence for θ ∈ int(dom(A)) there exists a neighborhood N of θ such that for all ψ ∈ N ,

|T (x)| exp(ψT (x)) ≤ φ(x) = exp((θ + ε)T (x)) + exp((θ − ε)T (x))ε

.

Since φ(x) is integrable for sufficiently small ε it follows by the dominated convergence theoremthat the derivative and integral can be exchanged.

(c) When X ∼ Pθ we have

E[exp(λT (X))] =∫

Rexp(λT (x))dPθ

dhdh(x)

=∫

Rexp(λT (x)) exp(θT (x)−A(θ))dh(x)

= exp(−A(θ))∫

Rexp((θ + λ)T (x))dh(x)

= exp(−A(θ)) exp(A(θ + λ))= exp(A(θ + λ)−A(θ)) .

92

Page 93: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(d) This is another straightforward calculation:

d(θ, θ′) =∫

R(θT (x)−A(θ)− θ′T (x) +A(θ′)) exp(θT (x)−A(θ))dh(x)

= A(θ′)−A(θ) + (θ − θ′)∫

RT (x) exp(θT (x)−A(θ))dh(x)

= A(θ′)−A(θ)− (θ′ − θ)A′(θ) .

(e) The Crammer-Chernoff method is the solution. Let λ = n(θ′ − θ). Then

Pθ(T ≥ Eθ′ [T ]) = Pθ(exp(λT ) ≥ exp(λEθ′ [T ]))≤ Eθ[exp(λT )] exp(−λEθ′ [T ])

=n∏

t=1Eθ[exp(λT (Xt)/n)] exp(−λA′(θ′))

= exp(n(A(θ + λ/n)−A(θ))− λA′(θ′))= exp

(n(A(θ′)−A(θ)− (θ′ − θ)A′(θ′)))

= exp(−nd(θ′, θ)

).

A symmetric calculation shows that for θ′ < θ,

Pθ(T ≤ Eθ′ [T ]) ≤ exp(−nd(θ′, θ)

).

34.13 Let π∗ as in the problem definition. Let S be the extension of S by adding rays in thepositive direction: S = x+ u : x ∈ S, u ≥ 0. Clearly S remains convex and λ(S) ⊆ ∂S is on theboundary and is a subset of λ(S) (see figure) Let x ∈ λ(S). By the supporting hyperplane theoremand the convexity of S there exists a nonzero vector a ∈ RN and b ∈ R such that 〈a, `(π∗)〉 = b and〈a, y〉 ≥ b for all y ∈ S. Furthermore a ≥ 0 since x+ ei ∈ S and so 〈x+ ei, a〉 = b+ ai ≥ b. Defineq(νi) = ai/‖a‖1. Then, for any policy π,

ν∈Eq(ν)`(π, ν) = 1

‖a‖1

N∑

i=1ai`(π, νi) = 〈a, `(π)〉

‖a‖1≥ b

‖a‖1

with equality for any policy π with `(π) = `(π∗). Since ai is nonnegative, a 6= 0, q ∈ P(E), finishingthe proof.

S

93

Page 94: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

34.14

(a) Suppose that π is not admissible. Then there exists another policy π′ with `(π′, ν) ≤ `(π, ν) forall ν ∈ E . Clearly π′ is also Bayesian optimal. But π was unique, which is a contradiction.

(b) Suppose that Π = π1, π2 and E = ν1, ν2 and `(π, ν1) = 0 for all π and `(πi, ν2) = I i = 2.Then any policy is Bayesian optimal for Q = δν1 , but π2 is dominated by π1.

(c) Suppose that π is not admissible. Then there exists another policy π′ with `(π′, ν) ≤ `(π, ν) forall ν ∈ E and `(π′, ν) < `(π, ν) for at least one ν ∈ E . Then

E`(π, ν)dQ(ν) =

ν∈EQ(ν)`(π, ν) >

ν∈EQ(ν)`(π, ν) =

E`(π′, ν)dQ(ν) ,

which is a contradiction.

(d) Repeat the previous solution with the restriction to the support.

34.15 Let Π be the set of all policies and ΠD = e1, . . . , eN the set of all deterministic policies,which is finite. A policy π ∈ Π can be viewed as a probability measure on (ΠD,B(ΠD)), which isthe essence of Kuhn’s theorem on the equivalence of behavioral and mixed strategies in extensiveform games. Note that since ΠD is finite, probability measures on (ΠD,B(ΠD)) can be viewedas distributions in PN−1. In this way Π inherits a metric and topology from PN−1. Even morestraightforwardly, E is identified with [0, 1]k and inherits a metric from that space. As metric spacesboth Π and E are compact and the regret Rn(π, ν) is continuous in both arguments by Exercise 14.4.Let (νj)∞j=1 be a sequence of bandit environments that is dense in E and Ej = ν1, . . . , νj. Usingthe notation of the previous exercise, let Rn,j(π) = (Rn(π, ν1), . . . , Rn(π, νj)), Sj = Rn,j(Π) ⊂ Rj

and let λ(Sj) be the Pareto frontier of Sj . Note that Sj is non-empty, closed and convex. Thus,λ(Sj) ⊂ Sj . Now let πadm ∈ Π be an admissible policy. Then Rn,j(πadm) ∈ λ(Sj) and by theresult of the previous exercise there exists a distribution Qj ∈ P(E) supported on Ej such thatBRn(πadm, Qj) ≤ minπ BRn(π,Qj). Let Q be the space of probability measures on (E ,B(E)),which is compact with the weak* topology by Theorem 2.14. Hence (Qj)∞j=1 contains a convergentsubsequence (Qi)i converging to Q. Notice that ν 7→ Rn(π, ν) is a continuous function from E to[0, n]. Therefore by the definition of the weak* topology for any policy π,

limi→∞

BRn(π,Qi) = limi→∞

ERn(π, ν)dQ(ν) = BRn(π,Q) .

Hence BRn(πadm, Q) ≤ minπ BR(π,Q).

34.16 Clearly

maxQ∈Q

BR∗n(Q) ≤ R∗n(E) .

In the remainder we show the other direction. Since [0, 1]k is compact with the usual topology,Theorem 2.14 shows that Q is compact with the weak* topology. Let Π be the space of all policies

94

Page 95: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

with the discrete topology and

P =∑

π∈Ap(π)δπ : p ∈ P(A) and A ⊂ Π is finite

,

which is a convex subspace of the topological vector space of all signed measures on (Π, 2Π) withthe weak* topology. Let L : P ×Q → [0, n] be defined by

L(S,Q) = −∫

Π

ERn(π, ν)dQ(ν)dS(π) = −

ERn(πS , ν)dQ(ν) ,

where πS a policy such that PνπS =∫Π PνπdS(π), which is defined in Exercise 4.4. The regret

is bounded in [0, n] and the discrete topology on Π means that all functions from Π to R arecontinuous, including π 7→ ∫

E Rn(π, ν)Q(dν). By the definition of the weak* topology on P it holdsthat L(·, Q) is continuous in its first argument for all Q. The integral over Π with respect to S ∈ Pis a finite sum and ν 7→ Rn(π, ν) is continuous for all π by the result in Exercise 14.4. Therefore Lis continuous and linear in both arguments. By Sion’s theorem (Theorem 28.12),

−maxQ∈Q

BR∗n(Q) = minQ∈Q

supS∈PL(S,Q) = sup

S∈PminQ∈QL(S,Q) = − inf

S∈PR∗n(πS , E) ,

Therefore

maxQ∈Q

BR∗n(Q) = infS∈P

R∗n(πS , E) = infπ∈Π

R∗n(π, E) = R∗n(E) .

Hence R∗n(E) = maxQ∈QBR∗n(Q).

Chapter 35 Bayesian Bandits

35.1 Let π be the policy of MOSS from Chapter 9, which for any 1-subgaussian bandit ν withrewards in [0, 1] satisfies

Rn(π, ν) ≤ C min√

kn,k log(n)∆min(ν)

,

95

Page 96: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

where ∆min(ν) is the smallest positive suboptimality gap. Let En be the set of bandits in E forwhich there exists an arm i with ∆i ∈ (0, n−1/4). Then, for C ′ = Ck,

BR∗n(Q) ≤ BRn(π,Q)

=∫

ERn(π, ν)dQ(ν)

=∫

EnRn(π, ν)dQ(ν) +

EcnRn(π, ν)dQ(ν)

≤ C ′√nQ(En) + C ′∫

Ecnn1/4 log(n)dQ(ν)

= C ′√nQ(En) + o(

√n) .

The first part follows since ∩nEn = ∅ and thus limn→∞Q(En) = 0 for any measure Q. For the secondpart we describe roughly what needs to be done. The idea is to make use of the minimax lowerbound technique in Exercise 15.2, which shows that for a uniform prior concentrated on a finite setof k bandits the regret is Ω(

√kn). The only problems are that (a) the rewards were assumed to be

Gaussian and (b) the prior depends on n. The first issue is corrected by replacing the Gaussiandistributions with Bernoulli distributions with means close to 1/2. For the second issue you shouldcompute this prior for n ∈ 1, 2, 4, 8, . . . and denote them Q1, Q2, . . .. Then let Q = ∑∞

j=1 pjQjwhere pj ∝ (j log2(j))−1. The result follows easily.35.2 Recall that Et = Ut and for t < n,

Et = maxUt,E[Et+1 | Ft] .

Integrability of (Ut)nt=1 ensures that (Et)nt=1 are integrable. By definition Et ≥ E[Et+1 | Ft]. Hence(Et)nt=1 is a supermartingale adapted to F. Hence for any stopping time κ ∈ Rn

1 the optionalstopping theorem says that

E[Uκ] ≤ E[Eκ] ≤ E1 .

On the other hand, for τ satisfying the requirements of the lemma the process Mt = Et∧τ is amartingale and hence E[Uτ ] = E[Mτ ] = M1 = E1.35.3 Define vn(x) = supτ∈Rn1 Ex[Uτ ]. By assumption, Ex[|u(St)|] < ∞ for all x ∈ S and t ∈ [n].Therefore by Theorem 1.7 of Peskir and Shiryaev [2006],

vn(x) = maxu(x),∫

Svn−1(y)Px(dy) . (35.1)

Recall that v(x) = supτ Ex[Uτ ]. Clearly vn(x) ≤ v(x) for all x ∈ S. Let τ be an arbitrary stoppingtime. Then

vn(x) ≥ Ex[Uτ∧n] = Ex[Uτ ] + Ex[(Un − Uτ )I τ ≥ n] .

Since by assumption supn |Un| is Px-integrable for all x, the dominated convergence theorem shows

96

Page 97: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

that

limn→∞

Ex[(Un − Uτ )I τ ≥ n] = Ex[

limn→∞

(Un − Uτ )I τ ≥ n]

= 0 ,

where the second equality follows because U∞ = limn→∞ Un exists Px-almost surely by assumption.Therefore limn→∞ v

n(x) = v(x). Since convergence is monotone, it follows that v is measurable.Taking limits in Eq. (35.1) shows that

v(x) = limn→∞

maxu(x),∫

Svn−1(y)Px(dy)

= maxu(x), limn→∞

Svn−1(y)Px(dy)

= maxu(x),∫

Sv(y)Px(dy) ,

where the last equality follows from the monotone convergence theorem. Next, let Vn = v(Sn). Notethat limn→∞ Ex[Un] = Ex[limn→∞ Un] = Ex[U∞], where the exchange of the limit and expectationis justified by the dominated convergence theorem because supn |Un| is Px integrable. By definition,Vn ≥ Un. Hence,

limn→∞

Ex[|Vn − Un|] = limn→∞

Ex[Vn]− limn→∞

Ex[Un]

= limn→∞

Ex[Vn − U∞]

≤ limn→∞

Ex

[Ex

[supt≥n

Ut

∣∣∣∣∣Sn]− U∞

]

= limn→∞

Ex

[supt≥n

Ut − U∞]

= Ex

[limn→∞

supt≥n

Ut − U∞]

= 0 ,

where the exchange of limit and expectation is again justified by the dominated convergence theoremand the assumption that supn |Un| is Px-integrable. Therefore V∞ = limn→∞ Vn = U∞ Px-a.s. Then

Ex[Vn+1 |Sn] =∫

Sv(y)PSn(dy) ≤ Vn a.s. .

Therefore (Vn)∞n=1 is a supermartingale, which means that for any stopping time κ,

Ex[Uκ] = Ex[

limn→∞

Uκ∧n]

= limn→∞

Ex[Uκ∧n] ≤ limn→∞

Ex[Vκ∧n] ≤ v(x) ,

where the exchange of limits and expectation is justified by the dominated convergence theoremand the fact that Uκ∧n ≤ supn Un, which is Px-integrable by assumption. Consider a stopping timeτ satisfying the conditions of Theorem 35.3. Then (Vn∧τ )∞n=1 is a martingale and using the same

97

Page 98: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

argument as before we have

Ex[Uτ ] = Ex[Vτ ] = Ex[

limn→∞

Vτ∧n]

= limn→∞

Ex [Vτ∧n] = v(x) ,

where the first equality follows from the assumption on τ that on the event τ <∞, Uτ = Vτ andthe fact that V∞ = U∞ Px-a.s..

35.6 Fix x ∈ S and let

g = supτ≥2

Ex[∑τ−1

t=1 αt−1r(St)

]

Ex[∑τ−1

t=1 αt−1] .

We will show that (a) vγ(x) > 0 for all γ < g and (b) vγ(x) = 0 for all γ ≥ g.For (a), assume γ < g. By the definition of g, there exists a stopping time τ ≥ 2 such that

Ex

[τ−1∑

t=1αt−1r(St)

]> γ Ex

[τ−1∑

t=1αt−1

],

which implies that

vγ(x) ≥ Ex

[τ−1∑

t=1αt−1(r(St)− γ)

]> 0 .

Moving now to (b), first note that vγ(x) ≥ 0 for any γ ∈ R because when τ = 1,Ex[∑τ−1

t=1 αt−1(r(St)− g)

]= 0. Hence, it suffices to show that vγ(x) ≤ 0 for all γ ≥ g. Pick

γ ≥ g. By the definition of g, for any stopping time τ ≥ 2,

Ex

[τ−1∑

t=1αt−1(r(St)− γ)

]≤ 0 ,

which implies that

supτ≥2

Ex

[τ−1∑

t=1αt−1(r(St)− γ)

]≤ 0 .

If τ is a F-stopping time then Px(τ = 1) is either zero or one (the stopping rule underlying τ eitherstops given S1 = x, or does not stop – the stopping rule cannot inject any further randomness).From this it follows that

vg(x) = supτ≥1

Ex

[τ−1∑

t=1αt−1(r(St)− γ)

]

= max

0, supτ≥2

Ex

[τ−1∑

t=1αt−1(r(St)− γ)

]

≤ 0 ,

98

Page 99: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

finishing the proof.

35.7 We want to apply Theorem 35.3. The difficulty is that Theorem 35.3 considers the case wherethe reward depends only on the current state, while here the reward accumulates. The solutionis to augment the state space to include the history. We use the convention that if x ∈ Sn andy ∈ Sm, then xy ∈ Sn+m is the concatenation of x and y. In particular, the ith component of xy isxi if i ≤ n and yi−n if i > n. We will also denote by x1:n the sequence (x1, . . . , xn) formed fromx1, . . . , xn. Recall that S∗ is the set of all finite sequences with elements in S and let G∗ be theσ-algebra given by

G∗ = σ

( ∞⋃

n=0Gn).

For n ≥ 1, let Sn = (S1, . . . , Sn). The sequence (Sn)∞n=1 is a Markov chain on the space of finitesequences (S∗,G∗) with probability kernel characterized by

Qx1:n(B1 × · · · ×Bn+1) = Pxn(Bn+1)n∏

t=1I xt ∈ Bt ,

where B1, . . . , Bn+1 ∈ G are measurable. Note that for measurable f : S∗ → R and x1:n ∈ Sn,∫

S∗f(y)Qx1:n(dy) =

Sf(x1:nxn+1)Pxn(dxn+1) .

Now define the G∗/B(R)-measurable function u : S∗ → R by

u(x1:n) =n−1∑

t=1αt−1(r(xt)− γ) .

Notice that the value of u(x1:n) does not depend on xn. Let Px1:n be the probability measurecarrying (Sn)∞n=1 for which Px1:n(Sn = x1:n) = 1. As usual, let Ex1:n be the expectation with respectto Px1:n . Now let Ut = u(St) and define

vγ(x1:n) = supτ≥n

Ex1:n [Uτ ] , (35.2)

The definitions ensure that for any x1:n ∈ S∗, x ∈ S and γ ∈ R,

vγ(x) = vγ(x) and vγ(x1:n) = u(x1:n) + αnvγ(xn) . (35.3)

In order to apply Theorem 35.3 we need to check the existence and integrability conditions of(Un)∞n=1. By Assumption 35.6, U = limn→∞ Un exists Px1:n-a.s. and supn≥1 Un is Px1:n-integrablefor all x1:n ∈ S∗. Then by Theorem 35.3 it follows that

vγ(x1:n) = maxu(x1:n),∫

Svγ(x1:nxn+1)Pxn(dxn+1) .

99

Page 100: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

The proof of Part (a) is completed by noting that u(xy) = r(x)− γ, and so by (35.3),

vγ(x) = vγ(x) = max0,∫

Svγ(xy)Px(dy)

= max0, r(x)− γ + α

Svγ(y)Px(dy) .

For Part (b), when γ < g(x) we have vγ(x) > 0 by definition and hence using the previous part itfollows that

vγ(x) = r(x)− γ + α

Svγ(y)Px(dy) .

Note that supx∈S |vγ+δ(x)− vγ(x)| ≤ |δ|/(1− α) and hence by continuity for γ = g(x) we have

r(x)− γ + α

Svγ(y)Px(dy) = 0 = vγ(x) .

For Part (c), applying Theorem 35.3 again shows that when vγ(x) = 0, then τ = mint ≥ 2 :vγ(St) = u(St) attains the stopping time in Eq. (35.2) with x1:n = x. Notice finally that by (35.3),for any x1:n ∈ S∗,

vγ(x1:n)− u(x1:n) = αnvγ(xn) ,

which means that τ = mint ≥ 2 : αt−1vγ(St) = 0 = mint ≥ 2 : g(St) ≤ γ, where we used thefact vγ(x) = 0⇔ G(x) ≤ γ.

Chapter 36 Thompson Sampling

36.3 We need to show that

P (A∗ = · | Ft−1) = P (At = · | Ft−1)

holds almost surely. For specificity, let r : E → [k] be the (tie-breaking) rule that chooses the armwith the highest mean given a bandit environment so that At = r(νt) and A∗ = r(ν). Recall thatνt ∼ Qt−1(·) = Q( · |A1, X1, . . . , At−1, Xt−1). We have

P (A∗ = i | Ft−1) = P (r(ν) = i | Ft−1) (definition of A∗)= Qt−1(x ∈ E : r(x) = i) (definition of Qt−1)= P (r(νt) = i | Ft−1) (definition of νt)= P (At = i | Ft−1) . (definition of At)

100

Page 101: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

36.5 We have

t∈TI At = i ≤

n∑

t=1

n∑

s=1I Ti(t) = s, Ti(t− 1) = s− 1, Gi(Ti(t− 1)) > 1/n

=n∑

s=1I Gi(s− 1) > 1/n

n∑

t=1I Ti(t) = s, Ti(t− 1) = s− 1

=n∑

s=1I Gi(s− 1) > 1/n ,

where the first equality uses that when At = i, Ti(t) = s and Ti(t− 1) = s− 1 for some s ∈ [n] andthat t ∈ T implies Gi(Ti(t− 1)) > 1/n. The next equality is by algebra, and the last follows becausefor any s ∈ [n], there is at most one time point t ∈ [n] such that Ti(t) = s and Ti(t − 1) = s − 1.For the next inequality, note that

E

t/∈TI Eci (t)

=

t

E[E[I Eci (t), Gi(Ti(t− 1)) ≤ 1/n |Ft−1]

]

=∑

t

E[I Gi(Ti(t− 1)) ≤ 1/nGi(Ti(t− 1))

]

≤∑

t

E[I Gi(Ti(t− 1)) ≤ 1/n 1/n

]

= E

t/∈T1/n

,

where the second equality used that I Gi(Ti(t− 1)) ≤ 1/n is Ft−1-measurable andE[I Eci (t) |Ft−1] = 1− P(θi(t) ≤ µ1 − ε|Ft−1) = Gi(Ti(t− 1)).

36.6

(a) Let f(y) =√s/(2π) exp(−sy2/2) be the probability density function of a centered Gaussian

with variance 1/s and F (y) =∫ y−∞ f(x)dx be its cumulative distribution function. Then

G1s =∫

Rf(y + ε)F (y)/(1− F (y))dy

≤∫ ∞

0f(y + ε)/(1− F (y)) + 2

∫ 0

−∞f(y + ε)F (y)dy . (36.1)

For the first term in Eq. (36.1), following the hint, we use the following bound on 1− F (y) fory ≥ 0:

1− F (y) ≥ exp(−sy2/2)y√s+

√sy2 + 4

.

101

Page 102: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Hence∫ ∞

0

f(y + ε)1− F (y)dy ≤

∫ ∞

0f(y + ε) exp(sy2/2)(y

√s+

√sy2 + 4)dy

≤ 2 exp(−sε2/2)∫ ∞

0exp(−syε)(y√s+ 1)

√s

2πdy

= 21 + ε√s

ε2s√

2πexp(−sε2/2) .

For the second term in Eq. (36.1),

2∫ 0

−∞f(y + ε)F (y)dy ≤ 2

Rf(y + ε)F (y)dy ≤ 2 exp(−sε2) .

Summing from s = 1 to ∞ shows that

2∞∑

s=1

(exp(−sε2) + 1 + ε

√s

ε2s√

2πexp(−sε2/2)

)≤ c

ε2 log(1ε

),

where the last line follows from a Mathematica slog.

(b) Let µis be the empirical mean of arm i after s observations. Then Gis ≤ 1/n if

µis +

√2 log(n)

s≤ µ1 − ε .

Hence for s ≥ u = 2 log(n)(∆i−ε) we have

P (Gis > 1/n) ≤ P

µis +

√2 log(n)

s> µ1 − ε

= P

µis − µi > ∆i − ε−

√2 log(n)

s

≤ exp

s

(∆i − ε−

√2 log(n)

s

)2

2

.

102

Page 103: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

Summing,

n∑

s=1P (Gis > 1/n) ≤ u+

n∑

s=dueexp

s

(∆i − ε−

√2 log(n)

s

)2

2

≤ 1 + 2(∆i − ε)2 (log(n) +

√π log(n) + 1) ,

where the last inequality follows by bounding the sum by an integral as in the proof of Lemma 8.2.

36.13 Let π be a minimax optimal policy for 0, 1n×k. Given an arbitrary adversarial banditx ∈ [0, 1]n×k. Choose π to be the policy obtained by observing Xt = xtAt and then samplingXt ∼ B(Xt) and passing Xt to π. Then

Rn(π, x) ≤∑

x∈0,1n×k

n∏

t=1

k∏

i=1xxtiti (1− xti)1−xtiRn(π, x) ≤ R∗n(0, 1n×k) .

Therefore R∗n([0, 1]n×k) ≤ R∗n(0, 1n×k). The other direction is obvious.

Chapter 37 Partial Monitoring

37.3 It suffices to show that y has no component in any direction z that is perpendicular to x.Let z be such a direction. Without loss of generality either z>1 = 0 or z>1 = 1. Assume firstthat z>1 = 0. Take some u ∈ ker′(x). Then, u + z ∈ ker′(x) also holds. Since ker′(x) ⊂ ker′(y),z>y = (u + z)>y − u>y = 0. Assume now that z>1 = 1. Then z ∈ ker′(x) ⊂ ker′(y) and hencez>y = 0.

37.10 An argument that almost works is to choose ω ∈ ri(Cb) arbitrarily and find the first νalong the chord connecting ω and λ for which ν ∈ Cb ∩ Cc for some cell c. The minor problemis that Cb ∩ Cc 6= ∅ does not imply that b and c are neighbours, because there could be a thirdnon-duplicate Pareto optimal action d with ν ∈ Cb ∩ Cc ∩ Cd. This issue is resolved by making anugly dimension argument. Let D be the set of ω ∈ Pd−1 for which at least three non-duplicatePareto optimal cells intersect, which has dimension at most dim(D) ≤ d − 3. Since b is Paretooptimal, dim(ri(Cb)) = d− 1. Meanwhile, the dimension of those ω ∈ ri(Cb) such that [ω, λ]∩D 6= ∅has dimension at most d− 2. Hence, there exists an ω ∈ ri(Cb) such that [ω, λ] ∩D 6= ∅ and for thischoice the initial argument works.

37.12

(a) Assume without loss of generality that Σ = [m]. Given an action c ∈ [k], let Sc be as in the

103

Page 104: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

proof of Theorem 37.12 and S ∈ Rkm×d be the matrix obtained by stacking (Sc)kc=1:

S =

S1...Sk

.

As in the proof of Theorem 37.12, by the definition of global observability, for any pair ofneighbouring actions a, b, it holds that `a − `b ∈ im(S>) and hence

`a − `b = S>U(`a − `b) ,

where U is the Moore–Penrose pseudo-inverse of S>. Then,

‖U(`a − `b)‖∞ ≤ ‖U(`a − `b)‖2 ≤ ‖U‖2‖`a − `b‖2 ≤ d1/2‖U‖2 .

The largest singular value of U is the square root of the reciprocal of the smallest non-zeroeigenvalue of S>S ∈ 0, . . . , kd×d. Let (λi)pi=1 be the non-zero eigenvalues of S>S, in decreasingorder. Recall that for square matrix A, the product of its non-zero eigenvalues is a coefficient ofthe characteristic polynomial. Since S>S has entries in 0, . . . , k, the characteristic equationhas integer coefficients. Since S>S is positive definite, its non-zero eigenvalues are all positiveand it follows that ∏p

i=1 λi ≥ 1. If p = 1, then we are done. Suppose that p > 1. By thearithmetic–geometric mean inequality,

1λp≤

p−1∏

i=1λi ≤

(trace(S>S)

p− 1

)p−1

≤(

dk

p− 1

)p−1≤ kd .

Hence, ‖U‖2 ≤ kd/2 and the result follows.

(b) Repeat the argument above, but restrict S to stacking (Sc)c∈Nab .

(c) For non-degenerate locally observable games, Ne = a, b for all e = (a, b) ∈ E. Givena Pareto optimal action a, let Va = (a,Φai) : i ∈ [d]. Let a, b be neighbouring actionsand f ∈ E loc

ab , which exists by the assumption that the game is locally observable. DefineC = ((a,Φai), (b,Φbi) : i ∈ [d] ⊂ Va × Vb, which makes (Va ∪ Vb, C) a bipartite graph. Wedefine a new function g : [k] × Σ → R. First, let g(c, σ) = 0 whenever σ /∈ Φci : i ∈ [d] orc /∈ a, b. Next, let (V ′a ∪ V ′b , C ′) be a connected component. Then, by the conditions on f ,

maxn′,n′′∈V ′a∪V ′b

|f(n′)− f(n′′)| ≤ s .= 2(m− 1) + 1 .

Letting c be the midpoint of the interval that the values of f restricted to V ′a ∪ V ′b fall into,

104

Page 105: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

define

f ′(n) =

f(n)− c , if n ∈ V ′a ;f(n) + c , if n ∈ V ′b .

Clearly we still have f ′ ∈ E locab and f ′|V ′a∪V ′b takes values in [−s/2, s/2] = [−m,m]. Now repeat

the procedure with f ′ and the next connected component of the graph until all connectedcomponents are processed. Let the resulting function be g. Then, g ∈ E loc

ab and also ‖g‖∞ ≤ m.

37.13 Let Σ = ♣,♥ and `1 = (1, 0, 0, 0 . . . , 0, 0) and `2 = (0, 1, 1, 1, . . . , 1, 1). For all otheractions a > 2 let `a = 1. Hence `1− `2 = (1,−1,−1,−1, . . . ,−1,−1). Let d = 2k− 1. Let Φ ∈ Σk×d

be the matrix such that the first row is ♣ and ♥, alternating and for a > 1, let

Φai =

♣ if i ≤ 2(a− 1)♥ if i = 1 + 2(a− 1)♣ otherwise and i is odd♥ otherwise and i is even .

For example, when k = 4, then

Φ =

♣ ♥ ♣ ♥ ♣ ♥ ♣♣ ♣ ♥ ♥ ♣ ♥ ♣♣ ♣ ♣ ♣ ♥ ♥ ♣♣ ♣ ♣ ♣ ♣ ♣ ♥

.

Suppose that f ∈ E glo12 . Then,

2 = `11 − `21 + `22 − `21 =k∑

a=1f(a,Φa1)−

k∑

a=1f(a,Φa2) = f(1,♣)− f(1,♥) .

Furthermore, for a > 1 and i = 2a,

k∑

b=1f(b,Φbi)−

k∑

b=1f(b,Φb,i+1) = f(a,♣)− f(a,♥) +

b<a

(f(b,♥)− f(b,♣))

Hence,

f(a,♣)− f(a,♥) =∑

b<a

(f(b,♣)− f(b,♥)) .

By induction it follows that f(a,♣) − f(a,♥) = 2a−1 for all a > 1. Finally, note that 1 and 2are non-duplicate and Pareto optimal. Since all other actions are degenerate, these actions areneighbours. The game is globally observable because all columns have distinct patterns.

105

Page 106: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

37.14 For bandit games with Φ = L, let p = q and

f(a, σ)b = σI a = b .

We have ∑ka=1 f(a,Φai)b = ∑k

a=1 f(a,Lai)b = ∑ka=1 LaiI a = b = Lbi and thus f ∈ E vec. Then,

using that exp(−x) + x− 1 ≤ x2/2 for x ≥ 0,

opt∗q(η) ≤ maxi∈[d]

((p− q)>Lei

η+ 1η2

k∑

a=1paΨq

(ηf(a,Φai)

pa

))

≤ maxi∈[d]

(12

k∑

a=1pa

k∑

b=1

qbL2aiI a = bp2a

)

= k

2 .

The full information game is similar. As before, let p = q, but choose f(a, σ)b = paLbσ. We have∑ka=1 f(a,Φai)b = ∑k

a=1 f(a, i)b = ∑ka=1 paLbi = Lbi and thus f ∈ E vec. Then, again using that

exp(−x) + x− 1 ≤ x2/2 for x ≥ 0,

opt∗q(η) ≤ maxi∈[d]

12

k∑

a=1pa

k∑

b=1qbL2

bi = 12 .

Chapter 38 Markov Decision Processes

38.2 The solution to Part (b) is immediate from Part (a), so we only show the solution toPart (a). Abbreviate Pπµ to P and Pπ′µ to P′. Let π′ = (π′1, π′2, . . . ) be the Markov policy to beconstructed: For each t ≥ 1, s ∈ S, π′t(·|s) is a distribution over A.

Fix (s, a) ∈ S × A and consider first t = 1. We want P′(S1 = s,A1 = a) = P(S1 = s,A1 = a).By the definition of P′, P and by the definition of conditional probabilities, we have P′(S1 = s,A1 =a) = P′(A1 = a|S1 = s)P′(S1 = s) = π′1(a|s)µ(s) = P(S1 = a,A1 = a) = P(A1 = a|S1 = s)µ(s).Thus, defining π′1(a|s) = P(A1 = a|S1 = s) we see that the desired equality holds for t = 1. Now,for t > 1 first notice that from P′(At−1 = a, St−1 = s) = P′(At−1 = a, St−1 = s), (s, a) ∈ S × Ait follows by summing these equations over a that P′(St−1 = s) = P′(St−1 = s). Hence, the samecalculation as for t = 1 applies, showing that π′t(a|s) = P(At = a|St = s) will work.

38.4 We show that D(M) <∞ implies that M is strongly connected and that D(M) =∞ impliesthat M is not strongly connected. Assume first that D(M) <∞. Take any s, s′ ∈ S. Assume firstthat s 6= s′. By definition, there is a policy whose expected travel time from state s to s′ is finite.Take this policy. It follows that this policy reaches state s′ from state s with positive probability,because otherwise the expected travel time would be infinite. Formally, if T is the random traveltime of a policy whose expected travel time between s and s′ is finite, T =∞ is the event thatthe policy does not reach state s′. Now, for any n ∈ N, T > nI T =∞. Taking expectations and

106

Page 107: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

reordering gives E [T ] /n > P (T =∞). Letting n → ∞, we see that P (T =∞) = 0 (thus we seethat the policy reaches state s′ in fact with probability one). It remains to consider the case whens = s′. If the MDP has a single state, it is strongly connected by definition. Otherwise, there exist astate s′′ ∈ S that is distinct from s = s′. Since D(M) is finite, there is a policy that reaches s′ froms with positive probability and another one that reaches s again from s′ with positive probability.Compose these two policies the obvious way to find the policy that travels from s to s with positiveprobability.

Assume now that D(M) =∞, while M is strongly connected (proof by contradiction). Since Mis strongly connected, for any s, s′, there is a policy that has a positive probability of reaching s′from s. But this means that the uniformly random policy (the policy which chooses uniformly atrandom between the actions at any state) has also a positive probability of reaching any state fromany other state. We claim that the expected travel time of this policy is finite between any pairs ofstates. Indeed, this follows by noticing that the states under this policy form a time-homogenousMarkov chain whose transition probability matrix is irreducible and the hitting times in some aMarkov chain, which coincide with the expected travel times in the MDP for the said policy, arefinite. link

38.5 We follow the advice of the hint. For the second part, note that the minimum in the definitionof d∗(µ0, U) is attained when nk is maximised for small indices until |U | is exhausted. In particular,if (nk)0≤k≤m denotes the optimal solution (nk = 0 for k > m) then n0 = A0, . . . , nm−1 = Ak,0 ≤ nm = |U | −∑n−1

k=0 Ak(= |U | − Am−1A−1 ) < Am. Hence, |U | < Am + Am−1

A−1 ≤ 2Am, implying thatm ≥ logA(|U |/2). Thus,

d∗(µ0, U) =m−1∑

k=0kAk +mnm

= m|U |+m−1∑

k=0(k −m)Ak

(a)= m|U |+ m

A− 1 −Am+1 −A(A− 1)2

≥ |U |(m− 1− 1

A− 1

)

≥ |U |(logA(|U |)− 3) ,

where step (a) follows since |U | < Am−1A−1 . Choosing U = S, we see that the expected minimum time

to reach a random state in S is lower bounded by logA(S) − 3. The expected minimum time toreach an arbitrary state in S must also be above this quantity, proving the desired result.

38.7

(a) An1 = 1n

∑n−1t=0 P

t1 = 1, which means that An is right stochastic.

(b) This follows immediately from the definitions.

107

Page 108: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(c) Let (Bn) and (Cn) be convergent subsequences of (An) with limn→∞Bn = B and limn→∞Cn =C. It suffices to show that B = C. From Part (b), Bm + 1

nm(Pnm − I) = BmP = PBm.

Taking the limit as m tends to infinity and using the fact that Pn is [0, 1]-valued we seethat B = BP = PB. Similarly, C = CP = PC and it follows that B = BP i = P iB andC = CP i = PCi hold for any i ≥ 0. Hence B = BCm = CmB and C = CBm = BmC for anym ≥ 1. Taking limit as m tends to infinity shows that B = BC = CB and C = CB = BC,which together imply that B = C.

(d) We have already seen in the proof of Part (c) that P ∗ = P ∗P = PP ∗. From this, it followsthat P ∗ = P ∗P i for any i ≥ 0, which implies that P ∗ = P ∗An holds for any n ≥ 1. Taking limitshows that P ∗ = P ∗P ∗.

(e) Let B = P − P ∗. By algebra I −Bi = (I −B)(I +B + · · ·+Bi−1). Summing over i = 1, . . . , nand dividing by n and using the fact that Bi = P i − P ∗ for all i ≥ 1,

I − 1n

n∑

i=1P i + P ∗ = (I −B)Hn , (38.1)

where Hn = 1n

∑ni=1

∑i−1k=0(P −P ∗)k. The limit of the left-hand side of (38.1) exists and is equal

to the identity matrix I. Hence the limit of the right-hand side also exists and in particular thelimit of Hn must exist. Denoting this by H∞ we find that I = (I −B)H∞ and thus I −B isinvertible and its inverse H is equal to H∞.

(f) Let Un = 1n

∑ni=1

∑i−1k=0(P k − P ∗). Then

Bk =

I, if k = 0 ;P k − P ∗, otherwise .

Using this we calculate Hn − Un = 1n

∑ni=1(P − P ∗)0 − 1

n

∑ni=1(P 0 − P ∗) = I − I + P ∗ = P ∗.

Hence H − limn→∞ Un = P ∗. From the definition of U we have U = limn→∞ Un.

(g) This follows immediately because

limn→∞

1n

n∑

i=1

i−1∑

k=0P k(r − ρ) = lim

n→∞1n

n∑

i=1

i−1∑

k=0(P k − P ∗)r = Ur .

(h) One way to prove this is to note that by the previous part v = Ur = (H − P ∗)r, hencer = H−1(v+ρ). Now, H−1(v+ρ) = (I−P+P ∗)(v+ρ) = v−Pv+P ∗v+ρ−Pρ+P ∗ρ = v−Pv+ρ,where we used that P ∗v = P ∗Ur and P ∗U = P ∗(H − P ∗) = P ∗H − P ∗ = (P ∗ − P ∗H−1)H = 0and that Pρ = PP ∗r = P ∗r = P ∗ρ = P ∗P ∗r.

Alternatively, the following direct argument also works. In this argument we only use that vis well defined. Let vn = ∑n−1

t=0 Pt(r − ρ), vn = 1

n

∑ni=1 vi. Note that limn→∞ vn = v. Then,

108

Page 109: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

vk+1 = Pvk + (r − ρ). Taking the average of these over k = 1, . . . , n we get

1n

((n+ 1)vn+1 − v1) = P vn + (r − ρ) .

Taking the limit of both sides proves that v = Pv + r − ρ, which, after reordering givesv + ρ = r + Pv.

38.8

(a) First note that |maxx f(x)−maxy g(y)| ≤ maxx |f(x)− g(x)|. Then for v, w ∈ RS,

‖Tγv − Tγw‖∞ ≤ maxs∈S

maxa∈A

γ |〈Pa(s), v − w〉|

≤ maxs∈S

maxa∈A

γ‖Pa(s)‖1‖v − w‖∞

= γ‖v − w‖∞ .

Hence T is a contraction with respect to the supremum norm as required.

(b) This follows immediately from the Banach fixed point theorem, which also guarantees theuniqueness of a value function v satisfying v = Tγv.

(c) Recall that the greedy policy is π(s) = argmaxa ra(s) + γ〈Pa(s), v〉. Then

v(s) = maxa∈A

ra(s) + γ〈Pa(s), v〉 = rπ(s) + γ〈Pπ(s), v〉 .

(d) We have v = rπ + γPπv. Solving for v completes the result.

(e) If π is a memoryless policy, it is trivial to see that vπγ = rπ + γPπvπγ . Let π∗ be the greedy

policy with respect to v, the unique solution of v = Tγv. By the previous part of this exercise, itfollows that vπ∗γ = v. By Exercise 38.2, it suffices to show that for any Markov policy π, vπγ ≤ v.If πt is the memoryless policy used in time step t when following π, vπγ = ∑∞

t=1 γt−1P (t−1)rπt ,

where P (0) = I and for t ≥ 1, P (t) = Pπ1 . . . Pπt . For n ≥ 1, let vπγ,n = ∑nt=1 γ

t−1P (t−1)rπt . It iseasy to see that vπγ,1 = rπ1 ≤ T0. Assume that for some n ≥ 1,

supπ Markov

vπγ,n ≤ Tn0, (38.2)

Notice that f ≤ g implies Tf ≤ Tg. Hence, Tvπγ,n ≤ Tn+10. Further, vπγ,n+1 =rπ0 + γPπ0v

π′γ,n ≤ Tvπ

′γ,n, where π′ is the Markov policy obtained from π by discarding π0

(that is, if π = (π0, π1, π2, . . . ), π′ = (π1, π2, . . . )). This shows that (38.2) holds for all n ≥ 1.Letting n → ∞, the right-hand side converges to v, while the left-hand side converges to vπγ .Hence, vπγ ≤ v.

38.9

109

Page 110: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

(a) Let 0 ≤ γ < 1. Algebra gives

P ∗γP = (1− γ)∞∑

t=0γtP tP =

P ∗γ − (1− γ)Iγ

.

Hence γP ∗γP = P ∗γ − (1−γ)I. It is easy to check that P ∗γ is right stochastic. By the compactnessof the space of right stochastic matrices, (P ∗γ )γ has at least one cluster point A as γ → 1−.It follows that AP = A, which implies that AP ∗ = A. Now, (P ∗γ )−1P ∗ = (I−γP )P ∗

1−γ = P ∗,which implies that P ∗ = AP ∗ = A. Since this holds for any cluster point we conclude thatlimγ→1− P

∗γ = P ∗.

(b) Since I − P + P ∗ is invertible and PP ∗ = P ∗ = P ∗P ∗, the required claim is equivalent to

limγ→1−

(I − P + P ∗)(P ∗γ − P ∗

1− γ

)= I − P ∗ .

Rewriting P ∗γ = (1− γ)∑∞t=0 γtP t shows that

(I − P + P ∗)(P ∗γ − P ∗

1− γ

)= (I − P + P ∗)

( ∞∑

t=0γtP t − P ∗

1− γ

)

=∞∑

t=0(I − P )γtP t = 1

γ(I − P ∗γ ) .

The result is completed by taking the limit as γ tends to one from below and using Part (a).

38.10

(a) Let (γn) be an arbitrary increasing sequence with γn < 1 for all n and limn→∞ γn = 1. Let vnbe the fixed point of Tγn and πn be the greedy policy with respect to vn. Since greedy policiesare always deterministic and there are only finitely many deterministic policies it follows thereexists a subsequence n1 < n2 < · · · and policy π such that πnk = π for all k.

(b) For arbitrary π and 0 ≤ γ < 1, let vπγ be the value function of policy π in the γ-discountedMDP, vπ the value function of π and ρπ its gain. Let Uπ be the deviation matrix underlyingPπ. Define fπγ by

vπγ = ρπ

1− γ + vπ + fπγ . (38.3)

By Part (b) of Exercise 38.9, and because ρπ = P ∗πrπ and vπ = Uπrπ, it holds that ‖fπγ ‖∞ → 0as γ → 1.Fix now π to be the policy whose existence is guaranteed by the previous part. By Part (e)of Exercise 38.8, π is γn-discount optimal for all n ≥ 1. Suppose that ρπ is not a constant. Inany case, ρπ is piecewise constant on the recurrent classes of the Markov chain with transitionprobabilities P π. Let ρπ∗ = maxs∈S ρπ(s). Let R ⊂ S be the recurrent class in this Markov chain

110

Page 111: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

where ρπ is the largest and take a policy π′ that is identical to π over R, while π′ is set up suchthat it gets to R with probability one. Such a π′ exist because the MDP is strongly connected.Fix any s ∈ S \R. We claim that there exists some γ∗ ∈ (0, 1) such that for all γ ≥ γ∗,

vπ′

γ (s) > vπγ (s) . (38.4)

If this was true and n is large enough so that γn ≥ γ∗ then, since π is γn-discount optimal,vπγn(s) ≥ vπ′γn(s) > vπ

′γn(s), which is a contradiction.

Hence, it remains to show (38.4). By the construction of π′, ρπ′(s) = ρπ∗ > ρπ(s). From (38.3),

vπ′

γ (s) = ρπ′(s)

1− γ + vπ′(s) + fπ′

γ (s) > ρπ(s)1− γ + vπ(s) + fπγ (s) = vπγ (s) ,

where the inequality follows by taking γ ≥ γ∗ for some γ∗ ∈ (0, 1). The existence of anyappropriate γ∗ follows because fπ′γ (s), fπγ (s)→ 0, while 1/(1− γ)→∞ as γ → 1.

(c) Since v is the value function of π and ρ is its gain, by (38.4) we have ρ1 + v = rπ + Pπv. Letπ′ be an arbitrary stationary policy and vn be as before. Let fn = fγn where fγ is defined by(38.3). Note that ‖fn‖∞ → 0 as n→∞. Then,

0 ≥ rπ′ + (γnPπ′ − I)vn

= rπ′ + (γnPπ′ − I)(

ρ11− γn

+ v + fn

)

= rπ′ − ρ1 + γnPπ′v − v + (γnPπ′ − I)fn .

Note that when π′ = π, the first inequality becomes an equality. Taking the limit as n tends toinfinity and rearranging shows that

ρ1 + v ≥ rπ′ + Pπ′v

and

ρ1 + v = rπ + Pπv . (38.5)

Since π′ was arbitrary, ρ+ v(s) ≥ maxa ra(s) + 〈Pa(s), v〉 holds for all s ∈ S. This combinedwith (38.5) shows that the pair (ρ, v) satisfies the Bellman optimality equation as required.

38.11 Clearly the optimal policy is to take action stay in any state and this policy has gainρ∗ = 0. Pick any solution (ρ, v) to the Bellman optimality equations. Therefore ρ = ρ∗ = 0 byTheorem 38.2. The Bellman optimality equation for state 1 is v(1) = max(v(1),−1 + v(2)), which isequivalent to v(1) ≥ −1 + v(2). Similarly, the Bellman optimality equation for state 2 is equivalentto v(2) ≥ −1 + v(1). Thus the set of solutions is a subset of

(ρ, v) ∈ R× R2 : ρ = 0, v(1)− 1 ≤ v(2) ≤ v(1) + 1 .

111

Page 112: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

The same argument shows that any element of this set is a solution to the optimality equations.

38.12 Consider the deterministic MDP below with two states and two actions, A =solid,dashed.

1 2

r = 0

r = 1

r = 0 r = 1

Clearly the optimal policy is to choose π(1) = solid and π(2) arbitrarily which leads to a gain of 1.On the other hand, choosing ρ = 1 and v = (2, 1) satisfies the linear program in Eq. (38.6) and thegreedy policy with respect to this value function chooses π(1) = dashed and π(2) arbitrary.

38.13 Let T : RS → RS be defined by (Tv)(s) = maxa ra(s) − ρ∗ + 〈Pa(s), v〉 so the Bellmanoptimality equation Eq. (38.5) can be written in the compact form v = Tv. Let v ∈ RS be a solutionto Eq. (38.5). The proof follows from the definition of the diameter and by showing that for anystates s1, s2 ∈ S and memoryless policy π it holds that

v(s2) ≤ v(s1) + (ρ∗ −mins,a

ra(s))Eπ[τs2 |S1 = s1] .

The remainder of the proof is devoted to proving this result for fixed s1, s2 ∈ S and memorylesspolicy π. Abbreviate τ = τs2 and let E[·] denote the expectation with respect to the measureinduced by the interaction of π and the MDP conditioned on S1 = s1. Since the result is trivialwhen E[τ ] =∞, for the remainder we assume that E[τ ] <∞. Define operator T : RS → RS by

(T u)(s) =

mins,a ra(s)− ρ∗ + 〈Pπ(s), u〉, if s 6= s2 ;v(s2), otherwise .

Since rπ(s)− ρ∗ ≥ mins,a ra(s)− ρ∗ and Tv = v it follows that (T v)(s) ≤ (Tv)(s) = v(s). Noticethat for u ≤ w it holds that T u ≤ Tw. Then by induction we have Tnv ≤ v for all n ∈ N+. Byunrolling the recurrence we have

v(s1) ≥ (Tnv)(s1) = E[−(ρ∗ −min

s,ara(s))(n ∧ τ) + v(Sτ∧n)

].

Taking the limit as n tends to infinity shows that v(s1) ≥ v(s2) − (ρ∗ −mins,a ra(s))E[τ ], whichcompletes the result.

38.14

(a) It is clear that Algorithm 27 returns true if and only if (ρ, v) is feasible for Eq. (38.6). Note thatfeasibility can be written in the compact form ρ1v ≥ Tv. It remains to show that when (ρ, v) isnot feasible then u = (1, es−Pa∗s (s)) is such that for any (ρ′, v′) feasible, 〈(ρ′, v′), u〉 > 〈(ρ, v), u〉.

112

Page 113: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

For this, we have 〈(ρ′, v′), u〉 = ρ′ + v′(s)− 〈Pa∗s (s), v′〉 ≥ ra∗s (s), where the inequality used that(ρ′, v′) is feasible. Further, 〈(ρ, v), u〉 = ρ+ v(s)− 〈Pa∗s (s), v〉 < ra∗s (s), by the construction of u.Putting these together gives the result.

(b) Relax the constraint that v(s) = 0 to −ε ≤ v(s) ≤ ε. Then add the ε of slack to the firstconstraint of Eq. (38.7) and add the additional constraints used in Eq. (38.9). Now the ellipsoidmethod can be applied as for Eq. (38.9).

38.15 Let φk(x) = true if 〈ak, x〉 ≥ bk and φk(x) = ak otherwise. Then the new separationoracle returns true if φ(x) and φk(x) are true for all k. Otherwise return the separating hyperplaneprovided by some φ or φk that did not return true.

38.16

(a) Let π = (π1, π2, . . .) be an arbitrary Markov policy where πt is the policy followed in time stept. Using the notation and techniques from the proof of Theorem 38.2,

P (t−1)π rπt = P (t−1)

π (rπt + Pπtv − Pπtv) ≤ P (t−1)π (rπ + Pπv − Pπtv)

≤ P (t−1)π ((ρ+ ε)1 + v − Pπtv) = (ρ+ ε)1 + P (t−1)

π v − P (t)π v .

Taking the average and then the limit shows that ρπ(s) ≤ ρ+ ε for all s ∈ S. By the claim inExercise 38.2, ρ∗ ≤ ρ+ ε.

(b) We have rπ(s) + 〈Pπ(s)(s), v〉 ≥ maxa ra(s) + 〈Pa(s), v〉 − ε′ ≥ ρ+ v(s)− (ε+ ε′). Therefore,

P t−1π rπ = P t−1

π (rπ + Pπv − Pπv) ≥ P t−1π ((ρ− (ε+ ε′))1 + v − Pπv) .

Taking the average and the limit again shows that ρπ(s) ≥ ρ− (ε+ ε′). The claim follows fromcombining this with the previous result.

(c) Let ε = v + ρ1− Tv, which by the first constraint satisfies ε ≥ 0. Let π∗ be an optimal policysatisfying the requirements of the theorem statement and π be the greedy policy with respectto v. Then

P tπ∗rπ∗ ≤ P tπ∗(rπ + Pπv − Pπ∗v) = P tπ∗(ρ1 + v − ε− Pπ∗v) .

Hence ρ∗1 = ρπ∗1 ≤ ρ1−P ∗π∗ε, which means that P ∗π∗ε ≤ δ1. By the definition of s there exists

a state s with Therefore

δ ≥ (Pπ∗ε)(s) =∑

s′∈SPπ∗(s, s′)ε(s′) ≥ Pπ∗(s, s)ε(s)

and so ε(s) ≤ δ|S|. Notice that v = v − ε+ ε(s)1 also satisfies the constraints in Eq. (38.7) andhence

〈v − ε+ ε(s)1,1〉 ≥ 〈v,1〉 ,

113

Page 114: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

which implies that 〈ε,1〉 ≤ |S|ε(s) ≤ |S|2δ. Hence (ρ, v) approximately satisfies the Bellmanoptimality equation.

38.17 Define operator T : RS → RS by (Tu)(s) = maxa∈A ra(s) + 〈Pa(s), u〉, which is chosen sothat v∗n = Tn0. Let v be a solution to the Bellman optimality equation with mins v(s) = 0. ThenTv = ρ∗1 + v and

v∗n = Tn0 ≤ Tnv = nρ∗1 + v ≤ nρ∗1 +D1 ,

where the last inequality follows from the previous exercise and the assumption that mins v(s) = 0.

38.19

(a) There are four memoryless policies in this MDP. All are optimal except the policy π that alwayschooses the dashed action.

(b) The optimistic rewards are given by

rk,stay(s) = 12 +

√L

2(1 ∨ Tk−1(s, stay))

rk,go(s) =√

L

2(1 ∨ Tk−1(s,go)) .

Whenever Tk−1(s, a) ≥ 1 the transition estimates Pk−1,a(s) = Pa(s). Let S′t be the statewith St 6= S′t. Suppose that Tk−1(St, stay) > Tk−1(S′t, stay) and rk,stay(St) < 1. Thenrk,stay(S′t) > rk,stay(St). Once this occurs the optimal policy in the optimistic MDP is to chooseaction go. It follows easily that once t is sufficiently large the algorithm will alternate betweenchoosing actions go and stay and subsequently suffer linear regret. Note that the uncertaintyin the transitions does not play a big role here. In the optimistic MDP they will always bechosen to maximise the probability of transitioning to the state with the largest optimisticreward.

38.21 Here we abuse notation by letting Pu,a(s) be the empirical next-state transitions after uvisits to state-action pair (s, a). By a union bound and the result in Exercise 5.17,

P (F ) ≤ P

∃u ∈ N, s, a ∈ S ×A : ‖P − Pu,a(s)‖1 ≥

√2S log(4SAu(u+ 1)/δ)

u

≤∑

s,a∈S×A

∞∑

u=1P

‖P − Pu,a(s)‖1 ≥

√2S log(4SAu(u+ 1)/δ)

u

≤∑

s,a∈S×A

∞∑

u=1

δ

2SAu(u+ 1) = δ

2 .

The statement in Exercise 5.17 makes an independence assumption that is not exactly satisfied here.We are saved by the Markov property, which provides the conditional independence required.

114

Page 115: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

38.22 If ∑m−1k=1 ak ≤ 1, then A0 = A1 = · · · = Am−1 = 1. Hence am ≤ 1 also holds and using

Am ≥ 1,

m∑

k=1

ak√Ak−1

=m−1∑

k=1ak + am ≤ 1 + 1 ≤ (

√2 + 1)Am .

Let us now use induction on m. As long as m is so that ∑m−1k=1 ak ≤ 1, the previous argument covers

us. Thus, consider any m > 1 such that ∑m−1k=1 ak > 1 and assume that the statement holds for

m− 1 (note that m = 1 implies ∑m−1k=1 ak ≤ 1). Let c =

√2 + 1. Then,

m∑

k=1

ak√Ak−1

≤ c√Am−1 + am√

Am−1(split sum, induction hypothesis)

=√c2Am−1 + 2cam + a2

m

Am−1

≤√c2Am−1 + (2c+ 1)am (am ≤ Am−1)

= c√Am−1 + am (choice of c)

= c√Am . (Am−1 ≥ 1 and definition of Am)

38.23 We only outline the necessary changes to the proof of Theorem 38.6. The first step is toaugment the failure event to include the event that there exists a phase k and state-action pair s, asuch that

|rk,a(s)− ra(s)| ≥√

2LTt(s, a) .

The likelihood of this event is at most δ/2 by Hoeffding’s bound combined with a union bound.Like in the proof of Theorem 38.6 we now restrict our attention to the regret on the event that thefailure does not occur. The first change is that rπk(s) in Eq. (38.18) must be replaced with rk,πk(s).Then the reward terms no longer cancel in Eq. (38.20), which means that now

Rk =∑

t∈Ek

(−vk(St) + 〈Pk,At(St), vk〉+ rk,At(St)− rAt(St))

≤∑

t∈Ek

(〈PAt , vk〉 − vk(St)) + D

2∑

t∈Ek

‖Pk,At(St)− PAt(St)‖1

+∑

t∈Ek

√2L

1 ∨ Tτk−1(St, At).

The first two terms are the same as the proof of Theorem 38.6 and are bounded in the same way,which result in the same contribution to the regret. Only the last term is new. Summing over all

115

Page 116: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

phases and applying the result from Exercise 38.22 and Cauchy-Schwarz,

K∑

k=1

t∈Ek

√2L

1 ∨ Tτk−1(St, At)=√

2L∑

s∈S

a∈A

K∑

k=1

T(k)(s, a)√

1 ∨ Tτk−1(s, a)

≤ (√

2 + 1)√

2LSAn .

This term is small relative to the contribution due to the uncertainty in the transitions. Hence thereexists a universal constant C such that with probability at least 1− 3δ/2 the regret of this modifiedalgorithm is at most

Rn ≤ CD(M)S√nA log

(nSAδ

).

38.24

(a) An easy calculation shows that the depth d of the tree is bounded by 2 + logA S, which by theconditions in the statement of the lower bound implies that d + 1 ≤ D/2. The diameter isthe maximum over all distinct pairs of states of the expected travel time between those twostates. It is not hard to see that this is maximised by the pair sg and sb, so we restrict ourattention to bounding the expected travel time between these two states under some policy.Let τ = mint : St = sb and let π be a policy that traverses the tree to a decision state withε(s, a) = 0. We will show that for this policy

E[τ |S1 = sg] ≤ D .

Let X1, X2, . . . be a sequence of random variables where Xi ∈ N+ is the number of rounds untilthe policy leaves state sg on the ith series of visits to sg. Then let M be the number of visits tostate sg before sb is reached. All of these random variables are independent and geometricallydistributed. An easy calculation shows that

E[Xi] = 1/δ E[M ] = 2 .

Then τ = ∑Mi=1(Xi + d+ 1), which has expectation E[τ ] = 2(1/δ + d+ 1) ≤ 2/δ +D/2 ≤ D.

(b) The definition of stopping time τ ensures that Tσ ≤ n/D + 1 ≤ 2n/D almost surely and henceDE[Tσ]/n ≤ 2 is immediate. For the second part note that

P

n/(2D)∑

t=1Dt ≥

n

D

(c) We need to prove that Rnj ≥ c3∆DEj [Tσ − Tj ] where Rnj is the expected regret of π in MDPMj over n rounds and c3 > 0 is a universal constant. The idea is to write the total rewardincurred using episodes punctuated by visits to s0 and note that the expected lengths of these

116

Page 117: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

episodes are the same regardless of the policy used.

For the formal argument, we start by rewriting the expected regret Rnj in a more suitableform. For this we introduce episodes indexed by k ∈ [n]. The kth episode starts at the time τkwhen s0 is visited the kth time (in particular, τ1 = 0). In the kth episode, a leaf is reached intime step t = τk + d. Let Ek be the indicator that the the state-action pair of this time step isnot the jth state-action in L = (s1, a1), . . . , (sp, ap): Ek = I (Sτk+d, Aτk+d) = (sj , aj). Notethat this is an indicator of a “regret-inducing choice”.

In time steps Tk = τk + d+ 1, . . . , τk+1, the state is one of sg and sb. Let Hk = |Tk ∩ [n]| bethe number of time steps in Tk that happen before round n is over. For clarity, we add subindexπ to Ej and Pj to make it explicit that these depend on π. Further, in MDP Mj , let the valuesof ε(s, a) be denoted by εj(s, a).

By construction, the total expected reward incurred up to time step n by policy π is

V πj :=

n∑

k=1Ej,π[HkI Sτ+k+d+1 = sg]

=n∑

k=1

(s,a)∈LEj,π[HkI Sτ+k+d+1 = sg |Sτk+d = s,Aτ+k+d = a]Pj,π(Sτk+d = s,Aτ+k+d = a)

(by the law of total probability)

=n∑

k=1

(s,a)∈LEj,π[Hk]Pj,π(Sτ+k+d+1 = sg|Sτk+d = s,Aτ+k+d = a]Pj,π(Sτk+d = s,Aτ+k+d = a)

(conditioning, Markov property)

=n∑

k=1

(s,a)∈LEj,π[Hk]

(12 + εj(s, a)

)Pj,π(Sτk+d = s,Aτ+k+d = a) (Mj definition)

=n∑

k=1

p 6=jEj,π[Hk]

(12

)Pj,π(Sτk+d = sp, Aτ+k+d = ap)

+(1

2 + ∆) n∑

k=1Ej,π[Hk]Pj,π(Sτk+d = sj , Aτ+k+d = aj) .

Now, note that Ej,π[Hk] = λk, regardless of policy π and index j. If π∗j is the optimal policy forMDP Mj , Pj,π∗j (Sτk+d = sj , Aτ+k+d = aj) = 1. Let ρπk = Pj,π((Sτk+d, Aτk+d) 6= (sj , aj)). Hence,

Rnj = V π∗j − V π

j

= (1/2 + ∆)n∑

k=1λk −

12

n∑

k=1λkρ

πk − (1

2 + ∆)n∑

k=1λk(1− ρπk)

=n∑

k=1λk

((1/2 + ∆)− 1

2ρπk − (1

2 + ∆)(1− ρπk))

= ∆n∑

k=1λkρ

πk ≥ ∆

m−1∑

k=1λkρ

πk ≥ c∆D

m−1∑

k=1ρπk ,

117

Page 118: Tor Lattimore and Csaba Szepesv´ari · Solutions to Selected Exercises in Bandit Algorithms Tor Lattimore and Csaba Szepesv´ari Draft of Thursday 27th June, 2019 Contents 2Foundations

where m = dn/D− 1e and the last inequality uses that λk ≥ cD for k ∈ [m− 1] with some universalconstant c > 0, the proof of which is left to the reader. Now, by definition,

Ej,π[Tσ − Tj ] =n∑

k=1Pj,π((Sτk+d, Aτk+d) 6= (sj , aj), τk + d < τ)

≤m−1∑

k=1Pj,π((Sτk+d, Aτk+d) 6= (sj , aj))

=m∑

k=1ρπk ,

where the second equality is because τ = n ∧ τm and thus for k ≥ m, τk + d ≥ τ . Putting togetherthe last two inequalities finishes the proof.

BibliographyV. I. Bogachev. Measure theory, volume 2. Springer Science & Business Media, 2007. [64]

O. Kallenberg. Foundations of modern probability. Springer-Verlag, 2002. [12, 55, 90]

T. Lattimore and Cs. Szepesvari. Learning with good feature representations in bandits and in RLwith a generative model. arXiv:1911.07676, 2019. [78]

T. Needham. A visual explanation of Jensen’s inequality. American Mathematical Monthly, 100(8):768–771, 1993. [62]

G. Peskir and A. Shiryaev. Optimal stopping and free-boundary problems. Springer, 2006. [96]

E. V. Slud. Distribution inequalities for the binomial law. The Annals of Probability, pages 404–412,1977. [17]

118