exploration-exploitation inreinforcementlearning · yasinabbasi-yadkoriandcsabaszepesvári....

Exploration-Exploitationin Reinforcement LearningPart 2 – Regret Minimization in Tabular MDPs

Mohammad Ghavamzadeh, Alessandro Lazaric and Matteo PirottaFacebook AI Research

Outline2

1 Tabular Model-BasedOptimisticRandomized

2 Tabular Model-Free Algorithms

Websitehttps://rlgammazero.github.io

Ghavamzadeh, Lazaric and Pirotta

https://rlgammazero.github.io

Minimax Lower Bound3

Theorem (adapted from Jaksch et al. [2010])

For any MDP M? = 〈S,A, ph, rh, H〉 with stationary (p1 = p2 = . . . = pH) transitions,any algorithm A at any episode K suffers a regret of at least

Ω(√

HSAT)

with T = HK.

If non-stationary transitions• p1, . . . , pH can be arbitrary different

• Effective number of states is S′ = HS• Lower bound

Ω(H√SAT

)Ghavamzadeh, Lazaric and Pirotta

Tabular MDPs: Outline4




The Optimism Principle: Intuition

The Optimism Principle: Intuition6

Exploration vs. Exploitation

Optimism in Face of Uncertainty

When you are uncertain, consider the best possible world (reward-wise)

If the best possible world is correct

=⇒ no regret

If the best possible world is wrong

=⇒ learn useful information







=⇒ no regret


=⇒ learn useful informationExploitation Exploration







=⇒ no regret


=⇒ learn useful informationExploitation Exploration

Optimism in value function


History: OFU for Regret MinimizationTabular MDPs

7

FH: finite-horizonAR: average reward

Agraw

al[1990]

AuerandOrtner[2006] (AR)

Bartlett

andTewari[2009] (AR)

Filippi et al.[2010] (AR)

Jakschetal.[2010] (AR)

Talebi andMaillard[2018] (AR)

Fruit

etal.

[201

8b] (AR)

Fruit

etal.

[201

8a] (AR)

Fruit

etal.

[201

9]:

Tossouetal.[2019] :

é

Qianet

al.[201

9](AR)

ZhangandJi[2019] (AR)

Azar et al. [2017] (FH)

Zanette andBrunskill [2018] (FH)

Kakade et al. [2018] :é

Jinet al. [2018] (FH)

Zanette andBrunskill [2019] (FH)

Efroni et al. [2019] (FH):: arXiv paper (not published)é: possibly incorrect


Learning Problem8

Input: S, A rh, phInitialize Qh1(s, a) = 0 for all (s, a) ∈ S ×A and h = 1, . . . , H, D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Compute (Qh,k)Hh=1 from Dk

Define πk based on (Qhk)Hh=1

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

end

Add trajectory (shk, ahk, rhk)Hh=1 to Dk+1

end


Learning Problem8



Compute (Qh,k)Hh=1 from Dk

Define πk based on (Qhk)Hh=1


end

Add trajectory (shk, ahk, rhk)Hh=1 to Dk+1

end

Defines the type of algorithm


Model-based Learning9



Estimate empirical MDP Mk = (S,A, phk, rhk, H) from Dk

phk(s′|s, a) =

∑k−1i=1 1

((shi, ahi, sh+1,i) = (s, a, s′)

)Nhk(s, a)

, rhk(s, a) =

∑k−1i=1 rhi · 1 ((shi, ahi) = (s, a))

Nhk(s, a)

Planning (by backward induction) for πhk


endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end


Measuring Uncertainty 10

Bounded parameter MDP [Strehl and Littman, 2008]

Mk =⟨S,A, rh, ph, H

⟩: ∀h ∈ [H]

rh(s, a) ∈ Brhk(s, a), ph(·|s, a) ∈ Bp

hk(s, a), ∀(s, a) ∈ S ×A

Compact confidence sets

Brhk(s, a) :=

[rhk(s, a)− βrhk(s, a), rhk(s, a) + βrhk(s, a)

]Bphk(s, a) :=

p(·|s, a) ∈ ∆(S) : ‖p(·|s, a)− phk(·|s, a)‖1 ≤ βphk(s, a)

Confidence bounds based on [Hoeffding, 1963] and [Weissman et al., 2003]

βrhk(s, a) ∝√

log(Nhk(s, a)/δ)

Nhk(s, a), βphk(s, a) ∝

√S log(Nhk(s, a)/δ)

Nhk(s, a)


Bounded Parameter MDP: Optimism11

M t

M t

M?

V πM Fix a policy π

Mt



M t

M t

M?


Mt

UCBt(V π1 )



M t

M t

M?


Mt

UCBt(V π1 )Optimism: UCBt(V π

1 ) = maxM∈Mt

V π1,M≥V π

1,M?


Extended Value Iteration[Jaksch et al., 2010]

12

Input: S, A, Brhk, BphkSet QH+1(s, a) = 0 for all (s, a) ∈ S ×A

for h = H, . . . , 1 dofor (s, a) ∈ S ×A do

Compute

Qhk(s, a) = maxrh∈Br

hk(s,a)

rh(s, a) + maxph∈B

phk

(s,a)Es′∼ph(·|s,a)

[Vh+1,k(s

′)]

= rhk(s, a) + βrhk(s, a) + maxph∈B

phk


[Vh+1,k(s

′)]

Vhk(s) = minH − (h− 1), max

a∈AQhk(s, a)

end

endreturn πhk(s) = argmaxa∈A Qhk(s, a)


Extended Value Iteration[Jaksch et al., 2010]

12

Input: S, A, Brhk, BphkSet QH+1(s, a) = 0 for all (s, a) ∈ S ×A


Compute

Qhk(s, a) = maxrh∈Br

hk(s,a)

rh(s, a) + maxph∈B

phk


[Vh+1,k(s

′)]

= rhk(s, a) + βrhk(s, a) + maxph∈B

phk


[Vh+1,k(s

′)]


a∈AQhk(s, a)

end

endreturn πhk(s) = argmaxa∈A Qhk(s, a)

Policy executed at episode k


Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

BrH

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)


Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)


Optimism 13

Q?h(s, a)

Qhk(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)


UCRL2-CH for Finite Horizon14

Theorem (adapted from [Jaksch et al., 2010])

For any tabular MDP with stationary transitions, UCRL2 with Chernoff-Hoeffdingconfidence intervals (UCRL2-CH), with high-probability, suffers a regret

R(K,M?,UCRL2-CH) = O(HS√AT +H2SA

)Order optimal

√AT√

HS factor worse than the lower-bound

Lower-bound: Ω(√HSAT ) (stationary transitions)


Extended Value Iteration15

Qhk(s, a) = max(r,p)∈Brhk(s,a)×Bphk(s,a)

r + pTVh+1,k

= maxr∈Brhk(s,a)

r + maxp∈Bphk(s,a)

pTVh+1,k

= rhk(s, a) + βrhk(s, a) + maxp∈Bphk(s,a)

pTVh+1,k

≤ rhk(s, a) + βrhk(s, a) + ‖p− phk(·|s, a)‖1‖Vh+1,k‖∞ + phk(·|s, a)TVh+1,k

≤ rhk(s, a) + βrhk(s, a) +Hβphk(s, a) + phk(·|s, a)TVh+1,k

- Exploration bonus (1 +H√S)βrhk(s, a) for the reward


UCBVI[Azar et al., 2017]

16

Replace EVI with Exploration Bonus

Input: S, A, Brhk, Bphk, rhk, phk, bhkSet QH+1,k(s, a) = 0 for all (s, a) ∈ S ×A


Compute

Qhk(s, a) = rhk(s, a) + bhk(s, a) + Es′∼phk(·|s,a)

[Vh+1,k(s

′)]


a′∈AQhk(s

′, a′)

endendreturn πhk(s) = argmaxa∈A Qhk(s, a)

Equivalent to value iteration on Mk = (S,A, rhk + bhk, phk, H)


UCBVI: Measuring Uncertainty 17

Combine uncertainties in rewards and transitionsIn a smart way

bhk(s, a) = (H + 1)

√log(Nhk(s, a)/δ)

Nhk(s, a)< βrhk +Hβphk

Save a√S factor∣∣∣ (ph(·|s, a)− phk(·|s, a))T V ?h︸︷︷︸

≤H

∣∣∣ ≤ H√ log(Nhk(s, a)/δ)

Nhk(s, a)︸︷︷︸=β

phk/√S


UCBVI-CH: Regret 18

Theorem (Thm. 1 of Azar et al. [2017])

For any tabular MDP with stationary transitions, UCBVI with Chernoff-Hoeffdingconfidence intervals (UCBVI-CH), with high-probability, suffers a regret

R(K,M?,UCBVI-CH) = O(H√SAT +H2S2A

)Order optimal

√SAT√

H factor worse than the lower-boundLong “warm up” phase

If non-stationary, then O(H3/2

√SAT

)



Refined Confidence Bounds19

UCRL2 with Bernstein-Freedman bounds (instead of Hoeffding/Weissman): * see tutorial website

R(K,M?,UCRL2B) = O(√

H Γ SAT +H2S2A

), Still not matching the lower-bound!

UCBVI with Bernstein-Freedman bounds: *

R(K,M?,UCBVI-BF) = O(√

HSAT +H2S2A+H√T)

- Matching the Lower-Bound!

, Long “warm up” phase

Γ = maxh,s,a

‖ph(·|s, a)‖0 ≤ S

* stationary model (p1 = . . . = pH)


Refined Confidence Bounds20

EULER [Zanette and Brunskill, 2019]keeps upper and lower bounds on V ?

h

R(K,M?, EULER) = O(√

Q?SAT +√SSAH2(

√S +√H))

, Problem-dependent bound based on environmental norm [Maillard et al., 2014]

Q? = maxs,a,h

(V(rh(s, a)) + Vx∼ph(·|s,a)(V

?h+1(x))

)Vx∼p(f(x)) = Ex∼p

[(f(x)− Ey∼p[f(y)]

)2]

- Can remove the dependence on H

- Matching lower-bound in the worst case


UCRL2: RiverSwim 21

Hoeffding

brhk(s, a) = rmax

√L

N

bphk(s, a) =

√SL

N

Bernstein

brhk(s, a) =

√LV(rhk)

N+ rmax

L

N

bphk(s, a) =

√LV(phk)

N+L

N

V(rhk) =1

N

∑i

(rh,i − rhk)2

is the population variance

N = Nhk(s, a) ∨ 1L = log(SAN/δ)

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCRL2-H

UCRL2-B


UCBVI: RiverSwim 22

Hoeffding

bhk(s, a) =(H − h)L√

N

Bernstein

bhk(s, a) =

√LVphk

(Vh+1,k)

N

+(H − h)L

N+

(H − h)√N

Vp(V ) = Ex∼p[(V (x)− µ)2]with µ = Ex∼p[V (x)]

N = Nhk(s, a) ∨ 1L = log(SAN/δ) 0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCBVI-H

UCBVI-B


UCBVI: RiverSwim 22

Hoeffding

bhk(s, a) =(H − h)L√

N

Bernstein

bhk(s, a) =

√LVphk

(Vh+1,k)

N

+(H − h)L

N+

(H − h)√N

Vp(V ) = Ex∼p[(V (x)− µ)2]with µ = Ex∼p[V (x)]

N = Nhk(s, a) ∨ 1L = log(SAN/δ) 0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCRL2-H

UCRL2-B

UCBVI-H

UCBVI-B


Model-Based Advantages23

Learning efficiencyFirst order optimalMatching lower-bound

Counterfactual reasoningOptimistic/Pessimistic value estimate for any πUsefull for inference (e.g., safety)


Model-Based Issues24

Complexity

Space O(HS2A)

non-stationary model =⇒ H( S2A︸︷︷︸transitions

+ SA︸︷︷︸rewards

)

Time O(K HS2A︸︷︷︸planning by VI

)

incremental updates


Real-Time Dynamic Programming (RTDP)[Barto et al., 1995]

25

Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H]

for k = 1, . . . ,K do // episodes

Observe initial state s1k (arbitrary)

for h = 1, . . . , H doahk ∈ arg max

a∈Arh(shk, a) + ph(·|shk, a)>Vh+1,k−1 (1-step planning)

Vh,k(shk) = rh(shk, ahk) + ph(·|shk, ahk)>Vh+1,k−1 (1-step planning)

Observe sh+1,k ∼ ph(·|shk, ahk)end

end


Opt-RTDP: Incremental Planning[Efroni et al., 2019]

26

Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H], D1 = ∅


Estimate empirical MDP Mk = (S,A, phk, rhk, H) from DkPlanning (by backward induction) for πhk



Hh=1 to Dk+1

end



26




for h = 1, . . . , H doBuild optimistic estimate of Q(shk, a) for all a ∈ A

Q← using phk, rhk, Vh+1,k−1

Set V hk(shk) = minVh,k−1(shk) , max

a′∈AQ(shk, a

′)

Execute ahk = πhk(shk) = argmaxa∈A

Q(shk, a)

Observe rhk and sh+1,k


Hh=1 to Dk+1

end Optimism + RTDP



26




for h = 1, . . . , H doBuild optimistic estimate of Q(shk, a) for all a ∈ A

Q← using phk, rhk, Vh+1,k−1

Set V hk(shk) = minVh,k−1(shk) , max

a′∈AQ(shk, a

′)

Execute ahk = πhk(shk) = argmaxa∈A

Q(shk, a)



Hh=1 to Dk+1

end Optimism + RTDP

Forward estimate (not backward!)

Next stage but previous episode!


Opt-RTDP: Properties 27

Non-Increasing EstimatesVhk(s) ≤ Vh,k−1(s)

how?

Optimistic initialization: Vh0(s) = H − (h− 1)

Clipping:

V hk(shk) = minVh,k−1(shk) , max

a′∈AQ(shk, a

′)


Opt-RTDP: Properties 28

Optimistic EstimatesVhk(s) ≥ V ?

h (s)

how?

Optimistic initialization: Vh0(s) = H − (h− 1)

Optimistic update

Example. UCRL2-like step

Q(shk, a) = maxr∈Br

hk(shk,a)r(shk, a) + max

p∈Bphk(shk,a)

Es′∼p(·|shk,a) [Vh+1,k−1(s′)]

Vh+1,k−1 is one episode behind but optimistic

Then Q is optimistic!


Opt-RTDP: Regret 29

Theorem (Thm. 8 of Efroni et al. [2019])

For any tabular MDP with stationary transitions, UCRL2-GP (Opt-RTDP based onUCRL2 with Hoeffding bounds), with high-probability, suffers a regret

R(K,M?,UCRL2-GP) = O(HS√AT +H2S3/2A

)Same regret as UCRL2-CH

Computationally more efficientTime: O(SA) per step, total runtime O(HSAK)

can be adapted to any algorithm (e.g.,UCBVI, EULER)


UCRL2-GP: RiverSwim 30

Hoeffding

brhk(s, a) = rmax

√L

N

bphk(s, a) =

√SL

N

Bernstein

brhk(s, a) =

√LV(rhk)

N+ rmax

L

N

bphk(s, a) =

√LV(phk)

N+L

N


0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCRL2-GP-H

UCRL2-GP-B


UCRL2-GP: RiverSwim 30

Hoeffding

brhk(s, a) = rmax

√L

N

bphk(s, a) =

√SL

N

Bernstein

brhk(s, a) =

√LV(rhk)

N+ rmax

L

N

bphk(s, a) =

√LV(phk)

N+L

N


0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCRL2-H

UCRL2-B

UCRL2-GP-H

UCRL2-GP-B


Posterior Sampling (PS)a.k.a. Thompson Sampling [Thompson, 1933]

32

Keep Bayesian posterior for the unknown MDP

A sample from the posterior is used as anestimate of the unknown MDP

Few samples =⇒ uncertainty in theestimate

Exploration

More samples =⇒ posterior concentrateson the true MDP

Exploitation

Posteriordistribution µt

Set of MDPs


History: PS for Regret MinimizationTabular MDPs

33

FH: finite-horizonAR: average reward

Thompson[1933]

Strens [2000]

Osbandet al. [2013] (FH)

Ë

Gopalan

andMannor [2015] (AR) Ë

Abbasi-YadkoriandSzepesvári[2015] (AR) é

OsbandandRoy [2016] :

Osbandand

Roy [2017] (FH)é

Ouyangetal.[2017] (AR) î

Agraw

alandJia

[2017] (AR) é

Theocharousetal.[2018] (AR) î

Russo[2019] (FH)

:: arXiv paper (not published)é: possibly incorrectî: possibly incorrect assumptions


Bayesian Regret 34

RB(K,µ1,A) = EM?∼µ1

[R(K,M?,A)︸︷︷︸

:=E[R(K,M?,A)

]]

= EM?

[K∑k=1

V ?1,M?(s1k)− V πk

1,M?(s1k)

]


Posterior Sampling[Osband and Roy, 2017]

35

Input: S, A, rh, ph, prior µ1

Initialize D1 = ∅


Sample Mk ∼ µk(·|Dk)Compute

πk ∈ arg maxπ

V π1,Mk



Hh=1 to Dk+1

end

Prior distribution:

∀Θ, P(M∗ ∈ Θ) = µ1(Θ)

Posterior distribution:

∀Θ, P(M∗ ∈ Θ|Dk, µ1) = µk(Θ)

PriorsDirichlet (transitions)Beta, Normal-Gamma, etc. (rewards)


Model Update with Dirichlet Priors36

o assume r is knownµt, (st, at, st+1)

︸︷︷︸∼Ht

7→ µt+1

µt(s, a) = Dirichlet(α1, . . . , αS) on p(·|s, a)

Observe st+1 ∼ p(·|st, at) (outcome of a multivariate Bernoulli) such thatst+1 = i. The Bayesian posterior is

µt+1(s, a) = Dirichlet( α1, . . . , αi + 1, . . . , αS )

• Posterior mean vector pt+1(si|s, a) =αi

n

• Variance bounded by1

n


Model Update with Dirichlet Priors36

o assume r is knownµt, (st, at, st+1)

︸︷︷︸∼Ht

7→ µt+1

µt(s, a) = Dirichlet(α1, . . . , αS) on p(·|s, a)

Observe st+1 ∼ p(·|st, at) (outcome of a multivariate Bernoulli) such thatst+1 = i. The Bayesian posterior is

µt+1(s, a) = Dirichlet( α1, . . . , αi + 1, . . . , αS )

• Posterior mean vector pt+1(si|s, a) =αi

n

• Variance bounded by1

n

n =

S∑i=1

αi


Posterior Sampling is Usually Better37

Bandit

[Chapelle and Li, 2011]

Finite horizon RL

[Osband and Roy, 2017]


PSRL: Regret 38

Theorem (Osband and Roy [2017] revisited)

For any prior µ1 with any independent Dirichlet prior over stationary transitions, theBayesian regret of PSRL is bounded as

RB(K,µ1, PSRL) = O(HS√AT )

Order optimal√AT√

HS factor suboptimal


* in [Osband and Roy, 2017] is O(H√SAT ) for stationary MDPs

but there is a mistake in Lem. 3 (see [Qian et al., 2020])Ghavamzadeh, Lazaric and Pirotta

PSRL: RiverSwim 39

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

10

20

30

K

R(K

)

PSRL


PSRL: RiverSwim 39

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

20

40

60

80

100

120

140

160

K

R(K

)

PSRL

UCRL2-B

UCBVI-B


Tabular Randomized Least-Squares Value Iteration (RLSVI)[Russo, 2019]

40

Input: S, A, H


Run Tabular-RLSVI on Dkfor h = 1, . . . , H do

Execute ahk = πhk(shk) = argmaxa

Qhk(shk, a)



Hh=1 to Dk+1

end

* Not necessary to store all the data. Updates can be done incrementallyGhavamzadeh, Lazaric and Pirotta

Tabular-RLSVI41

Input: Dataset Dk = (shi, ahi, rhi)H,kh=1,i=1

Estimate empirical MDP Mk = (S,A, ph, rh, H)

phk(s′|s, a) =1

Nhk(s, a)

k−1∑i=1

1((shi, ahi, sh+1,i) = (s, a, s′)

),

rhk(s, a) =1

Nhk(s, a)

k−1∑i=1

rhi · 1 ((shi, ahi) = (s, a))

for h = H, . . . , 1 do // backward inductionSample ξhk ∼ N

(0, σ2

hkI)

Compute

∀(s, a) ∈ S ×A, Qhk(s, a) = rhk(s, a) + ξhk(s, a) +∑s′∈S

phk(s′|s, a)Vh+1,k(s

′)

endreturn QhkHh=1

Planning on MDPMk = (S,A, phk, rhk + ξhk, H)


Tabular-RLSVI: Frequentist Regret42

Theorem (Russo [2019])

For any tabular MDP with non-stationary transitions, Tab-RLSVI with

σhk(s, a) = O(√

SH3

Nhk(s, a) + 1

)

suffers with high probability a frequentist regret

R(K,M?,Tab-RLSVI) = O(H5/2S3/2

√AT)

Order optimal√AT

H3/2S worse than the lower-bound Ω(H√SAT )

Analysis can be improved!


Tab-RLVSIσ: RiverSwim43

σ1 (theory)

σh(s, a) =1

4

√(H − h)3SL

N

σ2

σh(s, a) =1

4

√(H − h)2L

N

σ3

σh(s, a) =1

4

√L

N


0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

0.2

0.4

0.6

0.8

1

·104

K

R(K

)

Tab-RLSVI-1

Tab-RLSVI-2

Tab-RLSVI-3


Tab-RLVSIσ: RiverSwim43

σ1 (theory)

σh(s, a) =1

4

√(H − h)3SL

N

σ2

σh(s, a) =1

4

√(H − h)2L

N

σ3

σh(s, a) =1

4

√L

N

N = Nhk(s, a) ∨ 1L = log(SAN/δ) 0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

200

400

600

800

1,000

1,200

K

R(K

)

Tab-RLSVI-3

UCRL2-H


Model-Based Issues45

ComplexitySpace O(HS2A)

nonstationary model =⇒ H( S2A︸︷︷︸transitions

+ SA︸︷︷︸rewards

)

Time O(K HS2A︸︷︷︸planning by VI

)

Solutions1 Time complexity: incremental planning (e.g.,Opt-RTDP)

2 Space complexity: avoid to estimate rewards and transitions

Optimistic Q-learning (Opt-QL)Space: O(HSA) Time: O(HAK)


River Swim: Q-learning w\ ε-greedy Exploration46

εt = 1.0

εt = 0.5

εt =ε0

(N(st)− 1000)2/3

εt =

1.0 t < 6000ε0

N(st)1/2otherwise

εt =

1.0 t < 7000ε0

N(st)1/2otherwise

Tuning the ε schedule is difficult and problem dependent

Regret: Ω(

minT,AH/2)


Optimistic Q-learning47

Input: S, A, rh, phInitialize Qh(s, a) = H − (h− 1) and Nh(s, a) = 0 for all (s, a) ∈ S ×A and h = [H]

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)for h = 1, . . . , H do

Execute ahk = πhk(shk) = argmaxa

Qh(shk, a)


Set Nh(shk, ahk) = Nh(shk, ahk) + 1Update

Qh(shk, ahk) = (1− αt)Qh(shk, ahk) + αt(rhk + Vh+1(sh+1,k) + bt

)Set Vh(shk) = min

H − (h− 1),max

a∈AQh(shk, a)

end

end

Upper-Confidence Bound


Step size αt48

Qlearning uses αt ofO(1/t) or O(1/

√t)

with t = Nhk(s, a)

Opt-QL

αt =H + 1

H + t


Step size αt49

Recursive Q-learning update (t = Nhk(s, a))

Qhk(s, a) = 1 (t = 0)H +

t∑i=1

αit

(rki + Vh+1,ki(sh+1,ki) + bi

)with αi

t = αi

t∏j=i+1

(1− αj)

Optimistic initialization

Weighted Average ofbootstrapped values

*ki = k : Nhk(s, a) = i


Step size αt49


Qhk(s, a) = 1 (t = 0)H +

t∑i=1

αit


)with αi

t = αi

t∏j=i+1

(1− αj)

Idea: favoring later updateslast 1/H fraction of samples of (s, a) have non-negligible weights1− 1/H is forgotten



*ki = k : Nhk(s, a) = i


Step size αt49


Qhk(s, a) = 1 (t = 0)H +t∑i=1

αit


)with αi

t = αi

t∏j=i+1

(1− αj)

Example. H = 10 and assume t = Nhk(s, a) = 1000

0 100 200 300 400 500 600 700 800 900 1,0000

0.5

1

1.5·10−2

samples (i)

αt 1000

αi = (H + 1)/(H + i)

αi = 1/i

αt = 1/√i



*ki = k : Nhk(s, a) = i


Exploration Bonus bt50

Let t = Nhk(s, a)∣∣∣∣∣t∑i=1

αit

(V ?h+1(sh+1,ki)− Es′|s,a[V ?

h+1(s′)])∣∣∣∣∣ ≤ c

√H3 log(SAT/δ)

t︸︷︷︸:=bt

Note thatt∑i=1

αit = 1.


Opt-Q-learning: Regret51

Theorem ([Jin et al., 2018])For any tabular MDP with non-stationary transitions, Opt-QL with Hoeffdinginequalities (bt = O(

√H3/t)), with high probability, suffers a regret

R(K,M?,Opt-QL) = O(H2√SAT +H2SA)

Order optimal√SAT

H factor worse than the lower-bound Ω(H√SAT )

• √H factor worse than model-based with Hoeffding inequalitiesUCBVI-CH for non-stationary ph suffers O(H3/2

√SAT )

• but better second-order terms

The bound does not improve in stationary MDPs (i.e., p1 = . . . = pH)


Opt-Qlearning: Example52

0 1 2 3 4

·104

0

1,000

2,000

3,000

K

R(K

)

OptQL


Refined Confidence Intervals53

Opt-QL with Bernstein-Freedman bounds (instead of Hoeffding/Weissman):

R(K) = O(H3/2

√SAT +

√H9S3A3

), Still not matching the lower-bound!√

H worse than model-based (e.g.,UCBVI-BF)


Open Questions 54

1 prove frequentist regret for PSRL

2 whether the gap between the regret of model-based and model-free should exist?

3 which algorithm is better in practice?


Yasin Abbasi-Yadkori and Csaba Szepesvári. Bayesian optimal control of smoothly parameterized systems. InUAI, pages 1–11. AUAI Press, 2015.

Rajeev Agrawal. Adaptive control of markov chains under the weak accessibility. In 29th IEEE Conference onDecision and Control, pages 1426–1431. IEEE, 1990.

Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regretbounds. In NIPS, pages 1184–1194, 2017.

Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. InNIPS, pages 49–56. MIT Press, 2006.

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning.In ICML, volume 70 of Proceedings of Machine Learning Research, pages 263–272. PMLR, 2017.

Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement learning inweakly communicating MDPs. In UAI, pages 35–42. AUAI Press, 2009.

Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using real-time dynamicprogramming. Artif. Intell., 72(1-2):81–138, 1995.

Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In J. Shawe-Taylor, R. S. Zemel,P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems24, pages 2249–2257. Curran Associates, Inc., 2011. URLhttp://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling.pdf.

Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds formodel-based reinforcement learning with greedy policies. In NeurIPS, pages 12203–12213, 2019.

Sarah Filippi, Olivier Cappé, and Aurélien Garivier. Optimism in reinforcement learning and kullback-leiblerdivergence. 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton),pages 115–122, 2010.


http://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling.pdf

Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Near optimal exploration-exploitation innon-communicating markov decision processes. In NeurIPS, pages 2998–3008, 2018a.

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrainedexploration-exploitation in reinforcement learning. In ICML, Proceedings of Machine Learning Research.PMLR, 2018b.

Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of UCRL2B, 2019. URLhttps://rlgammazero.github.io/docs/ucrl2b_improved.pdf.

Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision processes. InCOLT, volume 40 of JMLR Workshop and Conference Proceedings, pages 861–898. JMLR.org, 2015.

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the AmericanStatistical Association, 58(301):13–30, 1963. URL http://www.jstor.org/stable/2282952.

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journalof Machine Learning Research, 11:1563–1600, 2010.

Chi Jin, Zeyuan Allen-Zhu, Sébastien Bubeck, and Michael I. Jordan. Is q-learning provably efficient? InNeurIPS, pages 4868–4878, 2018.

Sham Kakade, Mengdi Wang, and Lin F. Yang. Variance reduction methods for sublinear reinforcementlearning. CoRR, abs/1802.09184, 2018.

Odalric-Ambrym Maillard, Timothy A. Mann, and Shie Mannor. How hard is my mdp?" the distribution-normto the rescue". In NIPS, pages 1835–1843, 2014.

Ian Osband and Benjamin Van Roy. Posterior sampling for reinforcement learning without episodes. CoRR,abs/1608.02731, 2016.

Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning?In ICML, volume 70 of Proceedings of Machine Learning Research, pages 2701–2710. PMLR, 2017.


https://rlgammazero.github.io/docs/ucrl2b_improved.pdf

http://www.jstor.org/stable/2282952

Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posteriorsampling. In NIPS, pages 3003–3011, 2013.

Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision processes: Athompson sampling approach. In NIPS, pages 1333–1342, 2017.

Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Exploration bonus for regret minimization indiscrete and continuous average reward mdps. In NeurIPS, pages 4891–4900, 2019.

Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Concentration inequalities for multinoullirandom variables. CoRR, abs/2001.11595, 2020.

Daniel Russo. Worst-case regret bounds for exploration via randomized value functions. In NeurIPS, pages14410–14420, 2019.

Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for Markov decisionprocesses. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.

Malcolm Strens. A bayesian framework for reinforcement learning. In In Proceedings of the SeventeenthInternational Conference on Machine Learning, pages 943–950. ICML, 2000.

Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscountedreinforcement learning in mdps. In ALT, volume 83 of Proceedings of Machine Learning Research, pages770–805. PMLR, 2018.

Georgios Theocharous, Zheng Wen, Yasin Abbasi, and Nikos Vlassis. Scalar posterior sampling withapplications. In NeurIPS, pages 7696–7704, 2018.

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidenceof two samples. Biometrika, 25(3-4):285–294, 1933.

Aristide C. Y. Tossou, Debabrota Basu, and Christos Dimitrakakis. Near-optimal optimistic reinforcementlearning using empirical bernstein inequalities. CoRR, abs/1905.12425, 2019.


Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdú, and Marcelo J. Weinberger. Inequalities forthe L1 deviation of the empirical distribution. 2003.

Andrea Zanette and Emma Brunskill. Problem dependent reinforcement learning bounds which can identifybandit structure in mdps. In ICML, volume 80 of JMLR Workshop and Conference Proceedings, pages5732–5740. JMLR.org, 2018.

Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learningwithout domain knowledge using value function bounds. In ICML, volume 97 of Proceedings of MachineLearning Research, pages 7304–7312. PMLR, 2019.

Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by evaluating the optimal biasfunction. In NeurIPS, pages 2823–2832, 2019.


exploration-exploitation inreinforcementlearning · yasinabbasi-yadkoriandcsabaszepesvári....

Documents