exploration-exploitation inreinforcementlearning · yasinabbasi-yadkoriandcsabaszepesvári....

91
Exploration-Exploitation in Reinforcement Learning Part 2 – Regret Minimization in Tabular MDPs Mohammad Ghavamzadeh, Alessandro Lazaric and Matteo Pirotta Facebook AI Research

Upload: others

Post on 11-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Exploration-Exploitationin Reinforcement LearningPart 2 – Regret Minimization in Tabular MDPs

Mohammad Ghavamzadeh, Alessandro Lazaric and Matteo PirottaFacebook AI Research

Page 2: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Outline2

1 Tabular Model-BasedOptimisticRandomized

2 Tabular Model-Free Algorithms

Websitehttps://rlgammazero.github.io

Ghavamzadeh, Lazaric and Pirotta

Page 3: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Minimax Lower Bound3

Theorem (adapted from Jaksch et al. [2010])

For any MDP M? = 〈S,A, ph, rh, H〉 with stationary (p1 = p2 = . . . = pH) transitions,any algorithm A at any episode K suffers a regret of at least

Ω(√

HSAT)

with T = HK.

If non-stationary transitions• p1, . . . , pH can be arbitrary different

• Effective number of states is S′ = HS• Lower bound

Ω(H√SAT

)Ghavamzadeh, Lazaric and Pirotta

Page 4: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Tabular MDPs: Outline4

1 Tabular Model-BasedOptimisticRandomized

2 Tabular Model-Free Algorithms

Ghavamzadeh, Lazaric and Pirotta

Page 5: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

The Optimism Principle: Intuition

Page 6: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

The Optimism Principle: Intuition6

Exploration vs. Exploitation

Optimism in Face of Uncertainty

When you are uncertain, consider the best possible world (reward-wise)

If the best possible world is correct

=⇒ no regret

If the best possible world is wrong

=⇒ learn useful information

Ghavamzadeh, Lazaric and Pirotta

Page 7: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

The Optimism Principle: Intuition6

Exploration vs. Exploitation

Optimism in Face of Uncertainty

When you are uncertain, consider the best possible world (reward-wise)

If the best possible world is correct

=⇒ no regret

If the best possible world is wrong

=⇒ learn useful information

Ghavamzadeh, Lazaric and Pirotta

Page 8: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

The Optimism Principle: Intuition6

Exploration vs. Exploitation

Optimism in Face of Uncertainty

When you are uncertain, consider the best possible world (reward-wise)

If the best possible world is correct

=⇒ no regret

If the best possible world is wrong

=⇒ learn useful informationExploitation Exploration

Ghavamzadeh, Lazaric and Pirotta

Page 9: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

The Optimism Principle: Intuition6

Exploration vs. Exploitation

Optimism in Face of Uncertainty

When you are uncertain, consider the best possible world (reward-wise)

If the best possible world is correct

=⇒ no regret

If the best possible world is wrong

=⇒ learn useful informationExploitation Exploration

Optimism in value function

Ghavamzadeh, Lazaric and Pirotta

Page 10: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

History: OFU for Regret MinimizationTabular MDPs

7

FH: finite-horizonAR: average reward

Agraw

al[1990]

AuerandOrtner[2006] (AR)

Bartlett

andTewari[2009] (AR)

Filippi et al.[2010] (AR)

Jakschetal.[2010] (AR)

Talebi andMaillard[2018] (AR)

Fruit

etal.

[201

8b] (AR)

Fruit

etal.

[201

8a] (AR)

Fruit

etal.

[201

9]:

Tossouetal.[2019] :

é

Qianet

al.[201

9](AR)

ZhangandJi[2019] (AR)

Azar et al. [2017] (FH)

Zanette andBrunskill [2018] (FH)

Kakade et al. [2018] :é

Jinet al. [2018] (FH)

Zanette andBrunskill [2019] (FH)

Efroni et al. [2019] (FH):: arXiv paper (not published)é: possibly incorrect

Ghavamzadeh, Lazaric and Pirotta

Page 11: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Learning Problem8

Input: S, A rh, phInitialize Qh1(s, a) = 0 for all (s, a) ∈ S ×A and h = 1, . . . , H, D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Compute (Qh,k)Hh=1 from Dk

Define πk based on (Qhk)Hh=1

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

end

Add trajectory (shk, ahk, rhk)Hh=1 to Dk+1

end

Ghavamzadeh, Lazaric and Pirotta

Page 12: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Learning Problem8

Input: S, A rh, phInitialize Qh1(s, a) = 0 for all (s, a) ∈ S ×A and h = 1, . . . , H, D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Compute (Qh,k)Hh=1 from Dk

Define πk based on (Qhk)Hh=1

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

end

Add trajectory (shk, ahk, rhk)Hh=1 to Dk+1

end

Defines the type of algorithm

Ghavamzadeh, Lazaric and Pirotta

Page 13: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Model-based Learning9

Input: S, A rh, phInitialize Qh1(s, a) = 0 for all (s, a) ∈ S ×A and h = 1, . . . , H, D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Estimate empirical MDP Mk = (S,A, phk, rhk, H) from Dk

phk(s′|s, a) =

∑k−1i=1 1

((shi, ahi, sh+1,i) = (s, a, s′)

)Nhk(s, a)

, rhk(s, a) =

∑k−1i=1 rhi · 1 ((shi, ahi) = (s, a))

Nhk(s, a)

Planning (by backward induction) for πhk

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end

Ghavamzadeh, Lazaric and Pirotta

Page 14: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Measuring Uncertainty 10

Bounded parameter MDP [Strehl and Littman, 2008]

Mk =⟨S,A, rh, ph, H

⟩: ∀h ∈ [H]

rh(s, a) ∈ Brhk(s, a), ph(·|s, a) ∈ Bp

hk(s, a), ∀(s, a) ∈ S ×A

Compact confidence sets

Brhk(s, a) :=

[rhk(s, a)− βrhk(s, a), rhk(s, a) + βrhk(s, a)

]Bphk(s, a) :=

p(·|s, a) ∈ ∆(S) : ‖p(·|s, a)− phk(·|s, a)‖1 ≤ βphk(s, a)

Confidence bounds based on [Hoeffding, 1963] and [Weissman et al., 2003]

βrhk(s, a) ∝√

log(Nhk(s, a)/δ)

Nhk(s, a), βphk(s, a) ∝

√S log(Nhk(s, a)/δ)

Nhk(s, a)

Ghavamzadeh, Lazaric and Pirotta

Page 15: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Measuring Uncertainty 10

Bounded parameter MDP [Strehl and Littman, 2008]

Mk =⟨S,A, rh, ph, H

⟩: ∀h ∈ [H]

rh(s, a) ∈ Brhk(s, a), ph(·|s, a) ∈ Bp

hk(s, a), ∀(s, a) ∈ S ×A

Compact confidence sets

Brhk(s, a) :=

[rhk(s, a)− βrhk(s, a), rhk(s, a) + βrhk(s, a)

]Bphk(s, a) :=

p(·|s, a) ∈ ∆(S) : ‖p(·|s, a)− phk(·|s, a)‖1 ≤ βphk(s, a)

Confidence bounds based on [Hoeffding, 1963] and [Weissman et al., 2003]

βrhk(s, a) ∝√

log(Nhk(s, a)/δ)

Nhk(s, a), βphk(s, a) ∝

√S log(Nhk(s, a)/δ)

Nhk(s, a)

Ghavamzadeh, Lazaric and Pirotta

Page 16: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Bounded Parameter MDP: Optimism11

M t

M t

M?

V πM Fix a policy π

Mt

Ghavamzadeh, Lazaric and Pirotta

Page 17: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Bounded Parameter MDP: Optimism11

M t

M t

M?

V πM Fix a policy π

Mt

Ghavamzadeh, Lazaric and Pirotta

Page 18: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Bounded Parameter MDP: Optimism11

M t

M t

M?

V πM Fix a policy π

Mt

UCBt(V π1 )

Ghavamzadeh, Lazaric and Pirotta

Page 19: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Bounded Parameter MDP: Optimism11

M t

M t

M?

V πM Fix a policy π

Mt

UCBt(V π1 )Optimism: UCBt(V π

1 ) = maxM∈Mt

V π1,M≥V π

1,M?

Ghavamzadeh, Lazaric and Pirotta

Page 20: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Extended Value Iteration[Jaksch et al., 2010]

12

Input: S, A, Brhk, BphkSet QH+1(s, a) = 0 for all (s, a) ∈ S ×A

for h = H, . . . , 1 dofor (s, a) ∈ S ×A do

Compute

Qhk(s, a) = maxrh∈Br

hk(s,a)

rh(s, a) + maxph∈B

phk

(s,a)Es′∼ph(·|s,a)

[Vh+1,k(s

′)]

= rhk(s, a) + βrhk(s, a) + maxph∈B

phk

(s,a)Es′∼ph(·|s,a)

[Vh+1,k(s

′)]

Vhk(s) = minH − (h− 1), max

a∈AQhk(s, a)

end

endreturn πhk(s) = argmaxa∈A Qhk(s, a)

Ghavamzadeh, Lazaric and Pirotta

Page 21: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Extended Value Iteration[Jaksch et al., 2010]

12

Input: S, A, Brhk, BphkSet QH+1(s, a) = 0 for all (s, a) ∈ S ×A

for h = H, . . . , 1 dofor (s, a) ∈ S ×A do

Compute

Qhk(s, a) = maxrh∈Br

hk(s,a)

rh(s, a) + maxph∈B

phk

(s,a)Es′∼ph(·|s,a)

[Vh+1,k(s

′)]

= rhk(s, a) + βrhk(s, a) + maxph∈B

phk

(s,a)Es′∼ph(·|s,a)

[Vh+1,k(s

′)]

Vhk(s) = minH − (h− 1), max

a∈AQhk(s, a)

end

endreturn πhk(s) = argmaxa∈A Qhk(s, a)

Policy executed at episode k

Ghavamzadeh, Lazaric and Pirotta

Page 22: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

BrH

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Page 23: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Page 24: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Page 25: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Page 26: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Page 27: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Optimism 13

Q?h(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Page 28: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Optimism 13

Q?h(s, a)

Qhk(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Page 29: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Optimism 13

Q?h(s, a)

Qhk(s, a)

H

1

H1

uncertainty

EVI

VI

∀h ∈ [H],∀(s, a), Qhk(s, a) ≥ Q?h(s, a)

Ghavamzadeh, Lazaric and Pirotta

Page 30: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

UCRL2-CH for Finite Horizon14

Theorem (adapted from [Jaksch et al., 2010])

For any tabular MDP with stationary transitions, UCRL2 with Chernoff-Hoeffdingconfidence intervals (UCRL2-CH), with high-probability, suffers a regret

R(K,M?,UCRL2-CH) = O(HS√AT +H2SA

)Order optimal

√AT√

HS factor worse than the lower-bound

Lower-bound: Ω(√HSAT ) (stationary transitions)

Ghavamzadeh, Lazaric and Pirotta

Page 31: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Extended Value Iteration15

Qhk(s, a) = max(r,p)∈Brhk(s,a)×Bphk(s,a)

r + pTVh+1,k

= maxr∈Brhk(s,a)

r + maxp∈Bphk(s,a)

pTVh+1,k

= rhk(s, a) + βrhk(s, a) + maxp∈Bphk(s,a)

pTVh+1,k

≤ rhk(s, a) + βrhk(s, a) + ‖p− phk(·|s, a)‖1‖Vh+1,k‖∞ + phk(·|s, a)TVh+1,k

≤ rhk(s, a) + βrhk(s, a) +Hβphk(s, a) + phk(·|s, a)TVh+1,k

- Exploration bonus (1 +H√S)βrhk(s, a) for the reward

Ghavamzadeh, Lazaric and Pirotta

Page 32: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Extended Value Iteration15

Qhk(s, a) = max(r,p)∈Brhk(s,a)×Bphk(s,a)

r + pTVh+1,k

= maxr∈Brhk(s,a)

r + maxp∈Bphk(s,a)

pTVh+1,k

= rhk(s, a) + βrhk(s, a) + maxp∈Bphk(s,a)

pTVh+1,k

≤ rhk(s, a) + βrhk(s, a) + ‖p− phk(·|s, a)‖1‖Vh+1,k‖∞ + phk(·|s, a)TVh+1,k

≤ rhk(s, a) + βrhk(s, a) +Hβphk(s, a) + phk(·|s, a)TVh+1,k

- Exploration bonus (1 +H√S)βrhk(s, a) for the reward

Ghavamzadeh, Lazaric and Pirotta

Page 33: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

UCBVI[Azar et al., 2017]

16

Replace EVI with Exploration Bonus

Input: S, A, Brhk, Bphk, rhk, phk, bhkSet QH+1,k(s, a) = 0 for all (s, a) ∈ S ×A

for h = H, . . . , 1 dofor (s, a) ∈ S ×A do

Compute

Qhk(s, a) = rhk(s, a) + bhk(s, a) + Es′∼phk(·|s,a)

[Vh+1,k(s

′)]

Vhk(s) = minH − (h− 1), max

a′∈AQhk(s

′, a′)

endendreturn πhk(s) = argmaxa∈A Qhk(s, a)

Equivalent to value iteration on Mk = (S,A, rhk + bhk, phk, H)

Ghavamzadeh, Lazaric and Pirotta

Page 34: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

UCBVI: Measuring Uncertainty 17

Combine uncertainties in rewards and transitionsIn a smart way

bhk(s, a) = (H + 1)

√log(Nhk(s, a)/δ)

Nhk(s, a)< βrhk +Hβphk

Save a√S factor∣∣∣ (ph(·|s, a)− phk(·|s, a))T V ?h︸︷︷︸

≤H

∣∣∣ ≤ H√ log(Nhk(s, a)/δ)

Nhk(s, a)︸ ︷︷ ︸=β

phk/√S

Ghavamzadeh, Lazaric and Pirotta

Page 35: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

UCBVI: Measuring Uncertainty 17

Combine uncertainties in rewards and transitionsIn a smart way

bhk(s, a) = (H + 1)

√log(Nhk(s, a)/δ)

Nhk(s, a)< βrhk +Hβphk

Save a√S factor∣∣∣ (ph(·|s, a)− phk(·|s, a))T V ?h︸︷︷︸

≤H

∣∣∣ ≤ H√ log(Nhk(s, a)/δ)

Nhk(s, a)︸ ︷︷ ︸=β

phk/√S

Ghavamzadeh, Lazaric and Pirotta

Page 36: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

UCBVI-CH: Regret 18

Theorem (Thm. 1 of Azar et al. [2017])

For any tabular MDP with stationary transitions, UCBVI with Chernoff-Hoeffdingconfidence intervals (UCBVI-CH), with high-probability, suffers a regret

R(K,M?,UCBVI-CH) = O(H√SAT +H2S2A

)Order optimal

√SAT√

H factor worse than the lower-boundLong “warm up” phase

If non-stationary, then O(H3/2

√SAT

)

Lower-bound: Ω(√HSAT ) (stationary transitions)

Ghavamzadeh, Lazaric and Pirotta

Page 37: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Refined Confidence Bounds19

UCRL2 with Bernstein-Freedman bounds (instead of Hoeffding/Weissman): * see tutorial website

R(K,M?,UCRL2B) = O(√

H Γ SAT +H2S2A

), Still not matching the lower-bound!

UCBVI with Bernstein-Freedman bounds: *

R(K,M?,UCBVI-BF) = O(√

HSAT +H2S2A+H√T)

- Matching the Lower-Bound!

, Long “warm up” phase

Γ = maxh,s,a

‖ph(·|s, a)‖0 ≤ S

* stationary model (p1 = . . . = pH)

Ghavamzadeh, Lazaric and Pirotta

Page 38: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Refined Confidence Bounds19

UCRL2 with Bernstein-Freedman bounds (instead of Hoeffding/Weissman): * see tutorial website

R(K,M?,UCRL2B) = O(√

H Γ SAT +H2S2A

), Still not matching the lower-bound!

UCBVI with Bernstein-Freedman bounds: *

R(K,M?,UCBVI-BF) = O(√

HSAT +H2S2A+H√T)

- Matching the Lower-Bound!

, Long “warm up” phase

Γ = maxh,s,a

‖ph(·|s, a)‖0 ≤ S

* stationary model (p1 = . . . = pH)

Ghavamzadeh, Lazaric and Pirotta

Page 39: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Refined Confidence Bounds20

EULER [Zanette and Brunskill, 2019]keeps upper and lower bounds on V ?

h

R(K,M?, EULER) = O(√

Q?SAT +√SSAH2(

√S +√H))

, Problem-dependent bound based on environmental norm [Maillard et al., 2014]

Q? = maxs,a,h

(V(rh(s, a)) + Vx∼ph(·|s,a)(V

?h+1(x))

)Vx∼p(f(x)) = Ex∼p

[(f(x)− Ey∼p[f(y)]

)2]

- Can remove the dependence on H

- Matching lower-bound in the worst case

Ghavamzadeh, Lazaric and Pirotta

Page 40: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

UCRL2: RiverSwim 21

Hoeffding

brhk(s, a) = rmax

√L

N

bphk(s, a) =

√SL

N

Bernstein

brhk(s, a) =

√LV(rhk)

N+ rmax

L

N

bphk(s, a) =

√LV(phk)

N+L

N

V(rhk) =1

N

∑i

(rh,i − rhk)2

is the population variance

N = Nhk(s, a) ∨ 1L = log(SAN/δ)

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCRL2-H

UCRL2-B

Ghavamzadeh, Lazaric and Pirotta

Page 41: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

UCBVI: RiverSwim 22

Hoeffding

bhk(s, a) =(H − h)L√

N

Bernstein

bhk(s, a) =

√LVphk

(Vh+1,k)

N

+(H − h)L

N+

(H − h)√N

Vp(V ) = Ex∼p[(V (x)− µ)2]with µ = Ex∼p[V (x)]

N = Nhk(s, a) ∨ 1L = log(SAN/δ) 0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCBVI-H

UCBVI-B

Ghavamzadeh, Lazaric and Pirotta

Page 42: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

UCBVI: RiverSwim 22

Hoeffding

bhk(s, a) =(H − h)L√

N

Bernstein

bhk(s, a) =

√LVphk

(Vh+1,k)

N

+(H − h)L

N+

(H − h)√N

Vp(V ) = Ex∼p[(V (x)− µ)2]with µ = Ex∼p[V (x)]

N = Nhk(s, a) ∨ 1L = log(SAN/δ) 0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCRL2-H

UCRL2-B

UCBVI-H

UCBVI-B

Ghavamzadeh, Lazaric and Pirotta

Page 43: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Model-Based Advantages23

Learning efficiencyFirst order optimalMatching lower-bound

Counterfactual reasoningOptimistic/Pessimistic value estimate for any πUsefull for inference (e.g., safety)

Ghavamzadeh, Lazaric and Pirotta

Page 44: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Model-Based Issues24

Complexity

Space O(HS2A)

non-stationary model =⇒ H( S2A︸︷︷︸transitions

+ SA︸︷︷︸rewards

)

Time O(K HS2A︸ ︷︷ ︸planning by VI

)

incremental updates

Ghavamzadeh, Lazaric and Pirotta

Page 45: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Model-Based Issues24

Complexity

Space O(HS2A)

non-stationary model =⇒ H( S2A︸︷︷︸transitions

+ SA︸︷︷︸rewards

)

Time O(K HS2A︸ ︷︷ ︸planning by VI

)

incremental updates

Ghavamzadeh, Lazaric and Pirotta

Page 46: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Real-Time Dynamic Programming (RTDP)[Barto et al., 1995]

25

Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H]

for k = 1, . . . ,K do // episodes

Observe initial state s1k (arbitrary)

for h = 1, . . . , H doahk ∈ arg max

a∈Arh(shk, a) + ph(·|shk, a)>Vh+1,k−1 (1-step planning)

Vh,k(shk) = rh(shk, ahk) + ph(·|shk, ahk)>Vh+1,k−1 (1-step planning)

Observe sh+1,k ∼ ph(·|shk, ahk)end

end

Ghavamzadeh, Lazaric and Pirotta

Page 47: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Opt-RTDP: Incremental Planning[Efroni et al., 2019]

26

Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H], D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Estimate empirical MDP Mk = (S,A, phk, rhk, H) from DkPlanning (by backward induction) for πhk

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end

Ghavamzadeh, Lazaric and Pirotta

Page 48: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Opt-RTDP: Incremental Planning[Efroni et al., 2019]

26

Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H], D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Estimate empirical MDP Mk = (S,A, phk, rhk, H) from DkPlanning (by backward induction) for πhk

for h = 1, . . . , H doBuild optimistic estimate of Q(shk, a) for all a ∈ A

Q← using phk, rhk, Vh+1,k−1

Set V hk(shk) = minVh,k−1(shk) , max

a′∈AQ(shk, a

′)

Execute ahk = πhk(shk) = argmaxa∈A

Q(shk, a)

Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end Optimism + RTDP

Ghavamzadeh, Lazaric and Pirotta

Page 49: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Opt-RTDP: Incremental Planning[Efroni et al., 2019]

26

Input: S, A rh, phInitialize Vh0(s) = H − (h− 1) for all s ∈ S and h = [H], D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Estimate empirical MDP Mk = (S,A, phk, rhk, H) from DkPlanning (by backward induction) for πhk

for h = 1, . . . , H doBuild optimistic estimate of Q(shk, a) for all a ∈ A

Q← using phk, rhk, Vh+1,k−1

Set V hk(shk) = minVh,k−1(shk) , max

a′∈AQ(shk, a

′)

Execute ahk = πhk(shk) = argmaxa∈A

Q(shk, a)

Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end Optimism + RTDP

Forward estimate (not backward!)

Next stage but previous episode!

Ghavamzadeh, Lazaric and Pirotta

Page 50: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Opt-RTDP: Properties 27

Non-Increasing EstimatesVhk(s) ≤ Vh,k−1(s)

how?

Optimistic initialization: Vh0(s) = H − (h− 1)

Clipping:

V hk(shk) = minVh,k−1(shk) , max

a′∈AQ(shk, a

′)

Ghavamzadeh, Lazaric and Pirotta

Page 51: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Opt-RTDP: Properties 28

Optimistic EstimatesVhk(s) ≥ V ?

h (s)

how?

Optimistic initialization: Vh0(s) = H − (h− 1)

Optimistic update

Example. UCRL2-like step

Q(shk, a) = maxr∈Br

hk(shk,a)r(shk, a) + max

p∈Bphk(shk,a)

Es′∼p(·|shk,a) [Vh+1,k−1(s′)]

Vh+1,k−1 is one episode behind but optimistic

Then Q is optimistic!

Ghavamzadeh, Lazaric and Pirotta

Page 52: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Opt-RTDP: Properties 28

Optimistic EstimatesVhk(s) ≥ V ?

h (s)

how?

Optimistic initialization: Vh0(s) = H − (h− 1)

Optimistic update

Example. UCRL2-like step

Q(shk, a) = maxr∈Br

hk(shk,a)r(shk, a) + max

p∈Bphk(shk,a)

Es′∼p(·|shk,a) [Vh+1,k−1(s′)]

Vh+1,k−1 is one episode behind but optimistic

Then Q is optimistic!

Ghavamzadeh, Lazaric and Pirotta

Page 53: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Opt-RTDP: Regret 29

Theorem (Thm. 8 of Efroni et al. [2019])

For any tabular MDP with stationary transitions, UCRL2-GP (Opt-RTDP based onUCRL2 with Hoeffding bounds), with high-probability, suffers a regret

R(K,M?,UCRL2-GP) = O(HS√AT +H2S3/2A

)Same regret as UCRL2-CH

Computationally more efficientTime: O(SA) per step, total runtime O(HSAK)

can be adapted to any algorithm (e.g.,UCBVI, EULER)

Ghavamzadeh, Lazaric and Pirotta

Page 54: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

UCRL2-GP: RiverSwim 30

Hoeffding

brhk(s, a) = rmax

√L

N

bphk(s, a) =

√SL

N

Bernstein

brhk(s, a) =

√LV(rhk)

N+ rmax

L

N

bphk(s, a) =

√LV(phk)

N+L

N

N = Nhk(s, a) ∨ 1L = log(SAN/δ)

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCRL2-GP-H

UCRL2-GP-B

Ghavamzadeh, Lazaric and Pirotta

Page 55: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

UCRL2-GP: RiverSwim 30

Hoeffding

brhk(s, a) = rmax

√L

N

bphk(s, a) =

√SL

N

Bernstein

brhk(s, a) =

√LV(rhk)

N+ rmax

L

N

bphk(s, a) =

√LV(phk)

N+L

N

N = Nhk(s, a) ∨ 1L = log(SAN/δ)

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

50

100

150

200

250

K

R(K

)

UCRL2-H

UCRL2-B

UCRL2-GP-H

UCRL2-GP-B

Ghavamzadeh, Lazaric and Pirotta

Page 56: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Tabular MDPs: Outline31

1 Tabular Model-BasedOptimisticRandomized

2 Tabular Model-Free Algorithms

Ghavamzadeh, Lazaric and Pirotta

Page 57: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Posterior Sampling (PS)a.k.a. Thompson Sampling [Thompson, 1933]

32

Keep Bayesian posterior for the unknown MDP

A sample from the posterior is used as anestimate of the unknown MDP

Few samples =⇒ uncertainty in theestimate

Exploration

More samples =⇒ posterior concentrateson the true MDP

Exploitation

Posteriordistribution µt

Set of MDPs

Ghavamzadeh, Lazaric and Pirotta

Page 58: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

History: PS for Regret MinimizationTabular MDPs

33

FH: finite-horizonAR: average reward

Thompson[1933]

Strens [2000]

Osbandet al. [2013] (FH)

Ë

Gopalan

andMannor [2015] (AR) Ë

Abbasi-YadkoriandSzepesvári[2015] (AR) é

OsbandandRoy [2016] :

Osbandand

Roy [2017] (FH)é

Ouyangetal.[2017] (AR) î

Agraw

alandJia

[2017] (AR) é

Theocharousetal.[2018] (AR) î

Russo[2019] (FH)

:: arXiv paper (not published)é: possibly incorrectî: possibly incorrect assumptions

Ghavamzadeh, Lazaric and Pirotta

Page 59: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Bayesian Regret 34

RB(K,µ1,A) = EM?∼µ1

[R(K,M?,A)︸ ︷︷ ︸

:=E[R(K,M?,A)

]]

= EM?

[K∑k=1

V ?1,M?(s1k)− V πk

1,M?(s1k)

]

Ghavamzadeh, Lazaric and Pirotta

Page 60: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Posterior Sampling[Osband and Roy, 2017]

35

Input: S, A, rh, ph, prior µ1

Initialize D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Sample Mk ∼ µk(·|Dk)Compute

πk ∈ arg maxπ

V π1,Mk

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end

Prior distribution:

∀Θ, P(M∗ ∈ Θ) = µ1(Θ)

Posterior distribution:

∀Θ, P(M∗ ∈ Θ|Dk, µ1) = µk(Θ)

PriorsDirichlet (transitions)Beta, Normal-Gamma, etc. (rewards)

Ghavamzadeh, Lazaric and Pirotta

Page 61: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Posterior Sampling[Osband and Roy, 2017]

35

Input: S, A, rh, ph, prior µ1

Initialize D1 = ∅

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Sample Mk ∼ µk(·|Dk)Compute

πk ∈ arg maxπ

V π1,Mk

for h = 1, . . . , H doExecute ahk = πhk(shk)Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end

Prior distribution:

∀Θ, P(M∗ ∈ Θ) = µ1(Θ)

Posterior distribution:

∀Θ, P(M∗ ∈ Θ|Dk, µ1) = µk(Θ)

PriorsDirichlet (transitions)Beta, Normal-Gamma, etc. (rewards)

Ghavamzadeh, Lazaric and Pirotta

Page 62: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Model Update with Dirichlet Priors36

o assume r is knownµt, (st, at, st+1)

︸ ︷︷ ︸∼Ht

7→ µt+1

µt(s, a) = Dirichlet(α1, . . . , αS) on p(·|s, a)

Observe st+1 ∼ p(·|st, at) (outcome of a multivariate Bernoulli) such thatst+1 = i. The Bayesian posterior is

µt+1(s, a) = Dirichlet( α1, . . . , αi + 1, . . . , αS )

• Posterior mean vector pt+1(si|s, a) =αi

n

• Variance bounded by1

n

Ghavamzadeh, Lazaric and Pirotta

Page 63: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Model Update with Dirichlet Priors36

o assume r is knownµt, (st, at, st+1)

︸ ︷︷ ︸∼Ht

7→ µt+1

µt(s, a) = Dirichlet(α1, . . . , αS) on p(·|s, a)

Observe st+1 ∼ p(·|st, at) (outcome of a multivariate Bernoulli) such thatst+1 = i. The Bayesian posterior is

µt+1(s, a) = Dirichlet( α1, . . . , αi + 1, . . . , αS )

• Posterior mean vector pt+1(si|s, a) =αi

n

• Variance bounded by1

n

n =

S∑i=1

αi

Ghavamzadeh, Lazaric and Pirotta

Page 64: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Posterior Sampling is Usually Better37

Bandit

[Chapelle and Li, 2011]

Finite horizon RL

[Osband and Roy, 2017]

Ghavamzadeh, Lazaric and Pirotta

Page 65: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

PSRL: Regret 38

Theorem (Osband and Roy [2017] revisited)

For any prior µ1 with any independent Dirichlet prior over stationary transitions, theBayesian regret of PSRL is bounded as

RB(K,µ1, PSRL) = O(HS√AT )

Order optimal√AT√

HS factor suboptimal

Lower-bound: Ω(√HSAT ) (stationary transitions)

* in [Osband and Roy, 2017] is O(H√SAT ) for stationary MDPs

but there is a mistake in Lem. 3 (see [Qian et al., 2020])Ghavamzadeh, Lazaric and Pirotta

Page 66: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

PSRL: RiverSwim 39

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

10

20

30

K

R(K

)

PSRL

Ghavamzadeh, Lazaric and Pirotta

Page 67: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

PSRL: RiverSwim 39

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

20

40

60

80

100

120

140

160

K

R(K

)

PSRL

UCRL2-B

UCBVI-B

Ghavamzadeh, Lazaric and Pirotta

Page 68: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Tabular Randomized Least-Squares Value Iteration (RLSVI)[Russo, 2019]

40

Input: S, A, H

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)

Run Tabular-RLSVI on Dkfor h = 1, . . . , H do

Execute ahk = πhk(shk) = argmaxa

Qhk(shk, a)

Observe rhk and sh+1,k

endAdd trajectory (shk, ahk, rhk)

Hh=1 to Dk+1

end

* Not necessary to store all the data. Updates can be done incrementallyGhavamzadeh, Lazaric and Pirotta

Page 69: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Tabular-RLSVI41

Input: Dataset Dk = (shi, ahi, rhi)H,kh=1,i=1

Estimate empirical MDP Mk = (S,A, ph, rh, H)

phk(s′|s, a) =1

Nhk(s, a)

k−1∑i=1

1((shi, ahi, sh+1,i) = (s, a, s′)

),

rhk(s, a) =1

Nhk(s, a)

k−1∑i=1

rhi · 1 ((shi, ahi) = (s, a))

for h = H, . . . , 1 do // backward inductionSample ξhk ∼ N

(0, σ2

hkI)

Compute

∀(s, a) ∈ S ×A, Qhk(s, a) = rhk(s, a) + ξhk(s, a) +∑s′∈S

phk(s′|s, a)Vh+1,k(s

′)

endreturn QhkHh=1

Planning on MDPMk = (S,A, phk, rhk + ξhk, H)

Ghavamzadeh, Lazaric and Pirotta

Page 70: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Tabular-RLSVI: Frequentist Regret42

Theorem (Russo [2019])

For any tabular MDP with non-stationary transitions, Tab-RLSVI with

σhk(s, a) = O(√

SH3

Nhk(s, a) + 1

)

suffers with high probability a frequentist regret

R(K,M?,Tab-RLSVI) = O(H5/2S3/2

√AT)

Order optimal√AT

H3/2S worse than the lower-bound Ω(H√SAT )

Analysis can be improved!

Ghavamzadeh, Lazaric and Pirotta

Page 71: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Tab-RLVSIσ: RiverSwim43

σ1 (theory)

σh(s, a) =1

4

√(H − h)3SL

N

σ2

σh(s, a) =1

4

√(H − h)2L

N

σ3

σh(s, a) =1

4

√L

N

N = Nhk(s, a) ∨ 1L = log(SAN/δ)

0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

0.2

0.4

0.6

0.8

1

·104

K

R(K

)

Tab-RLSVI-1

Tab-RLSVI-2

Tab-RLSVI-3

Ghavamzadeh, Lazaric and Pirotta

Page 72: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Tab-RLVSIσ: RiverSwim43

σ1 (theory)

σh(s, a) =1

4

√(H − h)3SL

N

σ2

σh(s, a) =1

4

√(H − h)2L

N

σ3

σh(s, a) =1

4

√L

N

N = Nhk(s, a) ∨ 1L = log(SAN/δ) 0 0.5 1 1.5 2 2.5 3 3.5 4

·104

0

200

400

600

800

1,000

1,200

K

R(K

)

Tab-RLSVI-3

UCRL2-H

Ghavamzadeh, Lazaric and Pirotta

Page 73: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Tabular MDPs: Outline44

1 Tabular Model-BasedOptimisticRandomized

2 Tabular Model-Free Algorithms

Ghavamzadeh, Lazaric and Pirotta

Page 74: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Model-Based Issues45

ComplexitySpace O(HS2A)

nonstationary model =⇒ H( S2A︸︷︷︸transitions

+ SA︸︷︷︸rewards

)

Time O(K HS2A︸ ︷︷ ︸planning by VI

)

Solutions1 Time complexity: incremental planning (e.g.,Opt-RTDP)

2 Space complexity: avoid to estimate rewards and transitions

Optimistic Q-learning (Opt-QL)Space: O(HSA) Time: O(HAK)

Ghavamzadeh, Lazaric and Pirotta

Page 75: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Model-Based Issues45

ComplexitySpace O(HS2A)

nonstationary model =⇒ H( S2A︸︷︷︸transitions

+ SA︸︷︷︸rewards

)

Time O(K HS2A︸ ︷︷ ︸planning by VI

)

Solutions1 Time complexity: incremental planning (e.g.,Opt-RTDP)

2 Space complexity: avoid to estimate rewards and transitions

Optimistic Q-learning (Opt-QL)Space: O(HSA) Time: O(HAK)

Ghavamzadeh, Lazaric and Pirotta

Page 76: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Model-Based Issues45

ComplexitySpace O(HS2A)

nonstationary model =⇒ H( S2A︸︷︷︸transitions

+ SA︸︷︷︸rewards

)

Time O(K HS2A︸ ︷︷ ︸planning by VI

)

Solutions1 Time complexity: incremental planning (e.g.,Opt-RTDP)

2 Space complexity: avoid to estimate rewards and transitions

Optimistic Q-learning (Opt-QL)Space: O(HSA) Time: O(HAK)

Ghavamzadeh, Lazaric and Pirotta

Page 77: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

River Swim: Q-learning w\ ε-greedy Exploration46

εt = 1.0

εt = 0.5

εt =ε0

(N(st)− 1000)2/3

εt =

1.0 t < 6000ε0

N(st)1/2otherwise

εt =

1.0 t < 7000ε0

N(st)1/2otherwise

Tuning the ε schedule is difficult and problem dependent

Regret: Ω(

minT,AH/2)

Ghavamzadeh, Lazaric and Pirotta

Page 78: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Optimistic Q-learning47

Input: S, A, rh, phInitialize Qh(s, a) = H − (h− 1) and Nh(s, a) = 0 for all (s, a) ∈ S ×A and h = [H]

for k = 1, . . . ,K do // episodesObserve initial state s1k (arbitrary)for h = 1, . . . , H do

Execute ahk = πhk(shk) = argmaxa

Qh(shk, a)

Observe rhk and sh+1,k

Set Nh(shk, ahk) = Nh(shk, ahk) + 1Update

Qh(shk, ahk) = (1− αt)Qh(shk, ahk) + αt(rhk + Vh+1(sh+1,k) + bt

)Set Vh(shk) = min

H − (h− 1),max

a∈AQh(shk, a)

end

end

Upper-Confidence Bound

Ghavamzadeh, Lazaric and Pirotta

Page 79: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Step size αt48

Qlearning uses αt ofO(1/t) or O(1/

√t)

with t = Nhk(s, a)

Opt-QL

αt =H + 1

H + t

Ghavamzadeh, Lazaric and Pirotta

Page 80: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Step size αt49

Recursive Q-learning update (t = Nhk(s, a))

Qhk(s, a) = 1 (t = 0)H +

t∑i=1

αit

(rki + Vh+1,ki(sh+1,ki) + bi

)with αi

t = αi

t∏j=i+1

(1− αj)

Optimistic initialization

Weighted Average ofbootstrapped values

*ki = k : Nhk(s, a) = i

Ghavamzadeh, Lazaric and Pirotta

Page 81: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Step size αt49

Recursive Q-learning update (t = Nhk(s, a))

Qhk(s, a) = 1 (t = 0)H +

t∑i=1

αit

(rki + Vh+1,ki(sh+1,ki) + bi

)with αi

t = αi

t∏j=i+1

(1− αj)

Idea: favoring later updateslast 1/H fraction of samples of (s, a) have non-negligible weights1− 1/H is forgotten

Optimistic initialization

Weighted Average ofbootstrapped values

*ki = k : Nhk(s, a) = i

Ghavamzadeh, Lazaric and Pirotta

Page 82: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Step size αt49

Recursive Q-learning update (t = Nhk(s, a))

Qhk(s, a) = 1 (t = 0)H +t∑i=1

αit

(rki + Vh+1,ki(sh+1,ki) + bi

)with αi

t = αi

t∏j=i+1

(1− αj)

Example. H = 10 and assume t = Nhk(s, a) = 1000

0 100 200 300 400 500 600 700 800 900 1,0000

0.5

1

1.5·10−2

samples (i)

αt 1000

αi = (H + 1)/(H + i)

αi = 1/i

αt = 1/√i

Optimistic initialization

Weighted Average ofbootstrapped values

*ki = k : Nhk(s, a) = i

Ghavamzadeh, Lazaric and Pirotta

Page 83: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Exploration Bonus bt50

Let t = Nhk(s, a)∣∣∣∣∣t∑i=1

αit

(V ?h+1(sh+1,ki)− Es′|s,a[V ?

h+1(s′)])∣∣∣∣∣ ≤ c

√H3 log(SAT/δ)

t︸ ︷︷ ︸:=bt

Note thatt∑i=1

αit = 1.

Ghavamzadeh, Lazaric and Pirotta

Page 84: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Opt-Q-learning: Regret51

Theorem ([Jin et al., 2018])For any tabular MDP with non-stationary transitions, Opt-QL with Hoeffdinginequalities (bt = O(

√H3/t)), with high probability, suffers a regret

R(K,M?,Opt-QL) = O(H2√SAT +H2SA)

Order optimal√SAT

H factor worse than the lower-bound Ω(H√SAT )

• √H factor worse than model-based with Hoeffding inequalitiesUCBVI-CH for non-stationary ph suffers O(H3/2

√SAT )

• but better second-order terms

The bound does not improve in stationary MDPs (i.e., p1 = . . . = pH)

Ghavamzadeh, Lazaric and Pirotta

Page 85: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Opt-Qlearning: Example52

0 1 2 3 4

·104

0

1,000

2,000

3,000

K

R(K

)

OptQL

Ghavamzadeh, Lazaric and Pirotta

Page 86: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Refined Confidence Intervals53

Opt-QL with Bernstein-Freedman bounds (instead of Hoeffding/Weissman):

R(K) = O(H3/2

√SAT +

√H9S3A3

), Still not matching the lower-bound!√

H worse than model-based (e.g.,UCBVI-BF)

Ghavamzadeh, Lazaric and Pirotta

Page 87: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Open Questions 54

1 prove frequentist regret for PSRL

2 whether the gap between the regret of model-based and model-free should exist?

3 which algorithm is better in practice?

Ghavamzadeh, Lazaric and Pirotta

Page 88: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Yasin Abbasi-Yadkori and Csaba Szepesvári. Bayesian optimal control of smoothly parameterized systems. InUAI, pages 1–11. AUAI Press, 2015.

Rajeev Agrawal. Adaptive control of markov chains under the weak accessibility. In 29th IEEE Conference onDecision and Control, pages 1426–1431. IEEE, 1990.

Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regretbounds. In NIPS, pages 1184–1194, 2017.

Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. InNIPS, pages 49–56. MIT Press, 2006.

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning.In ICML, volume 70 of Proceedings of Machine Learning Research, pages 263–272. PMLR, 2017.

Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement learning inweakly communicating MDPs. In UAI, pages 35–42. AUAI Press, 2009.

Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using real-time dynamicprogramming. Artif. Intell., 72(1-2):81–138, 1995.

Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In J. Shawe-Taylor, R. S. Zemel,P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems24, pages 2249–2257. Curran Associates, Inc., 2011. URLhttp://papers.nips.cc/paper/4321-an-empirical-evaluation-of-thompson-sampling.pdf.

Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds formodel-based reinforcement learning with greedy policies. In NeurIPS, pages 12203–12213, 2019.

Sarah Filippi, Olivier Cappé, and Aurélien Garivier. Optimism in reinforcement learning and kullback-leiblerdivergence. 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton),pages 115–122, 2010.

Ghavamzadeh, Lazaric and Pirotta

Page 89: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Near optimal exploration-exploitation innon-communicating markov decision processes. In NeurIPS, pages 2998–3008, 2018a.

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrainedexploration-exploitation in reinforcement learning. In ICML, Proceedings of Machine Learning Research.PMLR, 2018b.

Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of UCRL2B, 2019. URLhttps://rlgammazero.github.io/docs/ucrl2b_improved.pdf.

Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision processes. InCOLT, volume 40 of JMLR Workshop and Conference Proceedings, pages 861–898. JMLR.org, 2015.

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the AmericanStatistical Association, 58(301):13–30, 1963. URL http://www.jstor.org/stable/2282952.

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journalof Machine Learning Research, 11:1563–1600, 2010.

Chi Jin, Zeyuan Allen-Zhu, Sébastien Bubeck, and Michael I. Jordan. Is q-learning provably efficient? InNeurIPS, pages 4868–4878, 2018.

Sham Kakade, Mengdi Wang, and Lin F. Yang. Variance reduction methods for sublinear reinforcementlearning. CoRR, abs/1802.09184, 2018.

Odalric-Ambrym Maillard, Timothy A. Mann, and Shie Mannor. How hard is my mdp?" the distribution-normto the rescue". In NIPS, pages 1835–1843, 2014.

Ian Osband and Benjamin Van Roy. Posterior sampling for reinforcement learning without episodes. CoRR,abs/1608.02731, 2016.

Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning?In ICML, volume 70 of Proceedings of Machine Learning Research, pages 2701–2710. PMLR, 2017.

Ghavamzadeh, Lazaric and Pirotta

Page 90: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posteriorsampling. In NIPS, pages 3003–3011, 2013.

Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision processes: Athompson sampling approach. In NIPS, pages 1333–1342, 2017.

Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Exploration bonus for regret minimization indiscrete and continuous average reward mdps. In NeurIPS, pages 4891–4900, 2019.

Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Concentration inequalities for multinoullirandom variables. CoRR, abs/2001.11595, 2020.

Daniel Russo. Worst-case regret bounds for exploration via randomized value functions. In NeurIPS, pages14410–14420, 2019.

Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for Markov decisionprocesses. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.

Malcolm Strens. A bayesian framework for reinforcement learning. In In Proceedings of the SeventeenthInternational Conference on Machine Learning, pages 943–950. ICML, 2000.

Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscountedreinforcement learning in mdps. In ALT, volume 83 of Proceedings of Machine Learning Research, pages770–805. PMLR, 2018.

Georgios Theocharous, Zheng Wen, Yasin Abbasi, and Nikos Vlassis. Scalar posterior sampling withapplications. In NeurIPS, pages 7696–7704, 2018.

William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidenceof two samples. Biometrika, 25(3-4):285–294, 1933.

Aristide C. Y. Tossou, Debabrota Basu, and Christos Dimitrakakis. Near-optimal optimistic reinforcementlearning using empirical bernstein inequalities. CoRR, abs/1905.12425, 2019.

Ghavamzadeh, Lazaric and Pirotta

Page 91: Exploration-Exploitation inReinforcementLearning · YasinAbbasi-YadkoriandCsabaSzepesvári. Bayesianoptimalcontrolofsmoothlyparameterizedsystems. In UAI,pages1–11.AUAIPress,2015

Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdú, and Marcelo J. Weinberger. Inequalities forthe L1 deviation of the empirical distribution. 2003.

Andrea Zanette and Emma Brunskill. Problem dependent reinforcement learning bounds which can identifybandit structure in mdps. In ICML, volume 80 of JMLR Workshop and Conference Proceedings, pages5732–5740. JMLR.org, 2018.

Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learningwithout domain knowledge using value function bounds. In ICML, volume 97 of Proceedings of MachineLearning Research, pages 7304–7312. PMLR, 2019.

Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by evaluating the optimal biasfunction. In NeurIPS, pages 2823–2832, 2019.

Ghavamzadeh, Lazaric and Pirotta