![Page 1: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/1.jpg)
Lecture 6: Model-Free Control
Lecture 6: Model-Free Control
Hado van Hasselt
![Page 2: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/2.jpg)
Lecture 6: Model-Free Control
Outline
1 Introduction
2 Monte-Carlo Control
3 On-Policy Temporal-Difference Learning
4 Off-Policy Learning
5 Summary
![Page 3: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/3.jpg)
Lecture 6: Model-Free Control
Introduction
Outline
1 Introduction
2 Monte-Carlo Control
3 On-Policy Temporal-Difference Learning
4 Off-Policy Learning
5 Summary
![Page 4: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/4.jpg)
Lecture 6: Model-Free Control
Introduction
Model-Free Reinforcement Learning
Last lecture:
Model-free predictionEstimate the value function of an unknown MDP
This lecture:
Model-free controlOptimise the value function of an unknown MDP
![Page 5: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/5.jpg)
Lecture 6: Model-Free Control
Introduction
Model-Free Reinforcement Learning
Last lecture:
Model-free prediction
Estimate the value function of an unknown MDP
This lecture:
Model-free controlOptimise the value function of an unknown MDP
![Page 6: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/6.jpg)
Lecture 6: Model-Free Control
Introduction
Model-Free Reinforcement Learning
Last lecture:
Model-free predictionEstimate the value function of an unknown MDP
This lecture:
Model-free controlOptimise the value function of an unknown MDP
![Page 7: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/7.jpg)
Lecture 6: Model-Free Control
Introduction
Model-Free Reinforcement Learning
Last lecture:
Model-free predictionEstimate the value function of an unknown MDP
This lecture:
Model-free controlOptimise the value function of an unknown MDP
![Page 8: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/8.jpg)
Lecture 6: Model-Free Control
Introduction
Model-Free Reinforcement Learning
Last lecture:
Model-free predictionEstimate the value function of an unknown MDP
This lecture:
Model-free control
Optimise the value function of an unknown MDP
![Page 9: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/9.jpg)
Lecture 6: Model-Free Control
Introduction
Model-Free Reinforcement Learning
Last lecture:
Model-free predictionEstimate the value function of an unknown MDP
This lecture:
Model-free controlOptimise the value function of an unknown MDP
![Page 10: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/10.jpg)
Lecture 6: Model-Free Control
Introduction
Uses of Model-Free Control
Some example problems that can be modelled as MDPs
Elevator
Parallel Parking
Ship Steering
Bioreactor
Helicopter
Aeroplane Logistics
Robocup Soccer
Portfolio management
Protein Folding
Robot walking
Atari video games
Game of Go
For most of these problems, either:
MDP model is unknown, but experience can be sampled
MDP model is known, but is too big to use, except by samples
Model-free control can solve these problems
![Page 11: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/11.jpg)
Lecture 6: Model-Free Control
Introduction
Uses of Model-Free Control
Some example problems that can be modelled as MDPs
Elevator
Parallel Parking
Ship Steering
Bioreactor
Helicopter
Aeroplane Logistics
Robocup Soccer
Portfolio management
Protein Folding
Robot walking
Atari video games
Game of Go
For most of these problems, either:
MDP model is unknown, but experience can be sampled
MDP model is known, but is too big to use, except by samples
Model-free control can solve these problems
![Page 12: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/12.jpg)
Lecture 6: Model-Free Control
Introduction
Uses of Model-Free Control
Some example problems that can be modelled as MDPs
Elevator
Parallel Parking
Ship Steering
Bioreactor
Helicopter
Aeroplane Logistics
Robocup Soccer
Portfolio management
Protein Folding
Robot walking
Atari video games
Game of Go
For most of these problems, either:
MDP model is unknown, but experience can be sampled
MDP model is known, but is too big to use, except by samples
Model-free control can solve these problems
![Page 13: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/13.jpg)
Lecture 6: Model-Free Control
Introduction
Uses of Model-Free Control
Some example problems that can be modelled as MDPs
Elevator
Parallel Parking
Ship Steering
Bioreactor
Helicopter
Aeroplane Logistics
Robocup Soccer
Portfolio management
Protein Folding
Robot walking
Atari video games
Game of Go
For most of these problems, either:
MDP model is unknown, but experience can be sampled
MDP model is known, but is too big to use, except by samples
Model-free control can solve these problems
![Page 14: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/14.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Outline
1 Introduction
2 Monte-Carlo Control
3 On-Policy Temporal-Difference Learning
4 Off-Policy Learning
5 Summary
![Page 15: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/15.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Generalized Policy Iteration
Generalized Policy Iteration (Refresher)
Policy evaluation Estimate V π
e.g. Iterative policy evaluation
Policy improvement Generate π′ ≥ πe.g. Greedy policy improvement
![Page 16: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/16.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Generalized Policy Iteration
Monte Carlo
Recall, Monte Carlo estimate from state St is
Gt = Rt+1 + γRt+2 + γ2Rt+3 + . . .
E[Gt ] = V π
So, we can multiple estimates to get V π
![Page 17: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/17.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Generalized Policy Iteration
Monte Carlo
Recall, Monte Carlo estimate from state St is
Gt = Rt+1 + γRt+2 + γ2Rt+3 + . . .
E[Gt ] = V π
So, we can multiple estimates to get V π
![Page 18: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/18.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Generalized Policy Iteration
Monte Carlo
Recall, Monte Carlo estimate from state St is
Gt = Rt+1 + γRt+2 + γ2Rt+3 + . . .
E[Gt ] = V π
So, we can multiple estimates to get V π
![Page 19: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/19.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Generalized Policy Iteration
Policy Iteration With Monte-Carlo Evaluation
Policy evaluation Monte-Carlo policy evaluation, V = V π?
Policy improvement Greedy policy improvement?
![Page 20: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/20.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Generalized Policy Iteration
Model-Free Policy Iteration Using Action-Value Function
Greedy policy improvement over V (s) requires model of MDP
π′(s) = argmaxa∈A
RAs + PA
ss′V (s ′)
Greedy policy improvement over Q(s, a) is model-free
π′(s) = argmaxa∈A
Q(s, a)
![Page 21: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/21.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Generalized Policy Iteration
Model-Free Policy Iteration Using Action-Value Function
Greedy policy improvement over V (s) requires model of MDP
π′(s) = argmaxa∈A
RAs + PA
ss′V (s ′)
Greedy policy improvement over Q(s, a) is model-free
π′(s) = argmaxa∈A
Q(s, a)
![Page 22: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/22.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Generalized Policy Iteration
Generalised Policy Iteration with Action-Value Function
Starting Q, π
π = greedy(Q)
Q = Q π
Q*, π*
Policy evaluation Monte-Carlo policy evaluation, Q = Qπ
Policy improvement Greedy policy improvement?
![Page 23: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/23.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Generalized Policy Iteration
ε-Greedy Policy Improvement
Theorem
For any ε-greedy policy π, the ε-greedy policy π′ with respect toQπ is an improvement, V π′
(s) ≥ V π(s)
Qπ(s, π′(s)) =∑
a∈Aπ′(s, a)Qπ(s, a)
= ε/m∑
a∈AQπ(s, a) + (1− ε) max
a∈AQπ(s, a)
≥ ε/m∑
a∈AQπ(s, a) + (1− ε)
∑
a∈A
π(s, a)− ε/m1− ε Qπ(s, a)
=∑
a∈Aπ(s, a)Qπ(s, a) = V π(s)
Therefore from policy improvement theorem, V π′(s) ≥ V π(s)
![Page 24: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/24.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Generalized Policy Iteration
ε-Greedy Policy Improvement
Theorem
For any ε-greedy policy π, the ε-greedy policy π′ with respect toQπ is an improvement, V π′
(s) ≥ V π(s)
Qπ(s, π′(s)) =∑
a∈Aπ′(s, a)Qπ(s, a)
= ε/m∑
a∈AQπ(s, a) + (1− ε) max
a∈AQπ(s, a)
≥ ε/m∑
a∈AQπ(s, a) + (1− ε)
∑
a∈A
π(s, a)− ε/m1− ε Qπ(s, a)
=∑
a∈Aπ(s, a)Qπ(s, a) = V π(s)
Therefore from policy improvement theorem, V π′(s) ≥ V π(s)
![Page 25: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/25.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Generalized Policy Iteration
Monte-Carlo Policy Iteration
Starting Q, π
π = ε-greedy(Q)
Q = Q π
Q*, π*
Policy evaluation Monte-Carlo policy evaluation, Q = Qπ
Policy improvement ε-greedy policy improvement
Here π∗ is best ε-greedy policy
![Page 26: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/26.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Generalized Policy Iteration
Monte-Carlo Generalized Policy Iteration
Starting Q
π = ε-greedy(Q)
Q = Q π
Q*, π*
Every episode:
Policy evaluation Monte-Carlo policy evaluation, Q ≈ Qπ
Policy improvement ε-greedy policy improvement
![Page 27: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/27.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
GLIE
GLIE
Definition
Greedy in the Limit with Infinite Exploration (GLIE)
All state-action pairs are explored infinitely many times,
limk→∞
Nk(s, a) =∞
The policy converges on a greedy policy,
limk→∞
πk(s, a) = 1(a = argmaxa′∈A
Qk(s, a′))
For example, ε-greedy is GLIE if ε reduces to zero at εk = 1k
![Page 28: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/28.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
GLIE
GLIE
Definition
Greedy in the Limit with Infinite Exploration (GLIE)
All state-action pairs are explored infinitely many times,
limk→∞
Nk(s, a) =∞
The policy converges on a greedy policy,
limk→∞
πk(s, a) = 1(a = argmaxa′∈A
Qk(s, a′))
For example, ε-greedy is GLIE if ε reduces to zero at εk = 1k
![Page 29: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/29.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
GLIE
GLIE
Definition
Greedy in the Limit with Infinite Exploration (GLIE)
All state-action pairs are explored infinitely many times,
limk→∞
Nk(s, a) =∞
The policy converges on a greedy policy,
limk→∞
πk(s, a) = 1(a = argmaxa′∈A
Qk(s, a′))
For example, ε-greedy is GLIE if ε reduces to zero at εk = 1k
![Page 30: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/30.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
GLIE
GLIE Every-Visit Monte-Carlo Control
Sample kth episode using π: {S1,A1,R2, ...,ST} ∼ π
For each state St and action At in the episode,
N(St ,At)← N(St ,At) + 1
Q(St ,At)← Q(St ,At) +1
N(St ,At)(Gt − Q(St ,At))
Improve policy based on new action-value function
ε← 1/k
π ← ε-greedy(Q)
Theorem
GLIE Monte-Carlo control converges to the optimal action-valuefunction, Q(s, a)→ Q∗(s, a)
![Page 31: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/31.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
GLIE
GLIE Every-Visit Monte-Carlo Control
Sample kth episode using π: {S1,A1,R2, ...,ST} ∼ πFor each state St and action At in the episode,
N(St ,At)← N(St ,At) + 1
Q(St ,At)← Q(St ,At) +1
N(St ,At)(Gt − Q(St ,At))
Improve policy based on new action-value function
ε← 1/k
π ← ε-greedy(Q)
Theorem
GLIE Monte-Carlo control converges to the optimal action-valuefunction, Q(s, a)→ Q∗(s, a)
![Page 32: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/32.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
GLIE
GLIE Every-Visit Monte-Carlo Control
Sample kth episode using π: {S1,A1,R2, ...,ST} ∼ πFor each state St and action At in the episode,
N(St ,At)← N(St ,At) + 1
Q(St ,At)← Q(St ,At) +1
N(St ,At)(Gt − Q(St ,At))
Improve policy based on new action-value function
ε← 1/k
π ← ε-greedy(Q)
Theorem
GLIE Monte-Carlo control converges to the optimal action-valuefunction, Q(s, a)→ Q∗(s, a)
![Page 33: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/33.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
GLIE
GLIE Every-Visit Monte-Carlo Control
Sample kth episode using π: {S1,A1,R2, ...,ST} ∼ πFor each state St and action At in the episode,
N(St ,At)← N(St ,At) + 1
Q(St ,At)← Q(St ,At) +1
N(St ,At)(Gt − Q(St ,At))
Improve policy based on new action-value function
ε← 1/k
π ← ε-greedy(Q)
Theorem
GLIE Monte-Carlo control converges to the optimal action-valuefunction, Q(s, a)→ Q∗(s, a)
![Page 34: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/34.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
GLIE
GLIE Every-Visit Monte-Carlo Control
Sample kth episode using π: {S1,A1,R2, ...,ST} ∼ πFor each state St and action At in the episode,
N(St ,At)← N(St ,At) + 1
Q(St ,At)← Q(St ,At) +1
N(St ,At)(Gt − Q(St ,At))
Improve policy based on new action-value function
ε← 1/k
π ← ε-greedy(Q)
Theorem
GLIE Monte-Carlo control converges to the optimal action-valuefunction, Q(s, a)→ Q∗(s, a)
![Page 35: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/35.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
GLIE
GLIE Every-Visit Monte-Carlo Control
N(St ,At)← N(St ,At) + 1
Q(St ,At)← Q(St ,At) +1
N(St ,At)(Gt − Q(St ,At))
ε← 1/k
π ← ε-greedy(Q)
Any practical issues with this algorithm?
Any ways to improve, in practice?
Is 1/N the best step size?
Is 1/k enough exploration?
![Page 36: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/36.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
GLIE
GLIE Every-Visit Monte-Carlo Control
N(St ,At)← N(St ,At) + 1
Q(St ,At)← Q(St ,At) +1
N(St ,At)(Gt − Q(St ,At))
ε← 1/k
π ← ε-greedy(Q)
Any practical issues with this algorithm?
Any ways to improve, in practice?
Is 1/N the best step size?
Is 1/k enough exploration?
![Page 37: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/37.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
GLIE
GLIE Every-Visit Monte-Carlo Control
N(St ,At)← N(St ,At) + 1
Q(St ,At)← Q(St ,At) +1
N(St ,At)(Gt − Q(St ,At))
ε← 1/k
π ← ε-greedy(Q)
Any practical issues with this algorithm?
Any ways to improve, in practice?
Is 1/N the best step size?
Is 1/k enough exploration?
![Page 38: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/38.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
GLIE
GLIE Every-Visit Monte-Carlo Control
N(St ,At)← N(St ,At) + 1
Q(St ,At)← Q(St ,At) +1
N(St ,At)(Gt − Q(St ,At))
ε← 1/k
π ← ε-greedy(Q)
Any practical issues with this algorithm?
Any ways to improve, in practice?
Is 1/N the best step size?
Is 1/k enough exploration?
![Page 39: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/39.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Blackjack Example
Back to the Blackjack Example
![Page 40: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/40.jpg)
Lecture 6: Model-Free Control
Monte-Carlo Control
Blackjack Example
Monte-Carlo Control in Blackjack
![Page 41: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/41.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Outline
1 Introduction
2 Monte-Carlo Control
3 On-Policy Temporal-Difference Learning
4 Off-Policy Learning
5 Summary
![Page 42: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/42.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
MC vs. TD Control
Temporal-difference (TD) learning has several advantagesover Monte-Carlo (MC)
Lower varianceOnlineCan learn from incomplete sequences
Natural idea: use TD instead of MC for control
Apply TD to Q(s, a)Use ε-greedy policy improvementUpdate every time-step
![Page 43: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/43.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
MC vs. TD Control
Temporal-difference (TD) learning has several advantagesover Monte-Carlo (MC)
Lower variance
OnlineCan learn from incomplete sequences
Natural idea: use TD instead of MC for control
Apply TD to Q(s, a)Use ε-greedy policy improvementUpdate every time-step
![Page 44: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/44.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
MC vs. TD Control
Temporal-difference (TD) learning has several advantagesover Monte-Carlo (MC)
Lower varianceOnline
Can learn from incomplete sequences
Natural idea: use TD instead of MC for control
Apply TD to Q(s, a)Use ε-greedy policy improvementUpdate every time-step
![Page 45: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/45.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
MC vs. TD Control
Temporal-difference (TD) learning has several advantagesover Monte-Carlo (MC)
Lower varianceOnlineCan learn from incomplete sequences
Natural idea: use TD instead of MC for control
Apply TD to Q(s, a)Use ε-greedy policy improvementUpdate every time-step
![Page 46: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/46.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
MC vs. TD Control
Temporal-difference (TD) learning has several advantagesover Monte-Carlo (MC)
Lower varianceOnlineCan learn from incomplete sequences
Natural idea: use TD instead of MC for control
Apply TD to Q(s, a)Use ε-greedy policy improvementUpdate every time-step
![Page 47: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/47.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
MC vs. TD Control
Temporal-difference (TD) learning has several advantagesover Monte-Carlo (MC)
Lower varianceOnlineCan learn from incomplete sequences
Natural idea: use TD instead of MC for control
Apply TD to Q(s, a)
Use ε-greedy policy improvementUpdate every time-step
![Page 48: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/48.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
MC vs. TD Control
Temporal-difference (TD) learning has several advantagesover Monte-Carlo (MC)
Lower varianceOnlineCan learn from incomplete sequences
Natural idea: use TD instead of MC for control
Apply TD to Q(s, a)Use ε-greedy policy improvement
Update every time-step
![Page 49: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/49.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
MC vs. TD Control
Temporal-difference (TD) learning has several advantagesover Monte-Carlo (MC)
Lower varianceOnlineCan learn from incomplete sequences
Natural idea: use TD instead of MC for control
Apply TD to Q(s, a)Use ε-greedy policy improvementUpdate every time-step
![Page 50: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/50.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Updating Action-Value Functions with Sarsa
s,a
r
a'
s'
Q(s, a)← Q(s, a) + α(R + γQ(s ′, a′)− Q(s, a)
)
![Page 51: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/51.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Sarsa
Starting Q
π = ε-greedy(Q)
Q = Q π
Q*, π*
Every time-step:
Policy evaluation Sarsa, Q ≈ Qπ
Policy improvement ε-greedy policy improvement
![Page 52: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/52.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Sarsa
![Page 53: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/53.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Convergence of Sarsa
Theorem
Sarsa converges to the optimal action-value function,Q(s, a)→ Q∗(s, a), under the following conditions:
GLIE sequence of policies πt(s, a)
Robbins-Monro sequence of step-sizes αt
∞∑
t=1
αt =∞
∞∑
t=1
α2t <∞
E.g., αt = 1/t or αt = 1/tω with ω ∈ (0.5, 1).
![Page 54: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/54.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Convergence of Sarsa
Theorem
Sarsa converges to the optimal action-value function,Q(s, a)→ Q∗(s, a), under the following conditions:
GLIE sequence of policies πt(s, a)
Robbins-Monro sequence of step-sizes αt
∞∑
t=1
αt =∞
∞∑
t=1
α2t <∞
E.g., αt = 1/t or αt = 1/tω with ω ∈ (0.5, 1).
![Page 55: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/55.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Windy Gridworld Example
Reward = -1 per time-step until reaching goal
Undiscounted
![Page 56: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/56.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Windy Gridworld Example
Reward = -1 per time-step until reaching goal
Undiscounted
![Page 57: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/57.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Sarsa on the Windy Gridworld
![Page 58: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/58.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
n-Step Sarsa
Consider the following n-step returns for n = 1, 2, . . . ,∞:
n = 1 Sarsa(0) G(1)t = Rt+1 + γQ(St+1)
n = 2 G(2)t = Rt+1 + γRt+2 + γ2Q(St+2)
......
n =∞ MC G(∞)t = Rt+1 + γRt+2 + ...+ γT−1RT
Define the n-step Q-return
G(n)t = Rt+1 + γRt+2 + ...+ γn−1Rt+n + γnQ(St+n)
n-step Sarsa updates Q(s, a) towards the n-step Q-return
Q(St ,At)← Q(St ,At) + α(G
(n)t − Q(St ,At)
)
![Page 59: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/59.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
n-Step Sarsa
Consider the following n-step returns for n = 1, 2, . . . ,∞:
n = 1 Sarsa(0) G(1)t = Rt+1 + γQ(St+1)
n = 2 G(2)t = Rt+1 + γRt+2 + γ2Q(St+2)
......
n =∞ MC G(∞)t = Rt+1 + γRt+2 + ...+ γT−1RT
Define the n-step Q-return
G(n)t = Rt+1 + γRt+2 + ...+ γn−1Rt+n + γnQ(St+n)
n-step Sarsa updates Q(s, a) towards the n-step Q-return
Q(St ,At)← Q(St ,At) + α(G
(n)t − Q(St ,At)
)
![Page 60: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/60.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
n-Step Sarsa
Consider the following n-step returns for n = 1, 2, . . . ,∞:
n = 1 Sarsa(0) G(1)t = Rt+1 + γQ(St+1)
n = 2 G(2)t = Rt+1 + γRt+2 + γ2Q(St+2)
......
n =∞ MC G(∞)t = Rt+1 + γRt+2 + ...+ γT−1RT
Define the n-step Q-return
G(n)t = Rt+1 + γRt+2 + ...+ γn−1Rt+n + γnQ(St+n)
n-step Sarsa updates Q(s, a) towards the n-step Q-return
Q(St ,At)← Q(St ,At) + α(G
(n)t − Q(St ,At)
)
![Page 61: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/61.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Forward View Sarsa(λ)
The Gλ return combines all n-step
Q-returns G(n)t
Using weight (1− λ)λn−1
Gλt = (1− λ)
∞∑
n=1
λn−1G(n)t
Forward-view Sarsa(λ)
Q(St ,At)← Q(St ,At) + α(Gλt − Q(St ,At)
)
![Page 62: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/62.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Forward View Sarsa(λ)
The Gλ return combines all n-step
Q-returns G(n)t
Using weight (1− λ)λn−1
Gλt = (1− λ)
∞∑
n=1
λn−1G(n)t
Forward-view Sarsa(λ)
Q(St ,At)← Q(St ,At) + α(Gλt − Q(St ,At)
)
![Page 63: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/63.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Forward View Sarsa(λ)
The Gλ return combines all n-step
Q-returns G(n)t
Using weight (1− λ)λn−1
Gλt = (1− λ)
∞∑
n=1
λn−1G(n)t
Forward-view Sarsa(λ)
Q(St ,At)← Q(St ,At) + α(Gλt − Q(St ,At)
)
![Page 64: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/64.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Forward View Sarsa(λ)
Written recursively:
Gλt = Rt+1 + γ(1− λ)Qt(St+1,At+1) + γλGλ
t+1
Weight 1 on Rt+1
Weight γ(1− λ) on Qt(St+1,At+1)
Weight γλ on Rt+2
Weight γ2λ(1− λ) on Qt(St+2,At+2)
Weight γ2λ2 on Rt+3
Weight γ3λ2(1− λ) on Qt(St+3,At+3)
...
![Page 65: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/65.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Forward View Sarsa(λ)
Written recursively:
Gλt = Rt+1 + γ(1− λ)Qt(St+1,At+1) + γλGλ
t+1
Weight 1 on Rt+1
Weight γ(1− λ) on Qt(St+1,At+1)
Weight γλ on Rt+2
Weight γ2λ(1− λ) on Qt(St+2,At+2)
Weight γ2λ2 on Rt+3
Weight γ3λ2(1− λ) on Qt(St+3,At+3)
...
![Page 66: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/66.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Forward View Sarsa(λ)
Written recursively:
Gλt = Rt+1 + γ(1− λ)Qt(St+1,At+1) + γλGλ
t+1
Weight 1 on Rt+1
Weight γ(1− λ) on Qt(St+1,At+1)
Weight γλ on Rt+2
Weight γ2λ(1− λ) on Qt(St+2,At+2)
Weight γ2λ2 on Rt+3
Weight γ3λ2(1− λ) on Qt(St+3,At+3)
...
![Page 67: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/67.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Forward View Sarsa(λ)
Written recursively:
Gλt = Rt+1 + γ(1− λ)Qt(St+1,At+1) + γλGλ
t+1
Weight 1 on Rt+1
Weight γ(1− λ) on Qt(St+1,At+1)
Weight γλ on Rt+2
Weight γ2λ(1− λ) on Qt(St+2,At+2)
Weight γ2λ2 on Rt+3
Weight γ3λ2(1− λ) on Qt(St+3,At+3)
...
![Page 68: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/68.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Forward View Sarsa(λ)
Written recursively:
Gλt = Rt+1 + γ(1− λ)Qt(St+1,At+1) + γλGλ
t+1
Weight 1 on Rt+1
Weight γ(1− λ) on Qt(St+1,At+1)
Weight γλ on Rt+2
Weight γ2λ(1− λ) on Qt(St+2,At+2)
Weight γ2λ2 on Rt+3
Weight γ3λ2(1− λ) on Qt(St+3,At+3)
...
![Page 69: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/69.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Forward View Sarsa(λ)
Written recursively:
Gλt = Rt+1 + γ(1− λ)Qt(St+1,At+1) + γλGλ
t+1
Weight 1 on Rt+1
Weight γ(1− λ) on Qt(St+1,At+1)
Weight γλ on Rt+2
Weight γ2λ(1− λ) on Qt(St+2,At+2)
Weight γ2λ2 on Rt+3
Weight γ3λ2(1− λ) on Qt(St+3,At+3)
...
![Page 70: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/70.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Forward View Sarsa(λ)
Written recursively:
Gλt = Rt+1 + γ(1− λ)Qt(St+1,At+1) + γλGλ
t+1
Weight 1 on Rt+1
Weight γ(1− λ) on Qt(St+1,At+1)
Weight γλ on Rt+2
Weight γ2λ(1− λ) on Qt(St+2,At+2)
Weight γ2λ2 on Rt+3
Weight γ3λ2(1− λ) on Qt(St+3,At+3)
...
![Page 71: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/71.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Forward View Sarsa(λ)
Written recursively:
Gλt = Rt+1 + γ(1− λ)Qt(St+1,At+1) + γλGλ
t+1
Weight 1 on Rt+1
Weight γ(1− λ) on Qt(St+1,At+1)
Weight γλ on Rt+2
Weight γ2λ(1− λ) on Qt(St+2,At+2)
Weight γ2λ2 on Rt+3
Weight γ3λ2(1− λ) on Qt(St+3,At+3)
...
![Page 72: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/72.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Backward View Sarsa(λ)
Just like TD(λ), we use eligibility traces in an online algorithm
But Sarsa(λ) has one eligibility trace for each state-action pair
E0(s, a) = 0
Et(s, a) = (1− αt)γλEt−1(s, a) + αt1(St = s,At = a)
Q(s, a) is updated for every state s and action a
In proportion to TD-error δt and eligibility trace Et(s, a)
δt = Rt+1 + γQt(St+1,At+1)− Q(St ,At)
∀s, a :Q(s, a)← Q(s, a) + δtEt(s, a)
NB: I rolled step size into the traces on this and next slide!
![Page 73: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/73.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Backward View Sarsa(λ)
Just like TD(λ), we use eligibility traces in an online algorithm
But Sarsa(λ) has one eligibility trace for each state-action pair
E0(s, a) = 0
Et(s, a) = (1− αt)γλEt−1(s, a) + αt1(St = s,At = a)
Q(s, a) is updated for every state s and action a
In proportion to TD-error δt and eligibility trace Et(s, a)
δt = Rt+1 + γQt(St+1,At+1)− Q(St ,At)
∀s, a :Q(s, a)← Q(s, a) + δtEt(s, a)
NB: I rolled step size into the traces on this and next slide!
![Page 74: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/74.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Backward View Sarsa(λ)
Just like TD(λ), we use eligibility traces in an online algorithm
But Sarsa(λ) has one eligibility trace for each state-action pair
E0(s, a) = 0
Et(s, a) = (1− αt)γλEt−1(s, a) + αt1(St = s,At = a)
Q(s, a) is updated for every state s and action a
In proportion to TD-error δt and eligibility trace Et(s, a)
δt = Rt+1 + γQt(St+1,At+1)− Q(St ,At)
∀s, a :Q(s, a)← Q(s, a) + δtEt(s, a)
NB: I rolled step size into the traces on this and next slide!
![Page 75: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/75.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Backward View Sarsa(λ)
Just like TD(λ), we use eligibility traces in an online algorithm
But Sarsa(λ) has one eligibility trace for each state-action pair
E0(s, a) = 0
Et(s, a) = (1− αt)γλEt−1(s, a) + αt1(St = s,At = a)
Q(s, a) is updated for every state s and action a
In proportion to TD-error δt and eligibility trace Et(s, a)
δt = Rt+1 + γQt(St+1,At+1)− Q(St ,At)
∀s, a :Q(s, a)← Q(s, a) + δtEt(s, a)
NB: I rolled step size into the traces on this and next slide!
![Page 76: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/76.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Backward View Sarsa(λ)
Just like TD(λ), we use eligibility traces in an online algorithm
But Sarsa(λ) has one eligibility trace for each state-action pair
E0(s, a) = 0
Et(s, a) = (1− αt)γλEt−1(s, a) + αt1(St = s,At = a)
Q(s, a) is updated for every state s and action a
In proportion to TD-error δt and eligibility trace Et(s, a)
δt = Rt+1 + γQt(St+1,At+1)− Q(St ,At)
∀s, a :Q(s, a)← Q(s, a) + δtEt(s, a)
NB: I rolled step size into the traces on this and next slide!
![Page 77: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/77.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Sarsa(λ) Algorithm
Select A0 according to ⇡0(S0)
Initialize E0(s, a) = 0, for all s, a
For t = 0, 1, 2, . . .
Take action At, observe Rt+1, St+1
Select At+1 according to ⇡t+1(St+1)
�t = Rt+1 + �tQt(St+1, At+1)�Qt(St, At)
Et(St, At) = Et�1(St, At) + ↵t (accumulating trace)
or Et(St, At) = ↵t (replacing trace)
or Et(St, At) = Et�1(St, At) + ↵t(1� Et�1(s, a)) (dutch trace)
For all s, a:
Qt+1(s, a) = Qt(s, a) + �tEt(s, a)
Et(s, a) = �t�tEt�1(s, a)
![Page 78: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/78.jpg)
Lecture 6: Model-Free Control
On-Policy Temporal-Difference Learning
Sarsa(λ)
Sarsa(λ) Gridworld Example
![Page 79: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/79.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Outline
1 Introduction
2 Monte-Carlo Control
3 On-Policy Temporal-Difference Learning
4 Off-Policy Learning
5 Summary
![Page 80: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/80.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
On and Off-Policy Learning
On-policy learning
“Learn on the job”Learn about policy π from experience sampled from π
Off-policy learning
“Look over someone’s shoulder”Learn about policy π from experience sampled from µ
![Page 81: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/81.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
On and Off-Policy Learning
On-policy learning
“Learn on the job”
Learn about policy π from experience sampled from π
Off-policy learning
“Look over someone’s shoulder”Learn about policy π from experience sampled from µ
![Page 82: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/82.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
On and Off-Policy Learning
On-policy learning
“Learn on the job”Learn about policy π from experience sampled from π
Off-policy learning
“Look over someone’s shoulder”Learn about policy π from experience sampled from µ
![Page 83: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/83.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
On and Off-Policy Learning
On-policy learning
“Learn on the job”Learn about policy π from experience sampled from π
Off-policy learning
“Look over someone’s shoulder”Learn about policy π from experience sampled from µ
![Page 84: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/84.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
On and Off-Policy Learning
On-policy learning
“Learn on the job”Learn about policy π from experience sampled from π
Off-policy learning
“Look over someone’s shoulder”
Learn about policy π from experience sampled from µ
![Page 85: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/85.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
On and Off-Policy Learning
On-policy learning
“Learn on the job”Learn about policy π from experience sampled from π
Off-policy learning
“Look over someone’s shoulder”Learn about policy π from experience sampled from µ
![Page 86: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/86.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Off-Policy Learning
Evaluate target policy π(s, a) to compute V π(s) or Qπ(s, a)
While following behaviour policy µ(s, a)
{S1,A1,R2, ...,ST} ∼ µ
Why is this important?
Learn from observing humans or other agents
Re-use experience generated from old policies π1, π2, ..., πt−1
Learn about optimal policy while following exploratory policy
Learn about multiple policies while following one policy
![Page 87: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/87.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Off-Policy Learning
Evaluate target policy π(s, a) to compute V π(s) or Qπ(s, a)
While following behaviour policy µ(s, a)
{S1,A1,R2, ...,ST} ∼ µ
Why is this important?
Learn from observing humans or other agents
Re-use experience generated from old policies π1, π2, ..., πt−1
Learn about optimal policy while following exploratory policy
Learn about multiple policies while following one policy
![Page 88: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/88.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Off-Policy Learning
Evaluate target policy π(s, a) to compute V π(s) or Qπ(s, a)
While following behaviour policy µ(s, a)
{S1,A1,R2, ...,ST} ∼ µ
Why is this important?
Learn from observing humans or other agents
Re-use experience generated from old policies π1, π2, ..., πt−1
Learn about optimal policy while following exploratory policy
Learn about multiple policies while following one policy
![Page 89: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/89.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Off-Policy Learning
Evaluate target policy π(s, a) to compute V π(s) or Qπ(s, a)
While following behaviour policy µ(s, a)
{S1,A1,R2, ...,ST} ∼ µ
Why is this important?
Learn from observing humans or other agents
Re-use experience generated from old policies π1, π2, ..., πt−1
Learn about optimal policy while following exploratory policy
Learn about multiple policies while following one policy
![Page 90: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/90.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Off-Policy Learning
Evaluate target policy π(s, a) to compute V π(s) or Qπ(s, a)
While following behaviour policy µ(s, a)
{S1,A1,R2, ...,ST} ∼ µ
Why is this important?
Learn from observing humans or other agents
Re-use experience generated from old policies π1, π2, ..., πt−1
Learn about optimal policy while following exploratory policy
Learn about multiple policies while following one policy
![Page 91: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/91.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Off-Policy Learning
Evaluate target policy π(s, a) to compute V π(s) or Qπ(s, a)
While following behaviour policy µ(s, a)
{S1,A1,R2, ...,ST} ∼ µ
Why is this important?
Learn from observing humans or other agents
Re-use experience generated from old policies π1, π2, ..., πt−1
Learn about optimal policy while following exploratory policy
Learn about multiple policies while following one policy
![Page 92: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/92.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Off-Policy Learning
Evaluate target policy π(s, a) to compute V π(s) or Qπ(s, a)
While following behaviour policy µ(s, a)
{S1,A1,R2, ...,ST} ∼ µ
Why is this important?
Learn from observing humans or other agents
Re-use experience generated from old policies π1, π2, ..., πt−1
Learn about optimal policy while following exploratory policy
Learn about multiple policies while following one policy
![Page 93: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/93.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Importance Sampling
Importance Sampling
Estimate the expectation of a different distribution
Ex∼d [f (x)] =∑
d(x)f (x)
=∑
d ′(x)d(x)
d ′(x)f (x)
= Ex∼d ′
[d(x)
d ′(x)f (x)
]
![Page 94: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/94.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Importance Sampling
Importance Sampling for Off-Policy Monte-Carlo
Use returns generated from µ to evaluate π
Weight return vt according to similarity between policies
Multiply importance sampling corrections along whole episode
Gπ/µt =
π(St ,At)
µ(St ,At)
π(St+1,At+1)
µ(St+1,At+1). . .
π(ST ,AT )
µ(ST ,AT )Gt
Update value towards corrected return
V (St)← V (St) + α(Gπ/µt − V (St)
)
Importance sampling can dramatically increase variance
![Page 95: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/95.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Importance Sampling
Importance Sampling for Off-Policy Monte-Carlo
Use returns generated from µ to evaluate π
Weight return vt according to similarity between policies
Multiply importance sampling corrections along whole episode
Gπ/µt =
π(St ,At)
µ(St ,At)
π(St+1,At+1)
µ(St+1,At+1). . .
π(ST ,AT )
µ(ST ,AT )Gt
Update value towards corrected return
V (St)← V (St) + α(Gπ/µt − V (St)
)
Importance sampling can dramatically increase variance
![Page 96: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/96.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Importance Sampling
Importance Sampling for Off-Policy Monte-Carlo
Use returns generated from µ to evaluate π
Weight return vt according to similarity between policies
Multiply importance sampling corrections along whole episode
Gπ/µt =
π(St ,At)
µ(St ,At)
π(St+1,At+1)
µ(St+1,At+1). . .
π(ST ,AT )
µ(ST ,AT )Gt
Update value towards corrected return
V (St)← V (St) + α(Gπ/µt − V (St)
)
Importance sampling can dramatically increase variance
![Page 97: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/97.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Importance Sampling
Importance Sampling for Off-Policy Monte-Carlo
Use returns generated from µ to evaluate π
Weight return vt according to similarity between policies
Multiply importance sampling corrections along whole episode
Gπ/µt =
π(St ,At)
µ(St ,At)
π(St+1,At+1)
µ(St+1,At+1). . .
π(ST ,AT )
µ(ST ,AT )Gt
Update value towards corrected return
V (St)← V (St) + α(Gπ/µt − V (St)
)
Importance sampling can dramatically increase variance
![Page 98: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/98.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Importance Sampling
Importance Sampling for Off-Policy Monte-Carlo
Use returns generated from µ to evaluate π
Weight return vt according to similarity between policies
Multiply importance sampling corrections along whole episode
Gπ/µt =
π(St ,At)
µ(St ,At)
π(St+1,At+1)
µ(St+1,At+1). . .
π(ST ,AT )
µ(ST ,AT )Gt
Update value towards corrected return
V (St)← V (St) + α(Gπ/µt − V (St)
)
Importance sampling can dramatically increase variance
![Page 99: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/99.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Importance Sampling
Importance Sampling for Off-Policy TD
Use TD targets generated from µ to evaluate π
Weight TD target r + γV (s ′) by importance sampling
Only need a single importance sampling correction
V (St)← V (St) +
α
(π(St ,At)
µ(St ,At)(Rt+1 + γV (St+1))− V (St)
)
Much lower variance than Monte-Carlo importance sampling
Policies only need to be similar over a single step
![Page 100: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/100.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Importance Sampling
Importance Sampling for Off-Policy TD
Use TD targets generated from µ to evaluate π
Weight TD target r + γV (s ′) by importance sampling
Only need a single importance sampling correction
V (St)← V (St) +
α
(π(St ,At)
µ(St ,At)(Rt+1 + γV (St+1))− V (St)
)
Much lower variance than Monte-Carlo importance sampling
Policies only need to be similar over a single step
![Page 101: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/101.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Importance Sampling
Importance Sampling for Off-Policy TD
Use TD targets generated from µ to evaluate π
Weight TD target r + γV (s ′) by importance sampling
Only need a single importance sampling correction
V (St)← V (St) +
α
(π(St ,At)
µ(St ,At)(Rt+1 + γV (St+1))− V (St)
)
Much lower variance than Monte-Carlo importance sampling
Policies only need to be similar over a single step
![Page 102: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/102.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Importance Sampling
Importance Sampling for Off-Policy TD
Use TD targets generated from µ to evaluate π
Weight TD target r + γV (s ′) by importance sampling
Only need a single importance sampling correction
V (St)← V (St) +
α
(π(St ,At)
µ(St ,At)(Rt+1 + γV (St+1))− V (St)
)
Much lower variance than Monte-Carlo importance sampling
Policies only need to be similar over a single step
![Page 103: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/103.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Importance Sampling
Importance Sampling for Off-Policy TD
Use TD targets generated from µ to evaluate π
Weight TD target r + γV (s ′) by importance sampling
Only need a single importance sampling correction
V (St)← V (St) +
α
(π(St ,At)
µ(St ,At)(Rt+1 + γV (St+1))− V (St)
)
Much lower variance than Monte-Carlo importance sampling
Policies only need to be similar over a single step
![Page 104: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/104.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Q-Learning
We now consider off-policy learning of action-values Q(s, a)
No importance sampling is required
Next action may be chosen using behaviour policyAt+1 ∼ µ(St , ·)But we consider probabilities under π(St , ·)Update Q(St ,At) towards value of alternative action
Q(St ,At)← Q(St ,At)+
α
(Rt+1 + γ
∑
a
π(a|St+1)Q(St+1, a)− Q(St ,At)
)
Called Expected Sarsa (when µ = π) or Generalized Q-learning
![Page 105: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/105.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Q-Learning
We now consider off-policy learning of action-values Q(s, a)
No importance sampling is required
Next action may be chosen using behaviour policyAt+1 ∼ µ(St , ·)But we consider probabilities under π(St , ·)Update Q(St ,At) towards value of alternative action
Q(St ,At)← Q(St ,At)+
α
(Rt+1 + γ
∑
a
π(a|St+1)Q(St+1, a)− Q(St ,At)
)
Called Expected Sarsa (when µ = π) or Generalized Q-learning
![Page 106: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/106.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Q-Learning
We now consider off-policy learning of action-values Q(s, a)
No importance sampling is required
Next action may be chosen using behaviour policyAt+1 ∼ µ(St , ·)
But we consider probabilities under π(St , ·)Update Q(St ,At) towards value of alternative action
Q(St ,At)← Q(St ,At)+
α
(Rt+1 + γ
∑
a
π(a|St+1)Q(St+1, a)− Q(St ,At)
)
Called Expected Sarsa (when µ = π) or Generalized Q-learning
![Page 107: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/107.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Q-Learning
We now consider off-policy learning of action-values Q(s, a)
No importance sampling is required
Next action may be chosen using behaviour policyAt+1 ∼ µ(St , ·)But we consider probabilities under π(St , ·)
Update Q(St ,At) towards value of alternative action
Q(St ,At)← Q(St ,At)+
α
(Rt+1 + γ
∑
a
π(a|St+1)Q(St+1, a)− Q(St ,At)
)
Called Expected Sarsa (when µ = π) or Generalized Q-learning
![Page 108: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/108.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Q-Learning
We now consider off-policy learning of action-values Q(s, a)
No importance sampling is required
Next action may be chosen using behaviour policyAt+1 ∼ µ(St , ·)But we consider probabilities under π(St , ·)Update Q(St ,At) towards value of alternative action
Q(St ,At)← Q(St ,At)+
α
(Rt+1 + γ
∑
a
π(a|St+1)Q(St+1, a)− Q(St ,At)
)
Called Expected Sarsa (when µ = π) or Generalized Q-learning
![Page 109: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/109.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Q-Learning
We now consider off-policy learning of action-values Q(s, a)
No importance sampling is required
Next action may be chosen using behaviour policyAt+1 ∼ µ(St , ·)But we consider probabilities under π(St , ·)Update Q(St ,At) towards value of alternative action
Q(St ,At)← Q(St ,At)+
α
(Rt+1 + γ
∑
a
π(a|St+1)Q(St+1, a)− Q(St ,At)
)
Called Expected Sarsa (when µ = π) or Generalized Q-learning
![Page 110: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/110.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Off-Policy Control with Q-Learning
We now allow both behaviour and target policies to improve
The target policy π is greedy w.r.t. Q(s, a)
π(St+1) = argmaxa′
Q(St+1, a′)
The behaviour policy µ is e.g. ε-greedy w.r.t. Q(s, a)
The Q-learning target then simplifies:
Rt+1 + γ∑
a
π(a|St+1)Q(St+1, a)
=Rt+1 + γQ(St+1, argmaxa
Q(St+1, a))
=Rt+1 + γmaxa
Q(St+1, a)
![Page 111: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/111.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Off-Policy Control with Q-Learning
We now allow both behaviour and target policies to improve
The target policy π is greedy w.r.t. Q(s, a)
π(St+1) = argmaxa′
Q(St+1, a′)
The behaviour policy µ is e.g. ε-greedy w.r.t. Q(s, a)
The Q-learning target then simplifies:
Rt+1 + γ∑
a
π(a|St+1)Q(St+1, a)
=Rt+1 + γQ(St+1, argmaxa
Q(St+1, a))
=Rt+1 + γmaxa
Q(St+1, a)
![Page 112: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/112.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Off-Policy Control with Q-Learning
We now allow both behaviour and target policies to improve
The target policy π is greedy w.r.t. Q(s, a)
π(St+1) = argmaxa′
Q(St+1, a′)
The behaviour policy µ is e.g. ε-greedy w.r.t. Q(s, a)
The Q-learning target then simplifies:
Rt+1 + γ∑
a
π(a|St+1)Q(St+1, a)
=Rt+1 + γQ(St+1, argmaxa
Q(St+1, a))
=Rt+1 + γmaxa
Q(St+1, a)
![Page 113: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/113.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Off-Policy Control with Q-Learning
We now allow both behaviour and target policies to improve
The target policy π is greedy w.r.t. Q(s, a)
π(St+1) = argmaxa′
Q(St+1, a′)
The behaviour policy µ is e.g. ε-greedy w.r.t. Q(s, a)
The Q-learning target then simplifies:
Rt+1 + γ∑
a
π(a|St+1)Q(St+1, a)
=Rt+1 + γQ(St+1, argmaxa
Q(St+1, a))
=Rt+1 + γmaxa
Q(St+1, a)
![Page 114: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/114.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Q-Learning Control Algorithm
Theorem
Q-learning control converges to the optimal action-value function,Q(s, a)→ Q∗(s, a), as long as we take each action in each stateinfinitely often.
![Page 115: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/115.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Q-Learning Control Algorithm
Theorem
Q-learning control converges to the optimal action-value function,Q(s, a)→ Q∗(s, a), as long as we take each action in each stateinfinitely often.
Note: no need for greedy behaviour!
![Page 116: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/116.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Q-Learning Algorithm for Off-Policy Control
For t = 0, 1, 2, . . .
Take action At according to ⇡t(St), observe Rt+1, St+1
Qt+1(St, At) = Qt(St, At) + ↵t
⇣Rt+1 + �t max
aQt(St+1, a)�Qt(St, At)
⌘
![Page 117: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/117.jpg)
Lecture 6: Model-Free Control
Off-Policy Learning
Q-Learning
Cliff Walking Example
![Page 118: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/118.jpg)
Lecture 6: Model-Free Control
Summary
Outline
1 Introduction
2 Monte-Carlo Control
3 On-Policy Temporal-Difference Learning
4 Off-Policy Learning
5 Summary
![Page 119: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/119.jpg)
Lecture 6: Model-Free Control
Summary
Relationship Between DP and TD
Full Backup (DP) Sample Backup (TD)
Bellman Expectation
s
a
V!(s)
s'
r
V!(s')
Equation for V π(s) Iterative Policy Evaluation TD Learning
Bellman Expectation
s,a
s'
Q!(s,a)
r
a' Q!(s',a')
s,a
r
a'
s'
Equation for Qπ(s, a) Q-Policy Iteration Sarsa
Bellman Optimality
s,a
s'
Q*(s,a)
r
a' Q*(s',a')
Equation for Q∗(s, a) Q-Value Iteration Q-Learning
![Page 120: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/120.jpg)
Lecture 6: Model-Free Control
Summary
Relationship Between DP and TD (2)
Full Backup (DP) Sample Backup (TD)
Iterative Policy Evaluation TD Learning
V (s)← E [r + γV (s ′) | s] V (s)α← r + γV (s ′)
Q-Policy Iteration Sarsa
Q(s, a)← E [r + γQ(s ′, a′) | s, a] Q(s, a)α← r + γQ(s ′, a′)
Q-Value Iteration Q-Learning
Q(s, a)← E[r + γ max
a′∈AQ(s ′, a′) | s, a
]Q(s, a)
α← r + γ maxa′∈A
Q(s ′, a′)
where xα← y ≡ x ← x + α(y − x)
![Page 121: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/121.jpg)
Lecture 6: Model-Free Control
Summary
Q-learning overestimates
Q-learning has a problem
Recall
maxa
Qt(St+1, a) = Qt(St+1, argmaxa
Qt(St+1, a))
Q-learning uses same values to select and to evaluate
... but values are approximate
Therefore:
more likely to select overestimated valuesless likely to select underestimated values
This causes upward bias
![Page 122: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/122.jpg)
Lecture 6: Model-Free Control
Summary
Q-learning overestimates
Q-learning has a problem
Recall
maxa
Qt(St+1, a) = Qt(St+1, argmaxa
Qt(St+1, a))
Q-learning uses same values to select and to evaluate
... but values are approximate
Therefore:
more likely to select overestimated valuesless likely to select underestimated values
This causes upward bias
![Page 123: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/123.jpg)
Lecture 6: Model-Free Control
Summary
Q-learning overestimates
Q-learning has a problem
Recall
maxa
Qt(St+1, a) = Qt(St+1, argmaxa
Qt(St+1, a))
Q-learning uses same values to select and to evaluate
... but values are approximate
Therefore:
more likely to select overestimated valuesless likely to select underestimated values
This causes upward bias
![Page 124: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/124.jpg)
Lecture 6: Model-Free Control
Summary
Q-learning overestimates
Q-learning has a problem
Recall
maxa
Qt(St+1, a) = Qt(St+1, argmaxa
Qt(St+1, a))
Q-learning uses same values to select and to evaluate
... but values are approximate
Therefore:
more likely to select overestimated valuesless likely to select underestimated values
This causes upward bias
![Page 125: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/125.jpg)
Lecture 6: Model-Free Control
Summary
Q-learning overestimates
Q-learning has a problem
Recall
maxa
Qt(St+1, a) = Qt(St+1, argmaxa
Qt(St+1, a))
Q-learning uses same values to select and to evaluate
... but values are approximate
Therefore:
more likely to select overestimated valuesless likely to select underestimated values
This causes upward bias
![Page 126: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/126.jpg)
Lecture 6: Model-Free Control
Summary
Q-learning overestimates
Q-learning has a problem
Recall
maxa
Qt(St+1, a) = Qt(St+1, argmaxa
Qt(St+1, a))
Q-learning uses same values to select and to evaluate
... but values are approximate
Therefore:
more likely to select overestimated values
less likely to select underestimated values
This causes upward bias
![Page 127: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/127.jpg)
Lecture 6: Model-Free Control
Summary
Q-learning overestimates
Q-learning has a problem
Recall
maxa
Qt(St+1, a) = Qt(St+1, argmaxa
Qt(St+1, a))
Q-learning uses same values to select and to evaluate
... but values are approximate
Therefore:
more likely to select overestimated valuesless likely to select underestimated values
This causes upward bias
![Page 128: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/128.jpg)
Lecture 6: Model-Free Control
Summary
Q-learning overestimates
Q-learning has a problem
Recall
maxa
Qt(St+1, a) = Qt(St+1, argmaxa
Qt(St+1, a))
Q-learning uses same values to select and to evaluate
... but values are approximate
Therefore:
more likely to select overestimated valuesless likely to select underestimated values
This causes upward bias
![Page 129: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/129.jpg)
Lecture 6: Model-Free Control
Summary
Q-learning overestimates
![Page 130: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/130.jpg)
Lecture 6: Model-Free Control
Summary
Double Q-learning
Q-learning uses same values to select and to evaluate
Rt+1 + γQt(St+1, argmaxa
Qt(St+1, a))
Solution: decouple selection from evaluation
Double Q-learning:
Rt+1 + γQ ′t(St+1, argmaxa
Qt(St+1, a))
Rt+1 + γQt(St+1, argmaxa
Q ′t(St+1, a))
Then update one at random for each experience
![Page 131: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/131.jpg)
Lecture 6: Model-Free Control
Summary
Double Q-learning
Q-learning uses same values to select and to evaluate
Rt+1 + γQt(St+1, argmaxa
Qt(St+1, a))
Solution: decouple selection from evaluation
Double Q-learning:
Rt+1 + γQ ′t(St+1, argmaxa
Qt(St+1, a))
Rt+1 + γQt(St+1, argmaxa
Q ′t(St+1, a))
Then update one at random for each experience
![Page 132: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/132.jpg)
Lecture 6: Model-Free Control
Summary
Double Q-learning
Q-learning uses same values to select and to evaluate
Rt+1 + γQt(St+1, argmaxa
Qt(St+1, a))
Solution: decouple selection from evaluation
Double Q-learning:
Rt+1 + γQ ′t(St+1, argmaxa
Qt(St+1, a))
Rt+1 + γQt(St+1, argmaxa
Q ′t(St+1, a))
Then update one at random for each experience
![Page 133: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/133.jpg)
Lecture 6: Model-Free Control
Summary
Double Q-learning
Q-learning uses same values to select and to evaluate
Rt+1 + γQt(St+1, argmaxa
Qt(St+1, a))
Solution: decouple selection from evaluation
Double Q-learning:
Rt+1 + γQ ′t(St+1, argmaxa
Qt(St+1, a))
Rt+1 + γQt(St+1, argmaxa
Q ′t(St+1, a))
Then update one at random for each experience
![Page 134: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/134.jpg)
Lecture 6: Model-Free Control
Summary
Double Q-learning
![Page 135: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/135.jpg)
Lecture 6: Model-Free Control
Summary
Double DQN on Atari
DQN
0 100
200
400
800
1600
3200
6400
Normalized score
Hum
an
Double DunkWizard of Wor
Video PinballGravitar
Private EyeAsterix
AsteroidsMontezuma's Revenge
FrostbiteVenture
Ms. PacmanZaxxonBowling
SeaquestAlien
AmidarTutankhamCentipede
Bank HeistRiver Raid
Up and DownChopper Command
Q*BertH.E.R.O.
Battle ZoneKung-Fu Master
Beam RiderIce Hockey
EnduroName This Game
Fishing DerbyJames Bond
KangarooSpace Invaders
Time PilotFreeway
PongGopherTennis
Road RunnerCrazy Climber
BoxingKrull
Star GunnerDemon Attack
AtlantisAssault
RobotankBreakout
![Page 136: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/136.jpg)
Lecture 6: Model-Free Control
Summary
Double DQN on Atari
DQNDouble DQN
0 100
200
400
800
1600
3200
6400
Normalized score
Hum
an
Double DunkWizard of Wor
Video PinballGravitar
Private EyeAsterix
AsteroidsMontezuma's Revenge
FrostbiteVenture
Ms. PacmanZaxxonBowling
SeaquestAlien
AmidarTutankhamCentipede
Bank HeistRiver Raid
Up and DownChopper Command
Q*BertH.E.R.O.
Battle ZoneKung-Fu Master
Beam RiderIce Hockey
EnduroName This Game
Fishing DerbyJames Bond
KangarooSpace Invaders
Time PilotFreeway
PongGopherTennis
Road RunnerCrazy Climber
BoxingKrull
Star GunnerDemon Attack
AtlantisAssault
RobotankBreakout
![Page 137: Lecture 6: Model-Free Control - Joseph Modayiljosephmodayil.com/UCLRL/Control.pdf · Lecture 6: Model-Free Control Outline 1 Introduction 2 Monte-Carlo Control 3 On-Policy Temporal-Di](https://reader034.vdocument.in/reader034/viewer/2022051800/5ac28af17f8b9a433f8e30ab/html5/thumbnails/137.jpg)
Lecture 6: Model-Free Control
Summary
Questions?