deep reinforcement learning and controlkatef/deeprlfall2018/lecture_planninglearningmcts.pdfplanning...
TRANSCRIPT
![Page 1: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/1.jpg)
Learning and Planning with Tabular Methods
Deep Reinforcement Learning and Control
Katerina Fragkiadaki
Carnegie MellonSchool of Computer Science
Lecture 6, CMU 10703
![Page 2: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/2.jpg)
Definitions
![Page 3: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/3.jpg)
Definitions
Planning: any computational process that uses a model to create or improve a policy
Model PolicyPlanning
Learning: the acquisition of knowledge or skills through experience, study, or by being taught.
![Page 4: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/4.jpg)
Planning examples
• Value iteration • Policy iteration • TD-gammon (look-ahead search) • Alpha-Go (MCTS) • Chess (heuristic search)
Lecture8:IntegratingLearningandPlanning
Introduction
Model-FreeRL
state
reward
action
At
Rt
St
v*,q*
![Page 5: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/5.jpg)
Definitions
Planning: any computational process that uses a model to create or improve a policy,
e.g., we compute value functions from simulated experience (action/state trajectories)
Model PolicyPlanning
Learning: the acquisition of knowledge or skills through experience, study, or by being taught.
e.g., we learn value functions from real experience (action/state trajectories) using MonteCarlo methods, or we learn a model (transition function)
![Page 6: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/6.jpg)
What can I learn by interacting with the world?We will combine both, learning and planning: 1. If the model is unknown, we will learn the model.
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
StLecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture8
:IntegratingLearn
ingandPlannin
g
Intro
duction
Model-FreeRL
state
reward
action
At R
t
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Based RL
state
reward
action
At
Rt
St
This lecture
![Page 7: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/7.jpg)
What can I learn by interacting with the world?We will combine both, learning and planning: 1. If the model is unknown, we will learn the model.
(Disclaimer: this will be underwhelming: we are still in tabular worlds, essentially we are just going to count, more on this in later lectures..)
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
StLecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture8
:IntegratingLearn
ingandPlannin
g
Intro
duction
Model-FreeRL
state
reward
action
At R
t
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Based RL
state
reward
action
At
Rt
St
This lecture
![Page 8: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/8.jpg)
This lectureWe will combine both, learning and planning: 1. If the model is unknown, we will learn the model. 2. Learn/compute value functions using both real experience and the
model
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
StLecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture8
:IntegratingLearn
ingandPlannin
g
Intro
duction
Model-FreeRL
state
reward
action
At R
t
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Based RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
![Page 9: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/9.jpg)
What can I learn by interacting with the world?
We will combine both, learning and planning: 1. If the model is unknown, we will learn the model. 2. Learn/compute value functions using both real experience
and the model 3. Computing value functions online (very successful
however so far mostly with ground-truth models)
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
StLecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture8
:IntegratingLearn
ingandPlannin
g
Intro
duction
Model-FreeRL
state
reward
action
At R
t
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Based RL
state
reward
action
At
Rt
St
This lecture
![Page 10: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/10.jpg)
What can I learn by interacting with the world?
We will combine both, learning and planning: 1. If the model is unknown, we will learn the model. 2. Learn/compute value functions using both real experience
and the model 3. Computing value functions online (very successful
however so far mostly with ground-truth models)
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
StLecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture8
:IntegratingLearn
ingandPlannin
g
Intro
duction
Model-FreeRL
state
reward
action
At R
t
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Based RL
state
reward
action
At
Rt
St
This lecture
![Page 11: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/11.jpg)
Model
s
as0
r
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Anything the agent can use to predict how the environment will respond to its actions, concretely, the state transition T(s’|s,a) and reward R(s,a).
this includes transitions of the state of the environment and the state of the agent..
Ground-truth
![Page 12: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/12.jpg)
this includes transitions of the state of the environment and the state of the agent..
Lecture 8: Integrating Learning and Planning
Introduction
Model-Based RL
state
reward
action
At
Rt
St
s
as0
r
Anything the agent can use to predict how the environment will respond to its actions, concretely, the state transition T(s’|s,a) and reward R(s,a).
Model
Learnt
![Page 13: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/13.jpg)
Distribution VS Sample Models
• Distribution model: lists all possible outcomes and their probabilities, T(s’|s,a) for all (s, a, s’) . (we used those in DP)
• Sample model, a.k.a. a simulator produces a single outcome (transition) sampled according to its probability of occurring (we used this in Monte Carlo methods in Black Jack).
Q: which one is more powerful? Which one is easier to obtain/learn?
![Page 14: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/14.jpg)
Paths to a policy
Model
Valuefunction
Policy
Experience
Direct RLmethods
Directplanning
Greedification
Modellearning
SimulationEnvironmentalinteraction
Model-based RL
Model
Experience
Policy
Value function
Interaction with Environment
Model learning
Simulation Planning
Greedification
Direct RL methods
![Page 15: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/15.jpg)
Paths to a policy
Model
Valuefunction
Policy
Experience
Direct RLmethods
Directplanning
Greedification
Modellearning
SimulationEnvironmentalinteraction
Model-based RL
Model
Experience
Policy
Value function
Interaction with Environment
Model learning
Simulation Planning
Greedification
Direct RL methods
Model-free RL
![Page 16: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/16.jpg)
Paths to a policy
Model
Valuefunction
Policy
Experience
Direct RLmethods
Directplanning
Greedification
Modellearning
SimulationEnvironmentalinteraction
Model-based RL
Model
Experience
Policy
Value function
Interaction with Environment
Model learning
Simulation Planning
Greedification
Direct RL methods
Planning (or model-based RL)
![Page 17: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/17.jpg)
Advantages of Planning (model-based RL)
Advantages:
• Model learning transfers across tasks and environment configurations (learning physics)
• Better exploits experience in case of sparse rewards
• Helps exploration: Can reason about model uncertainty
Disadvantages:
• First learn model, then construct a value function: Two sources of approximation error
![Page 18: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/18.jpg)
Paths to a policy
Model
Valuefunction
Policy
Experience
Direct RLmethods
Directplanning
Greedification
Modellearning
SimulationEnvironmentalinteraction
Model-based RL
Model
Experience
Policy
Value function
Interaction with Environment
Model learning
Simulation Planning
Greedification
Direct RL methods
![Page 19: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/19.jpg)
Transition function is approximated through some function approximator
Examples of Models for T(s’|s,a)Table lookup model (tabular): bookkeeping a probability of occurrence for each transition (s,a,s’)
Lecture 8: Integrating Learning and Planning
Simulation-Based Search
MCTS in Go
Monte-Carlo Evaluation in Go
Current position s
Simulation
1 1 0 0 Outcomes
V(s) = 2/4 = 0.5
SA
S’
![Page 20: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/20.jpg)
Examples of Models for T(s’|s,a)Table lookup model (tabular): bookkeeping a probability of occurrence for each transition (s,a,s’)
Lecture 8: Integrating Learning and Planning
Simulation-Based Search
MCTS in Go
Monte-Carlo Evaluation in Go
Current position s
Simulation
1 1 0 0 Outcomes
V(s) = 2/4 = 0.5
This Lecture Later..
Transition function is approximated through some function approximator
![Page 21: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/21.jpg)
Table Lookup Model
• Model is an explicit MDP,
• Count visits to each state action pair
• Alternatively
• At each time-step , record experience tuple
• To sample model, randomly pick tuple matching
T , RN(s, a)
T (s0|s, a) = 1
N(s, a)
TX
t=1
1(St, At, St+1 = s, a, s0)
R(s, a) =1
N(s, a)
TX
t=1
1(St, At = s, a)Rt
thSt, At, Rt+1, St+1i
hs, a, ·, ·iEssentially here model learning means save the experience, memorization==learning
![Page 22: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/22.jpg)
A simple Example
Two states A,B; no discounting; 8 episodes of experience
We have constructed a table lookup model from the experience
Lecture 8: Integrating Learning and Planning
Model-Based Reinforcement Learning
Learning a Model
AB Example
Two states A,B ; no discounting; 8 episodes of experience
A, 0, B, 0!B, 1!B, 1!B, 1!B, 1!B, 1!B, 1!B, 0!We have constructed a table lookup model from the experience
![Page 23: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/23.jpg)
Given a model
Solve the MDP
Using favorite planning algorithm
• Value iteration
• Policy iteration
Planning with a Model
M⌘ = hT⌘,R⌘ihS,A, T⌘R⌘i
![Page 24: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/24.jpg)
Given a model
Solve the MDP
Using one of our favorite planning algorithm:
• Value iteration
• Policy iteration
Planning with a ModelM⌘ = hT⌘,R⌘ihS,A, T⌘R⌘i
![Page 25: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/25.jpg)
Model
Experience
Policy
Value function
Interaction with Environment
Model learning
Simulation Planning
Greedification
Direct RL methods
Value iteration
![Page 26: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/26.jpg)
Given a model
Solve the MDP
Using one of our favorite planning algorithm:
• Value iteration
• Policy iteration
• However: we visit every state in each sweep. This means equal effort is assigned to every state. However, states are not created equal: some matter more than others, many states never actually occur, we should not spend energy estimating their value.
• (often it’s impossible to even complete one sweep within one’s lifetime. DP does not necessarily needs complete state visitation: we can in fact distribute updates asynchronously, in a prioritized manner, etc. . )
• Q: What is the right distribution from which to sample states and update their values so that we maximize the result of our effort towards improving our policy?
Curse of dimensionalityM⌘ = hT⌘,R⌘ihS,A, T⌘R⌘i
![Page 27: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/27.jpg)
• Use the model only to generate samples, not using its transition probabilities and expected immediate rewards
• Sample experience from model
• Apply model-free RL to samples, e.g.:
• Monte-Carlo control
• Sarsa
• Q-learning
• Sample-based planning methods are often more efficient: rather than exhaustive state sweeps we focus on what is likely to happen
Sample-based Planning
Rt+1 = R⌘(Rt+1|St, At)
St+1 ⇠ T⌘(St+1|St, At)
![Page 28: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/28.jpg)
Paths to a policy
Model
Valuefunction
Policy
Experience
Direct RLmethods
Directplanning
Greedification
Modellearning
SimulationEnvironmentalinteraction
Model-based RL
Model
Experience
Policy
Value function
Interaction with Environment
Model learning
Simulation Planning
Greedification
Direct RL methods
Sample-based planning
![Page 29: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/29.jpg)
A Simple Example
• Construct a table-lookup model from real experience
• Apply model-free RL to sampled experience
•
Lecture 8: Integrating Learning and Planning
Model-Based Reinforcement Learning
Planning with a Model
Back to the AB Example
Construct a table-lookup model from real experience
Apply model-free RL to sampled experience
Real experienceA, 0, B, 0B, 1B, 1B, 1B, 1B, 1B, 1B, 0
Sampled experienceB, 1B, 0B, 1A, 0, B, 1B, 1A, 0, B, 1B, 1B, 0
e.g. Monte-Carlo learning: V (A) = 1,V (B) = 0.75e.g. Monte-Carlo learning: v(A) = 1, v(B) = 0.75
![Page 30: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/30.jpg)
Combine real and simulated experience
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
StLecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture8
:IntegratingLearn
ingandPlannin
g
Intro
duction
Model-FreeRL
state
reward
action
At R
t
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Based RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
1. If the model is unknown, we will learn the model. 2. Learn/compute value functions using both real experience
and the model 3. Learning value functions online using model-based look-
ahead search
![Page 31: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/31.jpg)
Real and Simulated Experience
We consider two sources of experience
• Real experience - Sampled from environment (true MDP)
• Simulated experience - Sampled from model
S0 ⇠ T (s0|s, a)
R = r(s, a)
S0 ⇠ T⌘(S0|S,A)
R = R⌘(R|S,A)
![Page 32: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/32.jpg)
Dyna-Q AlgorithmLecture 8: Integrating Learning and Planning
Integrated Architectures
Dyna
Dyna-Q Algorithm
![Page 33: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/33.jpg)
Dyna-Q AlgorithmLecture 8: Integrating Learning and Planning
Integrated Architectures
Dyna
Dyna-Q Algorithm
Direct RLModel learning
Planning
![Page 34: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/34.jpg)
Dyna-Q on a Simple MazeLecture 8: Integrating Learning and Planning
Integrated Architectures
Dyna
Dyna-Q on a Simple Maze
![Page 35: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/35.jpg)
Midway in 2nd Episode
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13
Dyna-Q Snapshots: Midway in 2nd Episode
S
G
S
G
WITHOUT PLANNING (N=0) WITH PLANNING (N=50)n n
![Page 36: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/36.jpg)
Midway in 2nd Episode
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13
Dyna-Q Snapshots: Midway in 2nd Episode
S
G
S
G
WITHOUT PLANNING (N=0) WITH PLANNING (N=50)n n
![Page 37: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/37.jpg)
Random sampling is suboptimalLecture 8: Integrating Learning and Planning
Integrated Architectures
Dyna
Dyna-Q Algorithm
![Page 38: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/38.jpg)
Prioritized sweeping
![Page 39: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/39.jpg)
Prioritized sweeping vs Random Sampling
![Page 40: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/40.jpg)
Sampling-based look-ahead search
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
StLecture 8: Integrating Learning and Planning
Introduction
Model-Free RL
state
reward
action
At
Rt
St
Lecture8
:IntegratingLearn
ingandPlannin
g
Intro
duction
Model-FreeRL
state
reward
action
At R
t
St
1. If the model is unknown, we will learn the model. 2. Learn value functions using both real and simulated
experience 3. Computing value functions online using model-based look-
ahead search
![Page 41: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/41.jpg)
Paths to a policy
Model
Valuefunction
Policy
Experience
Direct RLmethods
Directplanning
Greedification
Modellearning
SimulationEnvironmentalinteraction
Model-based RL
Model
Experience
Action(from a given state s)
Value function
Interaction with Environment
Model learning
Simulation Planning
Greedification
Forward Search
Direct RL methods
![Page 42: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/42.jpg)
Online Planning with Search1. Build a search tree with the current state of the agent at the root
2. Compute value functions using simulated episodes (reward usually only on final state, e.g., win or loose)
3. Select the next move to execute
4. Execute it
5. GOTO 1
Lecture 8: Integrating Learning and Planning
Simulation-Based Search
Forward Search
Forward search algorithms select the best action by lookaheadThey build a search tree with the current state st at the rootUsing a model of the MDP to look ahead
T! T! T! T!T!
T! T! T! T! T!
st
T! T!
T! T!
T!T! T!
T! T!T!
No need to solve whole MDP, just sub-MDP starting from now
![Page 43: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/43.jpg)
Why online planning?Why don’t we learn a value function directly for every state offline, so that we do not waste time online?• Because the environment has many many states (consider Go 10^170,
Chess 10^48, real world ….)• Very hard to compute a good value function for each one of them, most
you will never visit• Thus, condition on the current state you are in, try to estimate the value
function of the relevant part of the state space online • Focus your resources on sub-MDP starting from now, often dramatically
easier than solving the whole MDP
Any problems with online tree search?
![Page 44: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/44.jpg)
Curse of dimensionality
• The sub-MDP rooted at the current state the agent is in may still be very large (too many states are reachable), despite much smaller than the original one.
• Too many actions possible: large tree branching factor• Too many steps: large tree depth
I cannot exhaustively search the full tree
![Page 45: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/45.jpg)
Curse of dimensionality
Goal of HEX: to make a connected line that links two antipodal points of the grid
![Page 46: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/46.jpg)
How to handle curse of dimensionality?
Intelligent search instead of exhaustive search:A. The depth of the search may be reduced by position evaluation:
truncating the search tree at state s and replacing the subtree below s by an approximate value function v(s)=v*(s) that predicts the outcome from state s.
B. The breadth of the search may be reduced by sampling actions from a policy p(a|s) that is a probability distribution over possible moves a in position s, instead of trying every action.
![Page 47: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/47.jpg)
Position evaluationWe can estimate values for positions in two ways:• Engineering them using human experts (DeepBlue)• Learning them from self-play (TD-gammon)
Problems with human engineering:• tiring• non transferrable to other domains.
YET: that’s how Kasparov was first beaten.
http://stanford.edu/~cpiech/cs221/apps/deepBlue.html
![Page 48: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/48.jpg)
Position evaluation using sampled rollouts a.k.a. Monte Carlo
What policy shall we use to draw our simulations? The cheapest one is random..
![Page 49: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/49.jpg)
Monte-Carlo Position Evaluation in GoLecture 8: Integrating Learning and Planning
Simulation-Based Search
MCTS in Go
Monte-Carlo Evaluation in Go
Current position s
Simulation
1 1 0 0 Outcomes
V(s) = 2/4 = 0.5
![Page 50: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/50.jpg)
• Given a model , a root state s_t, and a most of the times random policy
• For each action
• Q(s_t,a) = MC_boardEval(s’), s’=T(s,a)
• Select root action:
Simplest Monte-Carlo Search
M⌫⇡
a 2 A
at = argmaxa2A
Q(st, a)
![Page 51: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/51.jpg)
• Given a model and a most of the times random policy
• For each action
• Simulate episodes from current (real) state :
• Evaluate action value function of the root by mean return
• Select current (real) action with maximum value
Simplest Monte-Carlo Search
M⌫ ⇡
a 2 AK s
Q(st, a) =1
K
KX
k=1
GtP�! q⇡(st, a)
at = argmaxa2A
Q(st, a)
{st, a, Rkt+1, S
kt+1, A
kt+1, ..., S
kT }Kk=1 ⇠ M⌫ ,⇡
![Page 52: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/52.jpg)
Can we do better?
• Could we improve our simulation policy the more simulations we obtain?• Yes we can. We can keep track of action values Q not only for the root
but also for nodes internal to a tree we are expanding!
In MCTS the simulation policy improves
• How should we select the actions inside the tree?
at = argmaxa2A
Q(st, a)• This doesn’t work:
Why?
![Page 53: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/53.jpg)
K-armed Bandit Problem You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.
Each action has an expected reward:
q*(a) = 𝔼[Rt |At = a]
If we knew what it was, we would always pick the action with the highest expected reward, obviously. Those q values is exactly what we care to estimate.
Note that the state is not changing… that is a big difference than what we have seen so far…
![Page 54: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/54.jpg)
K-armed Bandit Problem
There are two things we can do each time-step:• Exploit: Pick the action with the highest • Explore: Pick a different action
Let Qt(a) denote our estimates of q* at time t.
You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.
is a simple policy that balances in some way exploitation/exploration. However, it does not differentiate between suboptimal, clearly suboptimal or marginally suboptimal actions, or actions that have been tried often or not, and thus have unreliable Q values.
✏� greedy(Q)
![Page 55: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/55.jpg)
Sample actions according to the following score:
• score is decreasing in the number of visits (explore) • score is increasing in a node’s value (exploit) • always tries every option once
Upper-Confidence Bound
At = argmaxa [ ]
Finite-time Analysis of the Multiarmed Bandit Problem, Auer, Cesa-Bianchi, Fischer, 2002
![Page 56: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/56.jpg)
1. Selection• Used for nodes we have seen before• Pick according to UCB
2. Expansion• Used when we reach the frontier• Add one node per playout
3. Simulation• Used beyond the search frontier• Don’t bother with UCB, just play randomly
4. Backpropagation• After reaching a terminal node• Update value and visits for states expanded in selection and expansion
Monte-Carlo Tree Search
Bandit based Monte-Carlo Planning, Kocsis and Szepesvari, 2006
![Page 57: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/57.jpg)
Monte-Carlo Tree Search
The state is inside the tree
The state is in the frontier
expansion
![Page 58: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/58.jpg)
Monte-Carlo Tree Search
unrolling
Sample actions based on UCB score
![Page 59: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/59.jpg)
Monte-Carlo Tree Search
![Page 60: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/60.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search Tree
![Page 61: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/61.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
![Page 62: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/62.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
![Page 63: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/63.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
![Page 64: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/64.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
![Page 65: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/65.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
![Page 66: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/66.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
![Page 67: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/67.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
![Page 68: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/68.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
![Page 69: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/69.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
New Node
![Page 70: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/70.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
New Node
PhaseRandom
![Page 71: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/71.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
New Node
PhaseRandom
![Page 72: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/72.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
New Node
PhaseRandom
![Page 73: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/73.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
New Node
PhaseRandom
![Page 74: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/74.jpg)
Monte-Carlo Tree Search
Kocsis Szepesvari, 06Gradually grow the search tree:
I Iterate Tree-WalkI Building Blocks
I Select next actionBandit phase
I Add a nodeGrow a leaf of the search tree
I Select next action bisRandom phase, roll-out
I Compute instant rewardEvaluate
I Update information in visited nodesPropagate
I Returned solution:I Path visited most often
Explored Tree
Search TreePhase
Bandit−Based
New Node
PhaseRandom
![Page 75: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/75.jpg)
Can we do better?
Can we inject prior knowledge into value functions to be estimated and actions to be tried, instead of initializing uniformly?
![Page 76: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/76.jpg)
AlphaGo: Learning-guided MCTS
• Value neural net to evaluate board positions• Policy neural net to select moves• Combine those networks with MCTS
![Page 77: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/77.jpg)
AlphaGo: Actions Policies
1. Train two action policies, one cheap (rollout) policy and one expensive policy by mimicking expert moves (standard supervised learning).
2. Then, train a new policy with RL and self-play initialized from SL policy.3. Train a value network that predicts the winner of games played by against itself.
AlphaGo: Learning-guided search
pπ pσ
pρ pσpρ
![Page 78: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/78.jpg)
Supervised learning of policy networks
• Objective: predicting expert moves• Input: randomly sampled state-action pairs (s, a) from expert games• Output: a probability distribution over all legal moves a.
SL policy network: 13-layer policy network trained from 30 million positions. The network predicted expert moves on a held out test set with an accuracy of 57.0% using all input features, and 55.7% using only raw board position and move history as inputs, compared to the state-of-the-art from other research groups of 44.4%.
pσ
![Page 79: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/79.jpg)
Reinforcement learning of policy networks• Objective: improve over SL policy• Weight initialization from SL network• Input: Sampled states during self-play• Output: a probability distribution over all legal moves a.
Rewards are provided only at the end of the game, +1 for winning, -1 for loosing
The RL policy network won more than 80% of games against the SL policy network.
pρ
![Page 80: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/80.jpg)
Reinforcement learning of value networks
• Objective: Estimating a value function vp(s) that predicts the outcome from position s of games played by using RL policy p for both players (in contrast to min-max search)
• Input: Sampled states during self-play, 30 million distinct positions, each sampled from a separate game, played by the RL policy against itself.
• Output: a scalar value
Trained by regression on state-outcome pairs (s, z) to minimize the mean squared error between the predicted value v(s), and the corresponding outcome z.
![Page 81: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/81.jpg)
MCTS + Policy/ Value networksSelection: selecting actions within the expanded tree
provided by SL policy
Tree policy
average reward collected so far from MC simulations
![Page 82: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/82.jpg)
$Expansion: when reaching a leaf, play the action with highest score from
MCTS + Policy/ Value networks
![Page 83: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/83.jpg)
MCTS + Policy/ Value networksSimulation/Evaluation: use the rollout policy to reach to the end of the game
• From the selected leaf node, run multiple simulations in parallel using the rollout policy
• Evaluate the leaf node as:
![Page 84: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/84.jpg)
MCTS + Policy/ Value networksBackup: update visitation counts and recorded rewards for the chosen path inside the tree:
![Page 85: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/85.jpg)
AlphaGoZero: Lookahead search during training!
• So far, look-ahead search was used for online planning at test time!• AlphaGoZero uses it during training instead, for improved exploration
during self-play• AlphaGo trained the RL policy using the current policy network pρ and a
randomly selected previous iteration of the policy network as opponent (for exploration).
• The intelligent exploration in AlphaGoZero gets rid of need for human supervision.
![Page 86: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/86.jpg)
AlphaGoZero: Lookahead search during training!
• Given any policy, a MCTS guided by this policy will produce an improved policy (policy improvement operator)
• Train to mimic such improved policy
![Page 87: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/87.jpg)
MCTS as policy improvement operator
• Train so that the policy network mimics this improved policy
• Train so that the position evaluation network output matches the outcome (same as in AlphaGo)
![Page 88: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/88.jpg)
MCTS: no MC rollouts till termination
MCTS: using always value net evaluations of leaf nodes, no rollouts!
![Page 89: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/89.jpg)
Architectures
• Resnets help• Jointly training the
policy and value function using the same main feature extractor helps
• Lookahead tremendously improves the basic policy
![Page 90: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/90.jpg)
Architectures
• Resnets help• Jointly training the
policy and value function using the same main feature extractor helps
Separate policy/value nets Joint policy/value nets
![Page 91: Deep Reinforcement Learning and Controlkatef/DeepRLFall2018/lecture_planninglearningMCTS.pdfPlanning Learning: the acquisition of knowledge or skills through experience, study, or](https://reader033.vdocument.in/reader033/viewer/2022050609/5fb018d623688b7517582571/html5/thumbnails/91.jpg)
RL VS SL