reinforcement learning: review
TRANSCRIPT
![Page 1: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/1.jpg)
Reinforcement Learning:Review
CS 330
![Page 2: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/2.jpg)
Reminders
Today:
Monday next week:
Project proposals due
Homework 2 due, Homework 3 out
![Page 3: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/3.jpg)
Why Reinforcement Learning?Isolated action that doesnโt affect the future?
Common applications
(most deployed ML systems)
robotics autonomous drivinglanguage & dialog business operations finance
+ a key aspect of intelligence
Supervised learning?
![Page 4: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/4.jpg)
The Plan
Reinforcement learning problem
Policy gradients
Q-learning
![Page 5: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/5.jpg)
The Plan
Reinforcement learning problem
Policy gradients
Q-learning
![Page 6: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/6.jpg)
object classification object manipulation
iid data action affects next state
large labeled, curated datasethow to collect data?
what are the labels?
well-defined notions of success what does success mean?
supervised learning sequential decision making
![Page 7: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/7.jpg)
1. run away
2. ignore
3. pet
Terminology & notation
Slide adapted from Sergey Levine
![Page 8: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/8.jpg)
Images: Bojarski et al. โ16, NVIDIA
training
data
supervised
learning
Imitation Learning
Slide adapted from Sergey Levine
Imitation Learning vs Reinforcement Learning?
![Page 9: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/9.jpg)
Reward functions
Slide adapted from Sergey Levine
![Page 10: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/10.jpg)
The goal of reinforcement learning
Slide adapted from Sergey Levine
![Page 11: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/11.jpg)
The goal of reinforcement learning
Slide adapted from Sergey Levine
![Page 12: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/12.jpg)
The goal of reinforcement learning
Slide adapted from Sergey Levine
![Page 13: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/13.jpg)
What is a reinforcement learning task?
data generating distributions, loss
A task: ๐ฏ๐ โ {๐๐(๐ฑ), ๐๐(๐ฒ|๐ฑ), โ๐}
Supervised learning Reinforcement learning
A task: ๐ฏ๐ โ {๐ฎ๐ , ๐๐ , ๐๐(๐ฌ1), ๐๐(๐ฌโฒ|๐ฌ, ๐), ๐๐(๐ฌ, ๐)}
a Markov decision process
dynamicsaction space
state space
initial state distribution
reward
much more than the semantic meaning of task!
![Page 14: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/14.jpg)
Examples Task Distributions
A task: ๐ฏ๐ โ {๐ฎ๐ , ๐๐ , ๐๐(๐ฌ1), ๐๐(๐ฌโฒ|๐ฌ, ๐), ๐๐(๐ฌ, ๐)}
Multi-robot RL:
Character animation: across maneuvers
across garments &
initial states
๐๐(๐ฌ, ๐) vary
๐๐(๐ฌ1), ๐๐(๐ฌโฒ|๐ฌ, ๐) vary
๐ฎ๐ , ๐๐ , ๐๐(๐ฌ1), ๐๐(๐ฌโฒ|๐ฌ, ๐) vary
![Page 15: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/15.jpg)
The Plan
Reinforcement learning problem
Policy gradients
Q-learning
![Page 16: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/16.jpg)
The anatomy of a reinforcement learning algorithm
![Page 17: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/17.jpg)
Evaluating the objective
Slide adapted from Sergey Levine
![Page 18: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/18.jpg)
Direct policy differentiation
a convenient identity
Slide adapted from Sergey Levine
![Page 19: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/19.jpg)
Direct policy differentiation
Slide adapted from Sergey Levine
![Page 20: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/20.jpg)
Evaluating the policy gradient
generate samples (i.e. run the policy)
fit a model to estimate return
improve the policy
Slide adapted from Sergey Levine
![Page 21: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/21.jpg)
Comparison to maximum likelihood
trainingdata
supervisedlearning
Slide adapted from Sergey Levine
![Page 22: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/22.jpg)
What did we just do?
good stuff is made more likely
bad stuff is made less likely
simply formalizes the notion of โtrial and errorโ!
Slide adapted from Sergey Levine
![Page 23: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/23.jpg)
Policy Gradients
Pros:+ Simple+ Easy to combine with existing multi-task & meta-learning algorithms
Cons:- Produces a high-variance gradient
- Can be mitigated with baselines (used by all algorithms in practice), trust regions
- Requires on-policy data- Cannot reuse existing experience to estimate the gradient!
- Importance weights can help, but also high variance
![Page 24: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/24.jpg)
On-policy vs Off-policy
- Data comes from the current policy
- Compatible with all RL algorithms
- Canโt reuse data from previous
policies
- Data comes from any policy
- Works with specific RL
algorithms
- Much more sample efficient,
can re-use old data
![Page 25: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/25.jpg)
Small note
Reward โto goโ
![Page 26: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/26.jpg)
The Plan
Reinforcement learning problem
Policy gradients
Q-learning
![Page 27: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/27.jpg)
The anatomy of a reinforcement learning algorithm
![Page 28: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/28.jpg)
Improving the policy gradient
Reward โto goโ
Slide adapted from Sergey Levine
![Page 29: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/29.jpg)
State & state-action value functions
Slide adapted from Sergey Levine
![Page 30: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/30.jpg)
Value-Based RLReward = 1 if I can play it in a month, 0 otherwise
a3
a1
a2
st
Current ๐ ๐1 ๐ฌ = 1
Value function: ๐๐ ๐ฌ๐ก = ?
Q function: ๐๐ ๐ฌ๐ก, ๐๐ก = ?
Advantage function: ๐ด๐ ๐ฌ๐ก , ๐๐ก = ?
![Page 31: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/31.jpg)
Multi-Step Prediction
- How do you update your predictions about winning
the game?
- What happens if you donโt finish the game?
- Do you always wait till the end?
![Page 32: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/32.jpg)
How can we use all of this to fit a better estimator?
Slide adapted from Sergey Levine
Goal:
![Page 33: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/33.jpg)
Policy evaluation examples
TD-Gammon, Gerald Tesauro 1992 AlphaGo, Silver et al. 2016
![Page 34: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/34.jpg)
Slide adapted from Sergey Levine
![Page 35: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/35.jpg)
This was just the prediction partโฆ
![Page 36: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/36.jpg)
Improving the Policy
how good is an action compared to the policy?
![Page 37: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/37.jpg)
Value-Based RLReward = 1 if I can play it in a month, 0 otherwise
a3
a1
a2
st
Current ๐ ๐1 ๐ฌ = 1
Value function: ๐๐ ๐ฌ๐ก = ?
Q function: ๐๐ ๐ฌ๐ก, ๐๐ก = ?
Advantage function: ๐ด๐ ๐ฌ๐ก , ๐๐ก = ?
How can we improve
the policy?
![Page 38: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/38.jpg)
Improving the Policy
Slide adapted from Sergey Levine
![Page 39: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/39.jpg)
Slide adapted from Sergey Levine
Policy Iteration
![Page 40: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/40.jpg)
Slide adapted from Sergey Levine
Value Iteration
approximates the new value!
![Page 41: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/41.jpg)
Q learning
doesnโt require simulation of actions!
![Page 42: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/42.jpg)
Value-Based RLReward = 1 if I can play it in a month, 0 otherwise
a3
a1
a2
st
Current ๐ ๐1 ๐ฌ = 1
Value function: ๐๐ ๐ฌ๐ก = ?
Q function: ๐๐ ๐ฌ๐ก, ๐๐ก = ?
Q* function: ๐โ ๐ฌ๐ก , ๐๐ก = ?
Value* function: ๐โ ๐ฌ๐ก = ?
![Page 43: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/43.jpg)
Fitted Q-iteration Algorithm
Slide adapted from Sergey Levine
Algorithm hyperparameters
This is not a gradient descent algorithm!
Result: get a policy ๐(๐|๐ฌ) from arg๐๐๐ฅ๐
๐๐(๐ฌ, ๐)
Important notes:We can reuse data from previous policies!
using replay buffersan off-policy algorithm
Can be readily extended to multi-task/goal-conditioned RL
![Page 44: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/44.jpg)
Example: Q-learning Applied to Robotics
Continuous action space?
Simple optimization algorithm -> Cross Entropy Method (CEM)
![Page 45: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/45.jpg)
In-memory buffers
QT-Opt: Q-learning at Scale
Bellman updatersstored data from all past experiments
Training jobs
CEM optimization
QT-Opt: Kalashnikov et al. โ18, Google BrainSlide adapted from D. Kalashnikov
![Page 46: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/46.jpg)
QT-Opt: Setup and Results
7 robots collected 580k grasps Unseen test objects
96% test success rate!
![Page 47: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/47.jpg)
Bellman equation: ๐โ(๐ฌ๐ก , ๐๐ก) = ๐ผ๐ฌโฒโผ๐(โ |๐ฌ,๐) ๐(๐ฌ, ๐) + ๐พ๐๐๐ฅ๐โฒ
๐โ(๐ฌโฒ, ๐โฒ)
Q-learning
Pros:+ More sample efficient than on-policy methods+ Can incorporate off-policy data (including a fully offline setting)+ Can updates the policy even without seeing the reward+ Relatively easy to parallelize
Cons:- Lots of โtricksโ to make it work- Potentially could be harder to learn than just a policy
![Page 48: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/48.jpg)
The Plan
Reinforcement learning problem
Policy gradients
Q-learning
![Page 49: Reinforcement Learning: Review](https://reader035.vdocument.in/reader035/viewer/2022080200/62e74779664c9b7d6c6d3f08/html5/thumbnails/49.jpg)
Additional RL Resources
Stanford CS234: Reinforcement Learning
UCL Course from David Silver: Reinforcement Learning
Berkeley CS285: Deep Reinforcement Learning