model-based reinforcement learning
TRANSCRIPT
![Page 1: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/1.jpg)
Model-Based Reinforcement Learning
CS 294-112: Deep Reinforcement Learning
Sergey Levine
![Page 2: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/2.jpg)
Class Notes
1. Project proposal due today!
2. Remember to start early on Homework 3!
![Page 3: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/3.jpg)
1. Last lecture: choose good actions autonomously by backpropagating (or planning) through known system dynamics (e.g. known physics)
2. Today: what do we do if the dynamics are unknown?a. Fitting global dynamics models (“model-based RL”)
b. Fitting local dynamics models
3. Friday: learning dynamics for high-dimensional observations, such as images
4. Following Wednesday: combining optimal control and policy search to train neural network policies with the aid of optimal control
Overview
![Page 4: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/4.jpg)
1. Overview of model-based RL• Learn only the model
• Learn model & policy
2. What kind of models can we use?
3. Global models and local models
4. Learning with local models and trust regions
• Goals:• Understand the terminology and formalism of model-based RL
• Understand the options for models we can use in model-based RL
• Understand practical considerations of model learning
• Not much deep RL today, we’ll see more advanced model-based RL later!
Today’s Lecture
![Page 5: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/5.jpg)
Why learn the model?
![Page 6: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/6.jpg)
Why learn the model?
![Page 7: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/7.jpg)
Why learn the model?
![Page 8: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/8.jpg)
Does it work? Yes!
• Essentially how system identification works in classical robotics
• Some care should be taken to design a good base policy
• Particularly effective if we can hand-engineer a dynamics representation using our knowledge of physics, and fit just a few parameters
![Page 9: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/9.jpg)
Does it work? No!
• Distribution mismatch problem becomes exacerbated as we use more expressive model classes
go right to get higher!
![Page 10: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/10.jpg)
Can we do better?
![Page 11: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/11.jpg)
What if we make a mistake?
![Page 12: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/12.jpg)
Can we do better?ev
ery
N s
tep
s
This will be on HW4!
![Page 13: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/13.jpg)
How to replan?ev
ery
N s
tep
s
• The more you replan, the less perfect each individual plan needs to be
• Can use shorter horizons
• Even random sampling can often work well here!
![Page 14: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/14.jpg)
That seems like a lot of work…ev
ery
N s
tep
s
![Page 15: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/15.jpg)
Backpropagate directly into the policy?
backprop backpropbackprop
easy for deterministic policies, but also possible for stochastic policy (more on this later)
![Page 16: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/16.jpg)
Summary
• Version 0.5: collect random samples, train dynamics, plan• Pro: simple, no iterative procedure• Con: distribution mismatch problem
• Version 1.0: iteratively collect data, replan, collect data• Pro: simple, solves distribution mismatch• Con: open loop plan might perform poorly, esp. in stochastic domains
• Version 1.5: iteratively collect data using MPC (replan at each step)• Pro: robust to small model errors• Con: computationally expensive, but have a planning algorithm available
• Version 2.0: backpropagate directly into policy• Pro: computationally cheap at runtime• Con: can be numerically unstable, especially in stochastic domains (more on this later)
![Page 17: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/17.jpg)
Case study: model-based policy search with GPs
![Page 18: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/18.jpg)
Case study: model-based policy search with GPs
![Page 19: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/19.jpg)
Case study: model-based policy search with GPs
![Page 20: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/20.jpg)
![Page 21: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/21.jpg)
What kind of models can we use?
Gaussian process neural network
image: Punjani & Abbeel ‘14
other
video prediction?more on this later in the course
![Page 22: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/22.jpg)
ever
y N
ste
ps
![Page 23: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/23.jpg)
![Page 24: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/24.jpg)
Break
![Page 25: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/25.jpg)
The trouble with global models
• Planner will seek out regions where the model is erroneously optimistic
• Need to find a very good model in most of the state space to converge on a good solution
![Page 26: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/26.jpg)
The trouble with global models
• Planner will seek out regions where the model is erroneously optimistic
• Need to find a very good model in most of the state space to converge on a good solution
• In some tasks, the model is much more complex than the policy
![Page 27: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/27.jpg)
Local models
![Page 28: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/28.jpg)
Local models
![Page 29: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/29.jpg)
Local models
![Page 30: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/30.jpg)
What controller to execute?
![Page 31: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/31.jpg)
What controller to execute?
![Page 32: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/32.jpg)
What controller to execute?
![Page 33: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/33.jpg)
Local models
![Page 34: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/34.jpg)
How to fit the dynamics?
Can we do better?
![Page 35: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/35.jpg)
What if we go too far?
![Page 36: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/36.jpg)
How to stay close to old controller?
![Page 37: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/37.jpg)
KL-divergences between trajectories
• Turns out to work very similarly to trust region for PG
dynamics & initial state are the same!
![Page 38: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/38.jpg)
KL-divergences between trajectories
![Page 39: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/39.jpg)
KL-divergences between trajectories
negative entropy
![Page 40: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/40.jpg)
KL-divergences between trajectories
![Page 41: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/41.jpg)
Digression: dual gradient descent
how to maximize? Compute the gradient!
![Page 42: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/42.jpg)
Digression: dual gradient descent
![Page 43: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/43.jpg)
Digression: dual gradient descent
![Page 44: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/44.jpg)
DGD with iterative LQR
![Page 45: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/45.jpg)
DGD with iterative LQR
this is the hard part, everything else is easy!
![Page 46: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/46.jpg)
DGD with iterative LQR
![Page 47: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/47.jpg)
DGD with iterative LQR
![Page 48: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/48.jpg)
Trust regions & trajectory distributions
•Bounding KL-divergences between two policies or controllers, whether linear-Gaussian or more complex (e.g. neural networks) is really useful
•Bounding KL-divergence between policies is equivalent to bounding KL-divergences between trajectory distributions
![Page 49: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/49.jpg)
Example: local models & iterative LQR
![Page 50: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/50.jpg)
![Page 51: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/51.jpg)
![Page 52: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/52.jpg)
Example: local models with images
![Page 53: Model-Based Reinforcement Learning](https://reader030.vdocument.in/reader030/viewer/2022012722/61b40ab1ef1f1e7c4c350362/html5/thumbnails/53.jpg)