intro to rl & policy gradient - department of computer...
TRANSCRIPT
![Page 1: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/1.jpg)
Intro to RL & Policy Gradient
Xuchan (Jenny) BaoCSC421 Tutorial 10, Mar 26/28, 2019
![Page 2: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/2.jpg)
Outline:- Brief intro to RL- Policy Gradient
- The log-derivative trick- Practical fixes: baseline & temporal structure
- OpenAI Gym- Example: policy gradient on Gym environments- References
Slides on intro & policy gradient are from / inspired by the Deep RL Bootcamp Lecture 4A: Policy Gradients by Pieter Abbeel https://www.youtube.com/watch?v=S_gwYj1Q-44
![Page 3: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/3.jpg)
Brief Intro to RLRepresent agent with stochastic policy
From Sutton & Barto “Reinforcement Learning: An Introduction”, 1998
![Page 4: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/4.jpg)
From Deep RL Bootcamp Lecture 4A: Policy Gradients, Pieter Abbeel https://www.youtube.com/watch?v=S_gwYj1Q-44
![Page 5: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/5.jpg)
From Deep RL Bootcamp Lecture 4A: Policy Gradients, Pieter Abbeel https://www.youtube.com/watch?v=S_gwYj1Q-44
![Page 6: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/6.jpg)
Policy GradientSuppose we have a trajectory:
And represent the reward for the whole trajectory:
The expected reward under policy (utility function):
The goal is to find the optimal parameters to max the utility function.
![Page 7: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/7.jpg)
Policy Gradient: the log-derivative trickTake the gradient:
![Page 8: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/8.jpg)
Policy Gradient: approximate gradient with samplesNow we can approximate the gradient using Monte Carlo Samples!
Where are sample rollout trajectories under policy
![Page 9: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/9.jpg)
Policy Gradient: approximate gradient with samplesTake a moment to appreciate this:
This gradient approximation is valid even when:
- The reward is discontinuous / unknown- Sample space is a discrete set
![Page 10: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/10.jpg)
Policy Gradient: intuition
The gradient tries to:
- Increase probability of paths with positive rewards- Decrease probability of paths with negative rewards
Does NOT try to change the paths themselves.
See any problems here?
![Page 11: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/11.jpg)
Decomposing the paths into states & actions
Not a function of
![Page 12: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/12.jpg)
Policy Gradient: problems and fixesThe vanilla policy gradient estimator is unbiased, but very noisy.
- Requires lots of samples to make it work
Fixes:
- Baseline - Temporal Structure- Other (e.g. KL trust region)
![Page 13: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/13.jpg)
Policy Gradient: baseline
Subtract the reward with a baseline (b) does not change the optimization problem.
- The gradient estimation is still unbiased, but with lower variance
Intuition: we want to adjust path probabilities based on how the path reward compares to the average, not the path reward itself.
- Increase probability if the path reward is higher than average- Decrease probability if the path reward is lower than average
![Page 14: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/14.jpg)
Policy Gradient: temporal structure Put together what we have:
Past reward does not affect current action
![Page 15: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/15.jpg)
OpenAI Gymhttps://gym.openai.com/
Widely-used testing platform for RL algorithms.
● pip install gym
Different kinds of environments, including discrete / continuous control, pixel-input Atari games, etc.
You can also create your own environments, following the Gym interface.
![Page 16: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/16.jpg)
OpenAI Gym environmentsCreate an environment:
● env = gym.make(“<environment_name>”) ← e.g. gym.make(“CartPole-v1”)
Env methods you will need the most:
● state = env.reset()
● next_state, reward, done, info = env.step(action)
● env.seed(seed=None)
● env.close()
More documentation at https://gym.openai.com/docs/
Useful attributes:
● env.observation_space
● env.action_space
![Page 17: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/17.jpg)
Example: Policy Gradient in PyTorch on a Gym Environment (CartPole-v1)
![Page 18: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,](https://reader033.vdocument.in/reader033/viewer/2022060221/5f075d0b7e708231d41c9cf5/html5/thumbnails/18.jpg)
References- Sutton & Barto “Reinforcement Learning: An Introduction”, 1998- Williams, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning”, 1992- Sutton et al, “Policy Gradient Methods for Reinforcement Learning with Function Approximation”, 1999- Pieter Abbeel, Deep RL Bootcamp Lecture 4A: Policy Gradients https://www.youtube.com/watch?v=S_gwYj1Q-44