intro to rl & policy gradient - department of computer...

Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10, Mar 26/28, 2019

Upload: others

Post on 15-Jun-2020

0 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Intro to RL & Policy Gradient

Xuchan (Jenny) BaoCSC421 Tutorial 10, Mar 26/28, 2019

Outline:- Brief intro to RL- Policy Gradient

- The log-derivative trick- Practical fixes: baseline & temporal structure

- OpenAI Gym- Example: policy gradient on Gym environments- References

Slides on intro & policy gradient are from / inspired by the Deep RL Bootcamp Lecture 4A: Policy Gradients by Pieter Abbeel https://www.youtube.com/watch?v=S_gwYj1Q-44

https://www.youtube.com/watch?v=S_gwYj1Q-44

Page 3: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

Brief Intro to RLRepresent agent with stochastic policy

From Sutton & Barto “Reinforcement Learning: An Introduction”, 1998

Page 4: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

From Deep RL Bootcamp Lecture 4A: Policy Gradients, Pieter Abbeel https://www.youtube.com/watch?v=S_gwYj1Q-44

https://www.youtube.com/watch?v=S_gwYj1Q-44

Page 5: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

From Deep RL Bootcamp Lecture 4A: Policy Gradients, Pieter Abbeel https://www.youtube.com/watch?v=S_gwYj1Q-44

https://www.youtube.com/watch?v=S_gwYj1Q-44

Page 6: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

Policy GradientSuppose we have a trajectory:

And represent the reward for the whole trajectory:

The expected reward under policy (utility function):

The goal is to find the optimal parameters to max the utility function.

Page 7: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

Policy Gradient: the log-derivative trickTake the gradient:

Page 8: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

Policy Gradient: approximate gradient with samplesNow we can approximate the gradient using Monte Carlo Samples!

Where are sample rollout trajectories under policy

Page 9: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

Policy Gradient: approximate gradient with samplesTake a moment to appreciate this:

This gradient approximation is valid even when:

- The reward is discontinuous / unknown- Sample space is a discrete set

Page 10: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

Policy Gradient: intuition

The gradient tries to:

- Increase probability of paths with positive rewards- Decrease probability of paths with negative rewards

Does NOT try to change the paths themselves.

See any problems here?

Page 11: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

Decomposing the paths into states & actions

Not a function of

Page 12: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

Policy Gradient: problems and fixesThe vanilla policy gradient estimator is unbiased, but very noisy.

- Requires lots of samples to make it work

Fixes:

- Baseline - Temporal Structure- Other (e.g. KL trust region)

Page 13: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

Policy Gradient: baseline

Subtract the reward with a baseline (b) does not change the optimization problem.

- The gradient estimation is still unbiased, but with lower variance

Intuition: we want to adjust path probabilities based on how the path reward compares to the average, not the path reward itself.

- Increase probability if the path reward is higher than average- Decrease probability if the path reward is lower than average

Page 14: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

Policy Gradient: temporal structure Put together what we have:

Past reward does not affect current action

Page 15: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

OpenAI Gymhttps://gym.openai.com/

Widely-used testing platform for RL algorithms.

● pip install gym

Different kinds of environments, including discrete / continuous control, pixel-input Atari games, etc.

You can also create your own environments, following the Gym interface.

https://gym.openai.com/

Page 16: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

OpenAI Gym environmentsCreate an environment:

● env = gym.make(“<environment_name>”) ← e.g. gym.make(“CartPole-v1”)

Env methods you will need the most:

● state = env.reset()

● next_state, reward, done, info = env.step(action)

● env.seed(seed=None)

● env.close()

More documentation at https://gym.openai.com/docs/

Useful attributes:

● env.observation_space

● env.action_space

https://gym.openai.com/docs/

Page 17: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

Example: Policy Gradient in PyTorch on a Gym Environment (CartPole-v1)

Page 18: Intro to RL & Policy Gradient - Department of Computer ...rgrosse/courses/csc421_2019/tutorials/tut10/t… · Intro to RL & Policy Gradient Xuchan (Jenny) Bao CSC421 Tutorial 10,

References- Sutton & Barto “Reinforcement Learning: An Introduction”, 1998- Williams, “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning”, 1992- Sutton et al, “Policy Gradient Methods for Reinforcement Learning with Function Approximation”, 1999- Pieter Abbeel, Deep RL Bootcamp Lecture 4A: Policy Gradients https://www.youtube.com/watch?v=S_gwYj1Q-44

https://www.youtube.com/watch?v=S_gwYj1Q-44

Learning to learn by gradient descent by gradient descentpapers.nips.cc/...to...descent-by-gradient-descent.pdf · Learning to learn by gradient descent by gradient descent Marcin

Lecture outline break - Stanford University in Meta-RL.pdf · Lecture outline - recap: policy gradient RL and how it can be used to build meta-RL algorithms - the exploration problem

Tut10 heteroskedasticity

gradient 1 gradient 2 gradient 3 gradient 4 ECDIS … Buyers Guide v2 0 19...ECDIS buyers guide gradient 1 gradient 2 gradient 3 gradient 4 gradient 1 gradient 2 gradient 3 gradient

RL for Large State Spaces: Policy Gradient

NOMINATED RL LIST2201528586GUNPAL RL\827 2406502093RIYASAT ALI RL\828 2405010222JITENDRA KUMAR SHARMA RL\831 2405519571TILAK CHOUDHARY RL\832 2201007005BALWAN SINGH RL\834 2405500408TARACHAND

Chi Turf, Chicago, IL Streeterville-Rpt... · 2017. 3. 7. · The Streeterville SOLID Jan-10-17 00:00 RL RL RL RL RL RL Analyzed: mg/kg Page 1 of 27 Final 1.003. Analytical Report

1 Scale: 1:200 · 2020. 8. 26. · SSD1.02/17-202.02/17-212.1 RL 58.38 RL 61.57 RL 65.275 RL 67.95 RL 54.99 RL 71.05 RL 54.65 PROPOSED RIDGE EXG + PROP 5TH FLOOR EXG + PROP 4TH FLOOR

RL-SM02B-8189ETV Specification RL-SM02B-8189ETV-V1

Direct Gradient-Based Reinforcement Learning: II. Gradient ...mkearns/finread/BaxterWeaverBartlett.pdf · Direct Gradient-Based Reinforcement Learning: II. Gradient Ascent Algorithms

APPROACH SURFACE HORIZONTAL SECTION RL 400…dsewebapps.dse.vic.gov.au/Shared/ATSAttachment1.nsf/(attachment... · rl 400 rl 400.52 rl 340 rl 350 rl 360 rl 370 rl 380 ... first section