reinforcement learning: review

49
Reinforcement Learning: Review CS 330

Upload: others

Post on 01-Aug-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reinforcement Learning: Review

Reinforcement Learning:Review

CS 330

Page 2: Reinforcement Learning: Review

Reminders

Today:

Monday next week:

Project proposals due

Homework 2 due, Homework 3 out

Page 3: Reinforcement Learning: Review

Why Reinforcement Learning?Isolated action that doesnโ€™t affect the future?

Common applications

(most deployed ML systems)

robotics autonomous drivinglanguage & dialog business operations finance

+ a key aspect of intelligence

Supervised learning?

Page 4: Reinforcement Learning: Review

The Plan

Reinforcement learning problem

Policy gradients

Q-learning

Page 5: Reinforcement Learning: Review

The Plan

Reinforcement learning problem

Policy gradients

Q-learning

Page 6: Reinforcement Learning: Review

object classification object manipulation

iid data action affects next state

large labeled, curated datasethow to collect data?

what are the labels?

well-defined notions of success what does success mean?

supervised learning sequential decision making

Page 7: Reinforcement Learning: Review

1. run away

2. ignore

3. pet

Terminology & notation

Slide adapted from Sergey Levine

Page 8: Reinforcement Learning: Review

Images: Bojarski et al. โ€˜16, NVIDIA

training

data

supervised

learning

Imitation Learning

Slide adapted from Sergey Levine

Imitation Learning vs Reinforcement Learning?

Page 9: Reinforcement Learning: Review

Reward functions

Slide adapted from Sergey Levine

Page 10: Reinforcement Learning: Review

The goal of reinforcement learning

Slide adapted from Sergey Levine

Page 11: Reinforcement Learning: Review

The goal of reinforcement learning

Slide adapted from Sergey Levine

Page 12: Reinforcement Learning: Review

The goal of reinforcement learning

Slide adapted from Sergey Levine

Page 13: Reinforcement Learning: Review

What is a reinforcement learning task?

data generating distributions, loss

A task: ๐’ฏ๐‘– โ‰œ {๐‘๐‘–(๐ฑ), ๐‘๐‘–(๐ฒ|๐ฑ), โ„’๐‘–}

Supervised learning Reinforcement learning

A task: ๐’ฏ๐‘– โ‰œ {๐’ฎ๐‘– , ๐’œ๐‘– , ๐‘๐‘–(๐ฌ1), ๐‘๐‘–(๐ฌโ€ฒ|๐ฌ, ๐š), ๐‘Ÿ๐‘–(๐ฌ, ๐š)}

a Markov decision process

dynamicsaction space

state space

initial state distribution

reward

much more than the semantic meaning of task!

Page 14: Reinforcement Learning: Review

Examples Task Distributions

A task: ๐’ฏ๐‘– โ‰œ {๐’ฎ๐‘– , ๐’œ๐‘– , ๐‘๐‘–(๐ฌ1), ๐‘๐‘–(๐ฌโ€ฒ|๐ฌ, ๐š), ๐‘Ÿ๐‘–(๐ฌ, ๐š)}

Multi-robot RL:

Character animation: across maneuvers

across garments &

initial states

๐‘Ÿ๐‘–(๐ฌ, ๐š) vary

๐‘๐‘–(๐ฌ1), ๐‘๐‘–(๐ฌโ€ฒ|๐ฌ, ๐š) vary

๐’ฎ๐‘– , ๐’œ๐‘– , ๐‘๐‘–(๐ฌ1), ๐‘๐‘–(๐ฌโ€ฒ|๐ฌ, ๐š) vary

Page 15: Reinforcement Learning: Review

The Plan

Reinforcement learning problem

Policy gradients

Q-learning

Page 16: Reinforcement Learning: Review

The anatomy of a reinforcement learning algorithm

Page 17: Reinforcement Learning: Review

Evaluating the objective

Slide adapted from Sergey Levine

Page 18: Reinforcement Learning: Review

Direct policy differentiation

a convenient identity

Slide adapted from Sergey Levine

Page 19: Reinforcement Learning: Review

Direct policy differentiation

Slide adapted from Sergey Levine

Page 20: Reinforcement Learning: Review

Evaluating the policy gradient

generate samples (i.e. run the policy)

fit a model to estimate return

improve the policy

Slide adapted from Sergey Levine

Page 21: Reinforcement Learning: Review

Comparison to maximum likelihood

trainingdata

supervisedlearning

Slide adapted from Sergey Levine

Page 22: Reinforcement Learning: Review

What did we just do?

good stuff is made more likely

bad stuff is made less likely

simply formalizes the notion of โ€œtrial and errorโ€!

Slide adapted from Sergey Levine

Page 23: Reinforcement Learning: Review

Policy Gradients

Pros:+ Simple+ Easy to combine with existing multi-task & meta-learning algorithms

Cons:- Produces a high-variance gradient

- Can be mitigated with baselines (used by all algorithms in practice), trust regions

- Requires on-policy data- Cannot reuse existing experience to estimate the gradient!

- Importance weights can help, but also high variance

Page 24: Reinforcement Learning: Review

On-policy vs Off-policy

- Data comes from the current policy

- Compatible with all RL algorithms

- Canโ€™t reuse data from previous

policies

- Data comes from any policy

- Works with specific RL

algorithms

- Much more sample efficient,

can re-use old data

Page 25: Reinforcement Learning: Review

Small note

Reward โ€œto goโ€

Page 26: Reinforcement Learning: Review

The Plan

Reinforcement learning problem

Policy gradients

Q-learning

Page 27: Reinforcement Learning: Review

The anatomy of a reinforcement learning algorithm

Page 28: Reinforcement Learning: Review

Improving the policy gradient

Reward โ€œto goโ€

Slide adapted from Sergey Levine

Page 29: Reinforcement Learning: Review

State & state-action value functions

Slide adapted from Sergey Levine

Page 30: Reinforcement Learning: Review

Value-Based RLReward = 1 if I can play it in a month, 0 otherwise

a3

a1

a2

st

Current ๐œ‹ ๐š1 ๐ฌ = 1

Value function: ๐‘‰๐œ‹ ๐ฌ๐‘ก = ?

Q function: ๐‘„๐œ‹ ๐ฌ๐‘ก, ๐š๐‘ก = ?

Advantage function: ๐ด๐œ‹ ๐ฌ๐‘ก , ๐š๐‘ก = ?

Page 31: Reinforcement Learning: Review

Multi-Step Prediction

- How do you update your predictions about winning

the game?

- What happens if you donโ€™t finish the game?

- Do you always wait till the end?

Page 32: Reinforcement Learning: Review

How can we use all of this to fit a better estimator?

Slide adapted from Sergey Levine

Goal:

Page 33: Reinforcement Learning: Review

Policy evaluation examples

TD-Gammon, Gerald Tesauro 1992 AlphaGo, Silver et al. 2016

Page 34: Reinforcement Learning: Review

Slide adapted from Sergey Levine

Page 35: Reinforcement Learning: Review

This was just the prediction partโ€ฆ

Page 36: Reinforcement Learning: Review

Improving the Policy

how good is an action compared to the policy?

Page 37: Reinforcement Learning: Review

Value-Based RLReward = 1 if I can play it in a month, 0 otherwise

a3

a1

a2

st

Current ๐œ‹ ๐š1 ๐ฌ = 1

Value function: ๐‘‰๐œ‹ ๐ฌ๐‘ก = ?

Q function: ๐‘„๐œ‹ ๐ฌ๐‘ก, ๐š๐‘ก = ?

Advantage function: ๐ด๐œ‹ ๐ฌ๐‘ก , ๐š๐‘ก = ?

How can we improve

the policy?

Page 38: Reinforcement Learning: Review

Improving the Policy

Slide adapted from Sergey Levine

Page 39: Reinforcement Learning: Review

Slide adapted from Sergey Levine

Policy Iteration

Page 40: Reinforcement Learning: Review

Slide adapted from Sergey Levine

Value Iteration

approximates the new value!

Page 41: Reinforcement Learning: Review

Q learning

doesnโ€™t require simulation of actions!

Page 42: Reinforcement Learning: Review

Value-Based RLReward = 1 if I can play it in a month, 0 otherwise

a3

a1

a2

st

Current ๐œ‹ ๐š1 ๐ฌ = 1

Value function: ๐‘‰๐œ‹ ๐ฌ๐‘ก = ?

Q function: ๐‘„๐œ‹ ๐ฌ๐‘ก, ๐š๐‘ก = ?

Q* function: ๐‘„โˆ— ๐ฌ๐‘ก , ๐š๐‘ก = ?

Value* function: ๐‘‰โˆ— ๐ฌ๐‘ก = ?

Page 43: Reinforcement Learning: Review

Fitted Q-iteration Algorithm

Slide adapted from Sergey Levine

Algorithm hyperparameters

This is not a gradient descent algorithm!

Result: get a policy ๐œ‹(๐š|๐ฌ) from arg๐‘š๐‘Ž๐‘ฅ๐š

๐‘„๐œ™(๐ฌ, ๐š)

Important notes:We can reuse data from previous policies!

using replay buffersan off-policy algorithm

Can be readily extended to multi-task/goal-conditioned RL

Page 44: Reinforcement Learning: Review

Example: Q-learning Applied to Robotics

Continuous action space?

Simple optimization algorithm -> Cross Entropy Method (CEM)

Page 45: Reinforcement Learning: Review

In-memory buffers

QT-Opt: Q-learning at Scale

Bellman updatersstored data from all past experiments

Training jobs

CEM optimization

QT-Opt: Kalashnikov et al. โ€˜18, Google BrainSlide adapted from D. Kalashnikov

Page 46: Reinforcement Learning: Review

QT-Opt: Setup and Results

7 robots collected 580k grasps Unseen test objects

96% test success rate!

Page 47: Reinforcement Learning: Review

Bellman equation: ๐‘„โ‹†(๐ฌ๐‘ก , ๐š๐‘ก) = ๐”ผ๐ฌโ€ฒโˆผ๐‘(โ‹…|๐ฌ,๐š) ๐‘Ÿ(๐ฌ, ๐š) + ๐›พ๐‘š๐‘Ž๐‘ฅ๐šโ€ฒ

๐‘„โ‹†(๐ฌโ€ฒ, ๐šโ€ฒ)

Q-learning

Pros:+ More sample efficient than on-policy methods+ Can incorporate off-policy data (including a fully offline setting)+ Can updates the policy even without seeing the reward+ Relatively easy to parallelize

Cons:- Lots of โ€œtricksโ€ to make it work- Potentially could be harder to learn than just a policy

Page 48: Reinforcement Learning: Review

The Plan

Reinforcement learning problem

Policy gradients

Q-learning

Page 49: Reinforcement Learning: Review

Additional RL Resources

Stanford CS234: Reinforcement Learning

UCL Course from David Silver: Reinforcement Learning

Berkeley CS285: Deep Reinforcement Learning