reinforcement learning | part i tabular solution …...case, the methods can often find exact...

37
Reinforcement Learning | Part I Tabular Solution Methods Mini-Bootcamp Richard S. Sutton & Andrew G. Barto 1 st ed. (1998), 2 nd ed. (2018) Presented by Nicholas Roy Pillow Lab Meeting June 27, 2019

Upload: others

Post on 30-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Reinforcement Learning | Part ITabular Solution Methods

Mini-Bootcamp

Richard S. Sutton & Andrew G. Barto1st ed. (1998), 2nd ed. (2018)

Presented by Nicholas RoyPillow Lab Meeting

June 27, 2019

Page 2: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

RL of the tabular variety• What is special about RL?– “The most important feature distinguishing reinforcement learning from other

types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit search for good behavior. Purely evaluative feedback indicates how good the action taken was, but not whether it was the best or the worst action possible.”

• What is the point of Part I?– “We describe almost all the core ideas of reinforcement learning algorithms in

their simplest forms: that in which the state and action spaces are small enough for the approximate value functions to be represented as arrays, or tables. In this case, the methods can often find exact solutions…”

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 3: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Part I: Tabular Solution Methods

• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 4: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Part I: Tabular Solution Methods

• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 5: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Let’s get through the basics…

• Agent/Environment

• States

• Actions

• Rewards

• Markov

• MDP

• Dynamics p(s’,r|s,a)

• Returns

• Discount factors

• Episodic/Continuing tasks

• Policies

• State-/Action-value function

• Bellman equation

• Optimal policies

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 6: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Agent vs. Environment

• Agent: the learner and decision maker, interacts with…

• Environment: everything else

• In a finite Markov Decision Process (MDP), the sets of states S, actions A, and rewards R have finite number of elements

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 7: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

MDP Dynamics• Dynamics defined completely by p

• Dynamics have Markov property, only depend on current (s,a)• Can collapse this 4D table to get other functions of interest:– state-transitions:

– expected reward:

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 8: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Rewards and Returns• Reward hypothesis: “…goals and purposes can be well thought of as

the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).”– way of communicating what you want to achieve, not how

• Return:

• Discounted Return:

• With discount, even infinite time steps have finite Return

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 9: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Policies & Value Functions• Policy: a mapping from states to probabilities of selecting each possible

action, !(a|s)• State-value function: for a given policy ! and state s, is defined as the

expected return G when starting in s and following ! thereafter

• Action-value function: for a given policy !, state s, and action a, the expected return G when starting in s, doing a, then following !

• The existence and uniqueness of v! and q! are guaranteed as long as either #<1 or eventual termination from all states under the policy !.

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 10: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Bellman Property• Bellman Property: a recursive relationship satisfied by the unique

functions v! and q! between the value of a state and the value of its successor states

• Analogous for q!(s,a), can also easily convert between v! and q!

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 11: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

• There is always at least one policy that is better than or equal to all other policies. This is an optimal policy. Although there may be more than one, we denote all the optimal policies by !*

– They share the same state- and action-value functions, v* and q*

Optimal Policies

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Bellman optimality equations

Page 12: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Part I: Tabular Solution Methods

• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 13: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

• DP algorithms can be used to compute optimal policies if given the complete dynamics, p(s’,r|s,a), of an MDP

• A strong assumption and computationally expensive, but provides theoretical best case that other algorithms attempt to achieve

• Chapter introduces Policy Evaluation then Policy Iteration:1. Initialize an arbitrary value function v and random policy !2. Use Bellman update to move v toward v! until convergence

3. Update !’ to be greedy w.r.t. v!4. Repeat from (2.) until v!′ = v! convergence, implying that v! = v∗• !* is just the greedy policy w.r.t. v*

Dynamic Programming

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 14: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Generalized Policy Iteration (GPI)

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

We can actually skip the strict iteration and justupdate the policy to be greedy in real-time…

Page 15: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

A Quick Example…

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 16: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

• DP suffers from “curse of dimensionality”• (Coined by Bellman, as is “dynamic programming”!)

– But exponentially better than direct search– Modern computers can handle millions of states, can run asynchronously

• DP is essentially just Bellman equations turned into updates• Generalized Policy Methods proven to converge for DP• Bootstrapping: DP bootstraps, that is it updates estimates of values

using other estimated values– Unlike the next set of methods…

DP Summary

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 17: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Part I: Tabular Solution Methods

• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 18: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

• What is v!? The expected return G from each state under !• So, why not just learn v! by averaging returns G?

Motivation for Monte Carlo

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 19: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

1. MC operates on sample experience, not with full dynamics

– DP : computing v! :: MC : learning v!2. MC does not bootstrap, estimates v! directly from returns G

Advantages of MC > DP:

1. Can be used to learn optimal behavior directly from interaction with the

environment, with no model of the environment’s dynamics

2. If there is a model, can learn from simulation (ex: Blackjack)

3. Easy and efficient to focus Monte Carlo methods on a small subset of states

4. No bootstrapping means less harmed by violations of the Markov property

The difference between Monte Carlo and DP

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 20: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

• Problem is now non-stationary — the return after taking an action in one state depends on the actions taken in later states in the same episode

• If ! is a deterministic policy, then in following ! one will observe returns for only one of the actions from each state. With no returns, the Monte Carlo estimates of the other actions will not improve with experience.

• Must assure continual exploration for policy evaluation to work• Solutions:– Exploring starts (small chance of starting in each state)– On-policy : epsilon-greedy (choose a random action epsilon-% of the time)– Off-policy : importance sampling (use distinct policy b to explore and improve !)

Problems of Exploration

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 21: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

• In on-policy methods, the agent commits to always exploring and tries to find the best policy that still explores.

• In off-policy methods, the agent also explores, but learns a deterministic optimal policy (!, usually greedy) that may be unrelated to the policy followed (b, the behavioral policy).

• Off-policy prediction learning methods are based on some form of importance sampling, that is, on weighting returns by the ratio of the probabilities of taking the observed actions under the two policies, thereby transforming their expectations from the behavior policy to the target policy.

On-policy vs. Off-policy for Exploration

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 22: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Part I: Tabular Solution Methods

• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 23: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

A Comparison of Updates

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 24: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

• “If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning.”

• Like MC methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics.

• Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

Advantages of TD methods:– can be applied online– with a minimal amount of computation– using experience generated from interaction with an environment– expressed nearly completely by single equations, implemented with small

computer programs.

Temporal-Difference Learning

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 25: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

• TD Update:

• TD Error:

TD Update & Error

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 26: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Example: TD(0) vs. MC

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 27: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Sarsa: on-policy TD control

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 28: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Q-learning: off-policy TD control

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 29: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Case Study: Cliff Walking

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 30: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Part I: Tabular Solution Methods

• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 31: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

• Specifically, n-step TD methods– Bridges gap between one-step TD(0) and ∞-step Monte Carlo

• With TD(0), the same time step determines both how often the action can be changed & the time interval for bootstrapping– want to update action values very fast to take into account any changes– but bootstrapping works best if it is over a length of time in which a

significant and recognizable state change has occurred.

• Will be superseded by Ch12 Eligibility Traces, continuous version

n-step Methods

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 32: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Part I: Tabular Solution Methods

• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 33: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

• Planning methods (model-based) : DP• Learning methods (model-free) : MC, TD• Both method types:– look ahead to future events,– compute a backed-up value,– using it as an update target for an approximate value function

• DP:

• MC:

• TD:

• Now seek to unify model-free/based

Model-free vs. Model-based

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 34: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

From experience you can:(1) improve your value function & policy (direct RL)(2) improve your model (model-learning, or indirect RL)

Dyna

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 35: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

A Final Overview

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

3 key ideas in common

1. They all seek to estimate value functions

2. They all operate by backing up values along actual or possible state trajectories

3. They all follow the general strategy of generalized policy iteration (GPI), meaning that they maintain an approximate value function and approximate policy, and continually try to improve each on the basis of the other.

+ 3rd dimension:On vs. Off-policy

Ch7: n-step MethodsCh12: Eligibility Traces

Page 36: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

Other method dimensions to consider…

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Page 37: Reinforcement Learning | Part I Tabular Solution …...case, the methods can often find exact solutions…” Reinforcement Learning Mini-Bootcamp Nicholas Roy Pillow Lab Meeting,

The Rest of the Book

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

• Part I: Tabular Solution Methods• Part II: Approximate Solution Methods– Ch 9: On-policy Prediction with Approximation– Ch10: On-policy Control with Approximation– Ch11: Off-policy Methods with Approximation– Ch12: Eligibility Traces– Ch13: Policy Gradient Methods

• Part III: Looking Deeper– Neuroscience, Psychology, Applications and Case Studies, Frontiers