reinforcement learning | part i tabular solution …...case, the methods can often find exact...

Reinforcement Learning | Part ITabular Solution Methods

Mini-Bootcamp

Richard S. Sutton & Andrew G. Barto1st ed. (1998), 2nd ed. (2018)

Presented by Nicholas RoyPillow Lab Meeting

June 27, 2019

RL of the tabular variety• What is special about RL?– “The most important feature distinguishing reinforcement learning from other

types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit search for good behavior. Purely evaluative feedback indicates how good the action taken was, but not whether it was the best or the worst action possible.”

• What is the point of Part I?– “We describe almost all the core ideas of reinforcement learning algorithms in

their simplest forms: that in which the state and action spaces are small enough for the approximate value functions to be represented as arrays, or tables. In this case, the methods can often find exact solutions…”

Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy

Part I: Tabular Solution Methods

• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods


Let’s get through the basics…

• Agent/Environment

• States

• Actions

• Rewards

• Markov

• MDP

• Dynamics p(s’,r|s,a)

• Returns

• Discount factors

• Episodic/Continuing tasks

• Policies

• State-/Action-value function

• Bellman equation

• Optimal policies


Agent vs. Environment

• Agent: the learner and decision maker, interacts with…

• Environment: everything else

• In a finite Markov Decision Process (MDP), the sets of states S, actions A, and rewards R have finite number of elements


MDP Dynamics• Dynamics defined completely by p

• Dynamics have Markov property, only depend on current (s,a)• Can collapse this 4D table to get other functions of interest:– state-transitions:

– expected reward:


Rewards and Returns• Reward hypothesis: “…goals and purposes can be well thought of as

the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).”– way of communicating what you want to achieve, not how

• Return:

• Discounted Return:

• With discount, even infinite time steps have finite Return


Policies & Value Functions• Policy: a mapping from states to probabilities of selecting each possible

action, !(a|s)• State-value function: for a given policy ! and state s, is defined as the

expected return G when starting in s and following ! thereafter

• Action-value function: for a given policy !, state s, and action a, the expected return G when starting in s, doing a, then following !

• The existence and uniqueness of v! and q! are guaranteed as long as either #<1 or eventual termination from all states under the policy !.


Bellman Property• Bellman Property: a recursive relationship satisfied by the unique

functions v! and q! between the value of a state and the value of its successor states

• Analogous for q!(s,a), can also easily convert between v! and q!


• There is always at least one policy that is better than or equal to all other policies. This is an optimal policy. Although there may be more than one, we denote all the optimal policies by !*

– They share the same state- and action-value functions, v* and q*

Optimal Policies


Bellman optimality equations

• DP algorithms can be used to compute optimal policies if given the complete dynamics, p(s’,r|s,a), of an MDP

• A strong assumption and computationally expensive, but provides theoretical best case that other algorithms attempt to achieve

• Chapter introduces Policy Evaluation then Policy Iteration:1. Initialize an arbitrary value function v and random policy !2. Use Bellman update to move v toward v! until convergence

3. Update !’ to be greedy w.r.t. v!4. Repeat from (2.) until v!′ = v! convergence, implying that v! = v∗• !* is just the greedy policy w.r.t. v*

Dynamic Programming


Generalized Policy Iteration (GPI)


We can actually skip the strict iteration and justupdate the policy to be greedy in real-time…

A Quick Example…


• DP suffers from “curse of dimensionality”• (Coined by Bellman, as is “dynamic programming”!)

– But exponentially better than direct search– Modern computers can handle millions of states, can run asynchronously

• DP is essentially just Bellman equations turned into updates• Generalized Policy Methods proven to converge for DP• Bootstrapping: DP bootstraps, that is it updates estimates of values

using other estimated values– Unlike the next set of methods…

DP Summary


• What is v!? The expected return G from each state under !• So, why not just learn v! by averaging returns G?

Motivation for Monte Carlo


1. MC operates on sample experience, not with full dynamics

– DP : computing v! :: MC : learning v!2. MC does not bootstrap, estimates v! directly from returns G

Advantages of MC > DP:

1. Can be used to learn optimal behavior directly from interaction with the

environment, with no model of the environment’s dynamics

2. If there is a model, can learn from simulation (ex: Blackjack)

3. Easy and efficient to focus Monte Carlo methods on a small subset of states

4. No bootstrapping means less harmed by violations of the Markov property

The difference between Monte Carlo and DP


• Problem is now non-stationary — the return after taking an action in one state depends on the actions taken in later states in the same episode

• If ! is a deterministic policy, then in following ! one will observe returns for only one of the actions from each state. With no returns, the Monte Carlo estimates of the other actions will not improve with experience.

• Must assure continual exploration for policy evaluation to work• Solutions:– Exploring starts (small chance of starting in each state)– On-policy : epsilon-greedy (choose a random action epsilon-% of the time)– Off-policy : importance sampling (use distinct policy b to explore and improve !)

Problems of Exploration


• In on-policy methods, the agent commits to always exploring and tries to find the best policy that still explores.

• In off-policy methods, the agent also explores, but learns a deterministic optimal policy (!, usually greedy) that may be unrelated to the policy followed (b, the behavioral policy).

• Off-policy prediction learning methods are based on some form of importance sampling, that is, on weighting returns by the ratio of the probabilities of taking the observed actions under the two policies, thereby transforming their expectations from the behavior policy to the target policy.

On-policy vs. Off-policy for Exploration


A Comparison of Updates


• “If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning.”

• Like MC methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics.

• Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

Advantages of TD methods:– can be applied online– with a minimal amount of computation– using experience generated from interaction with an environment– expressed nearly completely by single equations, implemented with small

computer programs.

Temporal-Difference Learning


• TD Update:

• TD Error:

TD Update & Error


Example: TD(0) vs. MC


Sarsa: on-policy TD control


Q-learning: off-policy TD control


Case Study: Cliff Walking


• Specifically, n-step TD methods– Bridges gap between one-step TD(0) and ∞-step Monte Carlo

• With TD(0), the same time step determines both how often the action can be changed & the time interval for bootstrapping– want to update action values very fast to take into account any changes– but bootstrapping works best if it is over a length of time in which a

significant and recognizable state change has occurred.

• Will be superseded by Ch12 Eligibility Traces, continuous version

n-step Methods


• Planning methods (model-based) : DP• Learning methods (model-free) : MC, TD• Both method types:– look ahead to future events,– compute a backed-up value,– using it as an update target for an approximate value function

• DP:

• MC:

• TD:

• Now seek to unify model-free/based

Model-free vs. Model-based


From experience you can:(1) improve your value function & policy (direct RL)(2) improve your model (model-learning, or indirect RL)

Dyna


A Final Overview


3 key ideas in common

1. They all seek to estimate value functions

2. They all operate by backing up values along actual or possible state trajectories

3. They all follow the general strategy of generalized policy iteration (GPI), meaning that they maintain an approximate value function and approximate policy, and continually try to improve each on the basis of the other.

+ 3rd dimension:On vs. Off-policy

Ch7: n-step MethodsCh12: Eligibility Traces

Other method dimensions to consider…


The Rest of the Book


• Part I: Tabular Solution Methods• Part II: Approximate Solution Methods– Ch 9: On-policy Prediction with Approximation– Ch10: On-policy Control with Approximation– Ch11: Off-policy Methods with Approximation– Ch12: Eligibility Traces– Ch13: Policy Gradient Methods

• Part III: Looking Deeper– Neuroscience, Psychology, Applications and Case Studies, Frontiers

reinforcement learning | part i tabular solution …...case, the methods can often find exact...

Documents