Reinforcement Learning | Part ITabular Solution Methods
Mini-Bootcamp
Richard S. Sutton & Andrew G. Barto1st ed. (1998), 2nd ed. (2018)
Presented by Nicholas RoyPillow Lab Meeting
June 27, 2019
RL of the tabular variety• What is special about RL?– “The most important feature distinguishing reinforcement learning from other
types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit search for good behavior. Purely evaluative feedback indicates how good the action taken was, but not whether it was the best or the worst action possible.”
• What is the point of Part I?– “We describe almost all the core ideas of reinforcement learning algorithms in
their simplest forms: that in which the state and action spaces are small enough for the approximate value functions to be represented as arrays, or tables. In this case, the methods can often find exact solutions…”
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Part I: Tabular Solution Methods
• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Part I: Tabular Solution Methods
• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Let’s get through the basics…
• Agent/Environment
• States
• Actions
• Rewards
• Markov
• MDP
• Dynamics p(s’,r|s,a)
• Returns
• Discount factors
• Episodic/Continuing tasks
• Policies
• State-/Action-value function
• Bellman equation
• Optimal policies
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Agent vs. Environment
• Agent: the learner and decision maker, interacts with…
• Environment: everything else
• In a finite Markov Decision Process (MDP), the sets of states S, actions A, and rewards R have finite number of elements
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
MDP Dynamics• Dynamics defined completely by p
• Dynamics have Markov property, only depend on current (s,a)• Can collapse this 4D table to get other functions of interest:– state-transitions:
– expected reward:
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Rewards and Returns• Reward hypothesis: “…goals and purposes can be well thought of as
the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).”– way of communicating what you want to achieve, not how
• Return:
• Discounted Return:
• With discount, even infinite time steps have finite Return
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Policies & Value Functions• Policy: a mapping from states to probabilities of selecting each possible
action, !(a|s)• State-value function: for a given policy ! and state s, is defined as the
expected return G when starting in s and following ! thereafter
• Action-value function: for a given policy !, state s, and action a, the expected return G when starting in s, doing a, then following !
• The existence and uniqueness of v! and q! are guaranteed as long as either #<1 or eventual termination from all states under the policy !.
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Bellman Property• Bellman Property: a recursive relationship satisfied by the unique
functions v! and q! between the value of a state and the value of its successor states
• Analogous for q!(s,a), can also easily convert between v! and q!
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
• There is always at least one policy that is better than or equal to all other policies. This is an optimal policy. Although there may be more than one, we denote all the optimal policies by !*
– They share the same state- and action-value functions, v* and q*
Optimal Policies
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Bellman optimality equations
Part I: Tabular Solution Methods
• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
• DP algorithms can be used to compute optimal policies if given the complete dynamics, p(s’,r|s,a), of an MDP
• A strong assumption and computationally expensive, but provides theoretical best case that other algorithms attempt to achieve
• Chapter introduces Policy Evaluation then Policy Iteration:1. Initialize an arbitrary value function v and random policy !2. Use Bellman update to move v toward v! until convergence
3. Update !’ to be greedy w.r.t. v!4. Repeat from (2.) until v!′ = v! convergence, implying that v! = v∗• !* is just the greedy policy w.r.t. v*
Dynamic Programming
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Generalized Policy Iteration (GPI)
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
We can actually skip the strict iteration and justupdate the policy to be greedy in real-time…
A Quick Example…
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
• DP suffers from “curse of dimensionality”• (Coined by Bellman, as is “dynamic programming”!)
– But exponentially better than direct search– Modern computers can handle millions of states, can run asynchronously
• DP is essentially just Bellman equations turned into updates• Generalized Policy Methods proven to converge for DP• Bootstrapping: DP bootstraps, that is it updates estimates of values
using other estimated values– Unlike the next set of methods…
DP Summary
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Part I: Tabular Solution Methods
• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
• What is v!? The expected return G from each state under !• So, why not just learn v! by averaging returns G?
Motivation for Monte Carlo
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
1. MC operates on sample experience, not with full dynamics
– DP : computing v! :: MC : learning v!2. MC does not bootstrap, estimates v! directly from returns G
Advantages of MC > DP:
1. Can be used to learn optimal behavior directly from interaction with the
environment, with no model of the environment’s dynamics
2. If there is a model, can learn from simulation (ex: Blackjack)
3. Easy and efficient to focus Monte Carlo methods on a small subset of states
4. No bootstrapping means less harmed by violations of the Markov property
The difference between Monte Carlo and DP
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
• Problem is now non-stationary — the return after taking an action in one state depends on the actions taken in later states in the same episode
• If ! is a deterministic policy, then in following ! one will observe returns for only one of the actions from each state. With no returns, the Monte Carlo estimates of the other actions will not improve with experience.
• Must assure continual exploration for policy evaluation to work• Solutions:– Exploring starts (small chance of starting in each state)– On-policy : epsilon-greedy (choose a random action epsilon-% of the time)– Off-policy : importance sampling (use distinct policy b to explore and improve !)
Problems of Exploration
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
• In on-policy methods, the agent commits to always exploring and tries to find the best policy that still explores.
• In off-policy methods, the agent also explores, but learns a deterministic optimal policy (!, usually greedy) that may be unrelated to the policy followed (b, the behavioral policy).
• Off-policy prediction learning methods are based on some form of importance sampling, that is, on weighting returns by the ratio of the probabilities of taking the observed actions under the two policies, thereby transforming their expectations from the behavior policy to the target policy.
On-policy vs. Off-policy for Exploration
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Part I: Tabular Solution Methods
• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
A Comparison of Updates
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
• “If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning.”
• Like MC methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics.
• Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
Advantages of TD methods:– can be applied online– with a minimal amount of computation– using experience generated from interaction with an environment– expressed nearly completely by single equations, implemented with small
computer programs.
Temporal-Difference Learning
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
• TD Update:
• TD Error:
TD Update & Error
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Example: TD(0) vs. MC
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Sarsa: on-policy TD control
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Q-learning: off-policy TD control
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Case Study: Cliff Walking
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Part I: Tabular Solution Methods
• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
• Specifically, n-step TD methods– Bridges gap between one-step TD(0) and ∞-step Monte Carlo
• With TD(0), the same time step determines both how often the action can be changed & the time interval for bootstrapping– want to update action values very fast to take into account any changes– but bootstrapping works best if it is over a length of time in which a
significant and recognizable state change has occurred.
• Will be superseded by Ch12 Eligibility Traces, continuous version
n-step Methods
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
Part I: Tabular Solution Methods
• Ch2: Multi-armed Bandits • Ch3: Finite Markov Decision Processes • Ch4: Dynamic Programming • Ch5: Monte Carlo Methods • Ch6: Temporal-Difference Learning • Ch7: n-step Bootstrapping • Ch8: Planning and Learning with Tabular Methods
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
• Planning methods (model-based) : DP• Learning methods (model-free) : MC, TD• Both method types:– look ahead to future events,– compute a backed-up value,– using it as an update target for an approximate value function
• DP:
• MC:
• TD:
• Now seek to unify model-free/based
Model-free vs. Model-based
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
From experience you can:(1) improve your value function & policy (direct RL)(2) improve your model (model-learning, or indirect RL)
Dyna
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
A Final Overview
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
3 key ideas in common
1. They all seek to estimate value functions
2. They all operate by backing up values along actual or possible state trajectories
3. They all follow the general strategy of generalized policy iteration (GPI), meaning that they maintain an approximate value function and approximate policy, and continually try to improve each on the basis of the other.
+ 3rd dimension:On vs. Off-policy
Ch7: n-step MethodsCh12: Eligibility Traces
Other method dimensions to consider…
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
The Rest of the Book
Reinforcement Learning Mini-Bootcamp Pillow Lab Meeting, 06/27/19Nicholas Roy
• Part I: Tabular Solution Methods• Part II: Approximate Solution Methods– Ch 9: On-policy Prediction with Approximation– Ch10: On-policy Control with Approximation– Ch11: Off-policy Methods with Approximation– Ch12: Eligibility Traces– Ch13: Policy Gradient Methods
• Part III: Looking Deeper– Neuroscience, Psychology, Applications and Case Studies, Frontiers