reinforcement learning and dynamic programming for control

111
Reinforcement learning and dynamic programming for control Lecture notes “Control prin ˆ ınvˇ at¸are” Master ICAF, 2012 Sem. 2 Lucian Bus ¸oniu

Upload: vehid-tavakol

Post on 24-Nov-2015

49 views

Category:

Documents


1 download

DESCRIPTION

RL Dynamic programming - Lucian Busonio

TRANSCRIPT

  • Reinforcement learningand dynamic programming

    for controlLecture notes

    Control prin nvatareMaster ICAF, 2012 Sem. 2

    Lucian Busoniu

  • Contents

    List of algorithms iii

    I Reinforcement learning framework 1

    1 Introduction 31.1 The RL problem from the control perspective . . . . . . . . . . . . 31.2 The artificial intelligence perspective . . . . . . . . . . . . . . . . . 61.3 The machine learning perspective . . . . . . . . . . . . . . . . . . 61.4 RL and dynamic programming . . . . . . . . . . . . . . . . . . . . 71.5 Importance of RL . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2 The reinforcement learning problem 92.1 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . 92.2 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Value functions and Bellman equations . . . . . . . . . . . . . . . . 122.4 Stochastic case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    II Classical reinforcement learning and dynamic programming 21

    3 Dynamic programming 253.1 Value iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Policy search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    4 Monte Carlo methods 394.1 Monte Carlo policy iteration . . . . . . . . . . . . . . . . . . . . . 394.2 The need for exploration . . . . . . . . . . . . . . . . . . . . . . . 404.3 Optimistic Monte Carlo learning . . . . . . . . . . . . . . . . . . . 42

    i

  • ii CONTENTS

    5 Temporal difference methods 455.1 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    6 Accelerating temporal-difference learning 516.1 Eligibility traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.2 Experience replay . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3 Model learning and Dyna . . . . . . . . . . . . . . . . . . . . . . . 59

    III Approximate reinforcement learning 65

    7 Function approximation 697.1 Approximation architectures . . . . . . . . . . . . . . . . . . . . . 697.2 Approximation in the context of DP and RL . . . . . . . . . . . . . 757.3 Comparison of approximators . . . . . . . . . . . . . . . . . . . . 77

    8 Offline approximate reinforcement learning 798.1 Approximate value iteration . . . . . . . . . . . . . . . . . . . . . 798.2 Approximate policy iteration . . . . . . . . . . . . . . . . . . . . . 858.3 Theoretical properties. Comparison . . . . . . . . . . . . . . . . . 89

    9 Online approximate reinforcement learning 919.1 Approximate Q-learning . . . . . . . . . . . . . . . . . . . . . . . 919.2 Approximate SARSA . . . . . . . . . . . . . . . . . . . . . . . . . 929.3 Actor-critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    10 Accelerating online approximate RL 9710.1 Eligibility traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9710.2 Experience replay . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

    Glossary 103

  • List of Algorithms

    3.1 Q-iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Q-iteration for the stochastic case. . . . . . . . . . . . . . . . . . . 283.3 Policy iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Iterative policy evaluation. . . . . . . . . . . . . . . . . . . . . . . 303.5 Iterative policy evaluation for the stochastic case. . . . . . . . . . . 323.6 Optimistic planning for online control . . . . . . . . . . . . . . . . 374.1 Monte Carlo policy iteration. . . . . . . . . . . . . . . . . . . . . . 414.2 Optimistic Monte Carlo. . . . . . . . . . . . . . . . . . . . . . . . 435.1 Q-learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.2 SARSA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.1 Q( ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2 SARSA( ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.3 ER Q-learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.4 Dyna-Q. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618.1 Fuzzy Q-iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 808.2 Fitted Q-iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 828.3 Approximate policy iteration. . . . . . . . . . . . . . . . . . . . . . 858.4 Least-squares policy iteration. . . . . . . . . . . . . . . . . . . . . 889.1 Approximate Q-learning. . . . . . . . . . . . . . . . . . . . . . . . 929.2 Approximate SARSA. . . . . . . . . . . . . . . . . . . . . . . . . 939.3 Actor-critic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9510.1 Approximate Q( ). . . . . . . . . . . . . . . . . . . . . . . . . . . 9810.2 Approximate ER Q-learning. . . . . . . . . . . . . . . . . . . . . . 99

    iii

  • Part I

    Reinforcement learningframework

    1

  • Chapter 1

    Introduction

    Reinforcement learning (RL) is a class of algorithms for solving problems in whichactions (decisions) are applied to a system over an extended period of time, in orderto achieve a desired goal. The time variable is usually discrete and actions are takenat every discrete time step, leading to a sequential decision-making problem. Theactions are taken in closed loop, which means that the outcome of earlier actions ismonitored and taken into account when choosing new actions. Rewards are providedthat evaluate the one-step decision-making performance, and the goal is to optimizethe long-term performance, measured by the total reward accumulated over the courseof interaction.

    Such decision-making problems appear in a wide variety of fields, including au-tomatic control, computer science, artificial intelligence, operations research, eco-nomics, and medicine. In these notes, we primarily adopt a control-theoretic point ofview, and therefore employ control-theoretical notation and terminology, and choosecontrol systems as examples. Nevertheless, to provide a higher-level image of thefield, in this first chapter we also introduce the artificial-intelligence and machine-learning points of view over RL.

    1.1 The RL problem from the control perspective

    The main elements of the RL problem, together with their flow of interaction, arerepresented in Figure 1.1: a controller interacts with a process by measuring statesand applying actions, and receives rewards according to a reward function.

    To clarify the meaning of these elements, consider the conceptual robotic navi-gation example from Figure 1.2, in which the robot shown in the bottom region mustnavigate to the goal on the top-right, while avoiding the obstacle represented by agray block. (For instance, in the field of rescue robotics, the goal might representthe location of a victim to be rescued.) The controller is the robots software, and theprocess consists of the robots environment (the surface on which it moves, the obsta-

    3

  • 4 CHAPTER 1. INTRODUCTION

    Controller Process

    Rewardfunction

    state

    action

    reward

    Figure 1.1: The elements of RL and their flow of interaction. The elements related tothe reward are depicted in gray.

    cle, and the goal) together with the body of the robot itself. It should be emphasizedthat in RL, the physical body of the decision-making entity (if it has one), its sensorsand actuators, as well as any fixed lower-level controllers, are all considered to be apart of the process, whereas the controller is taken to be only the decision-makingalgorithm.

    state (position)xk

    action (step)uk

    nextstate xk+1

    r ,k+1 rewardGoal

    Figure 1.2: A robotic navigation example. An example transition is also shown, inwhich the current and next states are indicated by black dots, the action by a blackarrow, and the reward by a gray arrow. The dotted silhouette represents the robot inthe next state.

    In the navigation example, the state x is the position of the robot on the surface,given, e.g., in Cartesian coordinates, and the action u is a step taken by the robot,similarly given in Cartesian coordinates. The discrete time step is denoted by k. Asa result of taking a step from the current position, the next position is obtained, ac-cording to a transition function. Because both the positions and steps are representedin Cartesian coordinates, the transitions are typically additive: the next position isthe sum of the current position and the step taken. More complicated transitions areobtained if the robot collides with the obstacle. Note that for simplicity, most of the

  • 1.1. THE RL PROBLEM FROM THE CONTROL PERSPECTIVE 5

    dynamics of the robot, such as the motion of the wheels, have not been taken into ac-count here. For instance, if the wheels can slip on the surface, the transitions becomestochastic, in which case the next state is a random variable.

    The quality of every transition is measured by a reward r, generated accordingto the reward function. For instance, the reward could have a positive value such as10 if the robot reaches the goal, a negative value such as 1, representing a penalty,if the robot collides with the obstacle, and a neutral value of 0 for any other transi-tion. Alternatively, more informative rewards could be constructed, using, e.g., thedistances to the goal and to the obstacle. We follow the convention that the rewardobtained as a result of the transition from xk to xk+1 has time index k +1.

    The goal is to maximize the return, consisting of the cumulative reward over thecourse of interaction. We mainly consider discounted infinite-horizon returns, whichaccumulate rewards obtained along (possibly infinitely long) trajectories starting atthe initial time step k = 0, and weigh the rewards by a factor that decreases exponen-tially as the time step increases:

    0r1 + 1r2 + 2r3 + ... (1.1)

    The discount factor [0,1) can be seen as a measure of how far-sighted thecontroller is in considering its rewards. Figure 1.3 illustrates the computation of thediscounted return for the navigation problem of Figure 1.2.

    0r1

    1r2

    2r3

    3r4

    Figure 1.3: The discounted return along a trajectory of the robot. The decreasingheights of the gray vertical bars indicate the exponentially diminishing nature of thediscounting applied to the rewards.

    The core challenge is therefore to arrive at a solution that optimizes the long-term performance given by the return, using only reward information that describesthe immediate performance.

  • 6 CHAPTER 1. INTRODUCTION

    1.2 The artificial intelligence perspective

    In artificial intelligence, RL is useful to learn optimal behavior for intelligent agents,which monitor their environment through perceptions and influence it by applyingactions. This view of the RL problem is shown in Figure 1.4.

    Environment

    Agent

    Rewardfunction

    action

    reward

    perception(state)

    Figure 1.4: RL from the AI perspective.

    As an example, we can consider again the robotic navigation problem above;autonomous mobile robotics is an application domain where automatic control andAI meet in a natural way. We can simply view the robot as the artificial agent thatmust accomplish a task in its environment.

    Note that in AI, the rewards are viewed as produced by the environment, and thereward function is considered as a (possibly unknown) part of this environment. Incontrast, in control the reward function is simply a performance evaluation compo-nent, under the complete control of the experimenter.

    1.3 The machine learning perspective

    Machine learning is the subfield of computer science and AI concerned with algo-rithms that analyze data in order to achieve various types of results. From the per-spective of machine learning, RL sits in-between two other paradigms: supervisedand unsupervised learning, see Figure 1.5.

    Supervisedlearning

    Reinforcementlearning

    Unsupervisedlearning

    moreinformativefeedback lessinformativefeedback

    Figure 1.5: RL on the machine learning spectrum.

  • 1.4. RL AND DYNAMIC PROGRAMMING 7

    Supervised learning is about generalizing input-output relationships from exam-ples. In each learning example, an input is associated with the correct correspondingoutput, so in a sense there is a teacher guiding the learning process step-by-step.Once learning is completed, the algorithm receives new inputs that it has not nec-essarily seen before, and must (approximately) determine outputs corresponding tothese inputs. In RL, the teacher (reward function) never provides the correct, op-timal actions, but only the less informative reward signal, which the algorithm mustthen use to find the correct actions by itself a significantly harder problem.

    As an example of supervised learning, consider a robot arm equipped with a grip-per and a camera sensor, which disassembles used electrical motors. The robot has tosort motor parts transported on a conveyor belt into several classes, after recognizingthem by their features, such as shape, color, size or texture. In this case, a supervisedlearning algorithm would first be provided with a training set of known objects withtheir features (inputs) and classes (output), and would then have to find the class ofeach new object arriving on the belt.

    Unsupervised learning is concerned with finding patterns and relationships indata that does not have any well-defined outputs. As there are no outputs, there canbe no information at all about which output is correct and in this sense, we saythat in unsupervised learning there is less informative feedback than in RL. As anexample, a retailer could look at the purchasing habits of their customers (prices ofitems purchased, frequency of purchases, etc.) and try to organize them into severalmarket segments without knowing in advance what these segments should be, andperhaps not even knowing how many segments there should be.

    This view is useful to understand reinforcement learning, but should be consid-ered in light of the following non-obvious fact. Both supervised and unsupervisedlearning are concerned only with making certain statements about (often static) data.RL uses the data to learn how to control a dynamical system, which adds an extradimension and new challenges that make RL different from the rest of the machinelearning field.

    1.4 RL and dynamic programming

    If a mathematical model of the process-to-be-controlled is available, a class of model-based methods called dynamic programming (DP) can be applied. A key benefit ofDP methods is that they make few assumptions on the system, which can have verygeneral, nonlinear and stochastic dynamics. This is in contrast to more classical tech-niques from automatic control, many of which require restrictive assumptions on thesystem, such as linearity or determinism. Moreover, some DP methods do not requirean analytical expression of the model, but are able to work with a simulation modelinstead: a model that does not expose the mathematical expressions for the functionsf and , but can be used to generate next states and rewards for any state-action pair.

  • 8 CHAPTER 1. INTRODUCTION

    Constructing a simulation model is often easier than deriving an analytical model,especially when the system behavior is stochastic.

    DP is not usually seen as being a part of RL but most RL algorithms have theirtheoretical foundation in DP, so we will devote significant space to explaining DP inthese notes.

    Sometimes, a model of the system cannot be obtained, e.g., because the system isnot fully known beforehand, is insufficiently understood, or obtaining a model is toocostly. RL methods are helpful in this case, since they work using only data obtainedfrom the system, without requiring a model of its behavior. Offline RL methods areapplicable if sufficient data can be obtained in advance. Online RL algorithms learna solution by interacting with the system, and can therefore be applied when data isnot available in advance. Note that RL methods can, of course, also be applied whena model is available, simply by using the model in place of the real system. It is verycommon to benchmark RL methods on simulation models, possibly as a preliminarystep to applying them in real-life problems.

    1.5 Importance of RL

    We close this chapter by summarizing the benefits of RL in control and AI, andbriefly mentioning some applications. In automatic control, RL algorithms learn howto optimally control very general systems that may be unknown beforehand, whichmakes them extremely useful. In AI, RL provides a way to build learning agentsthat optimize their behavior in initially unknown environments. RL has obtainedimpressive successes in applications such as backgammon playing (Tesauro, 1995),elevator scheduling (Crites and Barto, 1998), simulated treatment of HIV infections(Ernst et al., 2006), autonomous helicopter flight (Abbeel et al., 2007), robot control(Peters and Schaal, 2008), interfacing an animal brain with a robot arm (DiGiovannaet al., 2009), etc.1

    1The bibliography can be found at the end of Part I. Separate bibliographies are provided per part.

  • Chapter 2

    The reinforcement learningproblem

    We will now consider in more detail the reinforcement learning problem and its so-lution, delving more into the mathematical foundations of the field.

    2.1 Markov decision processes

    RL problems can be formalized with the help of Markov decision processes (MDPs).We focus on the case of MDPs with deterministic state transitions.

    A deterministic MDP is defined by the state space X of the process, the actionspace U of the controller, the transition function f of the process (which describeshow the state changes as a result of control actions), and the reward function (whichevaluates the immediate control performance). As a result of the action uk appliedin the state xk at the discrete time step k, the state changes to xk+1, according to thetransition function f : X U X :

    xk+1 = f (xk,uk)At the same time, the controller receives the scalar reward signal rk+1, according tothe reward function : X U R:

    rk+1 = (xk,uk)

    where we assume that = maxx,u |(x,u)| is finite.1 The reward evaluates theimmediate effect of action uk, namely the transition from xk to xk+1, but in generaldoes not say anything about its long-term effects.

    1If a maximum does not exist, supremum (sup) should be used instead. Also note that since thedomains of the variables over which we are maximizing are obvious from the context (X and U), theywere omitted from the formula. To avoid clutter, these types of simplifications will also be performedin the sequel.

    9

  • 10 CHAPTER 2. THE REINFORCEMENT LEARNING PROBLEM

    The controller chooses actions according to its policy h : X U , using:

    uk = h(xk)

    In control theory such a control law is called a state feedback.Given f and , the current state xk and the current action uk are sufficient to deter-

    mine both the next state xk+1 and the reward rk+1. This is called the Markov property,which is necessary to provide theoretical guarantees about DP/RL algorithms.

    Some MDPs have terminal states that, once reached, can no longer be left; allthe rewards received in terminal states are 0. In this case, a trial or episode is atrajectory starting from some initial state and ending in a terminal state. MDPs withterminal states are called episodic, while MDPs without terminal states are calledcontinuing.

    Example 2.1 The cleaning-robot MDP. Consider the problem depicted in Figure 2.1:a cleaning robot has to collect a used can and also has to recharge its batteries.

    x=0 1 2 3 4 5

    u=1u=-1

    r=1 r=5

    Figure 2.1: The cleaning-robot problem.

    In this problem, the state x describes the position of the robot, and the actionu describes the direction of its motion. The state space is discrete and contains sixdistinct states, denoted by integers 0 to 5: X = {0,1,2,3,4,5}. The robot can moveto the left (u = 1) or to the right (u = 1); the discrete action space is thereforeU = {1,1}. States 0 and 5 are terminal, meaning that once the robot reaches eitherof them it can no longer leave, regardless of the action. The corresponding transitionfunction is:

    f (x,u) ={

    x+u if 1 x 4x if x = 0 or x = 5 (regardless of u)

    In state 5, the robot finds a can and the transition into this state is rewarded with5. In state 0, the robot can recharge its batteries and the transition into this state isrewarded with 1. All other rewards are 0. In particular, taking any action while in aterminal state results in a reward of 0, which means that the robot will not accumulate(undeserved) rewards in the terminal states. The corresponding reward function is:

    (x,u) =

    5 if x = 4 and u = 11 if x = 1 and u =10 otherwise

  • 2.2. OPTIMALITY 11

    2.2 OptimalityIn RL and DP, the goal is to find an optimal policy that maximizes the return from anyinitial state x0. The return is a cumulative aggregation of rewards along a trajectorystarting at x0. It concisely represents the reward obtained by the controller in thelong run. Several types of return exist, depending on the way in which the rewardsare accumulated. We will mostly be concerned with the infinite-horizon discountedreturn, given by:

    Rh(x0) =

    k=0

    krk+1 =

    k=0

    k(xk,h(xk)) (2.1)

    where [0,1) is the discount factor and xk+1 = f (xk,h(xk)) for k 0. The discountfactor can be interpreted intuitively as a measure of how far-sighted the controlleris in considering its rewards, or as a way of taking into account increasing uncertaintyabout future rewards. From a mathematical point of view, discounting ensures thatthe return will always be finite if the rewards are finite. The goal is therefore tomaximize the long-term performance (return), while only using feedback about theimmediate, one-step performance (reward). This leads to the so-called challenge ofdelayed rewards: actions taken in the present affect the potential to achieve goodrewards far in the future, but the immediate reward provides no information aboutthese long-term effects.

    Other types of return can also be defined. For example, the infinite-horizon aver-age return is:

    limK

    1K

    K

    k=0

    (xk,h(xk))

    Finite-horizon returns can be obtained by accumulating rewards along trajectories ofa fixed, finite length K (the horizon), instead of along infinitely long trajectories. Forinstance, the finite-horizon discounted return can be defined as:

    K

    k=0

    k(xk,h(xk))

    We will mainly use the infinite-horizon discounted return (2.1), because it hasuseful theoretical properties. In particular, for this type of return, under certain techni-cal assumptions, there always exists at least one stationary optimal policy h : X U .(In contrast, in the finite-horizon case, optimal policies depend in general on the timestep k, i.e., they are nonstationary.)

    In practice, a good value of has to be chosen. Choosing often involves atrade-off between the quality of the solution and the convergence rate of the DP/RLalgorithm. Some important DP/RL algorithms converge faster when is smaller, butif is too small, the solution may be unsatisfactory because it does not sufficiently takeinto account rewards obtained after a large number of steps. There is no generally

  • 12 CHAPTER 2. THE REINFORCEMENT LEARNING PROBLEM

    valid procedure for choosing , but a rough guideline would be that good discountfactors are often in the range [0.9,1), and the value should increase with the lengthof transients in typical trajectories of the system.

    Example 2.2 Choosing the discount factor for the stabilization of a simple sys-tem. Consider a linear first-order system that must be stabilized, for which we knowfrom prior knowledge that the typical (stable) trajectory of the state looks like inFigure 2.2. The sampling time is is Ts = .25 s. Presumably, the rewards in the time

    0 5 10 15 200

    0.2

    0.4

    0.6

    0.8

    1

    t = k Ts

    x

    Figure 2.2: Typical trajectory of a simple system.interval t [15,20] indicate that stabilization has been achieved. For the learning per-formance, it is important that information from these rewards is sufficiently visiblein the discounted return of the initial state at t = 0; that is, that the exponential dis-counting corresponding to k = 15/Ts = 60 is not too small. Choosing this discountingto be at least 0.05, we get:

    60 0.05 log log0.0560 exp[

    log0.0560

    ] 0.9513

    Indeed, choosing = 0.96, we get the shape in Figure 2.3 for the discounting curvek, in which k > 0.05 for k = 15.

    2.3 Value functions and Bellman equationsA convenient way to characterize policies and optimality is by using value functions.Two types of value functions exist: state-action value functions (Q-functions) andstate value functions (V-functions). We will first define and characterize Q-functions,and then turn our attention to V-functions.

    The Q-function Qh : X U R of a policy h gives the return obtained whenstarting from a given state, applying a given action, and following h thereafter:

    Qh(x,u) = (x,u)+ Rh( f (x,u)) (2.2)

  • 2.3. VALUE FUNCTIONS AND BELLMAN EQUATIONS 13

    0 5 10 15 200

    0.2

    0.4

    0.6

    0.8

    1

    t = k Ts

    k

    Figure 2.3: The evolution of the discounting for = 0.96.

    Here, Rh( f (x,u)) is the return from the next state f (x,u). This concise formula can beobtained by first writing Qh(x,u) explicitly as the discounted sum of rewards obtainedby taking u in x and then following h:

    Qh(x,u) =

    k=0

    k(xk,uk)

    where (x0,u0) = (x,u), xk+1 = f (xk,uk) for k 0, and uk = h(xk) for k 1. Then,the first term is separated from the sum:

    Qh(x,u) = (x,u)+

    k=1

    k(xk,uk)

    = (x,u)+

    k=1

    k1(xk,h(xk))

    = (x,u)+ Rh( f (x,u))

    (2.3)

    where the definition (2.1) of the return was used in the last step. So, (2.2) has beenobtained.

    The optimal Q-function is defined as the best Q-function that can be obtained byany policy:

    Q(x,u) = maxh

    Qh(x,u) (2.4)

    Any policy h that selects at each state an action with the largest optimal Q-value,i.e., that satisfies:

    h(x) argmaxu

    Q(x,u) (2.5)

    is optimal (it maximizes the return).In general, for a given Q-function Q, a policy h that satisfies:

    h(x) argmaxu

    Q(x,u) (2.6)

  • 14 CHAPTER 2. THE REINFORCEMENT LEARNING PROBLEM

    is said to be greedy in Q. So, finding an optimal policy can be done by first findingQ, and then using (2.5) to compute a greedy policy in Q.

    When there are multiple greedy actions for a given state, we say there is a tie,which must be resolved by selecting one of these actions (e.g., by randomly pickingone of them, or by picking the first action in some natural order over the action space).When Q = Q, no matter which greedy action is chosen, the policy is still optimal.

    The Q-functions Qh and Q are recursively characterized by the Bellman equa-tions, which are of central importance for RL algorithms. The Bellman equation forQh states that the value of taking action u in state x under the policy h equals the sumof the immediate reward and the discounted value achieved by h in the next state:

    Qh(x,u) = (x,u)+ Qh( f (x,u),h( f (x,u))) (2.7)This Bellman equation can be derived from the second step in (2.3), as follows:

    Qh(x,u) = (x,u)+

    k=1

    k1(xk,h(xk))

    = (x,u)+ [

    ( f (x,u),h( f (x,u)))+

    k=2

    k2(xk,h(xk))]

    = (x,u)+ Qh( f (x,u),h( f (x,u)))where (x0,u0) = (x,u), xk+1 = f (xk,uk) for k 0, and uk = h(xk) for k 1.

    The Bellman optimality equation characterizes Q, and states that the optimalvalue of action u taken in state x equals the sum of the immediate reward and thediscounted optimal value obtained by the best action in the next state:

    Q(x,u) = (x,u)+ maxu

    Q( f (x,u),u) (2.8)

    The V-function V h : X R of a policy h is the return obtained by starting froma particular state and following h. This V-function can be computed from the Q-function of policy h:

    V h(x) = Rh(x) = Qh(x,h(x)) (2.9)The optimal V-function is the best V-function that can be obtained by any policy, andcan be computed from the optimal Q-function:

    V (x) = maxh

    V h(x) = maxu

    Q(x,u) (2.10)

    An optimal policy h can be computed from V , by using the fact that it satisfies:

    h(x) argmaxu

    [(x,u)+ V ( f (x,u))] (2.11)

    Using this formula is more difficult than using (2.5); in particular, a model of theMDP is required in the form of the dynamics f and the reward function . Because

  • 2.4. STOCHASTIC CASE 15

    the Q-function also depends on the action, it already includes information about thequality of transitions. In contrast, the V-function only describes the quality of thestates; in order to infer the quality of transitions, they must be explicitly taken intoaccount. This is what happens in (2.11), and this also explains why it is more difficultto compute policies from V-functions. Because of this difference, Q-functions willbe preferred to V-functions throughout this book, even though they are more costlyto represent than V-functions, as they depend both on x and u.

    The V-functions V h and V satisfy the following Bellman equations, which canbe interpreted similarly to (2.7) and (2.8):

    V h(x) = (x,h(x))+ V h( f (x,h(x))) (2.12)V (x) = max

    u[(x,u)+ V ( f (x,u))] (2.13)

    2.4 Stochastic caseIn closing the first part of these notes, we briefly outline the stochastic MDP case.In a stochastic MDP, the next state is not deterministically given by the current stateand action. Instead, the next state is a random variable, with a probability distributiondepending on the current state and action.

    More formally, the deterministic transition function f is replaced by a transitionprobability function f : X U X [0,1]. After action uk is taken in state xk, theprobability that the next state xk+1 is x is:

    P(xk+1 = x

    |xk,uk)

    = f (xk,uk,x)For any x and u, f (x,u, ) must define a valid probability distribution, where the dotstands for the random variable x; that is, x f (x,u,x) = 1. We assume here that thenext state x can only take a finite number of possible values (see e.g. the cleaningrobot example below).

    Because rewards are associated with transitions, and the transitions are no longerfully determined by the current state and action, the reward function also has to de-pend on the next state, : X U X R. After each transition to a state xk+1, areward rk+1 is received according to:

    rk+1 = (xk,uk,xk+1)

    where we assume that = maxx,u,x |(x,u,x)| is finite. Note that we consider to be a deterministic function of the transition (xk,uk,xk+1). This means that, oncexk+1 has been generated, the reward rk+1 is fully determined.

    Example 2.3 The stochastic cleaning-robot MDP. Consider again the cleaning-robot problem of Example 2.1. Assume that, due to uncertainties in the environment,

  • 16 CHAPTER 2. THE REINFORCEMENT LEARNING PROBLEM

    such as a slippery floor, state transitions are no longer deterministic. When trying tomove in a certain direction, the robot succeeds with a probability of only 0.8. With aprobability of 0.15 it remains in the same state, and it may even move in the oppositedirection with a probability of 0.05 (see also Figure 2.4).

    x=0 1 2 3 4 5

    P=0.8P=0.05

    P=0.15

    u

    Figure 2.4: The stochastic cleaning-robot problem. The robot intends to move right,but it may instead end up standing still or moving left, with different probabilities.

    The transition function f that models the probabilistic transitions described aboveis shown in Table 2.1. In this table, the rows correspond to combinations of currentstates and actions taken, while the columns correspond to future states. Note that thetransitions from any terminal state still lead deterministically to the same terminalstate, regardless of the action.

    Table 2.1: Dynamics of the stochastic, cleaning-robot MDP.(x,u) f (x,u,0) f (x,u,1) f (x,u,2) f (x,u,3) f (x,u,4) f (x,u,5)

    (0,1) 1 0 0 0 0 0(1,1) 0.8 0.15 0.05 0 0 0(2,1) 0 0.8 0.15 0.05 0 0(3,1) 0 0 0.8 0.15 0.05 0(4,1) 0 0 0 0.8 0.15 0.05(5,1) 0 0 0 0 0 1(0,1) 1 0 0 0 0 0(1,1) 0.05 0.15 0.8 0 0 0(2,1) 0 0.05 0.15 0.8 0 0(3,1) 0 0 0.05 0.15 0.8 0(4,1) 0 0 0 0.05 0.15 0.8(5,1) 0 0 0 0 0 1

    The robot receives rewards as in the deterministic case: upon reaching state 5, it isrewarded with 5, and upon reaching state 0, it is rewarded with 1. The correspondingreward function, in the form : X U X R, is:

    (x,u,x) =

    5 if x 6= 5 and x = 51 if x 6= 0 and x = 00 otherwise

  • 2.4. STOCHASTIC CASE 17

    The deterministic return from (2.1) is no longer well defined in the stochastic case.Instead, the expected infinite-horizon discounted return of an initial state x0 under a(deterministic) policy h is defined as:

    Rh(x0) = Exk+1 f (xk,h(xk),)

    {

    k=0

    krk+1

    }

    = Exk+1 f (xk,h(xk),)

    {

    k=0

    k(xk,h(xk),xk+1)} (2.14)

    where E denotes expectation and the notation xk+1 f (xk,h(xk), ) means that thenext state xk+1 is drawn from the probability distribution f (xk,h(xk), ) at each stepk. Thus, the expectation is taken over all the stochastic transitions or equivalentlythe entire stochastic trajectory. Intuitively the expected return may be seen as thesum of the returns of all possible trajectories, where each return is weighted by theprobability of its respective trajectory; the sum of these weights is 1. An importantproperty of stochastic MDPs is that there always exists a deterministic optimal policythat maximizes (2.14); so even though the dynamics are stochastic, we only need toconsider deterministic policies when characterizing the solution.2

    In the stochastic case, the Q-function is the expected return under the one-stepstochastic transitions, when starting in a particular state, applying a particular action,and following the policy h thereafter:

    Qh(x,u) = Ex f (x,u,){

    (x,u,x)+ Rh(x)}

    (2.15)

    The definition of the optimal Q-function Q remains unchanged from the determin-istic case: Q(x,u) = maxh Qh(x,u), and optimal policies can still be computed asgreedy policies in Q: h(x) argmaxu Q(x,u).

    Like the return, the Bellman equations for Qh and Q must now be given usingexpectations over the stochastic transitions. Because the next state x can only take afinite number of values, these expectations can be written as sums:

    Qh(x,u) = Ex f (x,u,){

    (x,u,x)+ Qh(x,h(x))}

    = x

    f (x,u,x)[(x,u,x)+ Qh(x,h(x))

    ] (2.16)Q(x,u) = Ex f (x,u,)

    {(x,u,x)+ max

    uQ(x,u)

    }=

    x

    f (x,u,x)[

    (x,u,x)+ maxu

    Q(x,u)] (2.17)

    2We use probability theory rather informally in this section; a mathematically complete formalismwould require significant additional development. We point the interested reader to (Bertsekas andShreve, 1978) for a complete development, and do not consider these difficulties further here.

  • 18 CHAPTER 2. THE REINFORCEMENT LEARNING PROBLEM

    Thus the weighted-sum interpretation of the expectation becomes explicit in theseequations.

    In the remainder of the lecture notes, we focus mainly on deterministic systems.Nevertheless, when the extensions of certain algorithms and results to the stochasticcase are important, we separately present these extensions. Whenever the stochasticcase is considered, this is explicitly mentioned in the text.

  • BIBLIOGRAPHY 19

    Bibliographical notes for Part I

    The following textbooks provide detailed descriptions of RL and DP: (Bertsekas andTsitsiklis, 1996; Bertsekas, 2007) are optimal-control oriented, (Sutton and Barto,1998; Szepesvari, 2010; Sigaud and Buffet, 2010) have an artificial-intelligence per-spective, (Powell, 2007) is operations-research oriented, while our recent book(Busoniu et al., 2010) focuses explicitly on approximate RL and DP for control prob-lems. Markov decision processes are described by (Puterman, 1994), and the mathe-matical conditions under which MDPs have solutions for the various types of returncan be found in (Bertsekas and Shreve, 1978). The books (Busoniu et al., 2010; Sut-ton and Barto, 1998; Bertsekas, 2007) are most useful to supplement the informationin these lecture notes.

    Bibliography

    Abbeel, P., Coates, A., Quigley, M., and Ng, A. Y. (2007). An application of re-inforcement learning to aerobatic helicopter flight. In Scholkopf, B., Platt, J. C.,and Hoffman, T., editors, Advances in Neural Information Processing Systems 19,pages 18. MIT Press.

    Bertsekas, D. P. (2007). Dynamic Programming and Optimal Control, volume 2.Athena Scientific, 3rd edition.

    Bertsekas, D. P. and Shreve, S. E. (1978). Stochastic Optimal Control: The DiscreteTime Case. Academic Press.

    Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. AthenaScientific.

    Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D. (2010). ReinforcementLearning and Dynamic Programming Using Function Approximators. Automationand Control Engineering. Taylor & Francis CRC Press.

    Crites, R. H. and Barto, A. G. (1998). Elevator group control using multiple rein-forcement learning agents. Machine Learning, 33(23):235262.

    DiGiovanna, J., Mahmoudi, B., Fortes, J., Principe, J. C., and Sanchez, J. C. (2009).Coadaptive brain-machine interface via reinforcement learning. IEEE Transac-tions on Biomedical Engineering, 56(1):5464.

    Ernst, D., Stan, G.-B., Goncalves, J., and Wehenkel, L. (2006). Clinical data basedoptimal STI strategies for HIV: A reinforcement learning approach. In Proceed-ings 45th IEEE Conference on Decision & Control, pages 667672, San Diego,US.

  • 20 BIBLIOGRAPHY

    Peters, J. and Schaal, S. (2008). Reinforcement learning of motor skills with policygradients. Neural Networks, 21:682697.

    Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curses ofDimensionality. Wiley.

    Puterman, M. L. (1994). Markov Decision ProcessesDiscrete Stochastic DynamicProgramming. Wiley.

    Sigaud, O. and Buffet, O., editors (2010). Markov Decision Processes in ArtificialIntelligence. Wiley.

    Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction.MIT Press.

    Szepesvari, Cs. (2010). Algorithms for Reinforcement Learning. Morgan & ClaypoolPublishers.

    Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communica-tions of the ACM, 38(3):5868.

  • Part II

    Classical reinforcement learningand dynamic programming

    21

  • Introduction and outline

    In this part, we present classical reinforcement learning (RL) algorithms, startingfrom their dynamic programming (DP) roots.

    Classical algorithms require exact representations of value functions (Q-functionsQ(x,u) or V-functions V (x)) and/or policies h(x). In general, this can only be achievedby storing separate values for each state or state-action pair, a representation calledtabular because it can be seen as a table (say, for the Q-function, with x on therows, u on the columns, and Q-values in the cells). This means that, in practice, theclassical algorithms only work when the state space X and the action space U are dis-crete and contain a relatively small number of elements. This class of problems doesinclude interesting examples, such as some high-level decision-making problems andcertain board games; however, most automatic control problems are ruled out, as theyhave continuous state and action spaces. We will discuss methods to deal with suchproblems in Part III of these notes; those methods are all extensions of the classicalDP and RL algorithms discussed here.

    Dynamicprogramming

    Reinforcementlearning

    valueiteration

    policyiteration

    policysearch&planning

    Q-learning

    SARSA

    MonteCarlo

    temporaldifference

    MonteCarlopolicyiteration

    optimisticMonteCarlo

    Acceleratingtemporaldifference

    eligilibilitytraces

    experiencereplay

    modellearning

    Q()

    ER-Q-learning

    SARSA()

    Dyna-Q

    Figure 2.5: Classification of classical DP and RL algorithms and enhancements dis-cussed.

    23

  • 24

    Figure 2.5 classifies the methods discussed in the present part. We start by in-troducing two classical DP methods for finding the optimal solution of a Markovdecision process (MDP): value iteration and policy iteration. Afterwards, we discusstechniques that search directly in the space of policies or control actions. All thesetechniques are model-based, that is, they use a model of the MDP in the form of thetransition dynamics f and reward function . (Note other textbooks may excludedirect policy search from the DP class, restricting DP to value- and policy-iterationalgorithms. We use the term DP to generically mean any model-based method.)

    We then move on to RL, model-free methods. First, Monte Carlo methods are dis-cussed; these methods learn on a trajectory-by-trajectory basis, only making changesto their value function and policy at the end of trajectories. After Monte Carlo meth-ods, temporal-difference learning is introduced: a fully incremental class of methodsthat learn on a sample-by-sample basis. Two major temporal difference methods arepresented, Q-learning and SARSA, which can respectively be viewed as online, in-cremental variants of value and policy iteration. Both MC and temporal differencemethods learn online, by interacting with the system.

    Additionally, we describe several ways to increase the learning speed of temporaldifference methods: using so-called eligibility traces, reusing raw data, and learninga model of the MDP which is then used to generate new data. Modified variants ofthe standard algorithms are introduced exploiting all of these enhancements.

    Throughout this part, we introduce algorithms from the deterministic perspective,going into the stochastic setting only when necessary. Despite this, all RL algorithmsthat we introduce work in general, deterministic as well as stochastic, problems. TheDP algorithms typically do not; so for value and policy iteration, we present exten-sions to the stochastic case.

    We describe algorithms that employ Q-functions rather than V-functions. Thecleaning robot task of Example 2.1 is employed to illustrate the behavior of severalrepresentative algorithms.

  • Chapter 3

    Dynamic programming

    DP algorithms can be broken down into three subclasses, according to the path takento find an optimal policy:

    Value iteration algorithms search for the optimal value function (recall thisconsists of the maximal returns from every state or from every state-actionpair). The optimal value function is used to compute an optimal policy.

    Policy iteration algorithms evaluate policies by constructing their value func-tions, and use these value functions to find new, improved policies.

    Policy search algorithms use optimization and related techniques to directlysearch for an optimal policy. From this class, we also discuss a planning algo-rithm that ingeniously combines policy search with features of value iteration.

    3.1 Value iterationValue iteration techniques use the Bellman optimality equation to iteratively computean optimal value function, from which an optimal policy is derived. To solve theBellman optimality equation, knowledge of the transition and reward functions isemployed.

    In particular, we introduce the Q-iteration algorithm, which computes Q-functions.Let the set of all the Q-functions be denoted by Q. Then, the Q-iteration mappingT : Q Q, computes the right-hand side of the Bellman optimality equation (2.8)for any Q-function:1

    [T (Q)](x,u) = (x,u)+ maxu

    Q( f (x,u),u) (3.1)1The term mapping is used to refer to functions that work with other functions as inputs and/or

    outputs. The term is used to differentiate mappings from ordinary functions, which only have numericalscalars, vectors, or matrices as inputs and/or outputs.

    25

  • 26 CHAPTER 3. DYNAMIC PROGRAMMING

    The Q-iteration algorithm starts from an arbitrary Q-function Q0 and at each iter-ation updates the Q-function using:

    Q+1 = T (Q) (3.2)When rewritten using the Q-iteration mapping, the Bellman optimality equation (2.8)states that Q is a fixed point of T , i.e.:

    Q = T (Q) (3.3)It can be shown that T is a contraction with factor < 1 in the infinity norm, i.e., forany pair of functions Q and Q, it is true that:

    T (Q)T (Q) QQBecause T is a contraction, it has a unique fixed point, and by (3.3) this point isactually Q. Due to its contraction nature, Q-iteration asymptotically converges toQ as , at a rate of , in the sense that Q+1 Q QQ. Anoptimal policy can be computed from Q with (2.5).

    Algorithm 3.1 presents Q-iteration in an explicit, procedural form, wherein T iscomputed using (3.1).

    Algorithm 3.1 Q-iteration.Input: dynamics f , reward function , discount factor

    1: initialize Q-function, e.g., Q0(x,u) 0 x,u2: repeat at every iteration = 0,1,2, . . .3: for every (x,u) do4: Q+1(x,u) (x,u)+ maxu Q( f (x,u),u)5: end for6: until stopping criterion is satisfied

    Output: Q+1

    As a stopping criterion, we could require that Q+1 = Q, i.e. that we have reachedthe convergence point Q. However, this is only guaranteed to happen as , soa more practical solution is to stop when the difference between two consecutive Q-functions decreases below a given threshold QI > 0, i.e., when Q+1Q QI.(The subscript QI stands for Q-iteration.) This can be guaranteed to happen aftera finite number of iterations, due to the convergence rate of . It is also possible toderive a stopping condition that guarantees the solution returned is within a givendistance of the optimum.

    Computational cost of Q-iterationNext, we investigate the computational cost of Q-iteration when applied to an MDPwith a finite number of states and actions. Denote by || the cardinality of the argu-

  • 3.1. VALUE ITERATION 27

    ment set , so that |X | denotes the finite number of states and |U | denotes the finitenumber of actions.

    Assume that, when updating the Q-value for a given state-action pair (x,u), themaximization over the action space U is solved by enumeration over its |U | elements,and f (x,u) is computed once and then stored and reused. Updating the Q-value thenrequires 2 + |U | function evaluations, where the functions being evaluated are f , ,and the current Q-function Q. Since at every iteration, the Q-values of |X | |U | state-action pairs have to be updated, the cost per iteration is |X | |U |(2+ |U |). So, the totalcost of L Q-iterations is:

    L |X | |U |(2+ |U |) (3.4)

    Example 3.1 Q-iteration for the cleaning robot. In this example, we apply Q-iteration to the cleaning-robot problem of Example 2.1. The discount factor is setto 0.5. The discount factor is small compared to the recommended values in Chapter 2because the state space is very small and thus interesting trajectories are short.

    Starting from an identically zero initial Q-function, Q0 = 0, Q-iteration producesthe sequence of Q-functions given in the first part of Table 3.1 (above the dashed line),where each cell shows the Q-values of the two actions in a certain state, separated bya semicolon. For instance:

    Q3(2,1) = (2,1)+ maxu

    Q2( f (2,1),u) = 0+0.5maxu

    Q2(3,u) = 0+0.5 2.5 = 1.25

    Table 3.1: Q-iteration results for the cleaning robot.x = 0 x = 1 x = 2 x = 3 x = 4 x = 5

    Q0 0; 0 0; 0 0; 0 0; 0 0; 0 0; 0Q1 0; 0 1; 0 0; 0 0; 0 0; 5 0; 0Q2 0; 0 1; 0 0.5; 0 0; 2.5 0; 5 0; 0Q3 0; 0 1; 0.25 0.5; 1.25 0.25; 2.5 1.25; 5 0; 0Q4 0; 0 1; 0.625 0.5; 1.25 0.625; 2.5 1.25; 5 0; 0Q5 0; 0 1; 0.625 0.5; 1.25 0.625; 2.5 1.25; 5 0; 0h 1 1 1 1 V 0 1 1.25 2.5 5 0

    The algorithm fully converges after 5 iterations; Q5 = Q4 = Q. The last tworows of the table (below the dashed line) also give the optimal policies, computedfrom Q with (2.5), and the optimal V-function V , computed from Q with (2.10).In the policy representation, the symbol means that any action can be taken inthat state without changing the quality of the policy. The total number of functionevaluations required by the algorithm is:

    5 |X | |U |(2+ |U |) = 5 6 2 4 = 240

  • 28 CHAPTER 3. DYNAMIC PROGRAMMING

    Q-iteration in the stochastic caseIn the stochastic case, one must simply use a stochastic variant of the Q-iterationmapping, derived from the Bellman equation (2.17):

    [T (Q)](x,u) = x

    f (x,u,x)[

    (x,u,x)+ maxu

    Q(x,u)]

    (3.5)

    Algorithm 3.2 is obtained. The remarks above regarding the contraction property,convergence rate, and stopping criterion, directly apply to this algorithm.

    Algorithm 3.2 Q-iteration for the stochastic case.Input: dynamics f , reward function , discount factor

    1: initialize Q-function, e.g., Q0 02: repeat at every iteration = 0,1,2, . . .3: for every (x,u) do4: Q+1(x,u) x f (x,u,x) [(x,u,x)+ maxu Q(x,u)]5: end for6: until stopping criterion is satisfied

    Output: Q+1

    3.2 Policy iterationWe now consider policy iteration algorithms, which evaluate policies by constructingtheir value functions, and use these value functions to find new, improved policies.Consider that policies are evaluated using their Q-functions. Policy iteration startswith an arbitrary policy h0. At every iteration , the Q-function Qh of the currentpolicy h is determined; this step is called policy evaluation. Policy evaluation is per-formed by solving the Bellman equation (2.7). When policy evaluation is complete,a new policy h+1 that is greedy in Qh is found:

    h+1(x) argmaxu

    Qh(x,u) (3.6)

    (breaking ties among greedy actions where necessary). This step is called policyimprovement. The entire procedure is summarized in Algorithm 3.3. The sequenceof Q-functions produced by policy iteration converges to Q, and at the same time,an optimal policy h is obtained. When the number of states-action pairs is finite,convergence will happen in a finite number of iterations. This is because in thatcase each policy improvement is guaranteed to find a strictly better policy unless

  • 3.2. POLICY ITERATION 29

    it is already optimal, and there is a finite number of possible policies. These twofacts imply that the algorithm will reach the optimal policy after a finite number ofimprovements.

    Algorithm 3.3 Policy iteration.1: initialize policy h02: repeat at every iteration = 0,1,2, . . .3: find Qh , the Q-function of h policy evaluation4: h+1(x) argmaxu Qh(x,u) x policy improvement5: until h+1 = h

    Output: h = h, Q = Qh

    The crucial component of policy iteration is policy evaluation, while policy im-provement is comparatively easy to perform. Thus, we pay special attention to policyevaluation.

    An iterative policy evaluation algorithm can be given that is similar to Q-iteration.Analogously to the Q-iteration mapping T (3.1), a policy evaluation mapping T h :Q Q is defined, which computes the right-hand side of the Bellman equation foran arbitrary Q-function:

    [T h(Q)](x,u) = (x,u)+ Q( f (x,u),h( f (x,u))) (3.7)The algorithm starts from some initial Q-function Qh0 and at each iteration updatesthe Q-function using:2

    Qh+1 = T h(Qh) (3.8)Like the Q-iteration mapping T , the policy evaluation mapping T h is a contractionwith a factor < 1 in the infinity norm, i.e., for any pair of functions Q and Q:

    T h(Q)T h(Q) QQSo, T h has a unique fixed point. Written in terms of the mapping T h, the Bellmanequation (2.7) states that this unique fixed point is actually Qh:

    Qh = T h(Qh) (3.9)

    Therefore, iterative policy evaluation (3.8) converges to Qh as . Like Q-iteration, it converges at a rate of : Qh+1Qh Qh Qh.

    Algorithm 3.4 presents this iterative policy evaluation procedure. In practice,the algorithm can be stopped when the difference between consecutive Q-functionsdecreases below a given threshold: Qh+1 Qh PE, where PE > 0. Here, thesubscript PE stands for policy evaluation.

    2A different iteration index is used for policy evaluation, because it runs in the inner loop of everypolicy iteration .

  • 30 CHAPTER 3. DYNAMIC PROGRAMMING

    Algorithm 3.4 Iterative policy evaluation.Input: policy h to be evaluated, dynamics f , reward function , discount factor

    1: initialize Q-function, e.g., Q0(x,u) 0 x,u2: repeat at every iteration = 0,1,2, . . .3: for every (x,u) do4: Qh+1(x,u) (x,u)+ Qh( f (x,u),h( f (x,u)))5: end for6: until stopping criterion satisfied

    Output: Qh+1

    There are also other ways to compute Qh, for example, by using the fact that theBellman equation (3.9) can be written as a linear system of equations in the Q-valuesQh(x,u), and solving this system directly.

    More generally, the linearity of the Bellman equation for Qh makes policy eval-uation easier than Q-iteration (the Bellman optimality equation (2.8) is highly non-linear, due to the maximization at the right-hand side). Moreover, in practice, policyiteration often converges in a smaller number of iterations than value iteration. How-ever, this does not mean that policy iteration is computationally less costly than valueiteration, because every single policy iteration requires a complete policy evaluation.

    Computational cost of iterative policy evaluation

    We next investigate the computational cost of iterative policy evaluation for an MDPwith a finite number of states and actions. The computational cost of one iteration ofAlgorithm 3.4, measured by the number of function evaluations, is:

    4 |X | |U |

    where the functions being evaluated are , f ,h, and the current Q-function Qh . Thetotal cost of an entire policy evaluation consisting of T iterations is T 4 |X | |U |.

    Recall that a single Q-iteration requires |X | |U |(2+ |U |) function evaluations, andis therefore more computationally expensive than an iteration of policy evaluationwhenever |U |> 2.

    Example 3.2 Policy iteration for the cleaning robot. In this example, we applypolicy iteration to the cleaning-robot problem. Recall that every single policy iter-ation requires a complete execution of policy evaluation for the current policy, to-gether with a policy improvement. The iterative policy evaluation Algorithm 3.4is employed, starting from identically zero Q-functions. Each policy evaluation isrun until the Q-function fully converges. The same discount factor is used as forQ-iteration in Example 3.1, namely = 0.5.

  • 3.2. POLICY ITERATION 31

    Starting from a policy that always moves left (h0(x) = 1 for all x), policy iter-ation produces the sequence of Q-functions and policies given in Table 3.2. In thistable, the sequence of Q-functions produced by a given execution of policy evalua-tion is separated by dashed lines from the policy being evaluated (shown above thesequence of Q-functions) and from the improved policy (shown below the sequence).The policy iteration algorithm converges after 4 iterations (h3 is already optimal).

    Table 3.2: Policy iteration results for the cleaning robot. Q-values are rounded to 3decimal places.

    x = 0 x = 1 x = 2 x = 3 x = 4 x = 5Q0 0; 0 0; 0 0; 0 0; 0 0; 0 0; 0Q1 0; 0 1; 0 0; 0 0; 0 0; 5 0; 0Q2 0; 0 1; 0 0.5; 0 0; 0 0; 5 0; 0Q3 0; 0 1; 0.25 0.5; 0 0.25; 0 0; 5 0; 0Q4 0; 0 1; 0.25 0.5; 0.125 0.25; 0 0.125; 5 0; 0Q5 0; 0 1; 0.25 0.5; 0.125 0.25; 0.0625 0.125; 5 0; 0Q6 0; 0 1; 0.25 0.5; 0.125 0.25; 0.0625 0.125; 5 0; 0h1 * 1 1 1 1 *Q0 0; 0 0; 0 0; 0 0; 0 0; 0 0; 0Q1 0; 0 1; 0 0; 0 0; 0 0; 5 0; 0Q2 0; 0 1; 0 0.5; 0 0; 2.5 0; 5 0; 0Q3 0; 0 1; 0.25 0.5; 0 0.25; 2.5 0; 5 0; 0Q4 0; 0 1; 0.25 0.5; 0.125 0.25; 2.5 0.125; 5 0; 0Q5 0; 0 1; 0.25 0.5; 0.125 0.25; 2.5 0.125; 5 0; 0h2 * 1 1 1 1 *Q0 0; 0 0; 0 0; 0 0; 0 0; 0 0; 0Q1 0; 0 1; 0 0; 0 0; 0 0; 5 0; 0Q2 0; 0 1; 0 0.5; 0 0; 2.5 0; 5 0; 0Q3 0; 0 1; 0.25 0.5; 1.25 0.25; 2.5 1.25; 5 0; 0Q4 0; 0 1; 0.25 0.5; 1.25 0.25; 2.5 1.25; 5 0; 0h3 * 1 1 1 1 *Q0 0; 0 0; 0 0; 0 0; 0 0; 0 0; 0Q1 0; 0 1; 0 0; 0 0; 0 0; 5 0; 0Q2 0; 0 1; 0 0.5; 0 0; 2.5 0; 5 0; 0Q3 0; 0 1; 0 0.5; 1.25 0; 2.5 1.25; 5 0; 0Q4 0; 0 1; 0.625 0.5; 1.25 0.625; 2.5 1.25; 5 0; 0Q5 0; 0 1; 0.625 0.5; 1.25 0.625; 2.5 1.25; 5 0; 0h4 * 1 1 1 1 *

    Policy evaluation takes, respectively, a number T of 6, 5, 4, and 5 iterations forthe four evaluated policies. Recall that the computational cost of an entire policy

  • 32 CHAPTER 3. DYNAMIC PROGRAMMING

    evaluation is T 4 |X | |U |. Assuming that the maximization over U in the policyimprovement is solved by enumeration, the computational cost of every policy im-provement is |X | |U |. Thus, the entire policy iteration algorithm has a cost of:

    (6+5+4+5) 4 |X | |U |+4 |X | |U |= 84 |X | |U |= 84 6 2 = 1008

    In the first expression, the first term corresponds to policy evaluations, and the secondto policy improvements. Compared to the cost 240 of Q-iteration in Example 3.1,policy iteration is in this case more computationally expensive.

    Policy iteration in the stochastic case

    Only the policy evaluation component must be changed in the stochastic case. Policyimprovement remains the same, because the expression of the greedy policy is alsounchanged, see Section 2.4.

    Similarly to the case of Q-iteration, a stochastic variant of the policy evaluationmust be used, derived from the Bellman equation (2.16):

    [T h(Q)](x,u) = x

    f (x,u,x)[(x,u,x)+ Q(x,h(x))] (3.10)Algorithm 3.5 shows the resulting policy evaluation method.

    Algorithm 3.5 Iterative policy evaluation for the stochastic case.Input: policy h to be evaluated, dynamics f , reward function , discount factor

    1: initialize Q-function, e.g., Qh0 02: repeat at every iteration = 0,1,2, . . .3: for every (x,u) do4: Qh+1(x,u) x f (x,u,x)

    [(x,u,x)+ Qh(x,h(x))

    ]5: end for6: until stopping criterion satisfied

    Output: Qh+1

    3.3 Policy searchPolicy search algorithms use optimization techniques to directly search for a (near)-optimal policy. Since the optimal policy maximizes the return from every initialstate, the optimization criterion should be a combination (e.g., average) of the returnsfrom every initial state. Note that gradient-based optimization methods will typicallynot be applicable because the optimization criterion is not differentiable, and theremay exist local optima. Instead, more general (global, gradient-free) methods arerequired, such as genetic algorithms.

  • 3.3. POLICY SEARCH 33

    Consider the return estimation procedure. While the returns are infinite sums ofdiscounted rewards (2.1), in practice they have to be estimated in a finite number ofK steps. This means that the infinite sum in the return is approximated with a finitesum over the first K steps. To guarantee that the approximation obtained in this waymakes an error of at most MC > 0, K can be chosen with:

    K =

    logMC(1 )

    (3.11)

    The return estimation has to be done for every initial state, for every policy thatmust be evaluated, which means that policy search algorithms will be computation-ally expensive, usually more so than value iteration and policy iteration.

    Computational cost of exhaustive policy searchWe next investigate the computational cost of a policy search algorithm for an MDPwith a finite number of states and actions. For simplicity, we consider an algorithmthat exhaustively searches the entire policy space.

    The number of possible policies is |U ||X | and the return has to be evaluated forall the |X | initial states, by using a K-step trajectory. It follows that the total numberof simulation steps that have to be performed to find an optimal policy is at mostK |U ||X | |X |. Since f , , and h are each evaluated once at every simulation step, thecomputational cost, measured by the number of function evaluations, is:

    3K |U ||X | |X |

    Compared e.g. to the cost L |X | |U |(2+ |U |) of Q-iteration (3.4), this implementationof policy search is, in most cases, clearly more costly.

    Of course, more efficient optimization techniques than exhaustive search areavailable, and the estimation of the expected returns can also be accelerated. Forinstance, after the return of a state has been estimated, this estimate can be reusedat every occurrence of that state along subsequent trajectories, thereby reducing thecomputational cost. Nevertheless, the costs derived above can be seen as worst-casevalues that illustrate the inherently large complexity of policy search.

    Example 3.3 Exhaustive policy search for the cleaning robot. Consider againthe cleaning-robot problem and assume that the exhaustive policy search describedabove is applied. Take the approximation tolerance in the evaluation of the returnto be MC = 0.01. Using MC, maximum absolute reward = 5, and discountfactor = 0.5 in (3.11), a time horizon of K = 10 steps is obtained. Therefore, thecomputational cost of the algorithm, measured by the number of function evaluations,is:

    3K |U ||X | |X |= 3 10 26 6 = 11520

  • 34 CHAPTER 3. DYNAMIC PROGRAMMING

    (Note that, in fact, the cost will be smaller as many trajectories will reach a terminalstate in under 10 steps, and additionally it is useless to search for optimal actions orestimate returns in the terminal states.)

    For the cleaning-robot problem, the exhaustive implementation of direct policysearch is very likely to be more expensive than both Q-iteration (which had a cost of240) and policy iteration (which had a cost of 1008).

    Optimistic planning

    In what follows, we introduce an algorithm for the local search of control actions,at a specific state of the system. By applying this search algorithm at every stateencountered while interacting with the system, an online control method is obtained.Although the algorithm is classified under policy search, it exploits the ideas of dy-namic programming to efficiently search the space of possible actions. An ingeniouscombination of dynamic programming and policy search is obtained. We only con-sider the deterministic case here, although stochastic variants of the algorithm exist.

    The algorithm is called optimistic planning (OP). It builds a planning (lookahead)tree starting from a root node that contains the state where an action must be chosen.At each iteration, the algorithm selects a leaf node (a state) and expands it, by gen-erating the next states for all the possible actions. Denote my M = |U | the numberof discrete actions; thus M new nodes are created at each expansion. The algorithmstops growing the tree after n expansions and returns an action chosen on the basis ofthe final tree.

    x0

    xx

    a

    xb

    V x( )a

    a, x,a( )

    V( )xb

    b, x,b( )

    Figure 3.1: An example of OP tree. In this example there are only two actions,denoted a and b, and 3 expansions have been performed so far. Each inner nodehas two children, one for a and one for b, and the arcs to these children also storethe actions and rewards associated to the transitions; such a relationship has beenexemplified for node x. Values V are indicated for two nodes.

    The following notation and conventions are necessary, see also Figure 3.1:

    The entire tree is denoted by T , and the set of leaf (unexpanded) nodes by L .The inner nodes are T \L .

  • 3.3. POLICY SEARCH 35

    A node of the tree is labeled by its associated state x. Note that several nodesmay have the same state label. These nodes will in fact be distinct; whilekeeping that in mind, we denote for simplicity a node directly by its associatedstate x. A child node, generically denoted x, has the meaning of next state;the child of x corresponding to action u is f (x,u). The actions and rewardsassociated with the transitions from parent to child nodes are stored on thetree.

    Because everything happens at the current time step, the time index k is droppedand the subscript of x is reused to indicate the depth d of a node in the tree,whenever this depth is relevant. So, x0 is the root node, where an action musteventually be chosen, and xd is a node at depth d.

    As a first step to develop an understanding the algorithm, some fixed tree will beconsidered, such as the one in Figure 3.1, and the procedure to choose a final actionat x0 will be explained. For each x T , define the values V (x) recursively, startingfrom the leaf nodes, as follows:

    V (x) = 0 x LV (x) = max

    u[(x,u)+ V ( f (x,u))] x T \L (3.12)

    Note that children nodes x = f (x,u) exist for all inner nodes x and actions u, by theway in which the tree is built, so the values V can indeed be computed using theinformation on the tree. Then, an action is chosen at the root with:

    u0 argmaxu

    [(x0,u)+ V ( f (x0,u))] (3.13)

    It is no mistake that the notation V has been used for the values; they have thesame meaning as the V-values of the states in the tree. To understand this, say thatthe optimal V-function function V is available. If leaf nodes values are initializedusing V instead of 0, then a single node expansion suffices to find optimal actions,because then (3.13) exactly implements the greedy policy in V , see (2.11):

    u0 = h(x0) argmaxu

    [(x0,u)+ V ( f (x0,u))]

    This situation is illustrated in Figure 3.2, top.The optimal V-function V is of course not available, so let us look at 0 initializa-

    tion, but in a uniform tree up to some very large depth D, as illustrated in Figure 3.2,bottom. Then (3.12) actually implements a local form of value iteration, in which theBellman equation (2.13):

    V (x) = maxu

    [(x,u)+ V ( f (x,u))]

  • 36 CHAPTER 3. DYNAMIC PROGRAMMING

    x0

    a

    xa

    ( ,)

    0

    x1a

    x1bb

    xb(

    , )0

    V*(x1a)

    V*(x1b)

    x0

    x1a

    x1b

    V( )x1a

    V( )x1b

    xD-1a

    V( )xD-1a

    xDa

    xDb

    V( )xDa

    V( )xDb

    ...

    ...

    ...

    ......

    Figure 3.2: Two extreme cases of the OP tree. Top: tree of depth 1, but where leafvalues are initialized using V . Bottom: uniform tree of depth D, for very large D.Like in the previous figure, there are only two actions. Unlike there however, nodedepth has been indicated using subscripts.

    is turned into an update starting from the 0 leaf values, and continuing backwardsalong the tree down to the root. By similar convergence arguments as for Q-iteration,as D , the values of states at depth 1 (and also, in fact, at any finite depth),converge to the optimal V-values, and (3.13) yet again recovers the optimal action.

    When the depth D is finite, the updates (3.12) actually compute an approximationof V ; in fact, the tree does not need to be uniform, and a valid approximation willbe computed with a tree of any shape, such as the one in Figure 3.1. The idea of OPis to judiciously choose which nodes to expand so that V (x) is as close as possible toV (x) after a limited number of expansions.

    To this end, an optimistic procedure is applied to select which leaf node to expandat each iteration of the algorithm. First, translate and rescale the reward function sothat all possible rewards are now in [0,1] (this can always be done without changingthe solution when there are no terminal states; if there are terminal states care mustbe taken). For each x T , define b-values b(x) in a similar way to the V-values, butstarting with 1/(1 ) at the leaves:

    b(x) = 1/(1 ) x Lb(x) = max

    u[(x,u)+ b( f (x,u))] x T \L (3.14)

    Each b-value b(x) is an upper bound on V (x), i.e. b(x) V (x) x T . This is

  • 3.3. POLICY SEARCH 37

    immediately clear at the leaves, where the value 1/(1 ) is an upper bound for anyV-value, because the rewards are in [0,1] and a discount factor < 1 is used. Bybackward induction, it is true at any inner node: since the b-values of the nodeschildren are upper bounds on their V-values, the resulting b-values of this node areupper bounds on its own V-values.

    To select a node to expand, the tree is traversed starting from the root and alwaysfollowing optimistic actions, i.e., those greedy in the b-values:

    xd+1 = f (xd,u(xd)), where u(x) argmaxu

    [(x,u)+ b( f (x,u))] (3.15)

    The leaf reached using this procedure is the one chosen for expansion. Ties in themaximization can be broken in any way, but to make the algorithm predictable, theyshould preferably be broken deterministically, e.g., always in favor of the first child.This procedure is optimistic because it uses b-values (upper bounds) as if they wereoptimal Q-values; said another way, it assumes the best possible optimal values com-patible with the planning tree available so far.

    By analyzing the algorithm it turns out that choosing nodes in this way is good:the quality of action chosen at the root increases fast with the number of node ex-pansions, where the complexity of the problem is taken into account in a specificway.

    Algorithm 3.6 summarizes the OP algorithm, placing it in an online control loop.

    Algorithm 3.6 Optimistic planning for online controlInput: dynamics f , reward function , discount factor , number of allowed expan-

    sions n1: for every time step k = 0,1,2, . . . do2: measure xk, relabel time so that the current step is 03: initialize tree: T {x0}4: for = 1, . . . ,n do5: find optimistic leaf x, navigating the tree with (3.15)6: for j = 1, . . . ,M do7: simulate transition: compute f (x,u j), (x,u j)8: add corresponding node to tree as a child of x9: end for

    10: update b-values upwards in the tree with (3.14)11: end for12: compute V-values with (3.14)13: apply u0, chosen with (3.13)14: end for

  • 38 CHAPTER 3. DYNAMIC PROGRAMMING

  • Chapter 4

    Monte Carlo methods

    When a mathematical model of the system is not available, DP methods are not ap-plicable, and we must resort to RL algorithms instead. We start by presenting MonteCarlo algorithms.

    4.1 Monte Carlo policy iterationMC methods are easy to understand initially as a trajectory-based variant of policy it-eration. For simplicity, we assume that the MDP is episodic, and that every trajectory(trial) eventually ends in a terminal state, so the i-th trial has the form:

    xi,0,ui,0,ri,1,xi,1,ui,1,ri,2,xi,2, . . . ,ri,Ki ,xi,Ki

    with xi,Ki terminal. If all the actions ui,k for k > 0 are chosen by the policy h, i.e. ifui,k = h(xi,k) for k > 0, then by definition the returns obtained along this trial, for allthe state-action pairs that occur along it, are in fact the Q-values of these pairs underh:

    Qh(xi,k,ui,k) =Ki1l=k

    lkri,k+1

    for k 0. Note that the first action does not have to be taken with the policy h, butcan be chosen freely. Once we have enough trajectories to find the Q-values of allthe state-action pairs, we can perform a policy improvement with (3.6), and continueto evaluate the new policy using the MC technique. Overall, we obtain a model-free,RL policy iteration algorithm.

    To give the general form of MC, which also works for stochastic MDPs, we mustbriefly consider the stochastic case. In fact, for stochastic MDPs the assumption thatevery trajectory ends up in a terminal state with some positive probability is easierto justify than in the deterministic case, where it is often easy to construct policiesthat loop over a never-ending cycle of states (just consider the cleaning robot and a

    39

  • 40 CHAPTER 4. MONTE CARLO METHODS

    policy that assigns h(2) = 1, h(3) = 1). In the stochastic case, a single trial doesnot suffice to find the Q-value of the state-action pairs occurring along it, but doesprovide a sample of this Q-value. The Q-value is estimated as an average of manysuch samples:

    Qh(x,u) A(x,u)C(x,u)

    where:

    A(x,u) = (i,k)s.t. xi,k=x,ui,k=u

    Ki1l=k

    lkri,k+1

    and C(x,u) counts how many times the pair (x,u) occurs:

    C(x,u) = (i,k)s.t. xi,k=x,ui,k=u

    1

    The name Monte Carlo is more generally applied to this type of sample-basedestimation for any random quantity, not necessarily returns. If a state-action pairoccurs twice along the same trial, the formulas above take it into account both times;such an MC method is called every-visit (since we take the pair into account everytime we visit it). A first-visit alternative, which takes a pair into account only thefirst time it occurs in each trial, can also be given. First-visit and every-visit MC haveslightly different theoretical properties, but they both obtain the correct Q-values inthe limit, as the number of visits approaches infinity.

    In practice, we can of course only use a finite number of trials to estimate theQ-function of a policy; denote this number by NMC. Algorithm 4.1 summarizes thispractical variant of MC learning. The trajectories could be obtained by interactingwith the real system, or with a simulation model of it.

    Note that an idealized, infinite-time setting is considered for this algorithm, inwhich no termination condition is specified and no explicit output is produced. In-stead, the result of the algorithm is the improvement of the control performanceachieved while interacting with the process. A similar setting will be considered forother online learning algorithms described in this book, with the implicit understand-ing that, in practice, the algorithms will of course be stopped after a finite number ofsteps. When MC policy iteration is stopped, the resulting policy and Q-function canbe interpreted as outputs and reused.

    4.2 The need for exploration

    Until now, we have tacitly assumed that all the state-action pairs are visited a suffi-cient number of times to estimate their Q-values. This requires two things: (i) thateach state is sufficiently visited and (ii) that for any given state, each action is suf-ficiently visited. Assuming that the selection procedure for the initial states of the

  • 4.2. THE NEED FOR EXPLORATION 41

    Algorithm 4.1 Monte Carlo policy iteration.Input: discount factor , algorithm type (first-visit or every-visit)

    1: initialize policy h02: for every iteration = 0,1,2, . . . do3: A(x,u) 0, C(x,u) 0, x,u start policy evaluation4: for i = 1, . . . ,NMC do5: initialize xi,0,ui,06: execute trial xi,0,ui,0,ri,1,xi,1,ui,1,ri,2,xi,2, . . . ,ri,Ki ,xi,Ki

    using policy h for every k > 07: for k = 0, . . . ,Ki1 do8: if first-visit and xi,k,ui,k already encountered in trial i then9: ignore this pair and continue to next k

    10: else11: A(xi,k,ui,k) A(xi,k,ui,k)+Ki1l=k lkri,k+112: C(xi,k,ui,k)C(xi,k,ui,k)+113: end if14: end for15: Q(x,u) A(x,u)/C(x,u), x,u complete policy evaluation16: end for17: h+1(x) argmaxu Q(x,u), x policy improvement18: end for

    trajectories ensures (i) (e.g. random states are selected from the entire state space),we focus on (ii) in this section.

    If actions would always be chosen according to the policy h that is being evalu-ated, only pairs of the form (x,h(x)) would be visited, and no information about pairs(x,u) with u 6= h(x) would be available. As a result, the Q-values of such pairs couldnot be estimated and relied upon for policy improvement. To alleviate this problem,exploration is necessary: sometimes, actions different from h(x) have to be selected.

    In Algorithm 4.1, the only opportunity to do this is for the first action of each trial.This action could e.g. be chosen uniformly randomly from U . Together with (i), thisensures that the initial state-action pairs sufficiently cover the state-action space; thisis called exploring starts in the literature. With exploring starts and an ideal versionof MC policy evaluation that employs an infinite number NMC of trajectories, the Q-function accurately evaluates the policy and thus MC policy iteration converges toh.

    An alternative to exploring starts is to use an exploratory policy throughout thetrajectories, that is, a policy that sometimes takes exploratory actions instead ofgreedy ones, at the initial as well as subsequent steps. This is also done using randomaction selections, so that the algorithm has a nonzero probability of selecting any ac-tion in every encountered state. A classical exploratory policy is the -greedy policy,

  • 42 CHAPTER 4. MONTE CARLO METHODS

    which selects actions according to:

    u =

    {u argmaxu Q(x,u) with probability 1 a uniformly random action in U with probability

    (4.1)

    where (0,1) is the exploration probability. Another option is to use Boltzmannexploration, which selects an action u with probabilities dependent on the Q-values,as follows:

    P(u |x) =eQ(x,u)/

    u eQ(x,u)/(4.2)

    where the temperature 0 controls the randomness of the exploration. When 0, (4.2) is equivalent to greedy action selection, while for , action selectionis uniformly random. For nonzero, finite values of k, higher-valued actions have agreater chance of being selected than lower-valued ones.

    In the extreme case, fully random actions could be applied at every step. How-ever, this is not desirable when interacting with a real system, as it can be damagingand will lead to poor control performance. Instead, the algorithm also has to ex-ploit its current knowledge in order to obtain good performance, by selecting greedyactions in the current Q-function. This is a typical illustration of the exploration-exploitation trade-off in online RL.

    Classically, this tradeoff is resolved by diminishing the exploration over time, sothat the policy gets closer and closer to the greedy policy, and thus (as the algorithmhopefully also converges to the optimal solution) to the optimal policy. This can beachieved by reducing or close to 0 over time, e.g., as the iteration number grows.

    4.3 Optimistic Monte Carlo learningAlgorithm 4.1 only improves the policy once every NMC trajectories. In-between,the policy remains unchanged, and possibly performs badly, for long periods of time(we say that the algorithm learns slowly). To avoid this, policy improvements couldbe performed more often, before an accurate evaluation of the current policy can becompleted. In particular, we consider the case when they are performed after eachtrial. In the context of policy iteration, such methods are called optimistic.

    Algorithm 4.2 shows this variant of MC learning, where -greedy exploration isused. A core difference from Algorithm 4.1 is that we no longer afford to reset A andC, as there is not enough information in a single trial to rebuild them. Instead, we justkeep updating them, disregarding the fact that the policy is changing. Furthermore,several notation simplifications have been made: a policy is no longer explicitly con-sidered, but the policy improvements are implicit in the greedy action choices; thetrial index is now identical to the iteration index , since the two have the same mean-ing; and the iteration index has been dropped from the Q-function, to reflect its natureof a continually evolving learned object.

  • 4.3. OPTIMISTIC MONTE CARLO LEARNING 43

    Algorithm 4.2 Optimistic Monte Carlo.Input: discount factor , algorithm type (first-visit or every-visit)

    1: A(x,u) 0, C(x,u) 0, x,u2: initialize Q-function Q, e.g., Q(x,u) 0 x,u3: for every iteration = 0,1,2, . . . do start policy evaluation4: execute trial x,0,u,0,r,1,x,1,u,1,r,2,x,2, . . . ,r,K ,x,K

    where u,k =

    {u argmaxu Q(xk, u) with probability 1 a uniformly random action in U with probability

    5: for k = 0, . . . ,K1 do6: if first-visit and x,k,u,k already encountered in trial then7: ignore this pair and continue to next k8: else9: A(x,k,u,k) A(x,k,u,k)+K1l=k lkr,k+1

    10: C(x,k,u,k)C(x,k,u,k)+111: end if12: end for13: Q(x,u) A(x,u)/C(x,u), x,u14: end for

    In closing, we note that the assumption of terminal states can be removed byestimating infinite-horizon returns from finitely long trajectories, as in policy search,see Section 3.3 and Equation (3.11). Conversely, MC methods can be used to estimatethe quality of policies in policy search methods.

    Example 4.1 Monte Carlo policy iteration for the cleaning robot. As an exampleof Monte Carlo learning, we apply Algorithm 4.1 to the cleaning-robot problem. Thediscount factor is the same as before, = 0.5. We use exploring starts, where boththe initial state and the initial action are chosen fully randomly among the possiblestates and actions. A number NMC = 10 of trials is used to evaluate each policy, andthe algorithm is allowed to run for a total number of 100 trials, corresponding to 10policy improvements.

    Starting from a policy that always moves left (h0(x) = 1 for all x), a represen-tative run of MC policy iteration produces the sequence of Q-functions and policiesgiven in Table 4.1. (A representative run must be chosen because the results of the al-gorithm depend on the random state and action choices, so they will change on everyrun.) Note that each policy h in the table, for > 0, is greedy in the Q-function Q.The algorithm finds the optimal policy for the first time at iteration 7. Then, however,the action for x = 1 changes to the suboptimal value of 1 (going right), before revert-ing back to the optimal. This behavior is less predictable than that of classical policyiteration (Algorithm 3.3), because of the approximate nature of Monte Carlo policyevaluation.

  • 44 CHAPTER 4. MONTE CARLO METHODS

    Table 4.1: Monte Carlo policy iteration results for the cleaning robot.x = 0 x = 1 x = 2 x = 3 x = 4 x = 5

    h0 * 1 1 1 1 *Q0 0; 0 1; 0.25 0.5; 0 0.25; 0 0.125; 0 0; 0h1 * 1 1 1 1 *Q1 0; 0 1; 0 0.5; 0.125 0.25; 0.0625 0.125; 0 0; 0h2 * 1 1 1 1 *Q2 0; 0 1; 0 0.5; 0.125 0.25; 0.0625 0.125; 0 0; 0h3 * 1 1 1 1 *Q3 0; 0 1; 0.25 0.5; 0 0.25; 0.0625 0.125; 5 0; 0h4 * 1 1 1 1 *Q4 0; 0 1; 0.25 0.5; 0.125 0.25; 2.01 1.02; 5 0; 0h5 * 1 1 1 1 *Q5 0; 0 1; 0 0.5; 0.875 0.5; 2.5 0; 5 0; 0h6 * 1 1 1 1 *Q6 0; 0 1; 0.4 0.425; 1.25 0.625; 2.5 1.25; 5 0; 0h7 * 1 1 1 1 *Q7 0; 0 0; 0.625 0.313; 1.25 0.625; 2.5 1.25; 5 0; 0h8 * 1 1 1 1 *Q8 0; 0 1; 0.625 0.313; 1.25 0.625; 2.5 0; 5 0; 0h9 * 1 1 1 1 *Q9 0; 0 1; 0.438 0.406; 1.25 0.625; 2.5 1.25; 5 0; 0h10 * 1 1 1 1 *

  • Chapter 5

    Temporal difference methods

    Temporal difference methods can be intuitively understood as a combination of DP,in particular value and policy iteration, and MC methods. Like MC, they learn fromtrajectories instead of using a model. However, unlike MC, they update the solutionin a sample-by-sample, incremental and fully online way, without waiting until theend of the trajectory (or indeed, without requiring that trajectories end at all). Thiscorresponds to the iterative way in which Q-iteration (Algorithm 3.1) and iterativepolicy evaluation (Algorithm 3.4) work, estimating new Q-functions on the basis ofold Q-functions, which is called bootstrapping in the literature.

    By far the most popular methods from the temporal difference class are Q-learningand SARSA, which we present next.

    5.1 Q-learningQ-learning starts from an arbitrary initial Q-function Q0 and updates it using observedstate transitions and rewards, i.e., data tuples of the form (xk,uk,xk+1,rk+1). Aftereach transition, the Q-function is updated using such a data tuple, as follows:

    Qk+1(xk,uk) = Qk(xk,uk)+k[rk+1 + maxu

    Qk(xk+1,u)Qk(xk,uk)] (5.1)

    where k (0,1] is the learning rate. The term between square brackets is the tempo-ral difference, i.e., the difference between the updated estimate rk+1+ maxu Qk(xk+1,u)of the optimal Q-value of (xk,uk), and the current estimate Qk(xk,uk). This new es-timate is actually the Q-iteration mapping (3.1) applied to Qk in the state-action pair(xk,uk), where (xk,uk) has been replaced by the observed reward rk+1, and f (xk,uk)by the observed next-state xk+1. Thus Q-learning can be seen as a sample-based,incremental variant of Q-iteration.

    As the number of transitions k approaches infinity, Q-learning asymptoticallyconverges to Q if the state and action spaces are discrete and finite, and under thefollowing conditions:

    45

  • 46 CHAPTER 5. TEMPORAL DIFFERENCE METHODS

    The sum k=0 2k produces a finite value, whereas the sum k=0 k producesan infinite value.

    All the state-action pairs are (asymptotically) visited infinitely often.

    The first condition is not difficult to satisfy. For instance, a satisfactory standardchoice is:

    k =1k (5.2)

    In practice, the learning rate schedule may require tuning, because it influences thenumber of transitions required by Q-learning to obtain a good solution. A goodchoice for the learning rate schedule depends on the problem at hand.

    The second condition can be satisfied using an exploratory policy, e.g. -greedy(4.1) or Boltzmann (4.2). As with MC, there is a need to balance exploration andexploitation. Usually, the exploration parameters are time-dependent (i.e., using a k-dependent k for -greedy, k for Boltzmann), and decrease over time. For instance,an -greedy exploration schedule of the form k = 1/k diminishes to 0 as k ,while still satisfying the second convergence condition of Q-learning, i.e., allowinginfinitely many visits to all the state-action pairs. Notice the similarity of this explo-ration schedule with the learning rate schedule (5.2). Like the learning rate schedule,the exploration schedule has a significant effect on the performance of Q-learning.

    Algorithm 5.1 presents Q-learning with -greedy exploration.

    Algorithm 5.1 Q-learning.Input: discount factor , exploration schedule {k}k=0, learning rate schedule

    {k}

    k=01: initialize Q-function, e.g., Q0(x,u) 0 x,u2: measure initial state x03: for every time step k = 0,1,2, . . . do

    4: uk

    {u argmaxu Qk(xk, u) with probability 1 k (exploit)a uniformly random action in U with probability k (explore)

    5: apply uk, measure next state xk+1 and reward rk+16: Qk+1(xk,uk) Qk(xk,uk)+k[rk+1 + maxu Qk(xk+1,u)Qk(xk,uk)]7: end for

    Note that no difference has been made between episodic and continuing tasks. Inepisodic tasks, whenever a terminal state is reached, the process must be somehowreset to a new, nonterminal initial state; the change from the terminal state to thisinitial state is not a valid transition and should therefore not be used for learning. Thealgorithm remains unchanged. This also holds for the algorithms presented in thesequel.

  • 5.2. SARSA 47

    5.2 SARSASARSA was proposed as an alternative to Q-learning, and is very similar to it. Thename SARSA is obtained by joining together the initials of every element in the datatuples employed by the algorithm, namely: State, Action, Reward, next State, nextAction. Formally, such a tuple is denoted by (xk,uk,rk+1,xk+1,uk+1). SARSA startswith an arbitrary initial Q-function Q0 and updates it at each step using tuples of thisform, as follows:

    Qk+1(xk,uk) = Qk(xk,uk)+k[rk+1 + Qk(xk+1,uk+1)Qk(xk,uk)] (5.3)where k (0,1] is the learning rate. The term between square brackets is the tem-poral difference, obtained as the difference between the updated estimate rk+1 +Qk(xk+1,uk+1) of the Q-value for (xk,uk), and the current estimate Qk(xk,uk).

    This is not the same as the temporal difference used in Q-learning (5.1). Whilethe Q-learning temporal difference includes the maximal Q-value in the next state, theSARSA temporal difference includes the Q-value of the action actually taken in thisnext state. This means that SARSA performs online, model-free policy evaluationsteps for the policy that is currently being followed. In particular, the new estimaterk+1 + Qk(xk+1,uk+1) of the Q-value for (xk,uk) is actually the policy evaluationmapping (3.7) applied to Qk in the state-action pair (xk,uk). Here, (xk,uk) has beenreplaced by the observed reward rk+1, and f (xk,uk) by the observed next state xk+1.

    Because it works online, SARSA cannot afford to wait until the Q-function hasconverged before it improves the policy. Instead, to select actions, SARSA combinesa greedy policy in the current Q-function with exploration. Because of the greedycomponent, SARSA implicitly performs an policy improvement at every time step,and is therefore a type of optimistic, online policy iteration. Policy improvementsare optimistic, like in the MC variant of Algorithm 4.2, because they are done on thebasis of Q-functions that may not be accurate evalua