variable resolution discretization in optimal control

24

Upload: dinhdat

Post on 31-Jan-2017

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Variable Resolution Discretization in Optimal Control

Machine Learning, 1, 1{24 (1999)c 1999 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Variable Resolution Discretization

in Optimal Control

R�EMI MUNOS AND ANDREW MOORE fmunos, [email protected]

Robotics Institute, Carnegie Mellon University,

5000 Forbes Ave, Pittsburgh, PA 15213, USA

Received March 1, 1999

Editor: Satinder Singh

Abstract. The problem of state abstraction is of central importance in optimal control, reinforce-ment learning and Markov decision processes. This paper studies the case of variable resolutionstate abstraction for continuous time and space, deterministic dynamic control problems in whichnear-optimal policies are required. We begin by de�ning a class of variable resolution policy andvalue function representations based on Kuhn triangulations embedded in a kd-trie. We then con-sider top-down approaches to choosing which cells to split in order to generate improved policies.The core of this paper is the introduction and evaluation of a wide variety of possible splittingcriteria. We begin with local approaches based on value function and policy properties that useonly features of individual cells in making split choices. Later, by introducing two new non-localmeasures, in uence and variance, we derive splitting criteria that allow one cell to e�ciently takeinto account its impact on other cells when deciding whether to split. In uence is an e�ciently-calculable measure of the extent to which changes in some state e�ect the values of some otherset of states. Variance is an e�ciently-calculable measure of how risky is some state in a Markovchain: a low variance state is one in which we would be very surprised if, during any one execution,the long-term reward attained from that state di�ered substantially from its expected value, givenby the value function.

The paper proceeds by graphically demonstrating the various approaches to splitting on thefamiliar, non-linear, non-minimum phase, but two dimensional problem of the \Car on the hill".It then evaluates the performance of a variety of splitting criteria on many benchmark problems(which we have published on the web), paying careful attention to their number-of-cells versuscloseness-to-optimality tradeo� curves.

Keywords: Reinforcement learning, optimal control, variable resolution discretization

1. Introduction

This paper is about non-uniform discretization of state spaces when �nding optimalcontrollers for continuous time and space Markov Processes.

Uniform discretizations (generally based on �nite element or �nite di�erence tech-niques (Kushner & Dupuis, 1992) provide us with important convergence results(see the analytical approach of (Barles & Souganidis, 1991; Crandall, Ishii, & Li-ons, 1992; Crandall & Lions, 1983; Munos, 1999) and the probabilistic results of(Kushner & Dupuis, 1992; Dupuis & James, 1998)), but su�er from impracticalcomputational requirements when the size of the discretization step is small, espe-cially when the state space is high dimensional. On the other hand, approximationmethods (Bertsekas & Tsitsiklis, 1996; Baird, 1995; Sutton, 1996) can handle high

Page 2: Variable Resolution Discretization in Optimal Control

2 R�EMI MUNOS AND ANDREW MOORE

dimensionality but in general, have no guarantee of convergence to the optimal so-lution (Boyan & Moore, 1995; Baird, 1995; Munos, 1999). Some local convergenceresults are in (Gordon, 1995; Baird, 1998).In this paper we try to keep the convergence properties of the discretized methods

while introducing an approximation factor by the iterative designing of a variableresolution. In this paper we only consider the \general towards speci�c" approach :an initial coarse grid is successively re�ned at some areas of the state space by usinga splitting process, until some desired approximation (of the value function or theoptimal control) is reached.First, we implement two splitting criteria based on the value function (see section

6), then we de�ne a criterion of inconsistency between the value function and thepolicy (see section 7). In order to de�ne the e�ect of the splitting of a state on othersstates, we de�ne in section 8 the notion of in uence. And we estimate the expectedgain in the approximation of the value function when splitting states by de�ningin section 9 the variance of a Markov chain. By combining these two notions, wededuce, for a given discretization, the states whose splitting will mostly in uencethe parts of the state space where there is a change in the optimal control, leadingto increase the resolution at those important areas.We illustrate the di�erent splitting criteria on the \Car on the hill" problem and

in section 11 we show the results for other control problems, including the wellknown 4 dimensions \Cart-pole" and \Acrobot" problems.In this paper we make the assumption that we have a model of the dynamics and of

the reinforcement function. Moreover we assume that the dynamics is deterministic.

2. Description of the optimal control problem

We consider discounted deterministic control problems, which include the well-known reinforcement learning benchmarks of Car on the Hill (Moore, 1991), Cart-Pole (Barto, Sutton, & Anderson, 1983) and Acrobot (Sutton, 1996). Let x(t) 2 X

be the state of the system, with the state space X be a compact subset of IRd. Theevolution of the state depends on the control u(t) 2 U (with the control space U a�nite set of discrete actions) by the di�erential equation, called state dynamics :

dx

dt= f(x(t); u(t)) (1)

For an initial state x and a control function u(t), this equation leads to a uniquetrajectory x(t). Let � be the exit time from the state space (with the conventionthat if x(t) always stays in X, then � = 1). Then, we de�ne the gain J as thediscounted cumulative reinforcement :

J(x;u(t)) =

Z �

0 tr(x(t); u(t))dt+ �R(x(� )) (2)

where r(x; u) is the current reinforcement and R(x) the boundary reinforcement. is the discount factor (0 � < 1). For convenience reasons, in what follows, weassume that < 1. However, most of the results apply to the undiscounted case = 1 as well, assuming that for any control u(t), the trajectories do not loop (i.e.x(t1) 6= x(t2) for t1 6= t2).

Page 3: Variable Resolution Discretization in Optimal Control

VARIABLE RESOLUTION DISCRETIZATION IN OPTIMAL CONTROL 3

The objective of the control problem is to �nd, for any initial condition x, thecontrol u�(t) that optimizes the functional J .Here we use the method of Dynamic Programming (DP) that introduces the value

function (VF), maximum of J as a function of initial state x :

V (x) = supu(t)

J(x;u(t)):

Following the DP principle, we can prove (Fleming & Soner, 1993) that V satis�esa �rst-order non-linear di�erential equation, called the Hamilton-Jacobi-Bellman(HJB) equation :

Theorem 1 If V is di�erentiable at x 2 X, let DV (x) be the gradient of V at x,then the following HJB equation holds at x.

V (x) ln +maxu2U

[DV (x):f(x; u) + r(x; u)] = 0 (3)

DP computes the VF in order to de�ne the optimal control with a feed-backcontrol policy �(x) : X ! U such that the optimal control u�(t) at time t onlydepends on current state x(t) : u�(t) = �(x(t)). Indeed, from the value function,we deduce the following optimal feed-back control policy :

�(x) 2 argmaxu2U

[DV (x):f(x; u) + r(x; u)] (4)

3. The discretization process

In order to discretize the continuous control problem described in the previous sec-tion, we use a process based on the �nite element methods of (Kushner & Dupuis,1992). We use a class of functions known as barycentric interpolators (Munos &Moore, 1998), built from a triangulation of the state-space. These functions arepiecewise linear inside each simplex, but might be discontinuous at the bound-ary between two simplexes. This representation has been chosen for its very fastcomputational properties.Here is a description of this class of functions. The state-space is discretized into

a variable resolution grid using a structure of a tree. The root of the tree coversthe whole state space, supposed to be a (hyper) rectangle. It has two brancheswhich divide the state space into two smaller rectangles by means of a hyperplaneperpendicular to the chosen splitting dimension. In the same way, each node (exceptfor the leaf ones) splits in some direction i = 1::d the rectangle it covers at its middleinto two nodes of half area (see Figure 1). This kind of structure is known as a kd-trie (Knuth, 1973), and is a special kind of kd-tree (Friedman, Bentley, & Finkel,1977) in which splits occur at the center of every cell.For every leaf, we consider the Coxeter-Freudenthal-Kuhn triangulation (or sim-

ply the Kuhn triangulation (Moore, 1992)). In dimension 2 (Figure 1(b)) eachrectangle is composed of 2 triangles. In dimension 3 (see Figure 2) they are com-posed of 6 pyramids, and in dimension d, of d! simplexes.The interpolated functions consider here are de�ned by their values at the corners

of the rectangles. We use the Kuhn triangulation to linearly interpolate insidethe rectangles. Thus these functions are piecewise linear, continuous inside eachrectangle, but may be discontinuous at the boundary between two rectangles.

Page 4: Variable Resolution Discretization in Optimal Control

4 R�EMI MUNOS AND ANDREW MOORE

(b) The corresponding tree(a) Example of discretization

Figure 1. (a) An example of discretization of the state space. There are 12 rectangles and 24corners (the dots). (b) The corresponding tree structure. The area covered by each node isindicated in gray level. We implement a Kuhn triangulation on every leaf

Remark. As we are going to approximate the value function V with such piecewiselinear functions, it is very easy to compute the gradient DV at (almost) any pointof the state space, thus making possible to use the feed-back rule 4 to deduce thecorresponding optimal control.

ξ5

ξ7

ξ3

ξ6

ξ2

ξ0

ξ4

ξ4

ξ0

x

x

ξ5

ξ7

ξ1

Figure 2. The Kuhn triangulationof a (3d) rectangle. The point x

satisfying 1 � x2 � x0 � x1 � 0is in the simplex (�0; �4; �5; �7).

3.1. Computational issues

Although the number of simplexes inside a rectangle is factorial with the dimensiond, the computation time for interpolating the value at any point inside a rectangleis only of order (d lnd), which corresponds to a sorting of the d relative coordinates(x0; :::; xd�1) of the point inside the rectangle.Assume we want to compute the indexes i0; :::; id of the (d + 1) vertices of the

simplex containing a point de�ned by its relative coordinates (x0; :::; xd�1) withrespect to the rectangle whose corners are f�0; :::; �2dg. The indexes of the corners

Page 5: Variable Resolution Discretization in Optimal Control

VARIABLE RESOLUTION DISCRETIZATION IN OPTIMAL CONTROL 5

uses the binary decomposition in dimension d, as illustrated in Figure 2. Computingthese indexes is achieved by sorting the coordinates from the highest to the smallest :there exist indices j0; :::; jd�1, permutation of f0; ::; d� 1g, such that 1 � xj0 �xj1 � ::: � xjd�1 � 0. Then the indices i0; :::; id of the (d + 1) vertices of thesimplex containing the point are : i0 = 0, i1 = i0 + 2j0 , ..., ik = ik�1 + 2jk�1 , ...,id = id�1 + 2jd�1 = 2d � 1. For example, if the coordinates satisfy : 1 � x2 � x0 �x1 � 0 (illustrated by the point x in Figure 2) then the vertices are : �0 (everysimplex has this vertex, as well as �2d�1 = �7 in common), �4 (we added 22), �5 (weadded 20) and �7 (we added 21).The corresponding barycentric coordinates �0; :::; �d of the point inside the sim-

plex �i0 ; :::; �id are : �0 = 1 � xj0 , �1 = xj0 � xj1, ..., �k = xjk�1 � xjk , ...,�d = xjd�1 � 0 = xjd�1 . In the previous example, the barycentric coordinates are :�0 = 1� x2, �1 = x2 � x0, �2 = x0 � x1, �3 = x1.The approach of using Kuhn triangulations to interpolate the value function has

been introduced to the reinforcement learning literature by (Davies, 1997).

3.2. Building the discretized MDP

For a given discretization, we build a corresponding Markov Decision Process(MDP) in the following way. The state space of the MDP is the set � of cor-ners of the tree. The control space is the �nite set U . For every corner � 2 � andcontrol u 2 U we approximate a part of the corresponding trajectory x(t) (with theEuler or Runge-Kuta method) by integrating the state dynamics (1) from initialstate � for a constant control u, during some time � (�; u) until it enters inside a newrectangle at some point �(�; u) (see Figure 3). At the same time, we also computethe integral of the current reinforcement :

RMDP (�; u) =R �(�;u)t=0 t:r(x(t); u):dt

which de�nes the reinforcement function of the MDP. Then we compute thevertices (�0; :::; �d) of the simplex containing �(�; u) and the corresponding barycen-tric coordinates ��0(�(�; u)); :::; ��d(�(�; u)). The probabilities of transition

p(�ij�; u) of the MDP from state � and control u to states �i are de�ned by thesebarycentric coordinates (see Figure 3) : p(�ij�; u) = ��i(�(�; u)). Thus, the DPequation corresponding to this MDP is :

V (�) = maxuh �(�;u):

Pd

i=0 ��i (�(�; u)):V (�i) + RMDP (�; u)i

(5)

If while integrating (1) from initial state � with the control u, the trajectory exitsfrom the state space at time � (�; u), then (�; u) lead to a terminal state �t (i.e.satisfying p(�tj�t; v) = 1; p(� 6= �tj�t; v) = 0 for all v) with probability 1 and with

the reinforcement : RMDP =R �(�;u)t=0 t:r(x(t); u):dt+ �(�;u):R(x(� (�; u))).

3.3. Resolution of the discretized MDP

We can use any of the classical methods to solve the discretized MDP, i.e. valueiteration, policy iteration, modi�ed policy iteration (Puterman, 1994), (Bertsekas,

Page 6: Variable Resolution Discretization in Optimal Control

6 R�EMI MUNOS AND ANDREW MOORE

ξξ

ξ η(ξ, )u

0

12 ξFigure 3. According to the current (variable res-olution) grid, we build a discrete MDP. For everycorner � (state of the MDP) and every control u,we integrate the corresponding trajectory untilit enters a new rectangle at �(�; u). The inter-polated value at �(�; u) is a linear combinationof the values of the vertices of the simplex it isin (here (�0; �1; �2)). Furthermore, it is a linearcombination with positive coe�cients that sumto one. Thus, doing this interpolation is mathe-matically equivalent to probabilistically jumpingto a vertex. The probabilities of transition of theMDP for (state �, control u) to (states f�igi=0::2)are the barycentric coordinates ��i(�(�; u)) of�(�; u) inside (�0; �1; �2).

1987), (Barto, Bradtke, & Singh, 1995) or the prioritized sweeping (Moore & Atke-son, 1993).

4. An example : the \Car on the Hill" control problem

For a description of the dynamics of this problem, see (Moore & Atkeson, 1995).This problem is of dimension 2. In our experiments, we chose the reinforcementfunctions as follows : the current reinforcement r(x; u) is zero everywhere. Theterminal reinforcement R(x) is �1 if the car exits from the left side of the statespace, and varies linearly between +1 and �1 depending on the velocity of the carwhen it exits from the right side of the state space. The best reinforcement +1occurs when the car reaches the right boundary with a null velocity (�gure 4). Thecontrol u has only 2 possible values : maximal positive or negative thrust.Figure 6 represents the interpolated value function of the MDP obtained by a

regular discretization of 257 by 257 states.We observe the following distinctive features of the value function :

� There is a discontinuity in the VF along the \Frontier 1" (see Figure 6) whichresults from the fact that given an initial point situated above this frontier,the optimal trajectory stays inside the state space (and eventually leads to apositive reward) so the value function at this point is positive. Whereas for ainitial point below this frontier, any control lead the car to exit from the leftboundary (because the initial velocity is too negative), thus the correspondingvalue function is negative (see the optimal trajectories in Figure 5). We observethat there is no change in the optimal control around this frontier.

� There is a discontinuity in the gradient of the VF along the upper part of\Frontier 2" which results from a change in the optimal control. For example,a point above frontier 2 can reach directly the top of the hill, whereas a pointbelow this frontier has to go down and do one loop to gain enough velocity toreach the top (see Figure 5). Moreover, we observe that around the lower part

Page 7: Variable Resolution Discretization in Optimal Control

VARIABLE RESOLUTION DISCRETIZATION IN OPTIMAL CONTROL 7

Goal

Thrust

Gravitation

Resistance

:

ReinforcementR=-1

R=+1 for null velocityR=-1 for max. velocity

Figure 4. The \Car on the Hill" control problem.

Frontier 3

Frontier 2, upper part

Frontier 2, lower part

GOAL

Position

Vel

ocity

Figure 5. The optimal policy is indicatedby di�erent gray levels. Several optimaltrajectories are drawn for di�erent initialstarting points.

of frontier 2 (see Figures 6), there is no visible discontinuity of the VF despitethe fact that there is a change in the optimal control.

� There is a discontinuity in the gradient of the VF along the \Frontier 3" becauseof a change in the optimal control (below the frontier, the car accelerates in orderto reach the reward as fast as possible, whereas above, it decelerates to reachthe top of the hill with the lowest velocity and receive the highest reward).

Figure 6. The value function of the Car-on-the-Hill problem obtained by a regular grid of 257 by257 = 66049 states. The Frontier 1 illustrates the discontinuity of the VF, the Frontiers 2 and 3(the dash lines) stands where there is a change in the optimal control.

Page 8: Variable Resolution Discretization in Optimal Control

8 R�EMI MUNOS AND ANDREW MOORE

We deduce from these observations that a good approximation of the value func-tion does not necessarily mean a good approximation of the optimal control since :

� The approximation of the value function is not su�cient to predict the changein the optimal control around the lower part of frontier 2.

� A good approximation of the value function is not necessary around the frontier1 since there is no change in the optimal control.

5. The variable resolution approach

The idea is to start with an initial coarse discretization, build the correspondingMDP, solve it in order to have a (coarse) approximation of the value function, then,locally re�ne the discretization by splitting some cells according to the process :

1. Score each cell and each direction i according to how promising it is to splitaccording to some measure, called split-criterion(i).

2. Pick the top f% (where f is a parameter) of the highest scoring cells.

3. Split them along the direction given by argmaxi split-criterion(i). Use thedynamics and reward model to create a new (larger) discretized MDP (see thesplitting process in Figure 7). Note that only the cells that were split, and thosewhose successive states involve a split cell need to have their state transitionrecomputed.

4. Go to step 1 until we estimate that the approximation of the value function orthe optimal control is precise enough.

Thus, the central purpose of this paper is the study of several splitting criteria.

Figure 7. Several discretizations resulting of successive splitting operations.

Remark. In this paper, we only consider a \general towards specialized" processin the sense that the discretization is always re�ned. We could also consider some\generalization" process where, for example the tree coding for the discretizationcould be pruned, in order to avoid non relevant partitioning into too small subsets.

In what follows, we present several local splitting criteria and illustrate the re-sulting discretizations on the previous \Car on the Hill" control problem.

Page 9: Variable Resolution Discretization in Optimal Control

VARIABLE RESOLUTION DISCRETIZATION IN OPTIMAL CONTROL 9

6. Criteria based on the value function

6.1. First criterion : average corner-value di�erence

For every rectangle, we compute the average of the absolute di�erence of the valuesat the corners of the edges for all directions i = 0::d� 1. Let us denote Ave(i) thiscriterion for direction i. For example, consider the square described in Figure 2.Then this split-criterion is :

Ave(0) = 14 [jV (�1)� V (�0)j+ jV (�3)� V (�2)j+ jV (�5)� V (�4)j+ jV (�7)� V (�6)j]

Ave(1) = 14 [jV (�2)� V (�0)j+ jV (�3)� V (�1)j+ jV (�6)� V (�4)j+ jV (�7)� V (�5)j]

Ave(2) = 14[jV (�4)� V (�0)j+ jV (�5)� V (�1)j+ jV (�6)� V (�2)j+ jV (�7)� V (�3)j]

Figure 8 represents the discretization obtained after 15 iterations of this procedure,starting with a 9 by 9 initial grid and using the corner-value di�erence criterionwith a splitting rate of 50% of the rectangles at each iteration.

Figure 8. The discretization of the state spacefor the \Car on the Hill" problem using thecorner-value di�erence criterion.

Figure 9. The discretization of the state spacefor the \Car on the Hill" problem using thevalue non-linearity criterion.

6.2. Second criterion : value non-linearity

For every rectangle, we compute the variance of the absolute increase of the valuesat the corners of the edges for all directions i = 0::d. This criterion is similar tothe previous one except that it computes the variance instead of the average.Figure 9 shows the corresponding discretization using the value non-linearity cri-

terion with a splitting rate of 50% after 15 iterations.

Comments on these results:

� We observe that in both cases, the splitting occurs around the frontiers 1, 3 andthe upper part of frontier 2, previously de�ned. In fact, the �rst criterion leadsto reduce the variation of the values, and splits wherever the value function isnot constant. Figure 10(a)&(b) shows a (1-dimension) cut of a discontinuity and

Page 10: Variable Resolution Discretization in Optimal Control

10 R�EMI MUNOS AND ANDREW MOORE

the corresponding discretization and approximation obtained using the corner-value di�erence split criterion.

� The value non-linearity criterion leads to reduce the change of variation of thevalues, thus splits wherever the value function is not linear. So this criterion willalso concentrate on the same irregularities but with two important di�erencescompared to the corner-value di�erence criterion :

{ The value non-linearity criterion splits more parsimoniously than the corner-value di�erence. See, for example, the di�erence of splitting in the areaabove the frontier 3.

{ The discretization resulting of the split of a discontinuity by the corner-value di�erence and the value non-linearity criteria are di�erent (see Fig-ure 10). The value non-linearity criterion splits where the approximatedfunction (here some kind of sigmoid function whose slope depends on thedensity of the resolution) is the least linear (Figure 10(c)&(d)). This ex-plains the 2 parallel tails observed around the frontiers (mainly the rightpart of frontier 1) in Figure 9.

� The re�nement process does not split around the bottom part of frontier 2although there is a change in the optimal control (because the VF is almostconstant in this area). Moreover, there is a huge amount of memory spent forthe approximation of the discontinuity (frontier 1) although the optimal controlis constant in this area.

(a) Average split criterion Coarse resolution

(b) Average split criterion Dense resolution

(c) Variance split criterion Coarse resolution

(d) Variance split criterion Dense resolution

Figure 10. The discretization around a discontinuity resulting of the corner-value di�erence

(a)&(b) and the value non-linearity (c)&(d) split criterion, for a coarse (a)&(c) and a dense(b)&(d) resolution.

Thus we can wonder if it is really useful to split so much around frontier 1,knowing that it will not result in an improved policy ?Next section introduces a new criterion which takes into account the policy.

Page 11: Variable Resolution Discretization in Optimal Control

VARIABLE RESOLUTION DISCRETIZATION IN OPTIMAL CONTROL 11

7. A criterion based on the policy

Figure 5 shows the optimal policy and several optimal trajectories for di�erentstarting points. We would like to de�ne a re�nement process that could re�nearound the areas of change in the optimal control, that is around frontier 2 (upperand lower parts) and 3, but not around frontier 1. In what follows, we proposesuch a criterion based on the inconsistency of the control derived from the valuefunction and from the policy.

7.1. The policy disagreement criterion

When we solve the MDP and compute the value function of the DP equation (5),we deduce the following policy for any state � 2 � :

�(�) 2 argmaxu2U

h �(�;u):

Pd

i=0 ��i (�(�; u)):V (�i) +RMDP (�; u)i

(6)

and we can compare it with the optimal control law (4) derived from the gradient ofV . The policy disagreement criterion compares the control derived from the localgradient of V (4) with the control derived from the policy of the MDP (6).

Remark. Instead of computing the gradient DV for all the (d!) simplexes in therectangles, we compute an approximated gradient ~DV for all the (2d) corners,based on a �nite di�erence quotient. For the example of �gure 2, the approximated

gradient at corner �0 is�V (�1)�V (�0)jj�0�1jj

;V (�2)�V (�0)jj�0�2jj

;V (�4)�V (�0)jj�0�4 jj

�.

Thus for every corner, we compute this approximated gradient and the correspond-ing optimal control from (4) and compare it to the optimal policy given by (6).Figure 11 shows the discretization obtained by splitting the rectangles where these

two measures of the optimal control diverge.This criterion is interesting since it splits at the places where there is a change in

the optimal control, thus re�ning the resolution at the most important parts of thestate space for the approximation of the optimal control. However, as we can expect,if we only use this criterion, the value function will not be well approximated, thusthis process may converge to a sub-optimal performance. Indeed, we can observethat on Figure 11, the bottom part of frontier 2 is (lightly) situated higher than itsoptimal position, illustrated on Figure 5. This results in an underestimation of thevalue function at this area because of the lack of precision around the discontinuity(frontier 1). In section 7.3, we will observe that the performance of the discretizationresulting from this splitting criterion is relatively weak.However, this splitting criterion can be bene�cially combined with the previous

ones based on the approximation of the VF.

7.2. Combination of several criteria

We can combine the policy disagreement criterion with the corner-value di�erenceor value non-linearity criterion in order to take the advantages of both methods : agood approximation of the value function on the whole state space and an increaseof the resolution around the areas of change in the optimal control. We can combine

Page 12: Variable Resolution Discretization in Optimal Control

12 R�EMI MUNOS AND ANDREW MOORE

Figure 11. The discretizationof the state spaceusing the policy disagreement criterion. Herewe used an initial grid of 33�33 and a splittingrate of 20%.

Figure 12. The discretizationof the state spacefor the \Car on the Hill" problem using thecombination of the value non-linearity and thepolicy disagreement criterion.

the previous criteria in several ways, for example by a weighted sum of the respectivecriteria, by a logical operation (split if an and/or combination of these criteria issatis�ed), by an ordering of the criteria (�rst split with one criterion, then use another one), etc.

Figure 12 shows the discretization obtained by alternatively, between iterations,using the value non-linearity criterion (to obtain a good approximation of the valuefunction) and the policy disagreement criterion (to increasing the accuracy aroundthe area of change in the optimal control).

7.3. Comparison of the performance

In order to compare the respective performance of the discretizations, we ran a set(here 256) of optimal trajectories (using the feed-back control law (4)) starting frominitial states regularly situated in the state space. The performance of a discretiza-tion is the sum of the cumulated reinforcement (the gain de�ned by equation (2))obtained by these trajectories, over the set of start positions.

Figure 13 shows the respective performances of several splitting criteria as afunction of the number of states.

In this 2 dimensional control problem, the variable resolution approach performmuch better (except for the policy disagreement criterion alone) than the uniformgrids. However, as we will see later, for higher dimensional problems, the ressourcesallocated to approximate the discontinuities of the VF in areas not useful for im-proving the optimal control might be prohibitely high.

Can we do better ?

So far, we have only considered local splitting criteria, in the sense that we decide

Page 13: Variable Resolution Discretization in Optimal Control

VARIABLE RESOLUTION DISCRETIZATION IN OPTIMAL CONTROL 13

Figure 13. The performancefor the uniformversus variableresolution grids for severalsplitting criterion. Both thecorner-value di�erence andvalue non-linearity splittingprocesses perform better thanthe uniform grids. The policy

disagreement splitting is verygood for a small number ofstates but does not improveafter, and thus leads to sub-optimal performance. Thepolicy disagreement combinedwith the value non-linearity

gives the best performances.

whether or not to split a rectangle according to information (value function andpolicy) relative to the rectangle itself. However, the e�ect of the splitting is notlocal : it has an in uence on the whole state space.

We will thus try to see if it is possible to �nd a better re�nement process that couldsplit an area if and only if it is useful to improve the performance. Sections thatfollow presents two notions which will be useful for de�ning such a global splittingcriterion : the in uence, which measures the extend to which local changes insome state e�ect the global VF, and the variance, which measures how accuratethe current approximated VF is.

8. Notion of in uence

Let us consider the Markov chain resulting from the choice of the control for theoptimal policy u� = ��(�) of the MDP. For convenience reasons, let us denoteR(�) = RMDP (�; ��(�)), p(�ijx) = p(�ij�; ��(�)), � (�) = � (�; ��(�)).

8.1. Intuitive idea

The intuitive idea of the in uence I(�ij�) of a state �i on another state � is to givea measure of to what extend the VF of state �i \contributes" to the VF of state �,i.e. the change in the VF at � resulting from a modi�cation of the VF at �i.

But the VF is the solution of a Bellman equation which depends on the structureof the Markov chain and the reinforcement values, thus we cannot modify the valueat some state �i, without violating Bellman equation.

However, we notice that the value function at state �i is a�ected linearly by thereinforcement obtained at state �, thus we can compute the \contribution" of state�i to state � by estimating the change in the VF at � resulting from a modi�cationof the reinforcement R(�i).

Page 14: Variable Resolution Discretization in Optimal Control

14 R�EMI MUNOS AND ANDREW MOORE

8.2. De�nition of the in uence

Let us de�ne the discounted cumulative k�chained probabilities pk(�ij�), whichrepresent the sum of the discounted transition probabilities of all sequences of kstates from � to �i :

p0(�ij�) = 1 (if � = �i) or 0 (if � 6= �i)

p1(�ij�) = �(�)p(�ij�)

p2(�ij�) =P

�j2�p1(�ij�j):p1(�j j�)

:::

pk(�ij�) =P

�j2�p1(�ij�j):pk�1(�j j�) (7)

De�nition 1. Let � 2 �. We de�ne the in uence of a state �i on the state � beingthe quantity : I(�ij�) =

P1k=0 pk(�ij�). Let � be a subset of �. We de�ne the

in uence of a state �i on the subset � being the quantity : I(�ij�) =P

�2� I(�ij�).We call in uencers of a state � (respectively of a subset �), the set of states

�i that have a non-zero in uence on � (resp. on �) (note, by de�nition, that allin uences are non-negative).

8.3. Some properties of the in uence

First, we notice that if the times � (�) are > 0, then the in uence is well de�nedand is bounded by : I(�ij�) �

11� �min

with �min = min� � (�). Indeed, from the

de�nition of the discounted chained-probabilities, we have pk(�ij�) � k:�min thus :I(�ij�) �

P1k=0

k:�min = 11� �min

.Moreover, we can relate the de�nition to the intuitive idea previously stated and

the following properties hold :

� The in uence I(�ij�) is the partial derivative of V (�) by R(�i) : I(�ij�) =@V (�)@R(�i)

.

� For any states � and �i, we have :

I(�ij�) =X�j

p1(�ij�j):I(�j j�) +

�1 if �i = �

0 if �i 6= �(8)

Proof : The Bellman equation is : V (�) = R(�)+P

�ip1(�ij�):V (�i). By applying

Bellman equation to V (�i), we have :

V (�) = R(�) +P

�ip1(�ij�)

hR(�i) +

P�jp1(�j j�i):V (�j)

iFrom the de�nition of p2, we can rewrite this as :

V (�) = R(�) +P

�ip1(�ij�):R(�i) +

P�ip2(�ij�):V (�i)

Again we can apply Bellman equation to V (�i) and �nally deduce that :

V (�) =P1

k=0

P�ipk(�ij�):R(�i)

Thus the contribution of R(�i) to V (�) is :@V (�)@R(�i)

=P1

k=0 pk(�ij�) = I(�ij�).

Page 15: Variable Resolution Discretization in Optimal Control

VARIABLE RESOLUTION DISCRETIZATION IN OPTIMAL CONTROL 15

Property (8) is easily deduced from the very de�nition of the in uence and thechained probability property (7), since for all �,

I(�ij�) =P1

k=0 pk(�ij�) =P1

k=0 pk+1(�ij�) + p0(�ij�)

=P1

k=0

P�jp1(�ij�j):pk(�jj�) + p0(�ij�)

=X�j

p1(�ij�j):I(�j j�) +

�1 if �i = �

0 if �i 6= �

8.4. Computation of the in uence

Equation (8) is not a Bellman equation since the sum of the probabilitiesP

�jp1(�ij�j)

may be greater than 1, so we cannot deduce that the successive iterations :

In+1(�ij�) =X�j

p1(�ij�j):In(�j j�) +

�1 if �i = �

0 if �i 6= �(9)

converge to the in uence by using the classical contraction property in max-norm(Puterman, 1994). However, we have the following property :P

�iI(�ij�) =

P�i

P�jp1(�ij�j):I(�j j�) + 1

=P

�j �(�j ):I(�j j�) + 1

Thus, by denoting I(�j�) the vector whose components are the I(�ij�) and byintroducing the 1-norm jjI(�j�)jj1 =

Pi jI(�ij�)j, we deduce that :

jjIn+1(�j�)� I(�j�)jj1 � �min :jjIn(�j�)� I(�j�)jj1

and we have the contraction property in the 1-norm which insures convergence ofthe iterated In(�ij�) to the unique solution (the �xed point) I(�ij�) of (8).

8.5. Illustration on the \Car on the Hill" problem

For any subset , we can com-pute its in uencers. As an exam-ple, �gure 14 shows the in uencersof some 3 points.

Figure 14. In uencers of 3 points (thecrosses). The darker the gray level, themore important the in uence. We noticethat the in uence of a state \follows" thedirection of the optimal trajectory start-ing from that state (see �gure 5) throughsome kind of \di�usion process".

Page 16: Variable Resolution Discretization in Optimal Control

16 R�EMI MUNOS AND ANDREW MOORE

Let us de�ne the subset � to be those states of policy disagreement (in the sense ofsection 7.1). Figure 15(a) shows � for a regular grid of 129� 129. The in uencersof � is computed and plotted in Figure 15(b). The darkest zones in Figure 15(b)are the places whose splitting will most a�ect the value function at the places(illustrated in Figure 15(a)) of change in the optimal control.

(a) States of policy disagreement (b) Influence of these statesFigure 15. The set of states of policy disagreement (a) and its in uencers (b).

Now we would like to de�ne the areas whose re�nement could lead to the highestquantitative change on the value function. This is closely related to the quality ofthe approximation of the value function for a given discretization.Indeed, the better the approximation, the lower a change in the value function

may result from a splitting. In the following section, we introduce the variance ofthe Markov chain in order to estimate the quality of approximation of the VF forthe current discretization, thus de�ning the areas whose splitting may lead to thehighest change in the VF.

9. Variance of a Markov chain

By using the notation of the previous section, Bellman equation states that : V (�)�R(�) = �(�):

P�i2X

p(�ij�)V (�i), thus V �R at � is a discounted average of the nextvalues V (�i) weighted by the probabilities of transition p(�ij�). We are interested incomputing the variance of the next values in order to get an estimation of the rangeof the values averaged by V (�). The �rst idea is to compute the one-step-aheadvariance e(�) of V � R :

e(x) =P

�i2Xp(�ij�)

� �(�)V (�i) � V (�) + R(�)

�2(10)

However, the values V (�i) also average some successive values V (�j), so we wouldlike that the variance also takes into account this second-step-ahead average, as wellas all the following ones. The next sections de�ne the value function as an averagerof the reinforcements and present the de�nition of the variance of a Markov chain.

Page 17: Variable Resolution Discretization in Optimal Control

VARIABLE RESOLUTION DISCRETIZATION IN OPTIMAL CONTROL 17

9.1. The value function as an averager of the reinforcements

Let us denote sk(�) a sequence of (k + 1) states (�0 = �; �1; �2; :::; �k) whose �rstone is �. Let Sk(�) be the set of all possible sequences sk(�). Let p(sk(�)) =Qk

i=1 p(�ij�i�1), the product of the probabilities of transition of the successive states

in a sequence, and for i � k, let �i(sk(�)) =Pi�1

j=0 � (�j) the cumulated time of the

ith �rst states of the sequence (with by de�nition �0(sk(�)) = 0).We have the following property :

Psk(�)2Sk(�)

p(sk(�)) = 1. We can prove that

the value function satis�es the following equation (similar to the Bellman equationbut for k-steps-ahead) : for any k,

V (�) =P

sk(�)2Sk(�)p(sk(�))

h �k (sk(�))V (�k) +

Pk�1i=0

�i(sk(�))R(�i)i

Let us denote s1(�) an in�nite sequence of states starting with �, and S1(�)the set of all possible such sequences. De�ne p(s1(�)) = limk!1p(sk(�)), and�i(s1(�)) is de�ned as previously for any i � 0. Then we still have the property :P

s1(�)2S1(�) p(s1(�)) = 1, and the value function satis�es :

V (�) =X

s1(�)2S1(�)

p(s1(�))

"1Xi=0

�i(s1(�))R(�i)

#(11)

9.2. De�nition of the Variance of a Markov chain

Intuitively, the variance of a state � is a measure of how dissimilar the cumulativefuture reward obtained along all possible trajectories starting from � are. Moreprecisely, we de�ne it as the variance of the quantities averaged in equation (11) :

De�nition 2. Let � 2 �. We de�ne the variance �2 of the Markov chain at � :

�2(�) =X

s1(�)2S1(�)

p(s1(�))

"1Xi=0

�i (s1(�))R(�i)� V (�)

#2

Let us prove that the variance satis�es a Bellman equation. By selecting out thei = 0 case in the summation, we have :

�2(�) =P

s1(�)2S1(�) p(s1(�))�P1

i=1 �i(s1(�))R(�i) � (V (�)� R(�))

�2and from (11), we deduce that :

�2(�) =P

s1(�) p(s1(�))n�P1

i=1 �i (s1(�))R(�i)

�2�� �(�)V (�1)

�2+� �(�)V (�1)

�2� [V (�)� R(�)]2

oBy successively applying (11) to �1 and by selecting out the state �i in the sequences1, we obtain :

�2(�)=

Xs1(�)

p(s1(�))n�P

1

i=1 �i(s1(�))R(�i)�

�(�)V (�1)�2+� �(�)V (�1)�V (�)+R(�)

�2o

�2(�)=

X�1

p(�1j�)

8<: 2�(�)

Xs1(�1)

p(s1(�1))�P

1

i=0 �i(s1(�1))R(�i)

�2+� �(�)V (�1)�V (�)+R(�)

�29=;

Page 18: Variable Resolution Discretization in Optimal Control

18 R�EMI MUNOS AND ANDREW MOORE

Thus : �2(�) = e(�) + 2�(�)P

�ip(�ij�)�

2(�i)

with e(�) satisfying (10). Thus the variance �2(�) is the sum of the immediatecontribution e(�) that takes into account the variation in the values of the immediatesuccessors �i, and the discounted average of the variance �2(�i) of these successors.This is a Bellman equation and it can be solved by value iteration.

Remark. We can give a geometric interpretation of the term e(�) related to thegradient of the value function at the iterated point � = �(�; u�) (see �gure 3) andto the barycentric coordinates ��i(�). Indeed, from the de�nition of the discretizedMDP (section 3.2), we have V (�) = R(�) + �(�)V (�) and from the piecewiselinearity of the approximated functions we have V (�i) = V (�) + DV (�):(�i � �),thus : e(�) =

P�i��i (�):

2�(�)[DV (�):(�i � �)]2, which can be expressed as :

e(�) = 2�(�):DV (�)T :Q(�):DV (�)

with the matrix Q(�) de�ned by its elements qjk(�) =P

�i��i (�):(�i��)j :(�i��)k.

Thus, e(�) is close to 0 in two speci�c cases : either if the gradient at the iteratedpoint � is very low (i.e. the values are almost constant) or if � is very close to onevertex �i (then the barycentric coordinate ��i is close to 1 and the ��j (for j 6= i)are close to 0, thus Q(�) is low). In both of these cases, e(�) is low and implies thatthe iteration of � does not lead to a degradation of the quality of approximation ofthe value function (the variance does not increase).

9.3. Example : variance of the \Car on the Hill"

Figure 16 shows the standard deviation �(�) for the Car-on-the-Hill problem for auniform grid (of 257 by 257).

Figure 16. The standard deviation � for

the \Car on the Hill". The standard de-viation is very high around the frontier 1 :indeed, a discontinuity is impossible to ap-proximateperfectly by discretization tech-niques, whatever the resolution is. We canobserve this fact on �gure 10 where themaximal error of approximation is equalto half the step of the discontinuity. How-ever, the higher the resolution is, the lowerthe integral of the error is (compare �g-ure 10(a)&(c) versus (b)&(d)). There isa noticeable positive standard deviationaround frontier 3 and the upper part offrontier 2 because the value function isan average of di�erent values of the dis-counted terminal reinforcement.

A re�nement of the resolution in the areas where the standard deviation is lowhas no chance of producing an important change in the value function. Thus itappears that the areas where a splitting might a�ect the most the approximationof the value function are the rectangles of highest surface whose corners have thehighest standard deviations.

Page 19: Variable Resolution Discretization in Optimal Control

VARIABLE RESOLUTION DISCRETIZATION IN OPTIMAL CONTROL 19

10. A global splitting criterion

Now we are going to combine the notions of in uence and variance in order tode�ne a new, non-local splitting criterion. We have seen that :� The states � of highest standard deviation �(�) are the states of lowest quality

of approximation of the VF, thus the states that could improve the most theirapproximation accuracy when split (�gure 17(a)).

� The states � of highest in uence on the set � of states of policy disagreement(�gure 15(b)) are the states whose value function a�ects the area where thereis a change in the optimal control.

Thus in order to improve the precision of approximation at the most relevantareas of the state space we split the states � of highest standard deviation thathave an in uence on the areas of change in the optimal control, according to theStdev Inf criterion (see �gure 17) : Stdev Inf(�) = �(�):I(�j�). Figure 18 showsthe discretization obtained by using this splitting criterion.

(a) Standard deviation (b) Influence x Standard deviationFigure 17. (a) The standard deviation �(�) for the \Car on the Hill" (equivalent to �gure 16) and(b) The Stdev Inf criterion, product of �(�) by the in uence I(�j�) (�gure 15(b)).

Figure 18. The discretization resulting of theStdev Inf split criterion. We observe that theupper part of frontier 1 is well re�ned. Thisre�nement occurred not because we split ac-cording to the value function (such as thecorner-value di�erence or value non-linearitycriterion) but because the splitting there isnecessary to have a good approximation ofthe value function around the bottom part offrontier 2 (and even the upper part of fron-tier 2) where there is a change in the optimalcontrol.The fact that the Stdev Inf criterion does notsplit the areas where the VF is discontinuousunless some re�nement is necessary to get abetter approximation of the optimal control,is very important since, as we will see in thesimulations that follow, in higher dimensions,the cost to get an accurate approximation ofthe VF is too high.

Page 20: Variable Resolution Discretization in Optimal Control

20 R�EMI MUNOS AND ANDREW MOORE

Remark. The performance of this criterion for the \Car on the Hill" problem aresimilar to those of combining the value non-linearity and the policy disagreementcriterion. We didn't plot those performances in �gure 13 for clarity reasons andbecause they do not represent a major improvement. However, the di�erence ofperformances between the local criteria and the Stdev Inf criterion are much moresigni�cant in the case of more di�cult problems (the Acrobot, the Cart-pole) asillustrated in what follows.

11. Illustration on other control problems

11.1. The Cart-Pole problem

The dynamics of this 4-dimensional physical system (illustrated in �gure 19(a))are described in (Barto et al., 1983). In our experiments, we chose the followingparameters as follows : the state space is de�ned by the position y 2 [�10;+10],angle � 2 [��

2 ;�2 ], and velocities restricted to _y 2 [�4; 4], _� 2 [�2; 2]. The control

consists in applying a strength of �10 Newton. The goal is de�ned by the area :y = 4:3 � 0:2, � = 0 � �

45 , (and no limits on _y and _�). This is a notably narrowgoal to try to hit (see the projection of the state space and the goal on the 2d plan(y,�) in �gure 19). Notice that our task of \minimum time manoever to a smallgoal region" from an arbitrary start state is much harder than merely balancingthe pole without falling (Barto et al., 1983). The current reinforcement r is zeroeverywhere and the terminal reinforcement R is �1 if the system exits from thestate space (jyj > 10 or j�j > �

2 ), and +1 if the system reaches the goal.

_

_ __

θ

y-10

Goal :

+10= position

= angle π-45y 4.3 0.2

0=

=

θ θ

y

-10

+10

ππ2-

2

Goal

+

+

(b) The projection of the state space (a) The ‘‘Cart-pole’’

Figure 19. (a) Description of the Cart-pole. (b) The projection of the discretization (onto theplane (�,y)) obtained by the Stdev Inf criterion and some trajectories for several initial points.

Figure 20 shows the performance obtained for several splitting criteria previouslyde�ned for this 4 dimensional control problem. We observe the following points :

Page 21: Variable Resolution Discretization in Optimal Control

VARIABLE RESOLUTION DISCRETIZATION IN OPTIMAL CONTROL 21

� The local criteria do not perform better than the uniform grids. The problemis that the VF is discontinuous at several parts of the state space (areas of highj�j for which it is too late to rebalance the pole, which is similar to the frontier 1of the \Car on the Hill" problem) and the value-based criteria spend too manyresources on approximating these useless areas.

� The Stdev Inf criterion performs very well. We observe that the trajectories(see �gure 19(b)) are nearly optimal (the angle j�j is maximized in order toreach the goal as fast as possible, and very close to its limit value, for which itis no more possible to recover the balance).

Figure 20. Performance on the \Cart-pole". Figure 21. Performance on the Acrobot.

11.2. The Acrobot

The Acrobot is a 4 dimensional control problem which consists of a two-link armwith one single actuator at the elbow. This actuator exerts a torque between thelinks (see �gure 22(a)). It has dynamics similar to a gymnast on a high bar, whereLink 1 is analogous to the gymnast's hands, arms and torso, Link 2 represents thelegs, and the joint between the links is the gymnast's waist (Sutton, 1996). Here,the goal of the controller is to balance the Acrobot at its unstable, inverted verticalposition, in the minimum time (Boone, 1997). The goal is de�ned by a very narrowrange of �

16 on both angles around the vertical position �1 = �2 ; �2 = 0 (�gure

22(b)), for which the system receives a reinforcement of R = +1. Anywhere else,the reinforcement is zero. The two �rst dimensions (�1; �2) of the state space have astructure of a torus (because of the 2� modulo on the angles), which is implementedin our structure by having the vertices of 2 �rst dimensions being angle 0 and 2�pointing to the same entry for the value function in the interpolated kd-trie.Figure 21 shows the performance obtained for several splitting criteria previously

de�ned. The respective performance of the di�erent criteria are similar to the \Cart-pole" problem above : the local criteria are no better than the uniform grids ; theStdev Inf criterion performs much better.Figure 22(b) shows the projection of the discretization obtained by the Stdev Inf

criterion and one trajectory onto the 2d-plane (�1,�2).

Page 22: Variable Resolution Discretization in Optimal Control

22 R�EMI MUNOS AND ANDREW MOORE

- --

- -

--

- -

-

θ1

θ2

π2-= +- π-16

θ1

θ2= +- π-160

Goal :

θ1

θ2

π2-= +- π-16

θ1

θ2= +- π-160

Goal :

π−

θ2

θ12/3π__

2−

π

πGoalGoal

(a) The Acrobot (b) Projection of the state space

Figure 22. (a) Description of the Acrobot physical system. (b) Projection of the discretization(onto the plane (�1,�2)) obtained by the Stdev Inf criterion, and one trajectory.

Interpretation of the results : As we noticed for the two previous 4d prob-lems, the local splitting criteria fail to improve the performance of the uniform gridsbecause they spend too many resources on local considerations (either approximat-ing the value function or the optimal policy). For example, on the \Cart-pole"problem, the value non-linearity criterion will concentrate on approximating theVF mostly at parts of the state space where there is already no chance to rebalancethe pole. And the areas around the vertical position (low �), which are the mostimportant areas, will not be re�ned in time (however, if we continue the simulationsafter about 90000 states, the local criteria start to perform better than the uniformgrids, because these areas get eventually re�ned).The Stdev Inf criterion, which takes into account global consideration for the

splitting, performs very well for all the problems described above.

12. Conclusion and Future work

In this paper we proposed a variable resolution discretization approach to solvecontinuous time and space control problems. We described several local splittingcriteria, based on the VF or the policy approximation. We observed that thisapproach works well for 2d problems like the \Car on the Hill". However, for morecomplex problems, these local methods fail to perform better than uniform grids.Local value-based splitting is an e�cient, model-based, relative of the Q-learning-

based tree splitting criteria used, for example, by (Chapman & Kaelbling, 1991; Si-mons, Van Brussel, De Schutter, & Verhaert, 1982; McCallum, 1995). But it is onlywhen combined with new non-local measures that we are able to get truly e�ec-tive, near-optimal performance on our control problems. The tree-based state-spacepartitions in (Moore, 1991; Moore & Atkeson, 1995) were produced by di�erent cri-teria (of empirical performance), and produced far more parsimonious trees, butno attempt was made to minimize cost: merely to �nd a valid path.

Page 23: Variable Resolution Discretization in Optimal Control

VARIABLE RESOLUTION DISCRETIZATION IN OPTIMAL CONTROL 23

In order to design a global criterion, we introduced the notions of in uence whichestimates the impact of states over others, and of variance of a Markov chain, whichmeasure the quality of the current approximation. By combining these notions,we de�ned an interesting splitting criterion that gives very good performance (incomparison to the uniform grids) on all the problems studied.Another extension of these measures could be to learn them through interac-

tions with the environment in order to design e�cient exploration policies in rein-forcement learning. Our notion of variance could be used with \Interval Estima-tion" heuristic (Kaelbling, 1993), to permit \optimism-in-the-face-of-uncertainty"exploration, or with the \back-propagation of exploration bonuses" of (Meuleau &Bourgine, 1999) for exploration in continuous state-spaces. Indeed, if we observethat the learned variance of a state � is high, then a good exploration strategy couldbe to inspect the states that have a high expected in uence on �.In the future, it seems important to develop the following points :

� A generalization process, in order to have also a \speci�c towards general"grouping of areas (for example by pruning the tree) that have been over-re�ned.

� Suppose that we only want to solve the problem for a speci�c area of initialstart states. Then we can restrict our re�nement process to the areas usedby the trajectories. The notion of in uence introduced in this paper can beused for that purpose by computing the Stdev Inf criterion with respect to�j = f� 2 �; I(�j) > 0g (the set of states of policy disagreement that havean in uence on the area of initial states ) instead of �, and can improvedrastically the performance observed when starting from this speci�c area.

� We would like to deal with the stochastic case. If we assume that we have amodel of the noise, then the only change will be in the process of building theMDP (Kushner & Dupuis, 1992; Munos & Bourgine, 1997).

� Release the assumption that we have a model and build an approximation ofthe dynamics and the reinforcement, and deal with the exploration problem.

References

Baird, L. C. (1995). Residual algorithms : Reinforcement learning with function approximation. Ma-chine Learning : proceedings of the Twelfth International Conference.

Baird, L. C. (1998). Gradient descent for general reinforcement learning. Neural InformationProcessingSystems, 11.

Barles, G., & Souganidis, P. (1991). Convergence of approximation schemes for fully nonlinear secondorder equations. Asymptotic Analysis, 4, 271{283.

Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike Adaptive elements that that can learndi�cult Control Problems. IEEE Trans. on Systems Man and Cybernetics, 13 (5), 835{846.

Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to act using real-time dynamic program-ming. Arti�cial Intelligence, pp. 81{138.

Bertsekas, D. P. (1987). Dynamic Programming : Deterministic and Stochastic Models. Prentice Hall.

Bertsekas, D. P., & Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scienti�c.

Boone, G. (1997). Minimum-time control of the acrobot. International Conference on Robotics andAutomation.

Page 24: Variable Resolution Discretization in Optimal Control

24 R�EMI MUNOS AND ANDREW MOORE

Boyan, J., & Moore, A. (1995). Generalization in reinforcement learning : Safely approximating thevalue function. Advances in Neural Information Processing Systems, 7.

Chapman, D., & Kaelbling, L. P. (1991). Learning from Delayed Reinforcement In a Complex Domain.In IJCAI-91.

Crandall, M., Ishii, H., & Lions, P. (1992). User's guide to viscosity solutions of second order partialdi�erential equations. Bulletin of the American Mathematical Society, 27 (1).

Crandall, M., & Lions, P. (1983). Viscosity solutions of hamilton-jacobi equations. Trans. of theAmerican Mathematical Society, 277.

Davies, S. (1997). Multidimensional Triangulation and Interpolation for Reinforcement Learning. InNeural Information Processing Systems 9, 1996. Morgan Kaufmann.

Dupuis, P., & James, M. R. (1998). Rates of convergence for approximation schemes in optimal control.SIAM Journal Control and Optimization, 360 (2).

Fleming, W. H., & Soner, H. M. (1993). Controlled Markov Processes and Viscosity Solutions. Appli-cations of Mathematics. Springer-Verlag.

Friedman, J. H., Bentley, J. L., & Finkel, R. A. (1977). An Algorithm for Finding Best Matches inLogarithmic Expected Time. ACM Trans. on Mathematical Software, 3 (3), 209{226.

Gordon, G. (1995). Stable function approximation in dynamic programming. International Conferenceon Machine Learning.

Kaelbling, L. P. (1993). Learning in Embedded Systems. MIT Press, Cambridge MA.

Knuth, D. E. (1973). Sorting and Searching. Addison Wesley.

Kushner, H. J., & Dupuis (1992). Numerical Methods for Stochastic Control Problems in ContinuousTime. Applications of Mathematics. Springer-Verlag.

McCallum, A. (1995). Instance-Based Utile Distinctions for Reiforcement Learning with Hidden State.In Machine Learning (proceedings of the twelfth international conference) San Francisco, CA.Morgan Kaufmann.

Meuleau, N., & Bourgine, P. (1999). Exploration of multi-state environments: Local measures andback-propagation of uncertainty. Journal of Machine Learning.

Moore, A. W. (1991). Variable Resolution Dynamic Programming: E�ciently Learning Action Maps inMultivariate Real-valued State-spaces. In Birnbaum, L., & Collins, G. (Eds.), Machine Learning:Proceedings of the Eighth International Workshop. Morgan Kaufmann.

Moore, A. W., & Atkeson, C. (1993). Prioritized sweeping: Reinforcement learning with less data andless real time. Machine Learning, 13.

Moore, A. W., & Atkeson, C. (1995). The parti-game algorithm for variable resolution reinforcementlearning in multidimensional state space. Machine Learning Journal, 21.

Moore, D. W. (1992). Simplical Mesh Generation with Applications. Ph.D. thesis, Cornell University.

Munos, R. (1999). A study of reinforcement learning in the continuous case by the means of viscositysolutions. To appear in Machine Learning Journal.

Munos, R., & Bourgine, P. (1997). Reinforcement learning for continuous stochastic control problems.Neural Information Processing Systems.

Munos, R., & Moore, A. (1998). Barycentric interpolators for continuous space and time reinforcementlearning. Neural Information Processing Systems.

Puterman, M. L. (1994). Markov Decision Processes, Discrete Stochastic Dynamic Programming. AWiley-Interscience Publication.

Simons, J., Van Brussel, H., De Schutter, J., & Verhaert, J. (1982). A Self-Learning Automaton withVariable Resolution for High Precision Assembly by Industrial Robots. IEEE Trans. on Auto-matic Control, 27 (5), 1109{1113.

Sutton, R. S. (1996). Generalization in reinforcement learning : Successful examples using sparse coarsecoding. Advances in Neural Information Processing Systems, 8.