kalman

10
Approximate Kalman Filter Q-Learning for Continuous State-Space MDPs Charles Tripp Department of Electrical Engineering Stanford University Stanford, California, USA [email protected] Ross Shachter Management Science and Engineering Dept Stanford University Stanford, California, USA [email protected] Abstract We seek to learn an effective policy for a Markov Decision Process (MDP) with con- tinuous states via Q-Learning. Given a set of basis functions over state action pairs we search for a corresponding set of linear weights that minimizes the mean Bellman residual. Our algorithm uses a Kalman filter model to estimate those weights and we have developed a simpler approximate Kalman fil- ter model that outperforms the current state of the art projected TD-Learning methods on several standard benchmark problems. 1 INTRODUCTION We seek an effective policy via Q-Learning for a Markov Decision Process (MDP) with a continuous state space. Given a set of basis functions that project the MDP state action pairs onto a basis space, a lin- ear function maps them into Q-values using a vector of basis function weights. The challenge in implement- ing such a projection-based value model is twofold: to find a set of basis functions that allow a good fit to the problem when using a linear model, and given a set of basis functions, to efficiently estimate a set of basis function weights that yields a high-performance pol- icy. The techniques that we present here address the second part of this challenge. For more background on MDPs, see (Bellman, 1957, Bertsekas, 1982, Howard, 1960) and for Q-Learning see (Watkins, 1989). In Kalman Filter Q-Learning (KFQL), we use a Kalman filter (Kalman, 1960) to model the weights on the basis functions. The entire distribution over the value for any state action pair is captured in this model, where more credible assessments will yield dis- tributions with smaller variances. Because we have probability distributions over the weights on the ba- sis functions, and over the observations conditioned on those weights, we can compute Bayesian updates to the weight distributions. Unlike the current state of the art method, Projected TD Learning (PTD), Kalman Filter Q-Learning adjusts the weights on ba- sis functions more when they are less well known and less when they are more well known. The more obser- vations are made for a particular basis function, the more confident the model becomes in its assessment of its weight. We simplify KFQL to obtain Approximate Kalman Filter Q-Learning (AKFQL) by ignoring depen- dence among basis functions. As a result, each iter- ation is linear in the number of basis functions rather than quadratic, but it also appears to be significantly more efficient and robust for policy generation than either PTD or KFQL. In the next section we present KFQL and AKFQL, followed by sections on experimental results, related work, and some conclusions. 2 KALMAN FILTER Q-LEARNING In each time period an MDP has a continuous state s and an action a is chosen from a corresponding set of actions A s . The transition probabilities to a successor state s are determined by the state action pair (s, a), and the reward R(s, a, s ) for that period is determined by the two states and the action. We seek to learn an optimal policy for choosing actions and the corre- sponding optimal net present value of the future re- wards given that we start with state action pair (s, a), the Q-value Q(s, a). To make that analysis tractable it is standard to introduce basis functions φ(s, a). Given n basis functions φ(s, a) for MDP state ac- tion pairs (s, a), we learn an n-vector of weights r for those basis functions that minimizes the mean Bell- man residual. Our prior belief is that r is multivariate normal with n-vector mean μ and n ×n covariance ma- trix Σ. We model the Q-value by Q(s, a)= r T φ(s, a)

Upload: lmaraujo67

Post on 15-Nov-2015

4 views

Category:

Documents


2 download

DESCRIPTION

Kalman Filtering

TRANSCRIPT

  • Approximate Kalman Filter Q-Learningfor Continuous State-Space MDPs

    Charles TrippDepartment of Electrical Engineering

    Stanford UniversityStanford, California, USA

    [email protected]

    Ross ShachterManagement Science and Engineering Dept

    Stanford UniversityStanford, California, [email protected]

    Abstract

    We seek to learn an eective policy for aMarkov Decision Process (MDP) with con-tinuous states via Q-Learning. Given a setof basis functions over state action pairswe search for a corresponding set of linearweights that minimizes the mean Bellmanresidual. Our algorithm uses a Kalman filtermodel to estimate those weights and we havedeveloped a simpler approximate Kalman fil-ter model that outperforms the current stateof the art projected TD-Learning methods onseveral standard benchmark problems.

    1 INTRODUCTION

    We seek an eective policy via Q-Learning for aMarkov Decision Process (MDP) with a continuousstate space. Given a set of basis functions that projectthe MDP state action pairs onto a basis space, a lin-ear function maps them into Q-values using a vectorof basis function weights. The challenge in implement-ing such a projection-based value model is twofold: tofind a set of basis functions that allow a good fit to theproblem when using a linear model, and given a set ofbasis functions, to eciently estimate a set of basisfunction weights that yields a high-performance pol-icy. The techniques that we present here address thesecond part of this challenge. For more background onMDPs, see (Bellman, 1957, Bertsekas, 1982, Howard,1960) and for Q-Learning see (Watkins, 1989).

    In Kalman Filter Q-Learning (KFQL), we use aKalman filter (Kalman, 1960) to model the weightson the basis functions. The entire distribution overthe value for any state action pair is captured in thismodel, where more credible assessments will yield dis-tributions with smaller variances. Because we haveprobability distributions over the weights on the ba-sis functions, and over the observations conditioned

    on those weights, we can compute Bayesian updatesto the weight distributions. Unlike the current stateof the art method, Projected TD Learning (PTD),Kalman Filter Q-Learning adjusts the weights on ba-sis functions more when they are less well known andless when they are more well known. The more obser-vations are made for a particular basis function, themore confident the model becomes in its assessment ofits weight.

    We simplify KFQL to obtain Approximate KalmanFilter Q-Learning (AKFQL) by ignoring depen-dence among basis functions. As a result, each iter-ation is linear in the number of basis functions ratherthan quadratic, but it also appears to be significantlymore ecient and robust for policy generation thaneither PTD or KFQL.

    In the next section we present KFQL and AKFQL,followed by sections on experimental results, relatedwork, and some conclusions.

    2 KALMAN FILTER Q-LEARNING

    In each time period an MDP has a continuous state sand an action a is chosen from a corresponding set ofactions As. The transition probabilities to a successorstate s are determined by the state action pair (s, a),and the reward R(s, a, s) for that period is determinedby the two states and the action. We seek to learnan optimal policy for choosing actions and the corre-sponding optimal net present value of the future re-wards given that we start with state action pair (s, a),the Q-value Q(s, a). To make that analysis tractableit is standard to introduce basis functions (s, a).

    Given n basis functions (s, a) for MDP state ac-tion pairs (s, a), we learn an n-vector of weights r forthose basis functions that minimizes the mean Bell-man residual. Our prior belief is that r is multivariatenormal with n-vector mean and nn covariance ma-trix . We model the Q-value by Q(s, a) = rT(s, a)

  • with predicted value T(s, a) and variance 2(s, a).We then treat the Q-Learning sample update (s, a) asan observation with variance (s, a), which we assumeis conditionally independent of the prediction given r.Under these assumptions we can update our beliefsabout r using a Kalman filter. This notation is sum-marized in Table 1.

    Table 1: Symbol Definitions

    Symbol Definition

    s current MDP states successor MDP state

    a As action available in state sR(s, a, s) MDP reward

    MDP discount rate(s, a) basis function values for state ac-

    tion pair (s, a)r weights on the basis functions

    that minimize the mean Bellmanresidual

    , mean vector and covariance ma-trix for r N(,)

    Q(s, a) Q-value for (s, a)2(s, a) variance of Q(s, a)(s, a) Q-Learning sample update value

    for (s, a)(s, a) variance of observation (s, a),

    also called the sensor noise.

    Kalman Filter Q-Learning, as discussed here, is im-plemented using the covariance form of the Kalmanfilter. In out experiments the covariance form was lessprone to numerical stability issues than the precisionmatrix form. We also believe that dealing with co-variances makes the model easier to understand andto assess priors, sensor noise, and other model param-eters. While the update equations for the KFQL arestandard, the notation is unique to this application.We describe the exact calculations involved in updat-ing the KFQL below.

    2.1 THE KALMAN OBSERVATIONUPDATE

    The generic Kalman filter observation update equa-tions can be computed for any r N(,) withpredicted measurement with mean T and varianceT. The observed measurement has variance .First we compute the Kalman gain n-vector,

    G = (T+ )1. (1)

    The Kalman gain determines the relative impact theupdate makes on the elements of and .

    +G( T) (2) (I GT ) (3)

    The posterior values for and can then be usedas prior values in the next update. For in-depth de-scriptions of the Kalman Filter see these excellentreferences: (Koller and Friedman, 2009, Russell andNorvig, 2003, Welch and Bishop, 2001).

    2.1.1 MDP Updates

    r

    ,

    Q(s, a) s

    Q(s, b)

    Q(s, c)

    (s, a)

    (s, b)(s, c)

    state transition(s, a) s

    b

    c

    sample backup

    Q(s, a) = rT(s, a) (s, a)(s, a) = R(s, a, s) +

    max

    T (s, b) , ..., T (s, c)

    Figure 1: The Sample Update

    To update our distribution for r, we estimate the valueof a sampled successor state s and use that value inour observation update. Because this method is simi-lar to the Bellman residual minimizing approximationused in Q-Learning, we will refer to it as the sampleupdate. It is shown in Figure 1. To arrive at the up-date formulation, we start with the Bellman equation:

    Q(s, a) R(s, a, s) + maxaAs

    Q (s, a) (4)

    and then we apply the Q-value approximationQ(s, a) = rT(s, a), to get:

    Q(s, a) = rT(s, a) (5)

    R(s, a, s) + E maxaAs

    rT(s, a)(6)

    R(s, a, s) + maxaAs

    T(s, a) (7)

    = (s, a) (8)

    There are several approximations implicit in this up-date in addition to the Bellman equation. First, weassume that the Q-value Q(s, a) can be represented byrT(s, a). Second, we interchange the expectation andmaximization operations to simplify the computation.

  • Finally, we do not account for any correlation betweenthe measurement prediction and the sample update,both of which are noisy functions of the weights onthe basis functions r. Instead, we estimate the predic-tion error 2(s, a) = (s, a)T(s, a) separately fromthe sensor noise (s, a) as explained below.

    2.1.2 Computing the Sensor Noise

    The sensor noise, (s, a), plays a role similar to thatof the learning rate in TD Learning. The larger thesensor noise, the smaller the eect of each updateon . The sensor noise has three sources of uncer-tainty. First, the sensor suers from inaccuracies in-herent in the model because it is unlikely that the Q-values for all state action pairs can be fit exactly forany value of r. Because this source of noise is dueto the model used, we will refer to it as model bias.Second, there is sensor noise due to the random na-ture of the state transition because the value of thesample update depends on which new state s is sam-pled. We will refer to this source of noise as samplingnoise. Both model bias and sampling noise are typi-cally modeled as constant throughout the process, and0 should be chosen to capture that combination. Fi-nally, there is sensor noise from the assessments usedto compute because our assessment of Q(s, a) hasvariance 2(s, a) = (s, a)T(s, a). We will re-fer to this variance as assessment noise. Assessmentnoise will generally decrease as the algorithm makesmore observations. All together, the sample update istreated as a noisy observation of Q(s, a) with sensornoise (s, a).

    We have tried several heuristic methods for setting thesensor noise and found them eective. These includeusing the variance of the highest valued alternative(Equation 9), using the average of the variances of eachalternative (Equation 10), using the largest variance(Equation 11), and using a Boltzmann-weighted aver-age of the variances (Equation 12). Although each ofthese techniques appears to have similar performance,as is shown in the next section, the average and pol-icy methods performed somewhat better in our exper-iments.

    The policy method:

    (s, a) = 0 + 22(s, arg max

    aAsT(s, a)) (9)

    The average method:

    (s, a) = 0 + 2

    aAs

    2(s, a)|As | (10)

    The max method:

    (s, a) = 0 + 2 maxaAs

    2(s, a) (11)

    The Boltzmann method:

    (s, a) = 0 + 2

    aAs

    2(s, a)eT (s,a)

    aAs e

    T (s,a)

    (12)

    Regardless which heuristic method is used, the actionselection can be made independently. For example, inour experiments in the next section we use Boltzmannaction selection.

    Kalman Filter Q-Learning belongs to the family ofleast squares value models, and therefore each updatehas complexity O(n2). This has traditionally givenprojected TD Learning methods an advantage overleast squares methods. Although projected TD Learn-ing is generally less ecient per iteration than leastsquares methods, each update only has complexityO(n). This means that in practice, multiple updatesto the projected TD Learning model can be made inthe same time as a single update to the Kalman filtermodel.

    2.2 APPROXIMATE KALMAN FILTERQ-LEARNING

    We propose Approximate Kalman Filter Q-Learning (AKFQL) which updates the value modelwith only O(n) complexity, the same complexity up-date as projected TD Learning. The only change inAKFQL is to simply ignore the o-diagonal elementsin the covariance matrix. Only the variances on thediagonal of the covariance matrix are computed andstored. This approximation has linear update com-plexity, rather than quadratic, in the number of basisfunctions. Although KFQL outperforms AKFQL earlyon (on a per iteration basis), somewhat surprisingly,AKFQL overtakes it in the long run. We do not fullyunderstand the mechanism of this improvement, butsuspect that KFQL is overfitting.

    Approximate Kalman Filter Q-Learnings calculationsare simplified versions of KFQLs calculations whichinvolve only the diagonal elements of the covariancematrix . The Kalman gain can be computed by

    d =i

    i2ii + (13)

    Gi =iiid

    . (14)

  • The Kalman gain determines the relative impact theupdate makes on the elements of and .

    i i +Gi( T) (15)ii (1Gii)ii (16)

    The posterior values for and the diagonal of can then be used as prior values in the next update.Finally, to compute the sample noise (s, a) we use2(s, a) =

    i

    2i (s, a)ii.

    2.3 THE KFQL/AKFQL ALGORITHM

    1 , prior distribution over r2 s initial state3 while policy generation in progress do4 choose action a As5 observe transition s to s6 update and using (s, a) and (s, a)7 s s8 if isTerminal(s) then9 s initial state

    Algorithm 1: [Approximate] Kalman FilterQ-Learning

    In Kalman Filter Q-Learning (KFQL) and Approxi-mate Kalman Filter Q-Learning (AKFQL) we beginwith our prior beliefs about the basis function weightsr and we update them on each state transition, asshown in Algorithm 1.

    3 EXPERIMENTAL RESULTS

    Because our primary goal is policy generation ratherthan on-line performance, our techniques are notmeant to balance exploration with exploitation. In-stead of testing them on-line, we ran the algorithmso-line, periodically copied the current policy, andtested that policy. This resulted in a plot of the av-erage control performance of the current policy as afunction of the number of states visited by the algo-rithm. In each of our experiments, we ran each test32 50 times, and plotted the average of the results.At each measurement point of each test run, we aver-aged over several trials of the current policy.

    In our results, we define a state as being visited whenthe MDP entered that state while being controlled bythe algorithm. In the algorithms presented within thispaper, each state visited also corresponds to one it-eration of the algorithm. The purpose of these mea-surements is to show how the quality of the generatedpolicy varies as the algorithm is given more time to

    execute. Thus, the eciency of various algorithms canbe compared across a range of running times.

    3.1 CART-POLE

    The Cart-Pole process is a well known problem inthe MDP and control system literature (Anderson,1986, Barto et al., 1983, Michie and Chambers, 1968,Schmidhuber, 1990, Si and Wang, 2001). The Cart-Pole process, is an instance of the inverted pendu-lum problem. In the Cart-Pole problem, the controllerattempts to keep a long pole balanced atop a cart. Ateach time step, the controller applies a lateral force, F ,to the cart. The pole responds according to a systemof dierential equations accounting for the mass of thecart and pole, the acceleration of gravity, the frictionat the joint between the cart and pole, the frictionbetween the cart and the track it moves on, and more.

    Figure 2: The Cart-Pole System

    In our experiments, we used the corrected dynamicsderived and described by Florian in (Florian, 2007).

    In our experiments, we used a control interval of .1seconds, and the values listed in table 2. At each con-trol interval, the controller chooses a force to applyto the cart in {5N, 0, 5N}. Then, this force andan additional force which is uniformly distributed in[2N, 2N ] is also applied to the cart. The controllerreceives a reward of 1 at each time step except whena terminal state is reached. The problem is undis-counted with = 1. Thus, the number of simulationsteps before the pole falls over is equal to the totalreward received.

    3.1.1 Basis Functions

    For the Cart-Pole process, we used a simple bilinearinterpolating kernel. For each action, a grid of pointsis defined in the state space (,). Each point corre-sponds to a basis function. Thus, every state-actionpair corresponds to a point within a grid box with abasis function point at each of the four corners. All

  • Table 2: Variable definitions for the Cart-Poleprocess

    VARIABLE DEFINITION

    mc = 8.0kg Mass of the cartmp = 2.0kg Mass of the polel = .5m Length of the poleg = 9.81m/s2 Acceleration of gravityc = .001 Coecient of friction between

    the cart and the trackp = .002 Coecient of friction between

    the pole and the cart Angle of the pole from verti-

    cal (radians) = Angular velocity of the pole

    (rad/s) Angular acceleration of the

    pole (rad/s2)F Force applied to the cart by

    the controller (Newtons)

    basis function values for a particular state are zero ex-cept for these four functions. The value of each of thenon-zero basis functions is defined by:

    i =

    1

    is1 is

    (17)

    i =ii i

    (18)

    where s =2 and s =

    14 , and the basis func-

    tion points are defined for each combination of i {,2 , 0, 2 ,} and i { 12 , 14 , 0, 12 , 14}. Thisset of basis functions yields 5 5 = 25 basis functionsfor each of the three possible actions, for a total of 75basis functions.

    Each test on Cart Pole consisted of averaging over 50separate runs of the generation algorithm, evaluatingeach run at each test point once. Evaluation runswere terminated after 72000 timesteps, correspondingto two hours of simulated balancing time. Therefore,the best possible performance is 72000.

    For projected TD-Learning, a learning rateof .5 1e61e6+t was used. This learning ratewas determined by testing every combinationof s {.001, .005, .01, .05, .1, .5, 1} and c {1, 10, 100, 1000, 10000, 100000, 1000000, 10000000}for the learning rate, n = s

    cc+n . For KFQL and

    AKFQL, a sensor variance of = .1, a prior varianceof ii = 10000.

    3.2 CASHIERS NIGHTMARE

    The Cashiers Nightmare, also called Klimovsproblem is an extension of the multi-armed banditproblem into a queueing problem with a particularstructure. In the Cashiers Nightmare, there are dqueues and k jobs. At each time step, each queue,i, incurs a cost proportional to its length, gixi. Thus,the total immediate reward at each time step is gTxt.Further, at each time step the controller chooses aqueue to service. When a queue is serviced, its lengthis reduced by one, and the job then joins a new queue,j, with probabilities, pij , based on the queue serviced.Consequently, the state space, x, is defined by thelengths of each queue with xi equaling the length ofqueue i in state x.

    The Cashiers Nightmare has a state space ofdk

    =d+k1

    k

    states: one for each possible distribution of

    the k jobs in the d queues. The instance of theCashiers Nightmare that we used has d = 100 queues,and k = 200 jobs, yielding a state space of

    100200

    1.391081 states. This also means that each timesteppresents 100 possible actions and each state transition,given an action, presents 100 possible state transitions.In this instance, we set gi =

    id with i {1, 2, ..., d} and

    randomly and uniformly selected probability distribu-tions for pij .

    It is worth noting that although we use this problemas a benchmark, the optimal controller can be derivedin closed form (Buyukkoc et al., 1983). However, thisis a challenging problem with a large state space, andtherefore presents a good benchmark for approximatedynamic programming techniques.

    3.2.1 Basis Functions

    The basis functions that we use on Cashiers Night-mare are identical to those used by Choi and Van Royin (Choi and Van Roy, 2006). One basis function isdefined for each queue, and its value is based on thelength of the queue and the state transition probabil-ities:

    i(x, a) = xi + (1 xi,0)(pa,i a,i) (19)

    This way, at least the short term eect of each actionis reflected in the basis functions.

    For projected TD-Learning, a learning rate of .1 1e31e3+twas used. This learning rate was determined by test-ing every combination of s {.001, .01, .1, 1} andc {1, 10, 100, 1000, 10000} for the learning rate, n =s cc+n . For KFQL and AKFQL, a sensor variance of = 1, a prior variance of ii = 20.

  • 3.3 CAR-HILL

    The Car-Hill control problem, also called theMountain-Car problem is a well-known continuousstate space control problem in which the controller at-tempts to move an underpowered car to the top ofa mountain by increasing the cars energy over time.There are several variations of the dynamics of theCar-Hill problem; the variation we use here is the sameas that found in (Ernst et al., 2005).

    A positive reward was given when the car reached thesummit of the hill with a low speed, and a negativereward was given when it ventured too far from thesummit, or reached too high of a speed:

    r(p, v) =

    1 if p < 1 or |v| > 31 if p > 1 and |v| 30 otherwise

    (20)

    The implementation we used approximates thesedierential equations using Eulers method with atimestep of .001s. The control interval used was .1s,and the controllers possible actions were to applyforces of either +4N or 4N . A decay rate of = .999was used.

    Each test on Car-Hill consisted of averaging over 32separate runs of the generation algorithm, evaluatingeach run at each test point once. Evaluation runs wereterminated after 1000 timesteps, ignoring the small re-maining discounted reward.

    3.3.1 Basis Functions

    We used bilinearly interpolated basis functions withan 8x8 grid spread evenly over the p [1, 1] and v [3, 3] state space for each action. This configurationyields 8x8x2 = 128 basis functions. The prior we usedwas set to provide higher prior estimates for stateswith positions closer to the summit:

    0(p, v) = max

    0, 1 2(1 p)

    3

    (21)

    For projected TD-Learning, a learning rate of .1 1e31e3+twas used. This learning rate was determined by test-ing every combination of s {.001, .01, .1, 1} andc {1, 10, 100, 1000, 10000} for the learning rate, n =s cc+n . For KFQL and AKFQL, a sensor variance of = .5, a prior variance of ii = .1.

    3.4 RESULTS

    Kalman Filter Q-Learning has mixed performance rel-ative to a well-tuned projected TD Learning imple-mentation. For the Cart-Pole process (Figure 3),

    KFQL provides between 100 and 5000 times the ef-ficiency of PTD. However, on the Cashiers Nightmareprocess (Figure 4) and the Car Hill process (Figure5), KFQL was plagued by numerical instability, ap-parently because KFQL underestimates the varianceleading to slow policy changes. Increasing the con-stant sensor noise, 0, can reduce this problem some-what, but may not eliminate the issue.

    0 1 2 3 4 5

    1040

    2

    4

    6

    8104

    # states visited

    averagecontrolperform

    ance

    PTD

    KFQL w/ Max

    AKFQL w/ Max

    Figure 3: PTD, KFQL, and AKFQL on Cart-Pole

    0 0.5 1 1.5 2 2.5 3

    1051

    0.9

    0.8

    0.7 104

    # states visited

    averagecontrolperform

    ance

    PTD

    KFQL w/ Max

    AKFQL w/ Max

    Figure 4: PTD, KFQL, and AKFQL on CashiersNightmare

  • 0 0.5 1 1.5 2

    105

    0

    0.2

    0.4

    0.6

    # states visited

    averagecontrolperform

    ance

    PTD

    KFQL w/ Max

    AKFQL w/ Max

    Figure 5: PTD, KFQL, and AKFQL on Car-Hill

    0 1,000 2,000 3,000 4,000 5,000 6,0000

    2

    4

    6

    104

    # states visited

    averagecontrolperform

    ance

    KFQL w/ Max

    KFQL w/ Average

    KFQL w/ Policy

    KFQL w/ Boltzmann

    Figure 6: Comparison of sensor noise models whenusing KFQL on Cart-Pole

    Approximate Kalman Filter Q-Learning, on the otherhand, provides a very large benefit over PTD on allof the tested applications (Figures 3, 4, and 5). In-terestingly, in two of the experiments, AKFQL alsooutperforms KFQL by a large margin, even thoughthe eort per iteration is much less for AKFQL. Thisis likely due to the fact that AKFQL ignores any de-pendencies among basis functions. Instead of solvingthe full least-squares problem (that a Kalman filter

    solves), AKFQL is performing an update somewherebetween the least-squares problem and the gradient-descent update of TD-Learning. This middle-groundprovides the benefits of tracking the variances on eachbasis functions weight, and the freedom from the de-pendencies and numerical stability issues inherent inKFQL. Additionally, AKFQLs update complexity islinear, just as is PTDs, so it truly outperforms pro-jected TD-Learning in our experiments.

    0 1,000 2,000 3,000 4,000 5,000 6,0000

    2

    4

    6

    104

    # states visited

    averagecontrolperform

    ance

    AKFQL w/ Max

    AKFQL w/ Average

    AKFQL w/ Policy

    AKFQL w/ Boltzmann

    Figure 7: Comparison of sensor noise models whenusing AKFQL on Cart-Pole

    The choice of signal noise heuristic has a small eectfor the Cart-Pole process under either KFQL (Fig-ure 6) or AKFQL (Figure 7). The average and pol-icy methods appear to work better than the max andBoltzmann methods. The other figures in this sectionwere generated using the max method.

    4 RELATED WORK

    Kalman Filter Q-Learning uses a projected valuemodel, where a set of basis functions (s, a) projectthe MDP state action space onto the basis space. Inthese methods, basis function weights r are used toestimate Q-values, Q(s, a) rT, in order to develophigh performance policies.

    4.1 PROJECTED TD LEARNING

    Projected TD Learning (Roy, 1998, Sutton, 1988,Tsitsiklis and Roy, 1996) is a popular projection-basedmethod for approximate dynamic programming. The

  • version of TD Learning that we consider here is theo-policy Q-learning variant with = 0. Thus, ourobjective is to find an r such that Q(s, a) rT(s, a)because we are primarily concerned with finding e-cient o-line methods for approximate dynamic pro-gramming.

    The projected TD Learning update equation is:

    rt+1 = rt + t(st, at)(Frt)(st, at) + wt+1 (rt)(st, at)

    (22)where F is a weighting matrix and is a matrix of ba-sis function values for multiple states. We use stochas-tic samples to approximate (Frt)(xt). The standardQ-Learning variant that we consider here sets:

    (Frt)(st, at) + wt+1 =R(st, at, st+1) + max

    aAst+1rT(st+1, a) (23)

    where the state transition (st, at) (st+1) is sampledaccording to the state transition probabilities, P (st st+1|st, at), of the MDP. Substituting Equation 23 intoEquation 22, we arrive at the update equation for thisvariant of projected TD Learning:

    rt+1 = rt + t(st, at)R(st, at, st+1) + max

    aAst+1rT(st+1, a)

    (rt)(st, at)

    (24)The convergence properties of this and other versionsof TD learning have been studied in detail (Dayan,1992, Forster and Warmuth, 2003, Pineda, 1997, Roy,1998, Schapire andWarmuth, 1996, Sutton, 1988, Tsit-siklis and Roy, 1996). Unlike our Kalman filter meth-ods, PTD has no sense of how well it knows a par-ticular basis functions weight, and adjusts the weightson all basis functions at the global learning rate. Aswith the tabular value model version of TD learning,the choice of learning rates may greatly impact theperformance of the algorithm.

    4.2 LEAST SQUARES POLICYITERATION

    Least squares policy iteration (LSPI) (Lagoudakiset al., 2003) is an approximate policy iteration methodwhich at each iteration samples from the process usingthe current policy, and then solves the least-squaresproblem:

    minimize ||Atw bt|| (25)

    where A and b are:

    A = 1L

    i=1..L(si, ai)

    (si, ai) (si,(si))

    T T( P)

    (26)

    bt+1 = bt + (st, at)rt TR (27)

    where P , R, , and are matrices of probabilities, re-wards, policies, and basis function values, respectively.

    Solving this least squares problem is equivalent to us-ing KFQL with a fixed-point update method, an infi-nite prior variance and infinite dynamic noise. LSPI iseective for solving many MDPs, but it can be proneto policy oscillations. This is due to two factors. First,the policy biases A and b, can result in a substantiallydierent policy at the next iteration, which could leadto a ping-ponging eect between two or more policies.Second, information about A and b is discarded be-tween iterations, providing no dampening to the oscil-lations and jitter caused by simulation noise and thesampling bias induced by the policy.

    4.3 THE FIXED POINT KALMAN FILTER

    The Kalman filter has also been adapted for use as avalue model in approximate dynamic programming byChoi and Van Roy (Choi and Van Roy, 2006). Fixedpoint Kalman filter models the same multivariatenormal weights r as KFQL but parametrizes them withmeans rt and precision matrix Ht

    1 such that:

    rt+1 = rt +1tHt(st, at)

    (Frt)(st, at) + wt+1 (rt)(st, at) (28)

    where (Frt)(st, at)+wt+1 is the same as in Equation23, yielding the update rule:

    rt+1 = rt +1tHt(st, at)

    R(st, at, st+1) + maxaAst+1

    rT(st+1, a)

    (rt)(st, at)

    (29)

    where Ht is defined as:

  • Ht =1t

    ti=1

    (si, ai)T (si, ai)

    1(30)

    Non-singularity of Ht can be dealt with using anyknown method, although using the psudo-inverse isstandard. The choice of an initial r0, H0 and the in-version method for H are similar to choosing the priormeans and variance over r in the KFQL. The fixedpoint Kalman filter represents the covariance of r bya precision matrix rather than the covariance matrixused by KFQL. This dierence is significant becauseKFQL performs no matrix inversion and avoids manynon-singularity and numerical stability issues.

    4.4 APPROXIMATE KALMAN FILTERS

    Many methods for approximating and adapting theKalman filter have been developed for other appli-cations (Chen, 1993). Many of these methods canbe used in this application as well. AKFQL is justone simple approximation technique that works welland achieves O(n) instead of O(n2) complexity up-dates. Other possible approximation techniques in-clude fixed-rank approximation of and particle filter-ing (Doucet et al., 2001, Gordon et al., 1993, Kitagawaand Gersch, 1996, Rubin, 1987).

    5 CONCLUSIONS

    In this paper we have presented two new methodsfor policy generation in MDPs with continuous statespaces. The Approximate Kalman Filter Q-Learningalgorithm provides significant improvement on sev-eral benchmark problems over existing methods, suchas projected TD-Learning, as well as our other newmethod, Kalman Filter Q-Learning. Continuous stateMDPs are challenging problems that arise in multipleareas of artificial intelligence, and we believe that AK-FQL provides a significant improvement over the stateof the art. We believe these same benefits apply toother problems where basis functions are useful, suchas MDPs with large discrete sample spaces and mixedcontinuous and discrete state spaces.

    There are a variety of ways this work could be ex-tended. The Kalman filter allows for linear dynamicsin the distribution of basis function weights r duringtransitions from one MDP state action pair to anotheror from one policy to another. Another possibility isto incorporate more of a fixed point approach, recog-nizing the dependence between the prediction variance2(s, a) and the signal noise (s, a) conditioned on r.Still another possibility is to use an approximation, ig-noring o-diagonal elements like AKFQL, in the tradi-tional fixed point approach. Finally, we could apply a

    similar Kalman filter model to perform policy iterationdirectly rather than by Q-Learning.

    Acknowledgements

    We thank the anonymous referees for their helpful sug-gestions.

    References

    Charles W. Anderson. Learning and Problem Solvingwith Multilayer Connectionist Systems. PhD thesis,University of Massachusetts, Department of Com-puter and Information Science, Amherst, MA, USA,1986.

    A.G. Barto, R.S Sutton, and C. Anderson. Neuron-likeadaptive elements that can solve dicult learningcontrol problems. IEEE Transactions on Systems,Man, and Cybernetics, SMC-13:834846, 1983.

    Richard Bellman. A markovian decision process. In-diana Univ. Math. J., 6:679684, 1957. ISSN 0022-2518.

    Dimitri P. Bertsekas. Distributed dynamic program-ming. IEEE Transactions on Automated Control,AC-27:610616, 1982.

    C. Buyukkoc, Pravin Varaiya, and Jean Walrand. Ex-tensions of the multi-armed bandit problem. Techni-cal Report UCB/ERL M83/14, EECS Department,University of California, Berkeley, 1983.

    Q. Chen. Approximate Kalman Filtering. Series in Ap-proximations and Decompositions. World Scientific,1993. ISBN 9789810213596.

    David Choi and Benjamin Van Roy. A generalizedkalman filter for fixed point approximation and ef-ficient temporal-dierence learning. Discrete EventDynamic Systems, 16(2):207239, 2006.

    Peter Dayan. The convergence of td() for general .Machine Learning, 8:341362, 1992.

    A. Doucet, N. De Freitas, and N. Gordon. SequentialMonte Carlo Methods in Practice. Statistics for En-gineering and Information Science. Springer, 2001.ISBN 9780387951461.

    Damien Ernst, Pierre Geurts, Louis Wehenkel, andL. Littman. Tree-based batch mode reinforcementlearning. Journal of Machine Learning Research, 6:503556, 2005.

    Razvan V. Florian. Correct equations for the dynamicsof the cart-pole, 2007.

  • Jurgen Forster and Manfred K. Warmuth. Relativeloss bounds for temporal-dierence learning. Ma-chine Learning, 51(1):2350, 2003.

    N.J. Gordon, D.J. Salmond, and A.F.M. Smith. Novelapproach to nonlinear/non-gaussian bayesian stateestimation. Radar and Signal Processing, IEE Pro-ceedings F, 140(2):107 113, apr 1993. ISSN 0956-375X.

    Ronald A. Howard. Dynamic Programming andMarkov Processes. The M.I.T. Press, Cambridge,MA, USA, 1960. ISBN 0-262-08009-5.

    Rudolph Emil Kalman. A new approach to linear fil-tering and prediction problems. Transactions of theASMEJournal of Basic Engineering, 82(Series D):3545, 1960.

    G. Kitagawa and W. Gersch. Smoothness Priors Anal-ysis of Time Series. Lecture Notes in Statistics.Springer, 1996. ISBN 9780387948195.

    Daphne Koller and Nir Friedman. Probabilistic Graph-ical Models - Principles and Techniques. MIT Press,2009. ISBN 978-0-262-01319-2.

    Michail G. Lagoudakis, Ronald Parr, and L. Bartlett.Least-squares policy iteration. Journal of MachineLearning Research, 4:2003, 2003.

    D. Michie and R.A. Chambers. BOXES: An ex-periment in adaptive control. In E. Dale andD. Michie, editors, Machine Intelligence 2, pages137152. Oliver and Boyd, Edinburgh, 1968.

    Fernando J. Pineda. Mean-field theory for batched td(&lgr;). Neural Comput., 9(7):14031419, October1997. ISSN 0899-7667.

    Ben Van Roy. Learning and Value Function Approxi-mation in Complex Decision Processes. PhD thesis,MIT, 1998.

    Donald B. Rubin. The calculation of posterior distri-butions by data augmentation: Comment: A nonit-erative sampling/importance resampling alternativeto the data augmentation algorithm for creating afew imputations when fractions of missing informa-tion are modest: The sir algorithm. Journal of theAmerican Statistical Association, 82(398):pp. 543546, 1987. ISSN 01621459.

    Stuart J. Russell and Peter Norvig. Artificial Intel-ligence: A Modern Approach. Prentice-Hall, Inc.,Upper Saddle River, NJ, USA, second edition, 2003.ISBN 978-81-203-2382-7.

    Robert E. Schapire and Manfred K. Warmuth. Onthe worst-case analysis of temporal-dierence learn-ing algorithms. Machine Learning, 22(1-3):95121,1996.

    Jrgen Schmidhuber. Networks adjusting networks. InProceedings of Distributed Adaptive Neural Infor-mation Processing, St.Augustin, pages 2425. Old-enbourg, 1990.

    J. Si and Yu-Tsung Wang. On-line learning controlby association and reinforcement. Neural Networks,IEEE Transactions on, 12(2):264 276, mar 2001.ISSN 1045-9227. doi: 10.1109/72.914523.

    Richard S. Sutton. Learning to predict by the methodsof temporal dierences. Machine Learning, 3:944,1988. ISSN 0885-6125. 10.1007/BF00115009.

    John N. Tsitsiklis and Benjamin Van Roy. Analysis oftemporal-diference learning with function approx-imation. In NIPS, pages 10751081, 1996.

    C. Watkins. Learning from Delayed Rewards. PhDthesis, University of Cambridge, England, 1989.

    Greg Welch and Gary Bishop. An introduction to thekalman filter. Technical report, SIGGRAPH 2001,2001. Course Notes.