chapter 8: generalization and function approximation

Post on 11-Jan-2016

54 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Chapter 8: Generalization and Function Approximation. Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA) methods and how they can be adapted to RL. Objectives of this chapter:. - PowerPoint PPT Presentation

TRANSCRIPT

Chapter 8: Generalization and Function Approximation

Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part.

Overview of function approximation (FA) methods and how they can be adapted to RL

Objectives of this chapter:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

Value Prediction with FA

As usual: Policy Evaluation (the prediction problem): for a given policy , compute the state-value function V

In earlier chapters, value functions were stored in lookup tables.

Here, the value function estimate at time t, Vt , depends

on a parameter vector t , and only the parameter vector

is updated.

e.g., t could be the vector of connection weights

of a neural network.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

Adapt Supervised Learning Algorithms

Supervised Learning SystemInputs Outputs

Training Info = desired (target) outputs

Error = (target output – actual output)

Training example = {input, target output}

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Backups as Training Examples

e.g., the TD(0) backup :

V(st) V(st ) rt1 V(st1) V(st)

description of st , rt1 V (st1 )

As a training example:

input target output

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

Any Function Approximation Method?

In principle, yes: artificial neural networks decision trees multivariate regression methods etc.

But RL has some special requirements: usually want to learn while interacting ability to handle nonstationarity other?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

Gradient Descent Methods

t t(1), t(2),,t (n) T

transpose

Assume Vt is a (sufficiently smooth) differentiable function

of t , for all s S.

Assume, for now, training examples of this form :

description of st , V (st)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

Performance Measures

Many are applicable but… a common and simple one is the mean-squared error

(MSE) over a distribution P to weight various errors:

Why P ? Why minimize MSE? Let us assume that P is always the distribution of states at

which backups are done. The on-policy distribution: the distribution created while

following the policy being evaluated. Stronger results are available for this distribution.

MSE( t) P(s) V (s) Vt (s) sS 2

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

Gradient Descent

Let f be any function of the parameter space.

Its gradient at any point t in this space is :

f (

t )

f (t )

(1),f (

t)

(2),,

f (t)

(n)

T

.

(1)

(2)

t t(1), t(2) T

t1

t

f (t)

Iteratively move down the gradient:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

Gradient Descent Cont.

t1

t

1

2

MSE(

t)

t

1

2

P(s)sS V (s) Vt(s) 2

t P(s)

sS V (s) Vt(s)

Vt(s)

For the MSE given above and using the chain rule:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

Gradient Descent Cont.

t1

t

1

2

V (st) Vt (st) 2

t V (st ) Vt (st )

Vt(st ),

Assume that states appear with distribution PUse just the sample gradient instead:

Since each sample gradient is an unbiased estimate ofthe true gradient, this converges to a local minimum of the MSE if decreases appropriately with t.

E V (st ) Vt(st )

Vt (st) P(s) V (s) Vt (s) sS

Vt(s)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

But We Don’t have these Targets

Suppose we just have targets v t instead :t1

t v t Vt (st)

Vt(st )

If each vt is an unbiased estimate of V (st ),

i.e., E vt V (st ), then gradient descent converges

to a local minimum (provided decreases appropriately).

e.g., the Monte Carlo target vt Rt :

t1

t Rt Vt(st)

Vt (st )

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

What about TD() Targets?

t1

t Rt

Vt(st) Vt (st )

Not for 1

But we do it anyway, using the backwards view :t1

t t

e t ,

where :

t rt1 Vt (st1) Vt (st ), as usual, ande t

e t 1 Vt(st )

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

On-Line Gradient-Descent TD()

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

Linear Methods

Represent states as feature vectors :

for each s S :s s (1),s(2),,s (n) T

Vt (s)

t

Ts t (i)s(i)

i1

n

Vt (s) ?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

Nice Properties of Linear FA Methods

The gradient is very simple: For MSE, the error surface is simple: quadratic surface

with a single minumum. Linear gradient descent TD() converges:

Step size decreases appropriately On-line sampling (states sampled from the on-policy

distribution) Converges to parameter vector with property:

Vt (s)

s

MSE(

)

1 1

MSE( )

best parameter vector(Tsitsiklis & Van Roy, 1997)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Coarse Coding

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Learning and Coarse Coding

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

Tile Coding

Binary feature for each tile Number of features present at

any one time is constant Binary features means weighted

sum easy to compute Easy to compute indices of the

features present

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Tile Coding Cont.

Irregular tilings

HashingCMAC “Cerebellar model arithmetic computer”Albus 1971

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Radial Basis Functions (RBFs)

s (i) exp s ci

2

2 i2

e.g., Gaussians

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

Can you beat the “curse of dimensionality”?

Can you keep the number of features from going up exponentially with the dimension?

Function complexity, not dimensionality, is the problem. Kanerva coding:

Select a bunch of binary prototypes Use hamming distance as distance measure Dimensionality is no longer a problem, only complexity

“Lazy learning” schemes: Remember all the data To get new value, find nearest neighbors and interpolate e.g., locally-weighted regression

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Control with Function Approximation

description of ( st , at ), v t Training examples of the form:

Learning state-action values

The general gradient-descent rule:

Gradient-descent Sarsa() (backward view):t1

t v t Qt (st ,at)

Q(st ,at)

t1

t t

e t

where

t rt1 Qt (st1, at1 ) Qt(st ,at )e t

e t 1

Q t(st ,at )

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

GPI with Linear Gradient Descent Sarsa()

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

GPI Linear Gradient Descent Watkins’ Q()

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25

Mountain-Car Task

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26

Mountain-Car Results

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27

Baird’s Counterexample

Reward is always zero, so the true value is zero for all s

The approximate of state value function is shown in each state

Updating

Is unstable from some initial conditions

)()](}|)({[ 111 sVsVsssVrE ks

tttttt

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28

Baird’s Counterexample Cont.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29

Should We Bootstrap?

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30

Summary

Generalization Adapting supervised-learning function approximation

methods Gradient-descent methods Linear gradient-descent methods

Radial basis functions Tile coding Kanerva coding

Nonlinear gradient-descent methods? Backpropation? Subtleties involving function approximation, bootstrapping

and the on-policy/off-policy distinction

top related