1 a semiparametric statistics approach to model-free policy evaluation tsuyoshi ueno (1), motoaki...

37
1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1) , Motoaki KAWANABE (2) , Takeshi MORI (1) , Shin-ich MAEDA (1) , Shin ISHII (1),(3) (1) Kyoto University (2) Fraunhofer FIRST

Upload: imani-trepp

Post on 15-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

1

A Semiparametric Statistics Approach to Model-Free Policy Evaluation

Tsuyoshi UENO(1), Motoaki KAWANABE(2),

Takeshi MORI(1), Shin-ich MAEDA(1) , Shin ISHII(1),(3) (1)Kyoto University

(2)Fraunhofer FIRST

Page 2: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

2

Summary of This Talk

• We discussed LSTD-based policy evaluation from the viewpoint of semiparametric statistics and estimating function.

1. How good is LSTD?

2. Can we improve LSTD ?

LSTD is a type of estimating function method, andevaluate the asymptotic estimation variance of LSTD.

We derive an optimal estimating function with the minimum asymptotic estimation variance.

We propose a new policy evaluation algorithm (gLSTD)

Page 3: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

Model-Free Reinforcement Learning

3

Goal: Obtain an optimal policy

which maximizes the sum of future rewards

Environment

Action

State

Reward

*pp

sp

ap

rp

ppPolicy

Page 4: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

4

Policy Iteration [Sutton & Barto, 1998]

Policy Evaluation( Estimate the value function )

Policy Improvement(Update the policy)

Value function estimation is a key of policy iteration !!

If the value function can be correctly estimated,policy iteration converges the optimal policy *pp

Page 5: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

5

Policy Evaluation Method: LSTD[Bratke & Barto, 1996]

• Least Squares Temporal Difference (LSTD)– LSTD-based policy iteration algorithms have shown good

practical performance. • Least Squares Policy Iteration (LSPI) [Lagoudakis & Parr, 2003]

• Natural Actor-Critic (NAC) [Peters et.al., 2003, 2005]

• Representation Policy Iteration (RPI)[Mahadevan & Maggino, 2007]

LSTD is one of the important algorithms in RL field

Page 6: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

6

Least Square Temporal Difference (LSTD)

• Bellman equation [Bellman, 1966 ]

10

V ( ) : E |tt

t

s r sp p g¥

+=

é ù= ë ûå

( )T TV ( ) :t t ts sp = =f q f q

Feature Parameter

• Assumption

We assume that the linear function ‘completely’ represents the value function.

(There are no bias.)

[ ]TT1E | E |t t t t tr s sp pg+

é ù= +ë ûf q f q

Page 7: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

7

• Linearly approximated bellman equation

Parameter

( ) ( ){ } ( )1 1 11

T

11 E | E |t tt t tt t t t rs r r sp pgg + ++ + + +é ù é ù- -ë û ë û- + + =f fff q

Noise Noise

Just a linear regression problem(Error in (input) variable problem [Young,1984])

Input: Output:

Ttt tye =x q+

Least Square Temporal Difference (LSTD)

tx ty

the input and observation noise are mutually dependent!!

Page 8: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

8

Linear Regression with Error in Variables

1

OLS1 1

ˆN N

t t t tt t

y-

= =

é ù é ùê ú ê ú=ê ú ê úë û ë ûå åxx xq

x

y

OLS estimator is biased. LSq̂

• Ordinary least squares method (OLS):

y x=OLS

the observation noise depends on the input variable,

Page 9: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

9

Instrumental Variable Method[Soderstrom and Stoica, 2002]

• Introduce the instrumental variable: tz1

OLS1 1

ˆN N

t t t tt t

y-

= =

é ù é ùê ú ê ú=ê ú ê úë û ë ûå åxx xq

is an unbiased estimator IVq̂Input: x

Out

put:

y1

IV1 1

ˆN N

t ttt t

ty-

= =

é ù é ùê ú ê ú=ê ú ê úë û ë ûå åxz zq

y x=The instrumental variable is correlated with the input but uncorrelated with the noise

Page 10: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

10

• LSTD = Instrumenatal variable method.– Instrumental variable :

( )-11 1

T

LSTD 1 10 0

ˆN N

t tt t tt t

rg- -

+ += =

é ùê ú= -ê úë ûå å ffffq

t t=z f

Least Square Temporal Difference (LSTD)

, ,,t t t t k t ta-+= = = +z z zc cLff f(for example)

are also instrumental variables

It is important to choose an appropriate instrumental variable.

Page 11: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

Our Approach

• How good is LSTD ?

• Can we improve LSTD?

11

We analysis the asymptotic estimation variance of instrumental variable method.

We optimize the instrumental variable so as to minimize the asymptotic estimation variance.

We introduce a viewpoint of semiparametric statistical inference

Page 12: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

12

• Semiparametric model:

– is target parameter – are nuisance parameter (infinite degree of freedom )

Semiparametric Statistics Approach

Tt t ty e= x q+

( ); ,p x qk

kq

1

1

t t t

t ty r

g +

+

= -

=

x ff

We need to estimate only the target parameter regardless of the nuisance parameters

• Linearly approximated Bellman equation

We don’t know the noise distribution.

Page 13: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

• Estimating function [Godambe, 1985] [Conditions]

• Estimating equation

13

Inference of Semiparametric Model

( )1

0

, ;ˆN

t tt

y-

=

=å f x 0q

converges to the true parameter regardless of nuisance parameter. q̂ *q

( )[ ], ,E ;yp =f x 0q ( ) ( )2

E , ; 0,E , ;y yp pé ù¶ é ùê ú¹ < ¥ê úë ûê ú¶ë ûf x f xq q

q

For any nuisance parameter

Page 14: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

14

Estimating Functions

• Estimating function = LSTD

• Estimating function = Instrumental variable method

( ){ }T

LSTD 1 1t t t trg + += -f ff f q-

Are there any other estimating functions ?

( ) ( ){ }T

IV 1 1, ,t t k t t ts s rg- + += -f z L ff q-

Instrumental Variable

Page 15: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

15

Are There Any Other Estimating Functions ?

Proposition 1

( ) ( ){ }T

IV 1 1 1, , , .t t t T t t ts s s rg- - + += -f z L ff q-

Every admissible estimating functions must have the form of

No !!

“Inadmissible” estimating function means there are superior estimating functions to it.

Page 16: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

16

Asymptotic Variance of LSTD-Based Estimators

Lemma 2.The asymptotic estimation variance of estimating function for value functions is given by

where

and

( ) 11 T1ˆAVN

--é ù=ê úë û A M Aq

( )T

1E ,t t tp g +

é ù= -ê úë ûA z ff ( )2* TE t t t

p eé ù= ê úë ûM zz

( )T* *1 1.t t t tre g + += - -ff q

Which instrumental variable performs the minimum asymptotic variance ?

Page 17: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

17

The Optimal Estimating Function

Theorem 1.

The optimal instrumental variable with the minimum asymptotic variance is given by

where

( ) ( )12* *

1E | E |t t t t t ts sp pe g-

+é ù é ù= -ê ú ë ûë û

z ff

( )T* *1 1.t t t tre g + += - -ff q

True parameter (unknown)

Unknown conditional expectations

gLSTDApproximation is necessary

Page 18: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

gLSTD

• The residual of true parameter

• Unknown conditional expectations

18

*te

( )2*1E | ,E |t t t ts sp pe +

é ù é ùê ú ë ûë ûf

( ) ( )1

** 2

1E | E |t t t tt ts sp pe g-

+é ù é ùê= -ú ë ûë û

z ff

The optimal instrumental variable

Replace the regression residual of true parameter with that of LSTD estimator.

LSTDt̂e¬

Approximate these conditional expectations by using a sample-based function approximation technique.

(Unknown)

Page 19: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

19

Summary of gLSTD

1) Calculate the initial estimator and replace the true residual

2) Approximate the conditional expectations

3) Construct the instrumental variable

4) Calculate the gLSTD estimator

( )2*1E | ,E |t t t ts sp pe +

é ù é ùê ú ë ûë ûf

( )-11 1

T

gLSTD 1 10 0

ˆ ˆN N

t t t t tt t

rg- -

+ += =

é ù é ùê ú ê ú¬ -ê ú ê úë û ë ûå åz zq ff

( ) ( )12*

1ˆ E | E |t t t t t ts sp pe g-

+é ù é ù¬ -ê ú ë ûë û

z ff

( )-11 1

T

LSTD 1 10 0

N N

t t t t tt t

rg- -

+ += =

é ù é ùê ú ê ú¬ -ê ú ê úë û ë ûå åq ff ff

* LSTDˆt te e¬

Page 20: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

20

Simulation (Markov Random Walk)

• Conditions of the simulation experiment – Policy: Random– The number of steps: 100– The number of episodes: 100– Discounted factor: 0.9

• Basis function : – We generated three basis functions by the diffusion model.

[Mahadevan & Maggino, 2007]

1 32 4 5

R=0 R=0 R=0 R=1.0R=0.5

Page 21: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

21

Simulation Result.

The estimator of gLSTD achieved 20% smaller MSE than that of the LSTD

Median

The upper and lower quartiles

20%

Page 22: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

22

Conclusion• We discussed LSTD-based policy evaluation in the

framework of semiparametric statistics approach. – We evaluated the asymptotic variance of LSTD-based

estimator.

– We derived the optimal estimating function with the minimum asymptotic variance and proposed its practical implementation method: gLSTD.

– Through an simple Markov chain problem, we demonstrated that gLSTD reduces the estimation variance of LSTD.

Page 23: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

23

Future Work

A Semiparametric Approach to

Model-Free Policy Evaluation

A Semiparametric Approach to

Model-Free Reinforcement Learning

Application to the policy improvement

- Least Squares Policy Iteration (LSPI)

- Natural Actor Critic (NAC) etc.

Page 24: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

24

EndThank you for your attention!!

Page 25: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

Cost Function

25

2gLS gLS *1 ˆargmin

2 gLSD DDr

r

= -V Vq

2LS *1 ˆargmin

2 LS

LSD D

Drr

= -V Vq ( ) ( )LS Trr rg g= - -I P D D I PD FF

( ) ( )gLS 1 1r rg g- -= - -I P D I PD S S

Page 26: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

Simulation Result

26

1 2 3 4 5

0 0 0 0 1.0r é ù= ê úë û

Page 27: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

27

Page 28: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

28

Page 29: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

31

Questions

1. How good is the LSTD?

2. Can we improve the LSTD ?

LSTD is a type of estimating function method, andevaluate the asymptotic estimation variance of LSTD.

We derive the optimal estimating function with the minimum asymptotic estimation variance.

Page 30: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

32

The Suboptimal Estimating Function (LSTDc)

• GLSTD is required to estimate the functions depending on current state.

• To avoid estimating these functions, we simple replace them by constant value.

t t= +z cf

Optimize it to minimize the asymptotic variance

Page 31: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

33

The Suboptimal Estimating Function (LSTDc)

Theorem 2. The optimal shift is given by

where

( ) ( ) ( ) ( ) [ ]

( ) ( ) ( ) ( ) [ ]

2 2 1* * T T1

*2 2 1* * T T

1

E 1 E E E

E 1 E E E

t t t t t t t t t

t t t t t t t

p p p p

p p p p

e g e g

e g e g

-

+

-

+

é ù é ù é ù- - -ê ú ê ú ê úë ûë û ë û= -é ù é ù é ù- - -ê ú ê ú ê úë ûë û ë û

cff ff ff f

ff ff f

( )T* *1 1.t t t tre g + += - -ff q

Page 32: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

34

Summary of This Talk

• We introduce a semiparametric statistical viewpoint for estimation of value function with linear model.

• Our aim – Evaluate the estimation variance of value

functions – Develop more efficient estimation methods

Page 33: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

36

Summary of Our Main Results

1. Formulate the estimation problem of linearly-represented value functions as a semiparametric inference problem

2. Evaluate the asymptotic variance of estimations of value function

3. Derive the optimal estimation method with the minimum asymptotic variance

Page 34: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

37

Estimating Functions

•Question Which function is appropriate when more than one

estimating function exist ?

•Answer Choose the estimating function with minimum

asymptotic variance

( )( )T

* *ˆ ˆ ˆAV : E é ùé ù= - -ê úê úë û ë ûq q q q q

Page 35: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

38

Instrumental Variable (IV) Method

• Instrumental variable:– Correlated to the input variable, but uncorrected to the noise.

• Instrumental variable method

{ }Tt xt yt tye+ + =x e q

tz tz tx

tz

[ ] [ ]1E Et t t ty-zx zq=

xte

Page 36: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

39

Statistics approach

Page 37: 1 A Semiparametric Statistics Approach to Model-Free Policy Evaluation Tsuyoshi UENO (1), Motoaki KAWANABE (2), Takeshi MORI (1), Shin-ich MAEDA (1), Shin

40

What is the Semiparametric Approach ?

• Semiparametric model:– Parameter:

– Nuisance parameter:

• Estimating function [Godambe, 1985]

[Conditions]

( ); ,p x qk

( )1

0

ˆ;N

tt

-

=

=å f x 0q converges to the true parameter q̂

*q

qk

( )[ ]E ; =f x 0q

We need to estimate the parameter regardless of the nuisance parameter .k

q

Show the detail in [Godambe, 1985]