alborz

Alborz GeramifardSeptember, 2007

[email protected]

Incremental Least-Squares Temporal Difference Learning

mailto:[email protected]

mailto:[email protected]

Richard Sutton

Acknowledgement

Michael Bowling

Martin Zinkevich

Agencia

Generous

Wise

Kind

Forgetful

Agencia

Envirocia

Agencia

Envirocia

AgenciaAgriculture

Politics

Research

...

The Ambassador of Envirocia

The Ambassador of Envirocia

The Vazir of Agencia

The Ambassador of Envirocia“Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”


The Ambassador of Envirocia“Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”


Having consultants for various subjects might be the answer ...

Consultants

Ambassador

Ambassador

Vazir

Ambassador

“I should admit that recently the king makes decent judgments, but as the number of our meetings increases, it takes a while for me to hear back from the majesty ...”

Vazir

Agencia

Envirocia

Reward

Observation

ActionAgent

Environment

Reinforcement Learning Framework

Reward

Observation

ActionAgent

Environment

Goal

Reward

Observation

ActionAgent

Environment

Goal

Reward

Observation

ActionAgent

Environment

More Complicated State Space

More Complicated Tasks

Summary

Summary

Speed

Dat

a Effi

cien

cy

Summary

Speed

Dat

a Effi

cien

cy

Part I: King

Summary

Speed

Dat

a Effi

cien

cy

Part I: King

Part II: King + Consultants

Summary

Speed

Dat

a Effi

cien

cy

Part I: King

Part II: King + Consultants

?

Summary

Speed

Dat

a Effi

cien

cyLSTD

TD

Summary

Speed

Dat

a Effi

cien

cyLSTD

TD

iLSTD

Contributions

ContributionsiLSTD: A new policy evaluation algorithm

Extension with eligibility traces

Running time analysis

Dimension selection methods

Proof of convergence

Empirical results

Outline

Motivation

Introduction

The New Approach

Dimension Selection

Conclusion

Outline

Markov Decision Process

!S,A,Pass! ,Ra

ss! , !"


!S,A,Pass! ,Ra

ss! , !"

B.S.

100%,+85

M.S.

100%,+120

Studying

10%,-50

80%,-50 Studying

30%,-200

70%,-200Ph.D.

10%,-300

Working

100%,+60

Working

Working


!S,A,Pass! ,Ra

ss! , !"

B.S., Working, +60, B.S. Studying, -50, M.S. , ...

B.S.

100%,+85

M.S.

100%,+120

Studying

10%,-50

80%,-50 Studying

30%,-200

70%,-200Ph.D.

10%,-300

Working

100%,+60

Working

Working

Policy Evaluation

Policy Evaluation

Policy Improvement

Notation

Scalar Regular

Vector Bold Lower Case

Matrix Bold Upper Case

µ(!)

V !(s) rt+1

At A

!(s)

Policy Evaluation

V !(s) = E

! !"

t=1

!t"1rt

####s0 = s,"

$

Linear Function Approximation

V (s) = ! · "(s) =n!

i=1

!i"i(s)

Sparsity of featuresSparsity: Only features are active at any given moment.

k

k ! n


Acrobot [Sutton 96]: 48 << 18,648

k

k ! n


Acrobot [Sutton 96]: 48 << 18,648

Card game [Bowling et al. 02]: 3 << 10^6

k

k ! n


Acrobot [Sutton 96]: 48 << 18,648

Card game [Bowling et al. 02]: 3 << 10^6

Keep away soccer [Stone et al. 05]: 416 << 10^4

k

k ! n

Temporal Difference Learning TD(0)


(π) a,r

[Sutton 88]

st+1st

Tabular Representation


(π) a,r

[Sutton 88]

st+1st

!t = rt + "V (st+1)! V (st).Vt+1(st) = Vt(st) + !t"t

Tabular Representation


(π) a,r

[Sutton 88]

st+1st

!t+1 = !t + !t"(st)"t(V )

!t = rt + "V (st+1)! V (st).

Linear Function Approximation

Vt+1(st) = Vt(st) + !t"t

TD(0) Properties

Computational complexity

Data inefficient

Only last transition

O(k) per time step

TD(0) Properties

Computational complexity

Data inefficient

Only last transition

O(k) per time stepConstant

Least-Squares TD (LSTD)Sum of TD updates

Least-Squares TD (LSTD)[Bradtke, Barto 96]Sum of TD updates

Least-Squares TD (LSTD)[Bradtke, Barto 96]

µt(!) =t!

i=1

"i!i(V!)

Sum of TD updates


=t!

i=1

!iri+1

" #$ %bt

!t!

i=1

!i(!i ! !!i+1)T

" #$ %At

"

µt(!) =t!

i=1

"i!i(V!)

Sum of TD updates


=t!

i=1

!iri+1

" #$ %bt

!t!

i=1

!i(!i ! !!i+1)T

" #$ %At

"

µt(!) =t!

i=1

"i!i(V!)

= bt !At!

Sum of TD updates


=t!

i=1

!iri+1

" #$ %bt

!t!

i=1

!i(!i ! !!i+1)T

" #$ %At

"

µt(!) =t!

i=1

"i!i(V!)

= bt !At!

µt(!) = 0 ! = A!1b

Sum of TD updates

LSTD PropertiesComputational complexity

Data efficient

Look through all data

O(n2) per time step

LSTD PropertiesComputational complexity

Data efficient

Look through all data

O(n2) per time stepQuadratic

Outline

OutlineMotivation

Introduction

The New Approach

Dimension Selection

Conclusion

The New Approach

The New Approach

A and b matrices change on each iteration.

The New Approach


µt(!) =t!

i=1

"iri+1

" #$ %bt

!t!

i=1

"i("i ! !"i+1)T

" #$ %At

!

The New Approach


µt(!) =t!

i=1

"iri+1

" #$ %bt

!t!

i=1

"i("i ! !"i+1)T

" #$ %At

!

bt = bt!1 + rt!t!"#$!bt

At = At!1 + !t(!t ! !!t+1)T

! "# $!At

Incremental LSTD

µt(!) = bt !At!

Incremental LSTD

[Geramifard, Bowling, Sutton 06]

µt(!) = bt !At!

Fixed θ

Fixed A and b

Incremental LSTD


µt(!) = bt !At!

Fixed θ

Fixed A and b

Incremental LSTD


µt(!) = µt!1(!) + !bt ! (!At)!.

µt(!) = bt !At!

Fixed θ

Fixed A and b

Incremental LSTD


µt(!t+1) = µt(!t)!At(!!t).

µt(!) = µt!1(!) + !bt ! (!At)!.

µt(!) = bt !At!

iLSTDµt(!t+1) = µt(!t)!At(!!t).

How to change θ ?


How to change θ ?

Descent in the direction of ?


µt(!)

How to change θ ?



µt(!)

n

How to change θ ?



µt(!) O(n2)

n

How to change θ ?



µt(!) O(n2)

possible feature selection mechanisms in further detail and compare the performance of

various methods. For simplicity, throughout the rest of the thesis, unless specified the fea-

ture selection method will be non-zero random: we discard dimensions with zero sum TD

update and select one randomly out of the remaining dimensions.

3.3 Algorithm

Algorithm 5 shows the iLSTD algorithm together with the computational complexity of

important lines. After setting the initial values, the agent begins interacting with the envi-

ronment. A and µ are computed incrementally in Lines 5–8 according to Equations 3.1

and 3.21. It is followed by updating m selected features in Lines 9–13. For each of the

m updates, one feature is selected through a component selection mechanism. After the

update, µ is recomputed accordingly. This recomputation can affect the next component

chosen.

Algorithm 5: iLSTD Complexity0 s! s0, A! 0, µ! 0, t! 01 Initialize ! arbitrarily2 repeat3 Take action according to ! and observe r, s!

4 t! t + 15 !b! "(s)r O(k)6 !A! "(s)("(s)" ""(s!))T O(k2)7 A! A + !A O(k2)8 µ! µ + !b" (!A)! O(k2)9 for i from 1 to m do10 j ! choose an index of µ using a dimension selection mechanism11 #j ! #j + $µj O(1)12 µ! µ" $µjAej O(n)13 end for14 s! s!

15 end repeat

3.3.1 Time Analysis of iLSTD

We now examine the time complexity of the iLSTD algorithm.1b is not computed because it is implicitly included in µ.

22

iLSTD Algorithm

iLSTD

Per-time-step computational complexity

More data efficient than TD

O(mn + k2)

iLSTD



Number of iterations per time step

O(mn + k2)

iLSTD




Number of features

O(mn + k2)

iLSTD




Number of features

Maximum number of active features

O(mn + k2)

iLSTD




Number of features

Maximum number of active features

O(mn + k2) Linear

iLSTD Theorem : iLSTD converges with

probability one to the same solution as TD, under the usual step-size conditions, for any dimension selection method such that all dimensions for which is non-zero are selected in the limit an infinite number of times.

µt

Empirical Results

Settings

SettingsAveraged over 30 runs

Same random seed for all methods

Sparse matrix representation

iLSTD

Non-zero random selection

One descent per iteration

Boyan Chain

Boyan Chain

1 0234N 5

-3

-3

-3 -3 -3-3

-3 -3 -3 -3 -2 0

0.75

.

.

.

0

0.25

1

.

.

.

0

0

0

.

.

.

0

0

0.5

.

.

.

0

0.5

0.25

.

.

.

0

0.75

0

.

.

.

0

1

0

.

.

.

1

0

[Boyan 99]

Boyan Chain

1 0234N 5

-3

-3

-3 -3 -3-3

-3 -3 -3 -3 -2 0

0.75

.

.

.

0

0.25

1

.

.

.

0

0

0

.

.

.

0

0

0.5

.

.

.

0

0.5

0.25

.

.

.

0

0.75

0

.

.

.

0

1

0

.

.

.

1

0

[Boyan 99]

n = 4 (Small)n = 25 (Medium)n = 100 (Large)

Boyan Chain

1 0234N 5

-3

-3

-3 -3 -3-3

-3 -3 -3 -3 -2 0

0.75

.

.

.

0

0.25

1

.

.

.

0

0

0

.

.

.

0

0

0.5

.

.

.

0

0.5

0.25

.

.

.

0

0.75

0

.

.

.

0

1

0

.

.

.

1

0

[Boyan 99]

n = 4 (Small)n = 25 (Medium)n = 100 (Large)

k = 2

0 200 400 600 800 100010

−2

10−1

100

Episodes

RM

S o

f V

(s)

ove

r al

l sta

tes

TD

iLSTD LSTD

0 0.5 1 1.5 2 2.5 310

−2

10−1

100

101

102

Time(s)

RM

S o

f V

(s)

ove

r al

l sta

tes

TD

iLSTD

LSTD

Small Boyan Chain

0 200 400 600 800 100010

−1

100

101

102

Episodes

RM

S o

f V

(s)

ove

r al

l sta

tes

TDiLSTD

LSTD

TD

iLSTD

LSTD

0 5 10 15 2010

−1

100

101

102

103

Time(s)

RM

S o

f V

(s)

ove

r al

l sta

tes

Medium Boyan Chain

Large Boyan Chain

0 200 400 600 800 100010

−1

100

101

102

103

Episodes

RM

S o

f V

(s)

ove

r al

l sta

tes

TD

iLSTD

LSTD

TD

iLSTD

LSTD

0 20 40 60 80 10010

−1

100

101

102

103

Time(s)

RM

S o

f V

(s)

ove

r al

l sta

tes

Mountain Car

Tile codingn = 10,000k = 10

Mountain Car

Mountain CarPosition = -1 (Easy)Position = -.5 (Hard)

Goal

[For details see RL-Library]

Easy Mountain Car

TD

iLSTD

LSTD

0 200 400 600 800 100010

2

103

104

Episodes

Lo

ss F

un

ctio

n

TD

iLSTD

LSTD0 200 400 600 800 100010

4

105

106

Episode

Lo

ss

0 50 100 150 200 250 300 35010

4

105

106

107

Time (s)

Lo

ss TD

iLSTD

LSTD

Loss = ||b! !A!!||2

Hard Mountain Car

0 200 400 600 800 100010

5

106

107

Episode

Lo

ss

TD

iLSTD

LSTD

0 200 400 600 800 1000 120010

5

106

107

108

Time (s)

Lo

ss

TD

iLSTD

LSTD

Running Time

0

31

62

93

124

155

Small

Medium Hard Ea

syHard

154

34

20097

000 55000

TDiLSTDLSTD

Mountain CarBoyan Chain

Tim

e (m

s)ConstantLinearExponential

Outline

Motivation

Introduction

The New Approach

Dimension Selection

Conclusion

Outline

Dimension Selection

Random ✓

Dimension Selection

Random ✓

Greedy

Dimension Selection

Random ✓

Greedy

ε-Greedy

Dimension Selection

Random ✓

Greedy

ε-Greedy

Boltzmann

Greedy Dimension SelectionPick the one with highest value of

|µt(i)|

Greedy Dimension SelectionPick the one with highest value of

Not proven to converge.

|µt(i)|

ε-Greedy Dimension Selection

ε : Non-Zero Random

(1-ε) : Greedy

ε-Greedy Dimension Selection

ε : Non-Zero Random

(1-ε) : Greedy

Convergence proof applies.

Boltzmann Component Selection

Boltzmann Distribution + Non-Zero Random



! !m



! !m

Boltzmann Distribution



Convergence proof applies.

! !m

Boltzmann Distribution

ε-Greedy: ε =.1

Boltzmann: ψ = 10^-9, τ = 1

Empirical Results

ε-Greedy: ε =.1

Boltzmann: ψ = 10^-9, τ = 1

Small Medium Large Easy Hard10

−2

100

102

104

106

108

Boyan Chain Mountain Car

Err

or

Mea

sure

RandomGreedy!−GreedyBoltzmann

Empirical Results

0

2.2

4.4

6.6

8.8

11.0

Smal

l

Med

ium

Large

Easy

Har

d

Random Greedy !-greedy Boltzmann

Boyan chain Mountain car

Clo

ck

tim

e/st

ep

(m

s)

Running Time

Outline

Motivation

Introduction

The New Approach

Dimension Selection

Conclusion

Outline

Conclusion

Conclusion

Speed

Dat

a Effi

cien

cyLSTD

TD

Conclusion

Speed

Dat

a Effi

cien

cyLSTD

TD

iLSTD

Conclusion

Speed

Dat

a Effi

cien

cyLSTD

TD

iLSTD

No learning rate!

Questions ?

Questions ...

Questions ...What if someone uses batch-LSTD?


Why iLSTD takes simple descent?


Why iLSTD takes simple descent?

Hmm ... What about control?

Thanks ...

alborz

Technology

policy evaluation v

summary data eciency

vazir of agencia

agencia envirocia

consultants data eciency

data eciency speed summary

working working

king speed