alborz

156
Alborz Geramifard September, 2007 [email protected] Incremental Least -Squares Temporal Dierence Learning

Upload: knowdiff

Post on 19-Jun-2015

865 views

Category:

Technology


3 download

DESCRIPTION

Knowledge Diffusion NetworkVisiting Lecturer Program (114)Speaker: Alborz GeramifardPh.D. CandidateDepartment of Computing Science, Edmonton, University of Alberta, CanadaTitle: Incremental Least- Squares Temporal Difference LearningTime: Tuesday, Sep 11, 2007, 12:00-1:00 pmLocation: Department of Computer Engineering, Sharif University of Technology, Tehran

TRANSCRIPT

Page 1: Alborz

Alborz GeramifardSeptember, 2007

[email protected]

Incremental Least-Squares Temporal Difference Learning

Page 2: Alborz

Alborz GeramifardSeptember, 2007

[email protected]

Incremental Least-Squares Temporal Difference Learning

Page 3: Alborz
Page 4: Alborz

Richard Sutton

Acknowledgement

Michael Bowling

Martin Zinkevich

Page 5: Alborz
Page 6: Alborz

Agencia

Page 7: Alborz

Generous

Wise

Kind

Forgetful

Agencia

Page 8: Alborz

Envirocia

Agencia

Page 9: Alborz

Envirocia

AgenciaAgriculture

Politics

Research

...

Page 10: Alborz

The Ambassador of Envirocia

Page 11: Alborz

The Ambassador of Envirocia

The Vazir of Agencia

Page 12: Alborz

The Ambassador of Envirocia“Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”

The Vazir of Agencia

Page 13: Alborz

The Ambassador of Envirocia“Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”

The Vazir of Agencia

Having consultants for various subjects might be the answer ...

Page 14: Alborz
Page 15: Alborz

Consultants

Page 16: Alborz

Consultants

Page 17: Alborz

Ambassador

Page 18: Alborz

Ambassador

Vazir

Page 19: Alborz

Ambassador

“I should admit that recently the king makes decent judgments, but as the number of our meetings increases, it takes a while for me to hear back from the majesty ...”

Vazir

Page 20: Alborz

Agencia

Envirocia

Page 21: Alborz

Reward

Observation

ActionAgent

Environment

Reinforcement Learning Framework

Page 22: Alborz

Reward

Observation

ActionAgent

Environment

Page 23: Alborz

Reward

Observation

ActionAgent

Environment

Page 24: Alborz

Goal

Reward

Observation

ActionAgent

Environment

Page 25: Alborz

Goal

Reward

Observation

ActionAgent

Environment

Page 26: Alborz

Goal

Reward

Observation

ActionAgent

Environment

More Complicated State Space

More Complicated Tasks

Page 27: Alborz

Summary

Page 28: Alborz

Summary

Speed

Dat

a Effi

cien

cy

Page 29: Alborz

Summary

Speed

Dat

a Effi

cien

cy

Part I: King

Page 30: Alborz

Summary

Speed

Dat

a Effi

cien

cy

Part I: King

Part II: King + Consultants

Page 31: Alborz

Summary

Speed

Dat

a Effi

cien

cy

Part I: King

Part II: King + Consultants

?

Page 32: Alborz

Summary

Speed

Dat

a Effi

cien

cyLSTD

TD

Page 33: Alborz

Summary

Speed

Dat

a Effi

cien

cyLSTD

TD

iLSTD

Page 34: Alborz

Contributions

Page 35: Alborz

ContributionsiLSTD: A new policy evaluation algorithm

Extension with eligibility traces

Running time analysis

Dimension selection methods

Proof of convergence

Empirical results

Page 36: Alborz

ContributionsiLSTD: A new policy evaluation algorithm

Extension with eligibility traces

Running time analysis

Dimension selection methods

Proof of convergence

Empirical results

Page 37: Alborz

ContributionsiLSTD: A new policy evaluation algorithm

Extension with eligibility traces

Running time analysis

Dimension selection methods

Proof of convergence

Empirical results

Page 38: Alborz

Outline

Page 39: Alborz

Motivation

Introduction

The New Approach

Dimension Selection

Conclusion

Outline

Page 40: Alborz

Motivation

Introduction

The New Approach

Dimension Selection

Conclusion

Outline

Page 41: Alborz

Markov Decision Process

!S,A,Pass! ,Ra

ss! , !"

Page 42: Alborz

Markov Decision Process

!S,A,Pass! ,Ra

ss! , !"

B.S.

100%,+85

M.S.

100%,+120

Studying

10%,-50

80%,-50 Studying

30%,-200

70%,-200Ph.D.

10%,-300

Working

100%,+60

Working

Working

Page 43: Alborz

Markov Decision Process

!S,A,Pass! ,Ra

ss! , !"

B.S., Working, +60, B.S. Studying, -50, M.S. , ...

B.S.

100%,+85

M.S.

100%,+120

Studying

10%,-50

80%,-50 Studying

30%,-200

70%,-200Ph.D.

10%,-300

Working

100%,+60

Working

Working

Page 44: Alborz

Policy Evaluation

Page 45: Alborz

Policy Evaluation

Policy Improvement

Page 46: Alborz

Policy Evaluation

Policy Improvement

Page 47: Alborz

Policy Evaluation

Policy Improvement

Page 48: Alborz

Notation

Scalar Regular

Vector Bold Lower Case

Matrix Bold Upper Case

µ(!)

V !(s) rt+1

At A

!(s)

Page 49: Alborz

Policy Evaluation

V !(s) = E

! !"

t=1

!t"1rt

####s0 = s,"

$

Page 50: Alborz

Linear Function Approximation

V (s) = ! · "(s) =n!

i=1

!i"i(s)

Page 51: Alborz

Sparsity of featuresSparsity: Only features are active at any given moment.

k

k ! n

Page 52: Alborz

Sparsity of featuresSparsity: Only features are active at any given moment.

Acrobot [Sutton 96]: 48 << 18,648

k

k ! n

Page 53: Alborz

Sparsity of featuresSparsity: Only features are active at any given moment.

Acrobot [Sutton 96]: 48 << 18,648

Card game [Bowling et al. 02]: 3 << 10^6

k

k ! n

Page 54: Alborz

Sparsity of featuresSparsity: Only features are active at any given moment.

Acrobot [Sutton 96]: 48 << 18,648

Card game [Bowling et al. 02]: 3 << 10^6

Keep away soccer [Stone et al. 05]: 416 << 10^4

k

k ! n

Page 55: Alborz

Temporal Difference Learning TD(0)

Page 56: Alborz

Temporal Difference Learning TD(0)

(π) a,r

[Sutton 88]

st+1st

Page 57: Alborz

Tabular Representation

Temporal Difference Learning TD(0)

(π) a,r

[Sutton 88]

st+1st

!t = rt + "V (st+1)! V (st).Vt+1(st) = Vt(st) + !t"t

Page 58: Alborz

Tabular Representation

Temporal Difference Learning TD(0)

(π) a,r

[Sutton 88]

st+1st

!t+1 = !t + !t"(st)"t(V )

!t = rt + "V (st+1)! V (st).

Linear Function Approximation

Vt+1(st) = Vt(st) + !t"t

Page 59: Alborz

TD(0) Properties

Computational complexity

Data inefficient

Only last transition

O(k) per time step

Page 60: Alborz

TD(0) Properties

Computational complexity

Data inefficient

Only last transition

O(k) per time stepConstant

Page 61: Alborz

Least-Squares TD (LSTD)Sum of TD updates

Page 62: Alborz

Least-Squares TD (LSTD)[Bradtke, Barto 96]Sum of TD updates

Page 63: Alborz

Least-Squares TD (LSTD)[Bradtke, Barto 96]

µt(!) =t!

i=1

"i!i(V!)

Sum of TD updates

Page 64: Alborz

Least-Squares TD (LSTD)[Bradtke, Barto 96]

=t!

i=1

!iri+1

" #$ %bt

!t!

i=1

!i(!i ! !!i+1)T

" #$ %At

"

µt(!) =t!

i=1

"i!i(V!)

Sum of TD updates

Page 65: Alborz

Least-Squares TD (LSTD)[Bradtke, Barto 96]

=t!

i=1

!iri+1

" #$ %bt

!t!

i=1

!i(!i ! !!i+1)T

" #$ %At

"

µt(!) =t!

i=1

"i!i(V!)

= bt !At!

Sum of TD updates

Page 66: Alborz

Least-Squares TD (LSTD)[Bradtke, Barto 96]

=t!

i=1

!iri+1

" #$ %bt

!t!

i=1

!i(!i ! !!i+1)T

" #$ %At

"

µt(!) =t!

i=1

"i!i(V!)

= bt !At!

µt(!) = 0 ! = A!1b

Sum of TD updates

Page 67: Alborz

LSTD PropertiesComputational complexity

Data efficient

Look through all data

O(n2) per time step

Page 68: Alborz

LSTD PropertiesComputational complexity

Data efficient

Look through all data

O(n2) per time stepQuadratic

Page 69: Alborz

Outline

Page 70: Alborz

OutlineMotivation

Introduction

The New Approach

Dimension Selection

Conclusion

Page 71: Alborz

The New Approach

Page 72: Alborz

The New Approach

A and b matrices change on each iteration.

Page 73: Alborz

The New Approach

A and b matrices change on each iteration.

µt(!) =t!

i=1

"iri+1

" #$ %bt

!t!

i=1

"i("i ! !"i+1)T

" #$ %At

!

Page 74: Alborz

The New Approach

A and b matrices change on each iteration.

µt(!) =t!

i=1

"iri+1

" #$ %bt

!t!

i=1

"i("i ! !"i+1)T

" #$ %At

!

bt = bt!1 + rt!t!"#$!bt

At = At!1 + !t(!t ! !!t+1)T

! "# $!At

Page 75: Alborz

Incremental LSTD

µt(!) = bt !At!

Page 76: Alborz

Incremental LSTD

[Geramifard, Bowling, Sutton 06]

µt(!) = bt !At!

Page 77: Alborz

Fixed θ

Fixed A and b

Incremental LSTD

[Geramifard, Bowling, Sutton 06]

µt(!) = bt !At!

Page 78: Alborz

Fixed θ

Fixed A and b

Incremental LSTD

[Geramifard, Bowling, Sutton 06]

µt(!) = µt!1(!) + !bt ! (!At)!.

µt(!) = bt !At!

Page 79: Alborz

Fixed θ

Fixed A and b

Incremental LSTD

[Geramifard, Bowling, Sutton 06]

µt(!t+1) = µt(!t)!At(!!t).

µt(!) = µt!1(!) + !bt ! (!At)!.

µt(!) = bt !At!

Page 80: Alborz

iLSTDµt(!t+1) = µt(!t)!At(!!t).

Page 81: Alborz

How to change θ ?

iLSTDµt(!t+1) = µt(!t)!At(!!t).

Page 82: Alborz

How to change θ ?

Descent in the direction of ?

iLSTDµt(!t+1) = µt(!t)!At(!!t).

µt(!)

Page 83: Alborz

How to change θ ?

Descent in the direction of ?

iLSTDµt(!t+1) = µt(!t)!At(!!t).

µt(!)

Page 84: Alborz

How to change θ ?

Descent in the direction of ?

iLSTDµt(!t+1) = µt(!t)!At(!!t).

µt(!)

Page 85: Alborz

How to change θ ?

Descent in the direction of ?

iLSTDµt(!t+1) = µt(!t)!At(!!t).

µt(!)

n

Page 86: Alborz

How to change θ ?

Descent in the direction of ?

iLSTDµt(!t+1) = µt(!t)!At(!!t).

µt(!) O(n2)

n

Page 87: Alborz

How to change θ ?

Descent in the direction of ?

iLSTDµt(!t+1) = µt(!t)!At(!!t).

µt(!) O(n2)

Page 88: Alborz

How to change θ ?

Descent in the direction of ?

iLSTDµt(!t+1) = µt(!t)!At(!!t).

µt(!) O(n2)

Page 89: Alborz

How to change θ ?

Descent in the direction of ?

iLSTDµt(!t+1) = µt(!t)!At(!!t).

µt(!) O(n2)

Page 90: Alborz

How to change θ ?

Descent in the direction of ?

iLSTDµt(!t+1) = µt(!t)!At(!!t).

µt(!) O(n2)

Page 91: Alborz

possible feature selection mechanisms in further detail and compare the performance of

various methods. For simplicity, throughout the rest of the thesis, unless specified the fea-

ture selection method will be non-zero random: we discard dimensions with zero sum TD

update and select one randomly out of the remaining dimensions.

3.3 Algorithm

Algorithm 5 shows the iLSTD algorithm together with the computational complexity of

important lines. After setting the initial values, the agent begins interacting with the envi-

ronment. A and µ are computed incrementally in Lines 5–8 according to Equations 3.1

and 3.21. It is followed by updating m selected features in Lines 9–13. For each of the

m updates, one feature is selected through a component selection mechanism. After the

update, µ is recomputed accordingly. This recomputation can affect the next component

chosen.

Algorithm 5: iLSTD Complexity0 s! s0, A! 0, µ! 0, t! 01 Initialize ! arbitrarily2 repeat3 Take action according to ! and observe r, s!

4 t! t + 15 !b! "(s)r O(k)6 !A! "(s)("(s)" ""(s!))T O(k2)7 A! A + !A O(k2)8 µ! µ + !b" (!A)! O(k2)9 for i from 1 to m do10 j ! choose an index of µ using a dimension selection mechanism11 #j ! #j + $µj O(1)12 µ! µ" $µjAej O(n)13 end for14 s! s!

15 end repeat

3.3.1 Time Analysis of iLSTD

We now examine the time complexity of the iLSTD algorithm.1b is not computed because it is implicitly included in µ.

22

iLSTD Algorithm

Page 92: Alborz

possible feature selection mechanisms in further detail and compare the performance of

various methods. For simplicity, throughout the rest of the thesis, unless specified the fea-

ture selection method will be non-zero random: we discard dimensions with zero sum TD

update and select one randomly out of the remaining dimensions.

3.3 Algorithm

Algorithm 5 shows the iLSTD algorithm together with the computational complexity of

important lines. After setting the initial values, the agent begins interacting with the envi-

ronment. A and µ are computed incrementally in Lines 5–8 according to Equations 3.1

and 3.21. It is followed by updating m selected features in Lines 9–13. For each of the

m updates, one feature is selected through a component selection mechanism. After the

update, µ is recomputed accordingly. This recomputation can affect the next component

chosen.

Algorithm 5: iLSTD Complexity0 s! s0, A! 0, µ! 0, t! 01 Initialize ! arbitrarily2 repeat3 Take action according to ! and observe r, s!

4 t! t + 15 !b! "(s)r O(k)6 !A! "(s)("(s)" ""(s!))T O(k2)7 A! A + !A O(k2)8 µ! µ + !b" (!A)! O(k2)9 for i from 1 to m do10 j ! choose an index of µ using a dimension selection mechanism11 #j ! #j + $µj O(1)12 µ! µ" $µjAej O(n)13 end for14 s! s!

15 end repeat

3.3.1 Time Analysis of iLSTD

We now examine the time complexity of the iLSTD algorithm.1b is not computed because it is implicitly included in µ.

22

iLSTD Algorithm

Page 93: Alborz

possible feature selection mechanisms in further detail and compare the performance of

various methods. For simplicity, throughout the rest of the thesis, unless specified the fea-

ture selection method will be non-zero random: we discard dimensions with zero sum TD

update and select one randomly out of the remaining dimensions.

3.3 Algorithm

Algorithm 5 shows the iLSTD algorithm together with the computational complexity of

important lines. After setting the initial values, the agent begins interacting with the envi-

ronment. A and µ are computed incrementally in Lines 5–8 according to Equations 3.1

and 3.21. It is followed by updating m selected features in Lines 9–13. For each of the

m updates, one feature is selected through a component selection mechanism. After the

update, µ is recomputed accordingly. This recomputation can affect the next component

chosen.

Algorithm 5: iLSTD Complexity0 s! s0, A! 0, µ! 0, t! 01 Initialize ! arbitrarily2 repeat3 Take action according to ! and observe r, s!

4 t! t + 15 !b! "(s)r O(k)6 !A! "(s)("(s)" ""(s!))T O(k2)7 A! A + !A O(k2)8 µ! µ + !b" (!A)! O(k2)9 for i from 1 to m do10 j ! choose an index of µ using a dimension selection mechanism11 #j ! #j + $µj O(1)12 µ! µ" $µjAej O(n)13 end for14 s! s!

15 end repeat

3.3.1 Time Analysis of iLSTD

We now examine the time complexity of the iLSTD algorithm.1b is not computed because it is implicitly included in µ.

22

iLSTD Algorithm

Page 94: Alborz

iLSTD

Per-time-step computational complexity

More data efficient than TD

O(mn + k2)

Page 95: Alborz

iLSTD

Per-time-step computational complexity

More data efficient than TD

Number of iterations per time step

O(mn + k2)

Page 96: Alborz

iLSTD

Per-time-step computational complexity

More data efficient than TD

Number of iterations per time step

Number of features

O(mn + k2)

Page 97: Alborz

iLSTD

Per-time-step computational complexity

More data efficient than TD

Number of iterations per time step

Number of features

Maximum number of active features

O(mn + k2)

Page 98: Alborz

iLSTD

Per-time-step computational complexity

More data efficient than TD

Number of iterations per time step

Number of features

Maximum number of active features

O(mn + k2) Linear

Page 99: Alborz

iLSTD Theorem : iLSTD converges with

probability one to the same solution as TD, under the usual step-size conditions, for any dimension selection method such that all dimensions for which is non-zero are selected in the limit an infinite number of times.

µt

Page 100: Alborz

Empirical Results

Page 101: Alborz

Settings

Page 102: Alborz

SettingsAveraged over 30 runs

Same random seed for all methods

Sparse matrix representation

iLSTD

Non-zero random selection

One descent per iteration

Page 103: Alborz

Boyan Chain

Page 104: Alborz

Boyan Chain

1 0234N 5

-3

-3

-3 -3 -3-3

-3 -3 -3 -3 -2 0

0.75

.

.

.

0

0.25

1

.

.

.

0

0

0

.

.

.

0

0

0.5

.

.

.

0

0.5

0.25

.

.

.

0

0.75

0

.

.

.

0

1

0

.

.

.

1

0

[Boyan 99]

Page 105: Alborz

Boyan Chain

1 0234N 5

-3

-3

-3 -3 -3-3

-3 -3 -3 -3 -2 0

0.75

.

.

.

0

0.25

1

.

.

.

0

0

0

.

.

.

0

0

0.5

.

.

.

0

0.5

0.25

.

.

.

0

0.75

0

.

.

.

0

1

0

.

.

.

1

0

[Boyan 99]

n = 4 (Small)n = 25 (Medium)n = 100 (Large)

Page 106: Alborz

Boyan Chain

1 0234N 5

-3

-3

-3 -3 -3-3

-3 -3 -3 -3 -2 0

0.75

.

.

.

0

0.25

1

.

.

.

0

0

0

.

.

.

0

0

0.5

.

.

.

0

0.5

0.25

.

.

.

0

0.75

0

.

.

.

0

1

0

.

.

.

1

0

[Boyan 99]

n = 4 (Small)n = 25 (Medium)n = 100 (Large)

k = 2

Page 107: Alborz

0 200 400 600 800 100010

−2

10−1

100

Episodes

RM

S o

f V

(s)

ove

r al

l sta

tes

TD

iLSTD LSTD

0 0.5 1 1.5 2 2.5 310

−2

10−1

100

101

102

Time(s)

RM

S o

f V

(s)

ove

r al

l sta

tes

TD

iLSTD

LSTD

Small Boyan Chain

Page 108: Alborz

0 200 400 600 800 100010

−2

10−1

100

Episodes

RM

S o

f V

(s)

ove

r al

l sta

tes

TD

iLSTD LSTD

0 0.5 1 1.5 2 2.5 310

−2

10−1

100

101

102

Time(s)

RM

S o

f V

(s)

ove

r al

l sta

tes

TD

iLSTD

LSTD

Small Boyan Chain

Page 109: Alborz

0 200 400 600 800 100010

−1

100

101

102

Episodes

RM

S o

f V

(s)

ove

r al

l sta

tes

TDiLSTD

LSTD

TD

iLSTD

LSTD

0 5 10 15 2010

−1

100

101

102

103

Time(s)

RM

S o

f V

(s)

ove

r al

l sta

tes

Medium Boyan Chain

Page 110: Alborz

0 200 400 600 800 100010

−1

100

101

102

Episodes

RM

S o

f V

(s)

ove

r al

l sta

tes

TDiLSTD

LSTD

TD

iLSTD

LSTD

0 5 10 15 2010

−1

100

101

102

103

Time(s)

RM

S o

f V

(s)

ove

r al

l sta

tes

Medium Boyan Chain

Page 111: Alborz

Large Boyan Chain

0 200 400 600 800 100010

−1

100

101

102

103

Episodes

RM

S o

f V

(s)

ove

r al

l sta

tes

TD

iLSTD

LSTD

TD

iLSTD

LSTD

0 20 40 60 80 10010

−1

100

101

102

103

Time(s)

RM

S o

f V

(s)

ove

r al

l sta

tes

Page 112: Alborz

Large Boyan Chain

0 200 400 600 800 100010

−1

100

101

102

103

Episodes

RM

S o

f V

(s)

ove

r al

l sta

tes

TD

iLSTD

LSTD

TD

iLSTD

LSTD

0 20 40 60 80 10010

−1

100

101

102

103

Time(s)

RM

S o

f V

(s)

ove

r al

l sta

tes

Page 113: Alborz

Large Boyan Chain

0 200 400 600 800 100010

−1

100

101

102

103

Episodes

RM

S o

f V

(s)

ove

r al

l sta

tes

TD

iLSTD

LSTD

TD

iLSTD

LSTD

0 20 40 60 80 10010

−1

100

101

102

103

Time(s)

RM

S o

f V

(s)

ove

r al

l sta

tes

Page 114: Alborz

Mountain Car

Page 115: Alborz

Tile codingn = 10,000k = 10

Mountain Car

Mountain CarPosition = -1 (Easy)Position = -.5 (Hard)

Goal

[For details see RL-Library]

Page 116: Alborz

Easy Mountain Car

TD

iLSTD

LSTD

0 200 400 600 800 100010

2

103

104

Episodes

Lo

ss F

un

ctio

n

TD

iLSTD

LSTD0 200 400 600 800 100010

4

105

106

Episode

Lo

ss

0 50 100 150 200 250 300 35010

4

105

106

107

Time (s)

Lo

ss TD

iLSTD

LSTD

Loss = ||b! !A!!||2

Page 117: Alborz

Hard Mountain Car

0 200 400 600 800 100010

5

106

107

Episode

Lo

ss

TD

iLSTD

LSTD

0 200 400 600 800 1000 120010

5

106

107

108

Time (s)

Lo

ss

TD

iLSTD

LSTD

Page 118: Alborz

Hard Mountain Car

0 200 400 600 800 100010

5

106

107

Episode

Lo

ss

TD

iLSTD

LSTD

0 200 400 600 800 1000 120010

5

106

107

108

Time (s)

Lo

ss

TD

iLSTD

LSTD

Page 119: Alborz

Running Time

0

31

62

93

124

155

Small

Medium Hard Ea

syHard

154

34

20097

000 55000

TDiLSTDLSTD

Mountain CarBoyan Chain

Tim

e (m

s)ConstantLinearExponential

Page 120: Alborz

Running Time

0

31

62

93

124

155

Small

Medium Hard Ea

syHard

154

34

20097

000 55000

TDiLSTDLSTD

Mountain CarBoyan Chain

Tim

e (m

s)ConstantLinearExponential

Page 121: Alborz

Outline

Page 122: Alborz

Motivation

Introduction

The New Approach

Dimension Selection

Conclusion

Outline

Page 123: Alborz

Dimension Selection

Random ✓

Page 124: Alborz

Dimension Selection

Random ✓

Greedy

Page 125: Alborz

Dimension Selection

Random ✓

Greedy

ε-Greedy

Page 126: Alborz

Dimension Selection

Random ✓

Greedy

ε-Greedy

Boltzmann

Page 127: Alborz

Greedy Dimension SelectionPick the one with highest value of

|µt(i)|

Page 128: Alborz

Greedy Dimension SelectionPick the one with highest value of

Not proven to converge.

|µt(i)|

Page 129: Alborz

ε-Greedy Dimension Selection

ε : Non-Zero Random

(1-ε) : Greedy

Page 130: Alborz

ε-Greedy Dimension Selection

ε : Non-Zero Random

(1-ε) : Greedy

Convergence proof applies.

Page 131: Alborz

Boltzmann Component Selection

Boltzmann Distribution + Non-Zero Random

Page 132: Alborz

Boltzmann Component Selection

Boltzmann Distribution + Non-Zero Random

Page 133: Alborz

Boltzmann Component Selection

Boltzmann Distribution + Non-Zero Random

! !m

Page 134: Alborz

Boltzmann Component Selection

Boltzmann Distribution + Non-Zero Random

! !m

Boltzmann Distribution

Page 135: Alborz

Boltzmann Component Selection

Boltzmann Distribution + Non-Zero Random

Convergence proof applies.

! !m

Boltzmann Distribution

Page 136: Alborz

ε-Greedy: ε =.1

Boltzmann: ψ = 10^-9, τ = 1

Empirical Results

Page 137: Alborz

ε-Greedy: ε =.1

Boltzmann: ψ = 10^-9, τ = 1

Small Medium Large Easy Hard10

−2

100

102

104

106

108

Boyan Chain Mountain Car

Err

or

Mea

sure

RandomGreedy!−GreedyBoltzmann

Empirical Results

Page 138: Alborz

ε-Greedy: ε =.1

Boltzmann: ψ = 10^-9, τ = 1

Small Medium Large Easy Hard10

−2

100

102

104

106

108

Boyan Chain Mountain Car

Err

or

Mea

sure

RandomGreedy!−GreedyBoltzmann

Empirical Results

Page 139: Alborz

ε-Greedy: ε =.1

Boltzmann: ψ = 10^-9, τ = 1

Small Medium Large Easy Hard10

−2

100

102

104

106

108

Boyan Chain Mountain Car

Err

or

Mea

sure

RandomGreedy!−GreedyBoltzmann

Empirical Results

Page 140: Alborz

0

2.2

4.4

6.6

8.8

11.0

Smal

l

Med

ium

Large

Easy

Har

d

Random Greedy !-greedy Boltzmann

Boyan chain Mountain car

Clo

ck

tim

e/st

ep

(m

s)

Running Time

Page 141: Alborz

Outline

Page 142: Alborz

Motivation

Introduction

The New Approach

Dimension Selection

Conclusion

Outline

Page 143: Alborz

Conclusion

Page 144: Alborz

Conclusion

Speed

Dat

a Effi

cien

cyLSTD

TD

Page 145: Alborz

Conclusion

Speed

Dat

a Effi

cien

cyLSTD

TD

iLSTD

Page 146: Alborz

Conclusion

Speed

Dat

a Effi

cien

cyLSTD

TD

iLSTD

Page 147: Alborz

Conclusion

Speed

Dat

a Effi

cien

cyLSTD

TD

iLSTD

Page 148: Alborz

Conclusion

Speed

Dat

a Effi

cien

cyLSTD

TD

iLSTD

Page 149: Alborz

Conclusion

Speed

Dat

a Effi

cien

cyLSTD

TD

iLSTD

No learning rate!

Page 150: Alborz
Page 151: Alborz

Questions ?

Page 152: Alborz

Questions ...

Page 153: Alborz

Questions ...What if someone uses batch-LSTD?

Page 154: Alborz

Questions ...What if someone uses batch-LSTD?

Why iLSTD takes simple descent?

Page 155: Alborz

Questions ...What if someone uses batch-LSTD?

Why iLSTD takes simple descent?

Hmm ... What about control?

Page 156: Alborz

Thanks ...