alborz
DESCRIPTION
Knowledge Diffusion NetworkVisiting Lecturer Program (114)Speaker: Alborz GeramifardPh.D. CandidateDepartment of Computing Science, Edmonton, University of Alberta, CanadaTitle: Incremental Least- Squares Temporal Difference LearningTime: Tuesday, Sep 11, 2007, 12:00-1:00 pmLocation: Department of Computer Engineering, Sharif University of Technology, TehranTRANSCRIPT
Alborz GeramifardSeptember, 2007
Incremental Least-Squares Temporal Difference Learning
Alborz GeramifardSeptember, 2007
Incremental Least-Squares Temporal Difference Learning
Richard Sutton
Acknowledgement
Michael Bowling
Martin Zinkevich
Agencia
Generous
Wise
Kind
Forgetful
Agencia
Envirocia
Agencia
Envirocia
AgenciaAgriculture
Politics
Research
...
The Ambassador of Envirocia
The Ambassador of Envirocia
The Vazir of Agencia
The Ambassador of Envirocia“Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”
The Vazir of Agencia
The Ambassador of Envirocia“Your king is generous, wise, and kind, but he is forgetful. Sometimes he makes decisions which are not wise in light of our earlier conversations.”
The Vazir of Agencia
Having consultants for various subjects might be the answer ...
Consultants
Consultants
Ambassador
Ambassador
Vazir
Ambassador
“I should admit that recently the king makes decent judgments, but as the number of our meetings increases, it takes a while for me to hear back from the majesty ...”
Vazir
Agencia
Envirocia
Reward
Observation
ActionAgent
Environment
Reinforcement Learning Framework
Reward
Observation
ActionAgent
Environment
Reward
Observation
ActionAgent
Environment
Goal
Reward
Observation
ActionAgent
Environment
Goal
Reward
Observation
ActionAgent
Environment
Goal
Reward
Observation
ActionAgent
Environment
More Complicated State Space
More Complicated Tasks
Summary
Summary
Speed
Dat
a Effi
cien
cy
Summary
Speed
Dat
a Effi
cien
cy
Part I: King
Summary
Speed
Dat
a Effi
cien
cy
Part I: King
Part II: King + Consultants
Summary
Speed
Dat
a Effi
cien
cy
Part I: King
Part II: King + Consultants
?
Summary
Speed
Dat
a Effi
cien
cyLSTD
TD
Summary
Speed
Dat
a Effi
cien
cyLSTD
TD
iLSTD
Contributions
ContributionsiLSTD: A new policy evaluation algorithm
Extension with eligibility traces
Running time analysis
Dimension selection methods
Proof of convergence
Empirical results
ContributionsiLSTD: A new policy evaluation algorithm
Extension with eligibility traces
Running time analysis
Dimension selection methods
Proof of convergence
Empirical results
ContributionsiLSTD: A new policy evaluation algorithm
Extension with eligibility traces
Running time analysis
Dimension selection methods
Proof of convergence
Empirical results
Outline
Motivation
Introduction
The New Approach
Dimension Selection
Conclusion
Outline
Motivation
Introduction
The New Approach
Dimension Selection
Conclusion
Outline
Markov Decision Process
!S,A,Pass! ,Ra
ss! , !"
Markov Decision Process
!S,A,Pass! ,Ra
ss! , !"
B.S.
100%,+85
M.S.
100%,+120
Studying
10%,-50
80%,-50 Studying
30%,-200
70%,-200Ph.D.
10%,-300
Working
100%,+60
Working
Working
Markov Decision Process
!S,A,Pass! ,Ra
ss! , !"
B.S., Working, +60, B.S. Studying, -50, M.S. , ...
B.S.
100%,+85
M.S.
100%,+120
Studying
10%,-50
80%,-50 Studying
30%,-200
70%,-200Ph.D.
10%,-300
Working
100%,+60
Working
Working
Policy Evaluation
Policy Evaluation
Policy Improvement
Policy Evaluation
Policy Improvement
Policy Evaluation
Policy Improvement
Notation
Scalar Regular
Vector Bold Lower Case
Matrix Bold Upper Case
µ(!)
V !(s) rt+1
At A
!(s)
Policy Evaluation
V !(s) = E
! !"
t=1
!t"1rt
####s0 = s,"
$
Linear Function Approximation
V (s) = ! · "(s) =n!
i=1
!i"i(s)
Sparsity of featuresSparsity: Only features are active at any given moment.
k
k ! n
Sparsity of featuresSparsity: Only features are active at any given moment.
Acrobot [Sutton 96]: 48 << 18,648
k
k ! n
Sparsity of featuresSparsity: Only features are active at any given moment.
Acrobot [Sutton 96]: 48 << 18,648
Card game [Bowling et al. 02]: 3 << 10^6
k
k ! n
Sparsity of featuresSparsity: Only features are active at any given moment.
Acrobot [Sutton 96]: 48 << 18,648
Card game [Bowling et al. 02]: 3 << 10^6
Keep away soccer [Stone et al. 05]: 416 << 10^4
k
k ! n
Temporal Difference Learning TD(0)
Temporal Difference Learning TD(0)
(π) a,r
[Sutton 88]
st+1st
Tabular Representation
Temporal Difference Learning TD(0)
(π) a,r
[Sutton 88]
st+1st
!t = rt + "V (st+1)! V (st).Vt+1(st) = Vt(st) + !t"t
Tabular Representation
Temporal Difference Learning TD(0)
(π) a,r
[Sutton 88]
st+1st
!t+1 = !t + !t"(st)"t(V )
!t = rt + "V (st+1)! V (st).
Linear Function Approximation
Vt+1(st) = Vt(st) + !t"t
TD(0) Properties
Computational complexity
Data inefficient
Only last transition
O(k) per time step
TD(0) Properties
Computational complexity
Data inefficient
Only last transition
O(k) per time stepConstant
Least-Squares TD (LSTD)Sum of TD updates
Least-Squares TD (LSTD)[Bradtke, Barto 96]Sum of TD updates
Least-Squares TD (LSTD)[Bradtke, Barto 96]
µt(!) =t!
i=1
"i!i(V!)
Sum of TD updates
Least-Squares TD (LSTD)[Bradtke, Barto 96]
=t!
i=1
!iri+1
" #$ %bt
!t!
i=1
!i(!i ! !!i+1)T
" #$ %At
"
µt(!) =t!
i=1
"i!i(V!)
Sum of TD updates
Least-Squares TD (LSTD)[Bradtke, Barto 96]
=t!
i=1
!iri+1
" #$ %bt
!t!
i=1
!i(!i ! !!i+1)T
" #$ %At
"
µt(!) =t!
i=1
"i!i(V!)
= bt !At!
Sum of TD updates
Least-Squares TD (LSTD)[Bradtke, Barto 96]
=t!
i=1
!iri+1
" #$ %bt
!t!
i=1
!i(!i ! !!i+1)T
" #$ %At
"
µt(!) =t!
i=1
"i!i(V!)
= bt !At!
µt(!) = 0 ! = A!1b
Sum of TD updates
LSTD PropertiesComputational complexity
Data efficient
Look through all data
O(n2) per time step
LSTD PropertiesComputational complexity
Data efficient
Look through all data
O(n2) per time stepQuadratic
Outline
OutlineMotivation
Introduction
The New Approach
Dimension Selection
Conclusion
The New Approach
The New Approach
A and b matrices change on each iteration.
The New Approach
A and b matrices change on each iteration.
µt(!) =t!
i=1
"iri+1
" #$ %bt
!t!
i=1
"i("i ! !"i+1)T
" #$ %At
!
The New Approach
A and b matrices change on each iteration.
µt(!) =t!
i=1
"iri+1
" #$ %bt
!t!
i=1
"i("i ! !"i+1)T
" #$ %At
!
bt = bt!1 + rt!t!"#$!bt
At = At!1 + !t(!t ! !!t+1)T
! "# $!At
Incremental LSTD
µt(!) = bt !At!
Incremental LSTD
[Geramifard, Bowling, Sutton 06]
µt(!) = bt !At!
Fixed θ
Fixed A and b
Incremental LSTD
[Geramifard, Bowling, Sutton 06]
µt(!) = bt !At!
Fixed θ
Fixed A and b
Incremental LSTD
[Geramifard, Bowling, Sutton 06]
µt(!) = µt!1(!) + !bt ! (!At)!.
µt(!) = bt !At!
Fixed θ
Fixed A and b
Incremental LSTD
[Geramifard, Bowling, Sutton 06]
µt(!t+1) = µt(!t)!At(!!t).
µt(!) = µt!1(!) + !bt ! (!At)!.
µt(!) = bt !At!
iLSTDµt(!t+1) = µt(!t)!At(!!t).
How to change θ ?
iLSTDµt(!t+1) = µt(!t)!At(!!t).
How to change θ ?
Descent in the direction of ?
iLSTDµt(!t+1) = µt(!t)!At(!!t).
µt(!)
How to change θ ?
Descent in the direction of ?
iLSTDµt(!t+1) = µt(!t)!At(!!t).
µt(!)
How to change θ ?
Descent in the direction of ?
iLSTDµt(!t+1) = µt(!t)!At(!!t).
µt(!)
How to change θ ?
Descent in the direction of ?
iLSTDµt(!t+1) = µt(!t)!At(!!t).
µt(!)
n
How to change θ ?
Descent in the direction of ?
iLSTDµt(!t+1) = µt(!t)!At(!!t).
µt(!) O(n2)
n
How to change θ ?
Descent in the direction of ?
iLSTDµt(!t+1) = µt(!t)!At(!!t).
µt(!) O(n2)
How to change θ ?
Descent in the direction of ?
iLSTDµt(!t+1) = µt(!t)!At(!!t).
µt(!) O(n2)
How to change θ ?
Descent in the direction of ?
iLSTDµt(!t+1) = µt(!t)!At(!!t).
µt(!) O(n2)
How to change θ ?
Descent in the direction of ?
iLSTDµt(!t+1) = µt(!t)!At(!!t).
µt(!) O(n2)
possible feature selection mechanisms in further detail and compare the performance of
various methods. For simplicity, throughout the rest of the thesis, unless specified the fea-
ture selection method will be non-zero random: we discard dimensions with zero sum TD
update and select one randomly out of the remaining dimensions.
3.3 Algorithm
Algorithm 5 shows the iLSTD algorithm together with the computational complexity of
important lines. After setting the initial values, the agent begins interacting with the envi-
ronment. A and µ are computed incrementally in Lines 5–8 according to Equations 3.1
and 3.21. It is followed by updating m selected features in Lines 9–13. For each of the
m updates, one feature is selected through a component selection mechanism. After the
update, µ is recomputed accordingly. This recomputation can affect the next component
chosen.
Algorithm 5: iLSTD Complexity0 s! s0, A! 0, µ! 0, t! 01 Initialize ! arbitrarily2 repeat3 Take action according to ! and observe r, s!
4 t! t + 15 !b! "(s)r O(k)6 !A! "(s)("(s)" ""(s!))T O(k2)7 A! A + !A O(k2)8 µ! µ + !b" (!A)! O(k2)9 for i from 1 to m do10 j ! choose an index of µ using a dimension selection mechanism11 #j ! #j + $µj O(1)12 µ! µ" $µjAej O(n)13 end for14 s! s!
15 end repeat
3.3.1 Time Analysis of iLSTD
We now examine the time complexity of the iLSTD algorithm.1b is not computed because it is implicitly included in µ.
22
iLSTD Algorithm
possible feature selection mechanisms in further detail and compare the performance of
various methods. For simplicity, throughout the rest of the thesis, unless specified the fea-
ture selection method will be non-zero random: we discard dimensions with zero sum TD
update and select one randomly out of the remaining dimensions.
3.3 Algorithm
Algorithm 5 shows the iLSTD algorithm together with the computational complexity of
important lines. After setting the initial values, the agent begins interacting with the envi-
ronment. A and µ are computed incrementally in Lines 5–8 according to Equations 3.1
and 3.21. It is followed by updating m selected features in Lines 9–13. For each of the
m updates, one feature is selected through a component selection mechanism. After the
update, µ is recomputed accordingly. This recomputation can affect the next component
chosen.
Algorithm 5: iLSTD Complexity0 s! s0, A! 0, µ! 0, t! 01 Initialize ! arbitrarily2 repeat3 Take action according to ! and observe r, s!
4 t! t + 15 !b! "(s)r O(k)6 !A! "(s)("(s)" ""(s!))T O(k2)7 A! A + !A O(k2)8 µ! µ + !b" (!A)! O(k2)9 for i from 1 to m do10 j ! choose an index of µ using a dimension selection mechanism11 #j ! #j + $µj O(1)12 µ! µ" $µjAej O(n)13 end for14 s! s!
15 end repeat
3.3.1 Time Analysis of iLSTD
We now examine the time complexity of the iLSTD algorithm.1b is not computed because it is implicitly included in µ.
22
iLSTD Algorithm
possible feature selection mechanisms in further detail and compare the performance of
various methods. For simplicity, throughout the rest of the thesis, unless specified the fea-
ture selection method will be non-zero random: we discard dimensions with zero sum TD
update and select one randomly out of the remaining dimensions.
3.3 Algorithm
Algorithm 5 shows the iLSTD algorithm together with the computational complexity of
important lines. After setting the initial values, the agent begins interacting with the envi-
ronment. A and µ are computed incrementally in Lines 5–8 according to Equations 3.1
and 3.21. It is followed by updating m selected features in Lines 9–13. For each of the
m updates, one feature is selected through a component selection mechanism. After the
update, µ is recomputed accordingly. This recomputation can affect the next component
chosen.
Algorithm 5: iLSTD Complexity0 s! s0, A! 0, µ! 0, t! 01 Initialize ! arbitrarily2 repeat3 Take action according to ! and observe r, s!
4 t! t + 15 !b! "(s)r O(k)6 !A! "(s)("(s)" ""(s!))T O(k2)7 A! A + !A O(k2)8 µ! µ + !b" (!A)! O(k2)9 for i from 1 to m do10 j ! choose an index of µ using a dimension selection mechanism11 #j ! #j + $µj O(1)12 µ! µ" $µjAej O(n)13 end for14 s! s!
15 end repeat
3.3.1 Time Analysis of iLSTD
We now examine the time complexity of the iLSTD algorithm.1b is not computed because it is implicitly included in µ.
22
iLSTD Algorithm
iLSTD
Per-time-step computational complexity
More data efficient than TD
O(mn + k2)
iLSTD
Per-time-step computational complexity
More data efficient than TD
Number of iterations per time step
O(mn + k2)
iLSTD
Per-time-step computational complexity
More data efficient than TD
Number of iterations per time step
Number of features
O(mn + k2)
iLSTD
Per-time-step computational complexity
More data efficient than TD
Number of iterations per time step
Number of features
Maximum number of active features
O(mn + k2)
iLSTD
Per-time-step computational complexity
More data efficient than TD
Number of iterations per time step
Number of features
Maximum number of active features
O(mn + k2) Linear
iLSTD Theorem : iLSTD converges with
probability one to the same solution as TD, under the usual step-size conditions, for any dimension selection method such that all dimensions for which is non-zero are selected in the limit an infinite number of times.
µt
Empirical Results
Settings
SettingsAveraged over 30 runs
Same random seed for all methods
Sparse matrix representation
iLSTD
Non-zero random selection
One descent per iteration
Boyan Chain
Boyan Chain
1 0234N 5
-3
-3
-3 -3 -3-3
-3 -3 -3 -3 -2 0
0.75
.
.
.
0
0.25
1
.
.
.
0
0
0
.
.
.
0
0
0.5
.
.
.
0
0.5
0.25
.
.
.
0
0.75
0
.
.
.
0
1
0
.
.
.
1
0
[Boyan 99]
Boyan Chain
1 0234N 5
-3
-3
-3 -3 -3-3
-3 -3 -3 -3 -2 0
0.75
.
.
.
0
0.25
1
.
.
.
0
0
0
.
.
.
0
0
0.5
.
.
.
0
0.5
0.25
.
.
.
0
0.75
0
.
.
.
0
1
0
.
.
.
1
0
[Boyan 99]
n = 4 (Small)n = 25 (Medium)n = 100 (Large)
Boyan Chain
1 0234N 5
-3
-3
-3 -3 -3-3
-3 -3 -3 -3 -2 0
0.75
.
.
.
0
0.25
1
.
.
.
0
0
0
.
.
.
0
0
0.5
.
.
.
0
0.5
0.25
.
.
.
0
0.75
0
.
.
.
0
1
0
.
.
.
1
0
[Boyan 99]
n = 4 (Small)n = 25 (Medium)n = 100 (Large)
k = 2
0 200 400 600 800 100010
−2
10−1
100
Episodes
RM
S o
f V
(s)
ove
r al
l sta
tes
TD
iLSTD LSTD
0 0.5 1 1.5 2 2.5 310
−2
10−1
100
101
102
Time(s)
RM
S o
f V
(s)
ove
r al
l sta
tes
TD
iLSTD
LSTD
Small Boyan Chain
0 200 400 600 800 100010
−2
10−1
100
Episodes
RM
S o
f V
(s)
ove
r al
l sta
tes
TD
iLSTD LSTD
0 0.5 1 1.5 2 2.5 310
−2
10−1
100
101
102
Time(s)
RM
S o
f V
(s)
ove
r al
l sta
tes
TD
iLSTD
LSTD
Small Boyan Chain
0 200 400 600 800 100010
−1
100
101
102
Episodes
RM
S o
f V
(s)
ove
r al
l sta
tes
TDiLSTD
LSTD
TD
iLSTD
LSTD
0 5 10 15 2010
−1
100
101
102
103
Time(s)
RM
S o
f V
(s)
ove
r al
l sta
tes
Medium Boyan Chain
0 200 400 600 800 100010
−1
100
101
102
Episodes
RM
S o
f V
(s)
ove
r al
l sta
tes
TDiLSTD
LSTD
TD
iLSTD
LSTD
0 5 10 15 2010
−1
100
101
102
103
Time(s)
RM
S o
f V
(s)
ove
r al
l sta
tes
Medium Boyan Chain
Large Boyan Chain
0 200 400 600 800 100010
−1
100
101
102
103
Episodes
RM
S o
f V
(s)
ove
r al
l sta
tes
TD
iLSTD
LSTD
TD
iLSTD
LSTD
0 20 40 60 80 10010
−1
100
101
102
103
Time(s)
RM
S o
f V
(s)
ove
r al
l sta
tes
Large Boyan Chain
0 200 400 600 800 100010
−1
100
101
102
103
Episodes
RM
S o
f V
(s)
ove
r al
l sta
tes
TD
iLSTD
LSTD
TD
iLSTD
LSTD
0 20 40 60 80 10010
−1
100
101
102
103
Time(s)
RM
S o
f V
(s)
ove
r al
l sta
tes
Large Boyan Chain
0 200 400 600 800 100010
−1
100
101
102
103
Episodes
RM
S o
f V
(s)
ove
r al
l sta
tes
TD
iLSTD
LSTD
TD
iLSTD
LSTD
0 20 40 60 80 10010
−1
100
101
102
103
Time(s)
RM
S o
f V
(s)
ove
r al
l sta
tes
Mountain Car
Tile codingn = 10,000k = 10
Mountain Car
Mountain CarPosition = -1 (Easy)Position = -.5 (Hard)
Goal
[For details see RL-Library]
Easy Mountain Car
TD
iLSTD
LSTD
0 200 400 600 800 100010
2
103
104
Episodes
Lo
ss F
un
ctio
n
TD
iLSTD
LSTD0 200 400 600 800 100010
4
105
106
Episode
Lo
ss
0 50 100 150 200 250 300 35010
4
105
106
107
Time (s)
Lo
ss TD
iLSTD
LSTD
Loss = ||b! !A!!||2
Hard Mountain Car
0 200 400 600 800 100010
5
106
107
Episode
Lo
ss
TD
iLSTD
LSTD
0 200 400 600 800 1000 120010
5
106
107
108
Time (s)
Lo
ss
TD
iLSTD
LSTD
Hard Mountain Car
0 200 400 600 800 100010
5
106
107
Episode
Lo
ss
TD
iLSTD
LSTD
0 200 400 600 800 1000 120010
5
106
107
108
Time (s)
Lo
ss
TD
iLSTD
LSTD
Running Time
0
31
62
93
124
155
Small
Medium Hard Ea
syHard
154
34
20097
000 55000
TDiLSTDLSTD
Mountain CarBoyan Chain
Tim
e (m
s)ConstantLinearExponential
Running Time
0
31
62
93
124
155
Small
Medium Hard Ea
syHard
154
34
20097
000 55000
TDiLSTDLSTD
Mountain CarBoyan Chain
Tim
e (m
s)ConstantLinearExponential
Outline
Motivation
Introduction
The New Approach
Dimension Selection
Conclusion
Outline
Dimension Selection
Random ✓
Dimension Selection
Random ✓
Greedy
Dimension Selection
Random ✓
Greedy
ε-Greedy
Dimension Selection
Random ✓
Greedy
ε-Greedy
Boltzmann
Greedy Dimension SelectionPick the one with highest value of
|µt(i)|
Greedy Dimension SelectionPick the one with highest value of
Not proven to converge.
|µt(i)|
ε-Greedy Dimension Selection
ε : Non-Zero Random
(1-ε) : Greedy
ε-Greedy Dimension Selection
ε : Non-Zero Random
(1-ε) : Greedy
Convergence proof applies.
Boltzmann Component Selection
Boltzmann Distribution + Non-Zero Random
Boltzmann Component Selection
Boltzmann Distribution + Non-Zero Random
Boltzmann Component Selection
Boltzmann Distribution + Non-Zero Random
! !m
Boltzmann Component Selection
Boltzmann Distribution + Non-Zero Random
! !m
Boltzmann Distribution
Boltzmann Component Selection
Boltzmann Distribution + Non-Zero Random
Convergence proof applies.
! !m
Boltzmann Distribution
ε-Greedy: ε =.1
Boltzmann: ψ = 10^-9, τ = 1
Empirical Results
ε-Greedy: ε =.1
Boltzmann: ψ = 10^-9, τ = 1
Small Medium Large Easy Hard10
−2
100
102
104
106
108
Boyan Chain Mountain Car
Err
or
Mea
sure
RandomGreedy!−GreedyBoltzmann
Empirical Results
ε-Greedy: ε =.1
Boltzmann: ψ = 10^-9, τ = 1
Small Medium Large Easy Hard10
−2
100
102
104
106
108
Boyan Chain Mountain Car
Err
or
Mea
sure
RandomGreedy!−GreedyBoltzmann
Empirical Results
ε-Greedy: ε =.1
Boltzmann: ψ = 10^-9, τ = 1
Small Medium Large Easy Hard10
−2
100
102
104
106
108
Boyan Chain Mountain Car
Err
or
Mea
sure
RandomGreedy!−GreedyBoltzmann
Empirical Results
0
2.2
4.4
6.6
8.8
11.0
Smal
l
Med
ium
Large
Easy
Har
d
Random Greedy !-greedy Boltzmann
Boyan chain Mountain car
Clo
ck
tim
e/st
ep
(m
s)
Running Time
Outline
Motivation
Introduction
The New Approach
Dimension Selection
Conclusion
Outline
Conclusion
Conclusion
Speed
Dat
a Effi
cien
cyLSTD
TD
Conclusion
Speed
Dat
a Effi
cien
cyLSTD
TD
iLSTD
Conclusion
Speed
Dat
a Effi
cien
cyLSTD
TD
iLSTD
Conclusion
Speed
Dat
a Effi
cien
cyLSTD
TD
iLSTD
Conclusion
Speed
Dat
a Effi
cien
cyLSTD
TD
iLSTD
Conclusion
Speed
Dat
a Effi
cien
cyLSTD
TD
iLSTD
No learning rate!
Questions ?
Questions ...
Questions ...What if someone uses batch-LSTD?
Questions ...What if someone uses batch-LSTD?
Why iLSTD takes simple descent?
Questions ...What if someone uses batch-LSTD?
Why iLSTD takes simple descent?
Hmm ... What about control?
Thanks ...