part i: the problem introduction cse 190: reinforcement...
TRANSCRIPT
-
CSE 190: Reinforcement Learning:An Introduction
Chapter 7: Eligibility Traces
Acknowledgment:Acknowledgment:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton
22CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
The Book: Where we areand where we’re going
• Part I: The Problem• Introduction• Evaluative Feedback• The Reinforcement Learning Problem
• Part II: Elementary Solution Methods• Dynamic Programming• Monte Carlo Methods• Temporal Difference Learning
• Part III: A Unified View• Eligibility Traces• Generalization and Function Approximation• Planning and Learning• Dimensions of Reinforcement Learning• Case Studies
33CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Chapter 7: Eligibility Traces
44CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Simple Monte Carlo
T T T TT
T T T T T
V (st ) !V (st ) +" Rt #V (st )[ ]where Rt is the actual return following state st .
st
T T
T T
TT T
T TT
Rt
-
55CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Simplest TD Method
T T T TT
T T T T T
st+1rt+1
st
V (st )!V (st ) +" rt+1 + # V (st+1) $V (st )[ ]
TTTTT
T T T T T
66CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Is there something in between?
T T T TT
T T T T T
st+1rt+1
st
TTTTT
T T T T T
77CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
N-step TD Prediction
• Idea: Look farther into the future when you do TD backup(1, 2, 3, …, n steps)
88CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
• The Monte Carlo target is based on the actual return, Rtfrom time t to the end of the episode, at time T.
• The TD target is based on one actual return, plus theestimate of future returns from the next state.• I.e., use V to estimate remaining return
Mathematics of N-step TD Prediction
Rt = rt+1 + ! rt+2 + !2rt+3 +!+ !
T " t"1rT
Rt(1) = rt+1 + !Vt (st+1)
Where “T” denotes the final step.
-
99CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
• It is a little more obvious if we line them up:
• Now it is clear that V “stands in” for the future rewards -indeed, that’s its job: prediction!
Mathematics of N-step TD Prediction
Rt = rt+1 + ! rt+2 + !2rt+3 +!+ !
T " t"1rTRt(1) = rt+1 + !Vt (st+1)
1010CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
• It is a little more obvious if we line them up:
Mathematics of N-step TD Prediction
Rt = rt+1 + ! rt+2 + !2rt+3 +!+ !
T " t"1rTRt(1) = rt+1 + !Vt (st+1)
Rt(2) = rt+1 + ! rt+2 + !
2Vt (st+2 )Rt(n) = rt+1 + ! rt+2 + !
2rt+3 +!+ !n"1rt+n + !
nVt (st+n )
Data Estimate
1111CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
• n-step TD:• n-step return:
• An “(n)” at the top of the “R” means n-step return (easy tomiss!) - Rt is basically using the idea of an absorbingfinal state.
Note the Notation
Rt(n) = rt+1 + ! rt+2 + !
2rt+3 +!+ !n"1rt+n + !
nVt (st+n )
R !( )t
1212CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Learning with N-step Backups
• Backup (on-line or off-line):
• Error reduction property of n-step returns
• Using this, you can show that n-step methods converge
maxs
E! {Rt(n) | st = s} "V
! (s) # $ nmaxsV (s) "V ! (s)
Maximum error using n-step return Maximum error using V
!Vt (st ) = " Rt(n) #Vt (st )$% &'
-
1313CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Learning with N-step Backups
• Backup (on-line or off-line):
• Error reduction property of n-step returns
• Using this, you can show that n-step methods converge
maxs
E! {Rt(n) | st = s} "V
! (s) # $ nmaxsV (s) "V ! (s)
Maximum error using n-step return Maximum error using V
!Vt (st ) = " Rt(n) #Vt (st )$% &'
1414CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Random Walk Examples
• How does 2-step TD work here?• How about 3-step TD?
1515CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
A Larger Example• Task: 19 state
random walk• Each curve
corresponds to theerror after 10episodes.
• Each curve uses adifferent n.
• The x-axis is thelearning rate.
• Do you think there isan optimal n?for everything?
1616CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Averaging N-step Returns
• n-step methods were introduced to help withTD(!) understanding (next!)
• Idea: backup an average of several returns• e.g. backup half of 2-step and half of 4-step
• Called a complex backup• To draw the backup diagram:
• Draw each component• Label with the weights for that component
Rtavg =
12Rt(2) +
12Rt(4 )
One backup
-
1717CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Forward View of TD(!)• TD(!) is a method for
averaging allall n-step backups• weighted by !n-1 (time
since visitation)• !-return:
• Backup using !-return:
Rt! = (1" !) !n"1
n=1
#
$ Rt(n)
!Vt (st ) = " Rt# $Vt (st )%& '(
1818CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Forward View of TD(!)
• It is easy to see that theweights sum to 1:
• Suppose we stopped after 2steps (the last step is NOTweighted by (1-!)):So the sum is (1-!) + !.3 steps: (1-!) + (1-!) ! + !2
= (1-!)+ (!- !2) + !2=1Etc.
1919CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
!-return Weighting Function
Rt! = (1" !) !n"1
n=1
T " t"1
# Rt(n) + !T " t"1Rt
Until terminationAfter termination
2020CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
!-return Weighting Function
Rt! = (1" !) !n"1
n=1
T " t"1
# Rt(n) + !T " t"1Rt
Until terminationAfter termination
-
2121CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Relation to TD(0) and MC• !-return can be rewritten as:
• If ! = 1, you get MC:
• If ! = 0, you get TD(0)
Rt! = (1"1) 1n"1
n=1
T " t"1
# Rt(n) +1T " t"1Rt = Rt
Rt! = (1" 0) 0n"1
n=1
T " t"1
# Rt(n) + 0T " t"1Rt = Rt(1) (because only 00 = 1)
Rt! = (1" !) !n"1
n=1
T " t"1
# Rt(n) + !T " t"1RtUntil termination After termination
2222CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Forward View of TD(!) II• Look forward from each state to determine update from
future states and rewards:
2323CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
!-return on the Random Walk
• Same 19 state random walk as before• Why do you think intermediate values of ! are best?
2424CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Backward View
• Shout !t backwards over time• The strength of your voice decreases with temporal
distance by "#$
! t = rt+1 + "Vt (st+1) #Vt (st )
-
2525CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Backward View of TD(!)• The forward view was for theory• The backward view is for mechanism
• New variable called eligibility trace• On each step, decay all traces by "#$ and increment the trace for the
current state by 1
• This gives the equivalent weighting as the forward algorithmwhen used incrementally
et (s)!"+
et (s) =!"et#1(s) if s $ st
!"et#1(s) +1 if s = st
%&'
('
2626CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Policy Evaluation Using TD(!)Initialize V (s) arbitrarilyRepeat (for each episode): e(s) = 0, for all s !S Initialize s Repeat (for each step of episode): a" action given by # for s Take action a, observe reward, r, and next state $s % " r + &V ( $s ) 'V (s) /* Compute TD(() error */ e(s) " e(s) +1 /* Increment trace for current state */ For all s: /* Update ALL s's - really only ones we have visited (e(s) = 0 for the others) */ V (s) "V (s) +)%e(s) /* incrementally update V(s) according to TD(() */ e(s) "&(e(s) /* decay trace by ( */ s" $s /* Move to next state */ Until s is terminal
2727CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Policy Evaluation Using TD(!)
For all s: /* Update ALL s's - really only ones we have visited (e(s) = 0 for the others) */ V (s) !V (s) +"#e(s) /* incrementally update V(s) according to TD($) */
e(s) !%$e(s) /* decay trace by $ */
s! &s /* Move to next state */ Until s is terminal
What do each of these elements on the right hand side mean?What do each of these elements on the right hand side mean?
What do each of these elements on the right hand side mean?What do each of these elements on the right hand side mean?
2828CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Relation of Backwards View to MC& TD(0)
• Using update rule:
• As before, if you set $ to 0, you get to TD(0)• If you set $ to 1, you get MC but in a better way
• Can apply TD(1) to continuing tasks• Works incrementally and on-line (instead of waiting to the end of
the episode)
• Again, by the time you get to the terminal state, theseincrements add up to the one you would have made via theforward (theoretical) view.
!Vt (s) = "# tet (s)
-
2929CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Forward View = Backward View
• The forward (theoretical) view of TD(!) is equivalentto the backward (mechanistic) view for off-lineupdating
• The book shows:
• On-line updating with small " is similar
!VtTD (s)
t=0
T "1
# = $t=0
T "1
# Isst (%&)k" t'kk= t
T "1
# !Vt" (st )Isstt=0
T #1
$ = %t=0
T #1
$ Isst (&")k# t'kk= t
T #1
$
!VtTD (s)
t=0
T "1
# = !Vt$ (st )t=0
T "1
# Isst
Backward updates Forward updates
algebra shown in book
3030CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
On-line versus Off-line on RandomWalk
• Same 19 state random walk• On-line performs better over a broader range of parameters
3131CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Control: Sarsa(!)• Standard control idea: Learn Q
values - so, save eligibility forstate-action pairs instead ofjust states
et (s,a) =!"et#1(s,a) +1 if s = st and a = at!"et#1(s,a) otherwise
$%&
'&
Qt+1(s,a) = Qt (s,a) +() tet (s,a)) t = rt+1 + !Qt (st+1,at+1) #Qt (st ,at )
3232CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Sarsa(!) AlgorithmInitialize Q(s,a) arbitrarilyRepeat (for each episode): e(s,a) = 0, for all s,a Initialize s,a Repeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) # $ r + %Q( !s , !a ) &Q(s,a) e(s,a) $ e(s,a) +1 For all s,a: Q(s,a) $Q(s,a) +'#e(s,a) e(s,a) $%(e(s,a) s$ !s ;a$ !a Until s is terminal
-
3333CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Sarsa(!) Gridworld Example
• With one trial, the agent has much more information abouthow to get to the goal
• not necessarily the best way
• Can considerably accelerate learning3434CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Three Approaches to Q(!)• How can we extend this to Q-learning?• Recall Q-learning is an off-policy method to learn Q* - and it uses
the max of the Q values for a state in its backup
• What happens if we make an exploratory move? This is NOT theright thing to back up over then…
• What to do?
• Three answers: Watkins, Peng, and the “naïve” answer.
3535CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Three Approaches to Q(!)• What happens if we make an exploratory move? This is NOT the
right thing to back up over then!
•• Why not?Why not?• Suppose we are backing up the state-action pair (st,at) at time t.• Suppose that on the next two time steps the agent selects the
greedy action, but on the third, at time t+3, the agent selects anexploratory, non-greedy action.
•• In learning about the value of the greedy policy atIn learning about the value of the greedy policy at((sstt,a,att) we can use subsequent experience only as long as) we can use subsequent experience only as long asthe greedy policy is being followed.the greedy policy is being followed.
• Thus, we can use the one-step and two-step returns, but not, inthis case, the three-step return. The n-step returns for all n>2 nolonger have any necessary relationship to the greedy policy.
3636CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Three Approaches to Q(!)• If you mark every state action
pair as eligible, you backup overnon-greedy policy
• Watkins: Zero out eligibilitytrace after a non-greedy action.Do max when backing up at firstnon-greedy choice.
et (s,a) =1+ !"et#1(s,a)
0!"et#1(s,a)
if s = st ,a = at ,Qt#1(st ,at ) = maxa Qt#1(st ,a) if Qt#1(st ,at ) $ maxa Qt#1(st ,a)
otherwise
%
&'
('
Qt+1(s,a) = Qt (s,a) +)* tet (s,a)* t = rt+1 + ! max +a Qt (st+1, +a ) #Qt (st ,at )
-
3737CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Watkins’s Q(!)Initialize Q(s,a) arbitrarilyRepeat (for each episode): e(s,a) = 0, for all s,a Initialize s,a Repeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # argmaxb Q( !s ,b) (if a ties for the max, then a
* # !a ) $ # r + %Q( !s , !a ) &Q(s,a*) e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) else e(s,a) # 0 s# !s ;a# !a Until s is terminal
3838CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Watkins’s Q(!)Initialize Q(s,a) arbitrarilyRepeat (for each episode): e(s,a) = 0, for all s,a Initialize s,a Repeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # argmaxb Q( !s ,b) (if a ties for the max, then a
* # !a ) $ # r + %Q( !s , !a ) &Q(s,a*) e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) else e(s,a) # 0 s# !s ;a# !a Until s is terminal
3939CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Watkins’s Q(!)Initialize s,aRepeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # arg maxb Q( !s ,b) (if a ties for the max, then a* # !a ) /* What is the point of this?? */ $ # r + %Q( !s , !a ) &Q(s,a*) /* Compute TD error: using best action! */ e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) /* Note check! What is the point of this? */ else e(s,a) # 0 s# !s ;a# !a Until s is terminal
4040CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Initialize s,aRepeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # argmaxb Q( !s ,b) if a ties for the max, then a* # !a /* What is the point of this?? STAY ON GREEDY POLICY */ $ # r + %Q( !s , !a ) &Q(s,a*) /* Compute TD error: using best action! */ e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) /* Note check! What is the point of this? */ else e(s,a) # 0 See above: If we choose greedy action, then don't zero trace! s# !s ;a# !a Until s is terminal
-
4141CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Peng’s Q(!)• Disadvantage to Watkins’s
method:• Early in learning, the eligibility
trace will be “cut” (zeroed out)frequently resulting in littleadvantage to traces
• Peng:• Backup max action except at
end
• Never cut traces• Disadvantage:
• Complicated to implement
4242CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Naïve Q(!)• Idea: is it really a problem to backup exploratory
actions?
• Never zero traces• Always backup max at current action (unlike
Peng or Watkins’s)
• Is this truly naïve?
• Well, it works well in some empirical studies
What is the backup diagram?
4343CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Comparison Task
From McGovern and Sutton (1997). Towards a better Q(!)
• Compared Watkins’s, Peng’s, and Naïve (called McGovern’shere) Q(!) on several tasks.• See McGovern and Sutton (1997). Towards a Better Q($) for other
tasks and results (stochastic tasks, continuing tasks, etc)
• Deterministic gridworld with obstacles• 10x10 gridworld• 25 randomly generated obstacles• 30 runs• " = 0.05, # = 0.9, ! = 0.9, $ = 0.05, accumulating traces
4444CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Comparison Results
From McGovern and Sutton (1997). Towards a better Q(!)
-
4545CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Convergence of the Q(!)’s• None of the methods are proven to converge.
• Much extra credit if you can prove any of them.• Watkins’s is thought to converge to Q*• Peng’s is thought to converge to a mixture of Q% and Q*• Naïve - Q*?
4646CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Eligibility Traces for Actor-Critic Methods
• Critic: On-policy learning of V%. Use TD(!) as describedbefore.
• Actor: Needs eligibility traces for each state-action pair.• We change the update equation from:
pt+1(s,a) =pt (s,a) +!" t if a = at and s = stpt (s,a) otherwise
#$%
&%
pt+1(s,a) = pt (s,a) +!" tet (s,a)To:
4747CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Eligibility Traces for Actor-Critic Methods
• Critic: On-policy learning of V%. Use TD(!) as describedbefore.
• Actor: Needs eligibility traces for each state-action pair.• We change the update equation from:
pt+1(s,a) =pt (s,a) +!" t if a = at and s = stpt (s,a) otherwise
#$%
&%
pt+1(s,a) = pt (s,a) +!" tet (s,a)To:
Note - this updates the preference for the s,aNote - this updates the preference for the s,a that we actuallythat we actually diddid according to the erroraccording to the error
This updates the preference for This updates the preference for allall s,a s,a according to their eligibilityaccording to their eligibility
4848CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Eligibility Traces for Actor-Critic Methods
• Critic: On-policy learning of V%. Use TD(!) as describedbefore.
• Actor: Needs eligibility traces for each state-action pair.• Can change the other actor-critic update:
pt+1(s,a) =pt (s,a) +!" t 1# $ (s,a)[ ] if a = at and s = st
pt (s,a) otherwise
%&'
('
topt+1(s,a) = pt (s,a) +!" tet (s,a)
et (s,a) =!"et#1(s,a) +1# $ t (st ,at ) if s = st and a = at
!"et#1(s,a) otherwise
%&'
('
where
-
4949CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Replacing Traces
• Using accumulating traces, frequently visited states canhave eligibilities greater than greater than 11• This can be a problem for convergence
•• Replacing traces:Replacing traces: Instead of addingadding 1 when you visit astate, set that trace to set that trace to 11
et (s) =!"et#1(s) if s $ st
1 if s = st
%&'
('
5050CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Replacing Traces Example• Same 19 state random walk task as before• Replacing traces perform better than accumulating traces over more
values of !
5151CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Why Replacing Traces?• Replacing traces can significantly speed learning
• They can make the system perform well for a broader set of parameters
• Accumulating traces can do poorly on certain types of tasks
Why is this task particularly onerous foraccumulating traces?
5252CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
More Replacing Traces
• Off-line replacing trace TD(1) is identical to first-visit MC
• Extension to action-values (Q values):• When you revisit a state, what should you do with the traces for the
actions you didn’t take?
• Singh and Sutton say to set them to zero:
• What effect would this have on the previous problem? (right/wrong)
et (s,a) =10
!"et#1(s,a)
$
%&
'&
if s = st and a = atif s = st and a ( at
if s ( st
-
5353CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Implementation Issues with Traces
• Could require much more computation• But most eligibility traces are VERY close to zero
• If you implement it in Matlab, backup is only one line ofcode and is very fast (Matlab is optimized for matrices)
5454CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Variable !• Can generalize to variable !
• Here ! is a function of time• Could define
et (s) =!"tet#1(s) if s $ st
!"tet#1(s) +1 if s = st
%&'
('
!t = !(st ) or !t = !t"
5555CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Conclusions
• Provides efficient, incremental way to combine MCand TD• Includes advantages of MC (can deal with lack of Markov
property)
• Includes advantages of TD (using TD error, bootstrapping)• Can significantly speed learning• Does have a cost in computation
5656CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
The two views
-
END5858CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77
Unified View