part i: the problem introduction cse 190: reinforcement...

CSE 190: Reinforcement Learning:An Introduction

Chapter 7: Eligibility Traces

Acknowledgment:Acknowledgment:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton

22CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

The Book: Where we areand where we’re going

• Part I: The Problem• Introduction• Evaluative Feedback• The Reinforcement Learning Problem

• Part II: Elementary Solution Methods• Dynamic Programming• Monte Carlo Methods• Temporal Difference Learning

• Part III: A Unified View• Eligibility Traces• Generalization and Function Approximation• Planning and Learning• Dimensions of Reinforcement Learning• Case Studies


Chapter 7: Eligibility Traces


Simple Monte Carlo

T T T TT

T T T T T

V (st ) !V (st ) +" Rt #V (st )[ ]where Rt is the actual return following state st .

st

T T

T T

TT T

T TT

Rt


Simplest TD Method

T T T TT

T T T T T

st+1rt+1

st

V (st )!V (st ) +" rt+1 + # V (st+1) $V (st )[ ]

TTTTT

T T T T T


Is there something in between?

T T T TT

T T T T T

st+1rt+1

st

TTTTT

T T T T T


N-step TD Prediction

• Idea: Look farther into the future when you do TD backup(1, 2, 3, …, n steps)


• The Monte Carlo target is based on the actual return, Rtfrom time t to the end of the episode, at time T.

• The TD target is based on one actual return, plus theestimate of future returns from the next state.• I.e., use V to estimate remaining return

Mathematics of N-step TD Prediction

Rt = rt+1 + ! rt+2 + !2rt+3 +!+ !

T " t"1rT

Rt(1) = rt+1 + !Vt (st+1)

Where “T” denotes the final step.


• It is a little more obvious if we line them up:

• Now it is clear that V “stands in” for the future rewards -indeed, that’s its job: prediction!


Rt = rt+1 + ! rt+2 + !2rt+3 +!+ !

T " t"1rTRt(1) = rt+1 + !Vt (st+1)


• It is a little more obvious if we line them up:


Rt = rt+1 + ! rt+2 + !2rt+3 +!+ !

T " t"1rTRt(1) = rt+1 + !Vt (st+1)

Rt(2) = rt+1 + ! rt+2 + !

2Vt (st+2 )Rt(n) = rt+1 + ! rt+2 + !

2rt+3 +!+ !n"1rt+n + !

nVt (st+n )

Data Estimate


• n-step TD:• n-step return:

• An “(n)” at the top of the “R” means n-step return (easy tomiss!) - Rt is basically using the idea of an absorbingfinal state.

Note the Notation

Rt(n) = rt+1 + ! rt+2 + !

2rt+3 +!+ !n"1rt+n + !

nVt (st+n )

R !( )t


Learning with N-step Backups

• Backup (on-line or off-line):

• Error reduction property of n-step returns

• Using this, you can show that n-step methods converge

maxs

E! {Rt(n) | st = s} "V

! (s) # $ nmaxsV (s) "V ! (s)

Maximum error using n-step return Maximum error using V

!Vt (st ) = " Rt(n) #Vt (st )$% &'


Learning with N-step Backups

• Backup (on-line or off-line):

• Error reduction property of n-step returns

• Using this, you can show that n-step methods converge

maxs

E! {Rt(n) | st = s} "V

! (s) # $ nmaxsV (s) "V ! (s)

Maximum error using n-step return Maximum error using V

!Vt (st ) = " Rt(n) #Vt (st )$% &'


Random Walk Examples

• How does 2-step TD work here?• How about 3-step TD?


A Larger Example• Task: 19 state

random walk• Each curve

corresponds to theerror after 10episodes.

• Each curve uses adifferent n.

• The x-axis is thelearning rate.

• Do you think there isan optimal n?for everything?


Averaging N-step Returns

• n-step methods were introduced to help withTD(!) understanding (next!)

• Idea: backup an average of several returns• e.g. backup half of 2-step and half of 4-step

• Called a complex backup• To draw the backup diagram:

• Draw each component• Label with the weights for that component

Rtavg =

12Rt(2) +

12Rt(4 )

One backup


Forward View of TD(!)• TD(!) is a method for

averaging allall n-step backups• weighted by !n-1 (time

since visitation)• !-return:

• Backup using !-return:

Rt! = (1" !) !n"1

n=1

#

$ Rt(n)

!Vt (st ) = " Rt# $Vt (st )%& '(


Forward View of TD(!)

• It is easy to see that theweights sum to 1:

• Suppose we stopped after 2steps (the last step is NOTweighted by (1-!)):So the sum is (1-!) + !.3 steps: (1-!) + (1-!) ! + !2

= (1-!)+ (!- !2) + !2=1Etc.


!-return Weighting Function

Rt! = (1" !) !n"1

n=1

T " t"1

# Rt(n) + !T " t"1Rt

Until terminationAfter termination


!-return Weighting Function

Rt! = (1" !) !n"1

n=1

T " t"1

# Rt(n) + !T " t"1Rt

Until terminationAfter termination


Relation to TD(0) and MC• !-return can be rewritten as:

• If ! = 1, you get MC:

• If ! = 0, you get TD(0)

Rt! = (1"1) 1n"1

n=1

T " t"1

# Rt(n) +1T " t"1Rt = Rt

Rt! = (1" 0) 0n"1

n=1

T " t"1

# Rt(n) + 0T " t"1Rt = Rt(1) (because only 00 = 1)

Rt! = (1" !) !n"1

n=1

T " t"1

# Rt(n) + !T " t"1RtUntil termination After termination


Forward View of TD(!) II• Look forward from each state to determine update from

future states and rewards:


!-return on the Random Walk

• Same 19 state random walk as before• Why do you think intermediate values of ! are best?


Backward View

• Shout !t backwards over time• The strength of your voice decreases with temporal

distance by "#$

! t = rt+1 + "Vt (st+1) #Vt (st )


Backward View of TD(!)• The forward view was for theory• The backward view is for mechanism

• New variable called eligibility trace• On each step, decay all traces by "#$ and increment the trace for the

current state by 1

• This gives the equivalent weighting as the forward algorithmwhen used incrementally

et (s)!"+

et (s) =!"et#1(s) if s $ st

!"et#1(s) +1 if s = st

%&'

('


Policy Evaluation Using TD(!)Initialize V (s) arbitrarilyRepeat (for each episode): e(s) = 0, for all s !S Initialize s Repeat (for each step of episode): a" action given by # for s Take action a, observe reward, r, and next state $s % " r + &V ( $s ) 'V (s) /* Compute TD(() error */ e(s) " e(s) +1 /* Increment trace for current state */ For all s: /* Update ALL s's - really only ones we have visited (e(s) = 0 for the others) */ V (s) "V (s) +)%e(s) /* incrementally update V(s) according to TD(() */ e(s) "&(e(s) /* decay trace by ( */ s" $s /* Move to next state */ Until s is terminal


Policy Evaluation Using TD(!)

For all s: /* Update ALL s's - really only ones we have visited (e(s) = 0 for the others) */ V (s) !V (s) +"#e(s) /* incrementally update V(s) according to TD($) */

e(s) !%$e(s) /* decay trace by $ */

s! &s /* Move to next state */ Until s is terminal

What do each of these elements on the right hand side mean?What do each of these elements on the right hand side mean?

What do each of these elements on the right hand side mean?What do each of these elements on the right hand side mean?


Relation of Backwards View to MC& TD(0)

• Using update rule:

• As before, if you set $ to 0, you get to TD(0)• If you set $ to 1, you get MC but in a better way

• Can apply TD(1) to continuing tasks• Works incrementally and on-line (instead of waiting to the end of

the episode)

• Again, by the time you get to the terminal state, theseincrements add up to the one you would have made via theforward (theoretical) view.

!Vt (s) = "# tet (s)


Forward View = Backward View

• The forward (theoretical) view of TD(!) is equivalentto the backward (mechanistic) view for off-lineupdating

• The book shows:

• On-line updating with small " is similar

!VtTD (s)

t=0

T "1

# = $t=0

T "1

# Isst (%&)k" t'kk= t

T "1

# !Vt" (st )Isstt=0

T #1

$ = %t=0

T #1

$ Isst (&")k# t'kk= t

T #1

$

!VtTD (s)

t=0

T "1

# = !Vt$ (st )t=0

T "1

# Isst

Backward updates Forward updates

algebra shown in book


On-line versus Off-line on RandomWalk

• Same 19 state random walk• On-line performs better over a broader range of parameters


Control: Sarsa(!)• Standard control idea: Learn Q

values - so, save eligibility forstate-action pairs instead ofjust states

et (s,a) =!"et#1(s,a) +1 if s = st and a = at!"et#1(s,a) otherwise

$%&

'&

Qt+1(s,a) = Qt (s,a) +() tet (s,a)) t = rt+1 + !Qt (st+1,at+1) #Qt (st ,at )


Sarsa(!) AlgorithmInitialize Q(s,a) arbitrarilyRepeat (for each episode): e(s,a) = 0, for all s,a Initialize s,a Repeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) # $ r + %Q( !s , !a ) &Q(s,a) e(s,a) $ e(s,a) +1 For all s,a: Q(s,a) $Q(s,a) +'#e(s,a) e(s,a) $%(e(s,a) s$ !s ;a$ !a Until s is terminal


Sarsa(!) Gridworld Example

• With one trial, the agent has much more information abouthow to get to the goal

• not necessarily the best way

• Can considerably accelerate learning3434CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

Three Approaches to Q(!)• How can we extend this to Q-learning?• Recall Q-learning is an off-policy method to learn Q* - and it uses

the max of the Q values for a state in its backup

• What happens if we make an exploratory move? This is NOT theright thing to back up over then…

• What to do?

• Three answers: Watkins, Peng, and the “naïve” answer.


Three Approaches to Q(!)• What happens if we make an exploratory move? This is NOT the

right thing to back up over then!

•• Why not?Why not?• Suppose we are backing up the state-action pair (st,at) at time t.• Suppose that on the next two time steps the agent selects the

greedy action, but on the third, at time t+3, the agent selects anexploratory, non-greedy action.

•• In learning about the value of the greedy policy atIn learning about the value of the greedy policy at((sstt,a,att) we can use subsequent experience only as long as) we can use subsequent experience only as long asthe greedy policy is being followed.the greedy policy is being followed.

• Thus, we can use the one-step and two-step returns, but not, inthis case, the three-step return. The n-step returns for all n>2 nolonger have any necessary relationship to the greedy policy.


Three Approaches to Q(!)• If you mark every state action

pair as eligible, you backup overnon-greedy policy

• Watkins: Zero out eligibilitytrace after a non-greedy action.Do max when backing up at firstnon-greedy choice.

et (s,a) =1+ !"et#1(s,a)

0!"et#1(s,a)

if s = st ,a = at ,Qt#1(st ,at ) = maxa Qt#1(st ,a) if Qt#1(st ,at ) $ maxa Qt#1(st ,a)

otherwise

%

&'

('

Qt+1(s,a) = Qt (s,a) +)* tet (s,a)* t = rt+1 + ! max +a Qt (st+1, +a ) #Qt (st ,at )


Watkins’s Q(!)Initialize Q(s,a) arbitrarilyRepeat (for each episode): e(s,a) = 0, for all s,a Initialize s,a Repeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # argmaxb Q( !s ,b) (if a ties for the max, then a

* # !a ) $ # r + %Q( !s , !a ) &Q(s,a*) e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) else e(s,a) # 0 s# !s ;a# !a Until s is terminal


Watkins’s Q(!)Initialize Q(s,a) arbitrarilyRepeat (for each episode): e(s,a) = 0, for all s,a Initialize s,a Repeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # argmaxb Q( !s ,b) (if a ties for the max, then a

* # !a ) $ # r + %Q( !s , !a ) &Q(s,a*) e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) else e(s,a) # 0 s# !s ;a# !a Until s is terminal


Watkins’s Q(!)Initialize s,aRepeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # arg maxb Q( !s ,b) (if a ties for the max, then a* # !a ) /* What is the point of this?? */ $ # r + %Q( !s , !a ) &Q(s,a*) /* Compute TD error: using best action! */ e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) /* Note check! What is the point of this? */ else e(s,a) # 0 s# !s ;a# !a Until s is terminal


Initialize s,aRepeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # argmaxb Q( !s ,b) if a ties for the max, then a* # !a /* What is the point of this?? STAY ON GREEDY POLICY */ $ # r + %Q( !s , !a ) &Q(s,a*) /* Compute TD error: using best action! */ e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) /* Note check! What is the point of this? */ else e(s,a) # 0 See above: If we choose greedy action, then don't zero trace! s# !s ;a# !a Until s is terminal


Peng’s Q(!)• Disadvantage to Watkins’s

method:• Early in learning, the eligibility

trace will be “cut” (zeroed out)frequently resulting in littleadvantage to traces

• Peng:• Backup max action except at

end

• Never cut traces• Disadvantage:

• Complicated to implement


Naïve Q(!)• Idea: is it really a problem to backup exploratory

actions?

• Never zero traces• Always backup max at current action (unlike

Peng or Watkins’s)

• Is this truly naïve?

• Well, it works well in some empirical studies

What is the backup diagram?


Comparison Task

From McGovern and Sutton (1997). Towards a better Q(!)

• Compared Watkins’s, Peng’s, and Naïve (called McGovern’shere) Q(!) on several tasks.• See McGovern and Sutton (1997). Towards a Better Q($) for other

tasks and results (stochastic tasks, continuing tasks, etc)

• Deterministic gridworld with obstacles• 10x10 gridworld• 25 randomly generated obstacles• 30 runs• " = 0.05, # = 0.9, ! = 0.9, $ = 0.05, accumulating traces


Comparison Results

From McGovern and Sutton (1997). Towards a better Q(!)


Convergence of the Q(!)’s• None of the methods are proven to converge.

• Much extra credit if you can prove any of them.• Watkins’s is thought to converge to Q*• Peng’s is thought to converge to a mixture of Q% and Q*• Naïve - Q*?


Eligibility Traces for Actor-Critic Methods

• Critic: On-policy learning of V%. Use TD(!) as describedbefore.

• Actor: Needs eligibility traces for each state-action pair.• We change the update equation from:

pt+1(s,a) =pt (s,a) +!" t if a = at and s = stpt (s,a) otherwise

#$%

&%

pt+1(s,a) = pt (s,a) +!" tet (s,a)To:




• Actor: Needs eligibility traces for each state-action pair.• We change the update equation from:

pt+1(s,a) =pt (s,a) +!" t if a = at and s = stpt (s,a) otherwise

#$%

&%

pt+1(s,a) = pt (s,a) +!" tet (s,a)To:

Note - this updates the preference for the s,aNote - this updates the preference for the s,a that we actuallythat we actually diddid according to the erroraccording to the error

This updates the preference for This updates the preference for allall s,a s,a according to their eligibilityaccording to their eligibility




• Actor: Needs eligibility traces for each state-action pair.• Can change the other actor-critic update:

pt+1(s,a) =pt (s,a) +!" t 1# $ (s,a)[ ] if a = at and s = st

pt (s,a) otherwise

%&'

('

topt+1(s,a) = pt (s,a) +!" tet (s,a)

et (s,a) =!"et#1(s,a) +1# $ t (st ,at ) if s = st and a = at

!"et#1(s,a) otherwise

%&'

('

where


Replacing Traces

• Using accumulating traces, frequently visited states canhave eligibilities greater than greater than 11• This can be a problem for convergence

•• Replacing traces:Replacing traces: Instead of addingadding 1 when you visit astate, set that trace to set that trace to 11

et (s) =!"et#1(s) if s $ st

1 if s = st

%&'

('


Replacing Traces Example• Same 19 state random walk task as before• Replacing traces perform better than accumulating traces over more

values of !


Why Replacing Traces?• Replacing traces can significantly speed learning

• They can make the system perform well for a broader set of parameters

• Accumulating traces can do poorly on certain types of tasks

Why is this task particularly onerous foraccumulating traces?


More Replacing Traces

• Off-line replacing trace TD(1) is identical to first-visit MC

• Extension to action-values (Q values):• When you revisit a state, what should you do with the traces for the

actions you didn’t take?

• Singh and Sutton say to set them to zero:

• What effect would this have on the previous problem? (right/wrong)

et (s,a) =10

!"et#1(s,a)

$

%&

'&

if s = st and a = atif s = st and a ( at

if s ( st


Implementation Issues with Traces

• Could require much more computation• But most eligibility traces are VERY close to zero

• If you implement it in Matlab, backup is only one line ofcode and is very fast (Matlab is optimized for matrices)


Variable !• Can generalize to variable !

• Here ! is a function of time• Could define

et (s) =!"tet#1(s) if s $ st

!"tet#1(s) +1 if s = st

%&'

('

!t = !(st ) or !t = !t"


Conclusions

• Provides efficient, incremental way to combine MCand TD• Includes advantages of MC (can deal with lack of Markov

property)

• Includes advantages of TD (using TD error, bootstrapping)• Can significantly speed learning• Does have a cost in computation


The two views

END5858CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

Unified View

part i: the problem introduction cse 190: reinforcement...

Documents