part i: the problem introduction cse 190: reinforcement...

15
CSE 190: Reinforcement Learning: An Introduction Chapter 7: Eligibility Traces Acknowledgment: Acknowledgment: A good number of these slides A good number of these slides are cribbed from Rich Sutton are cribbed from Rich Sutton 2 CSE 190: Reinforcement Learning, Lecture CSE 190: Reinforcement Learning, Lecture on Chapter on Chapter 7 The Book: Where we are and where we’re going Part I: The Problem Introduction Evaluative Feedback The Reinforcement Learning Problem Part II: Elementary Solution Methods Dynamic Programming Monte Carlo Methods Temporal Difference Learning Part III: A Unified View Eligibility Traces Generalization and Function Approximation Planning and Learning Dimensions of Reinforcement Learning Case Studies 3 CSE 190: Reinforcement Learning, Lecture CSE 190: Reinforcement Learning, Lecture on Chapter on Chapter 7 Chapter 7: Eligibility Traces 4 CSE 190: Reinforcement Learning, Lecture CSE 190: Reinforcement Learning, Lecture on Chapter on Chapter 7 Simple Monte Carlo V ( s t ) ! V ( s t ) + " R t # V ( s t ) [ ] where R t is the actual return following state s t . s t T T T T T T T T T T R t

Upload: others

Post on 24-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • CSE 190: Reinforcement Learning:An Introduction

    Chapter 7: Eligibility Traces

    Acknowledgment:Acknowledgment:A good number of these slidesA good number of these slides are cribbed from Rich Suttonare cribbed from Rich Sutton

    22CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    The Book: Where we areand where we’re going

    • Part I: The Problem• Introduction• Evaluative Feedback• The Reinforcement Learning Problem

    • Part II: Elementary Solution Methods• Dynamic Programming• Monte Carlo Methods• Temporal Difference Learning

    • Part III: A Unified View• Eligibility Traces• Generalization and Function Approximation• Planning and Learning• Dimensions of Reinforcement Learning• Case Studies

    33CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Chapter 7: Eligibility Traces

    44CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Simple Monte Carlo

    T T T TT

    T T T T T

    V (st ) !V (st ) +" Rt #V (st )[ ]where Rt is the actual return following state st .

    st

    T T

    T T

    TT T

    T TT

    Rt

  • 55CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Simplest TD Method

    T T T TT

    T T T T T

    st+1rt+1

    st

    V (st )!V (st ) +" rt+1 + # V (st+1) $V (st )[ ]

    TTTTT

    T T T T T

    66CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Is there something in between?

    T T T TT

    T T T T T

    st+1rt+1

    st

    TTTTT

    T T T T T

    77CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    N-step TD Prediction

    • Idea: Look farther into the future when you do TD backup(1, 2, 3, …, n steps)

    88CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    • The Monte Carlo target is based on the actual return, Rtfrom time t to the end of the episode, at time T.

    • The TD target is based on one actual return, plus theestimate of future returns from the next state.• I.e., use V to estimate remaining return

    Mathematics of N-step TD Prediction

    Rt = rt+1 + ! rt+2 + !2rt+3 +!+ !

    T " t"1rT

    Rt(1) = rt+1 + !Vt (st+1)

    Where “T” denotes the final step.

  • 99CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    • It is a little more obvious if we line them up:

    • Now it is clear that V “stands in” for the future rewards -indeed, that’s its job: prediction!

    Mathematics of N-step TD Prediction

    Rt = rt+1 + ! rt+2 + !2rt+3 +!+ !

    T " t"1rTRt(1) = rt+1 + !Vt (st+1)

    1010CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    • It is a little more obvious if we line them up:

    Mathematics of N-step TD Prediction

    Rt = rt+1 + ! rt+2 + !2rt+3 +!+ !

    T " t"1rTRt(1) = rt+1 + !Vt (st+1)

    Rt(2) = rt+1 + ! rt+2 + !

    2Vt (st+2 )Rt(n) = rt+1 + ! rt+2 + !

    2rt+3 +!+ !n"1rt+n + !

    nVt (st+n )

    Data Estimate

    1111CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    • n-step TD:• n-step return:

    • An “(n)” at the top of the “R” means n-step return (easy tomiss!) - Rt is basically using the idea of an absorbingfinal state.

    Note the Notation

    Rt(n) = rt+1 + ! rt+2 + !

    2rt+3 +!+ !n"1rt+n + !

    nVt (st+n )

    R !( )t

    1212CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Learning with N-step Backups

    • Backup (on-line or off-line):

    • Error reduction property of n-step returns

    • Using this, you can show that n-step methods converge

    maxs

    E! {Rt(n) | st = s} "V

    ! (s) # $ nmaxsV (s) "V ! (s)

    Maximum error using n-step return Maximum error using V

    !Vt (st ) = " Rt(n) #Vt (st )$% &'

  • 1313CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Learning with N-step Backups

    • Backup (on-line or off-line):

    • Error reduction property of n-step returns

    • Using this, you can show that n-step methods converge

    maxs

    E! {Rt(n) | st = s} "V

    ! (s) # $ nmaxsV (s) "V ! (s)

    Maximum error using n-step return Maximum error using V

    !Vt (st ) = " Rt(n) #Vt (st )$% &'

    1414CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Random Walk Examples

    • How does 2-step TD work here?• How about 3-step TD?

    1515CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    A Larger Example• Task: 19 state

    random walk• Each curve

    corresponds to theerror after 10episodes.

    • Each curve uses adifferent n.

    • The x-axis is thelearning rate.

    • Do you think there isan optimal n?for everything?

    1616CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Averaging N-step Returns

    • n-step methods were introduced to help withTD(!) understanding (next!)

    • Idea: backup an average of several returns• e.g. backup half of 2-step and half of 4-step

    • Called a complex backup• To draw the backup diagram:

    • Draw each component• Label with the weights for that component

    Rtavg =

    12Rt(2) +

    12Rt(4 )

    One backup

  • 1717CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Forward View of TD(!)• TD(!) is a method for

    averaging allall n-step backups• weighted by !n-1 (time

    since visitation)• !-return:

    • Backup using !-return:

    Rt! = (1" !) !n"1

    n=1

    #

    $ Rt(n)

    !Vt (st ) = " Rt# $Vt (st )%& '(

    1818CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Forward View of TD(!)

    • It is easy to see that theweights sum to 1:

    • Suppose we stopped after 2steps (the last step is NOTweighted by (1-!)):So the sum is (1-!) + !.3 steps: (1-!) + (1-!) ! + !2

    = (1-!)+ (!- !2) + !2=1Etc.

    1919CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    !-return Weighting Function

    Rt! = (1" !) !n"1

    n=1

    T " t"1

    # Rt(n) + !T " t"1Rt

    Until terminationAfter termination

    2020CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    !-return Weighting Function

    Rt! = (1" !) !n"1

    n=1

    T " t"1

    # Rt(n) + !T " t"1Rt

    Until terminationAfter termination

  • 2121CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Relation to TD(0) and MC• !-return can be rewritten as:

    • If ! = 1, you get MC:

    • If ! = 0, you get TD(0)

    Rt! = (1"1) 1n"1

    n=1

    T " t"1

    # Rt(n) +1T " t"1Rt = Rt

    Rt! = (1" 0) 0n"1

    n=1

    T " t"1

    # Rt(n) + 0T " t"1Rt = Rt(1) (because only 00 = 1)

    Rt! = (1" !) !n"1

    n=1

    T " t"1

    # Rt(n) + !T " t"1RtUntil termination After termination

    2222CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Forward View of TD(!) II• Look forward from each state to determine update from

    future states and rewards:

    2323CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    !-return on the Random Walk

    • Same 19 state random walk as before• Why do you think intermediate values of ! are best?

    2424CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Backward View

    • Shout !t backwards over time• The strength of your voice decreases with temporal

    distance by "#$

    ! t = rt+1 + "Vt (st+1) #Vt (st )

  • 2525CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Backward View of TD(!)• The forward view was for theory• The backward view is for mechanism

    • New variable called eligibility trace• On each step, decay all traces by "#$ and increment the trace for the

    current state by 1

    • This gives the equivalent weighting as the forward algorithmwhen used incrementally

    et (s)!"+

    et (s) =!"et#1(s) if s $ st

    !"et#1(s) +1 if s = st

    %&'

    ('

    2626CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Policy Evaluation Using TD(!)Initialize V (s) arbitrarilyRepeat (for each episode): e(s) = 0, for all s !S Initialize s Repeat (for each step of episode): a" action given by # for s Take action a, observe reward, r, and next state $s % " r + &V ( $s ) 'V (s) /* Compute TD(() error */ e(s) " e(s) +1 /* Increment trace for current state */ For all s: /* Update ALL s's - really only ones we have visited (e(s) = 0 for the others) */ V (s) "V (s) +)%e(s) /* incrementally update V(s) according to TD(() */ e(s) "&(e(s) /* decay trace by ( */ s" $s /* Move to next state */ Until s is terminal

    2727CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Policy Evaluation Using TD(!)

    For all s: /* Update ALL s's - really only ones we have visited (e(s) = 0 for the others) */ V (s) !V (s) +"#e(s) /* incrementally update V(s) according to TD($) */

    e(s) !%$e(s) /* decay trace by $ */

    s! &s /* Move to next state */ Until s is terminal

    What do each of these elements on the right hand side mean?What do each of these elements on the right hand side mean?

    What do each of these elements on the right hand side mean?What do each of these elements on the right hand side mean?

    2828CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Relation of Backwards View to MC& TD(0)

    • Using update rule:

    • As before, if you set $ to 0, you get to TD(0)• If you set $ to 1, you get MC but in a better way

    • Can apply TD(1) to continuing tasks• Works incrementally and on-line (instead of waiting to the end of

    the episode)

    • Again, by the time you get to the terminal state, theseincrements add up to the one you would have made via theforward (theoretical) view.

    !Vt (s) = "# tet (s)

  • 2929CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Forward View = Backward View

    • The forward (theoretical) view of TD(!) is equivalentto the backward (mechanistic) view for off-lineupdating

    • The book shows:

    • On-line updating with small " is similar

    !VtTD (s)

    t=0

    T "1

    # = $t=0

    T "1

    # Isst (%&)k" t'kk= t

    T "1

    # !Vt" (st )Isstt=0

    T #1

    $ = %t=0

    T #1

    $ Isst (&")k# t'kk= t

    T #1

    $

    !VtTD (s)

    t=0

    T "1

    # = !Vt$ (st )t=0

    T "1

    # Isst

    Backward updates Forward updates

    algebra shown in book

    3030CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    On-line versus Off-line on RandomWalk

    • Same 19 state random walk• On-line performs better over a broader range of parameters

    3131CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Control: Sarsa(!)• Standard control idea: Learn Q

    values - so, save eligibility forstate-action pairs instead ofjust states

    et (s,a) =!"et#1(s,a) +1 if s = st and a = at!"et#1(s,a) otherwise

    $%&

    '&

    Qt+1(s,a) = Qt (s,a) +() tet (s,a)) t = rt+1 + !Qt (st+1,at+1) #Qt (st ,at )

    3232CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Sarsa(!) AlgorithmInitialize Q(s,a) arbitrarilyRepeat (for each episode): e(s,a) = 0, for all s,a Initialize s,a Repeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) # $ r + %Q( !s , !a ) &Q(s,a) e(s,a) $ e(s,a) +1 For all s,a: Q(s,a) $Q(s,a) +'#e(s,a) e(s,a) $%(e(s,a) s$ !s ;a$ !a Until s is terminal

  • 3333CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Sarsa(!) Gridworld Example

    • With one trial, the agent has much more information abouthow to get to the goal

    • not necessarily the best way

    • Can considerably accelerate learning3434CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Three Approaches to Q(!)• How can we extend this to Q-learning?• Recall Q-learning is an off-policy method to learn Q* - and it uses

    the max of the Q values for a state in its backup

    • What happens if we make an exploratory move? This is NOT theright thing to back up over then…

    • What to do?

    • Three answers: Watkins, Peng, and the “naïve” answer.

    3535CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Three Approaches to Q(!)• What happens if we make an exploratory move? This is NOT the

    right thing to back up over then!

    •• Why not?Why not?• Suppose we are backing up the state-action pair (st,at) at time t.• Suppose that on the next two time steps the agent selects the

    greedy action, but on the third, at time t+3, the agent selects anexploratory, non-greedy action.

    •• In learning about the value of the greedy policy atIn learning about the value of the greedy policy at((sstt,a,att) we can use subsequent experience only as long as) we can use subsequent experience only as long asthe greedy policy is being followed.the greedy policy is being followed.

    • Thus, we can use the one-step and two-step returns, but not, inthis case, the three-step return. The n-step returns for all n>2 nolonger have any necessary relationship to the greedy policy.

    3636CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Three Approaches to Q(!)• If you mark every state action

    pair as eligible, you backup overnon-greedy policy

    • Watkins: Zero out eligibilitytrace after a non-greedy action.Do max when backing up at firstnon-greedy choice.

    et (s,a) =1+ !"et#1(s,a)

    0!"et#1(s,a)

    if s = st ,a = at ,Qt#1(st ,at ) = maxa Qt#1(st ,a) if Qt#1(st ,at ) $ maxa Qt#1(st ,a)

    otherwise

    %

    &'

    ('

    Qt+1(s,a) = Qt (s,a) +)* tet (s,a)* t = rt+1 + ! max +a Qt (st+1, +a ) #Qt (st ,at )

  • 3737CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Watkins’s Q(!)Initialize Q(s,a) arbitrarilyRepeat (for each episode): e(s,a) = 0, for all s,a Initialize s,a Repeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # argmaxb Q( !s ,b) (if a ties for the max, then a

    * # !a ) $ # r + %Q( !s , !a ) &Q(s,a*) e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) else e(s,a) # 0 s# !s ;a# !a Until s is terminal

    3838CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Watkins’s Q(!)Initialize Q(s,a) arbitrarilyRepeat (for each episode): e(s,a) = 0, for all s,a Initialize s,a Repeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # argmaxb Q( !s ,b) (if a ties for the max, then a

    * # !a ) $ # r + %Q( !s , !a ) &Q(s,a*) e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) else e(s,a) # 0 s# !s ;a# !a Until s is terminal

    3939CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Watkins’s Q(!)Initialize s,aRepeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # arg maxb Q( !s ,b) (if a ties for the max, then a* # !a ) /* What is the point of this?? */ $ # r + %Q( !s , !a ) &Q(s,a*) /* Compute TD error: using best action! */ e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) /* Note check! What is the point of this? */ else e(s,a) # 0 s# !s ;a# !a Until s is terminal

    4040CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Initialize s,aRepeat (for each step of episode): Take action a, observe r, !s Choose !a from !s using policy derived from Q (e.g. "-greedy) a* # argmaxb Q( !s ,b) if a ties for the max, then a* # !a /* What is the point of this?? STAY ON GREEDY POLICY */ $ # r + %Q( !s , !a ) &Q(s,a*) /* Compute TD error: using best action! */ e(s,a) # e(s,a) +1 For all s,a: Q(s,a) #Q(s,a) +'$e(s,a) If !a = a*, then e(s,a) #%(e(s,a) /* Note check! What is the point of this? */ else e(s,a) # 0 See above: If we choose greedy action, then don't zero trace! s# !s ;a# !a Until s is terminal

  • 4141CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Peng’s Q(!)• Disadvantage to Watkins’s

    method:• Early in learning, the eligibility

    trace will be “cut” (zeroed out)frequently resulting in littleadvantage to traces

    • Peng:• Backup max action except at

    end

    • Never cut traces• Disadvantage:

    • Complicated to implement

    4242CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Naïve Q(!)• Idea: is it really a problem to backup exploratory

    actions?

    • Never zero traces• Always backup max at current action (unlike

    Peng or Watkins’s)

    • Is this truly naïve?

    • Well, it works well in some empirical studies

    What is the backup diagram?

    4343CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Comparison Task

    From McGovern and Sutton (1997). Towards a better Q(!)

    • Compared Watkins’s, Peng’s, and Naïve (called McGovern’shere) Q(!) on several tasks.• See McGovern and Sutton (1997). Towards a Better Q($) for other

    tasks and results (stochastic tasks, continuing tasks, etc)

    • Deterministic gridworld with obstacles• 10x10 gridworld• 25 randomly generated obstacles• 30 runs• " = 0.05, # = 0.9, ! = 0.9, $ = 0.05, accumulating traces

    4444CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Comparison Results

    From McGovern and Sutton (1997). Towards a better Q(!)

  • 4545CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Convergence of the Q(!)’s• None of the methods are proven to converge.

    • Much extra credit if you can prove any of them.• Watkins’s is thought to converge to Q*• Peng’s is thought to converge to a mixture of Q% and Q*• Naïve - Q*?

    4646CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Eligibility Traces for Actor-Critic Methods

    • Critic: On-policy learning of V%. Use TD(!) as describedbefore.

    • Actor: Needs eligibility traces for each state-action pair.• We change the update equation from:

    pt+1(s,a) =pt (s,a) +!" t if a = at and s = stpt (s,a) otherwise

    #$%

    &%

    pt+1(s,a) = pt (s,a) +!" tet (s,a)To:

    4747CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Eligibility Traces for Actor-Critic Methods

    • Critic: On-policy learning of V%. Use TD(!) as describedbefore.

    • Actor: Needs eligibility traces for each state-action pair.• We change the update equation from:

    pt+1(s,a) =pt (s,a) +!" t if a = at and s = stpt (s,a) otherwise

    #$%

    &%

    pt+1(s,a) = pt (s,a) +!" tet (s,a)To:

    Note - this updates the preference for the s,aNote - this updates the preference for the s,a that we actuallythat we actually diddid according to the erroraccording to the error

    This updates the preference for This updates the preference for allall s,a s,a according to their eligibilityaccording to their eligibility

    4848CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Eligibility Traces for Actor-Critic Methods

    • Critic: On-policy learning of V%. Use TD(!) as describedbefore.

    • Actor: Needs eligibility traces for each state-action pair.• Can change the other actor-critic update:

    pt+1(s,a) =pt (s,a) +!" t 1# $ (s,a)[ ] if a = at and s = st

    pt (s,a) otherwise

    %&'

    ('

    topt+1(s,a) = pt (s,a) +!" tet (s,a)

    et (s,a) =!"et#1(s,a) +1# $ t (st ,at ) if s = st and a = at

    !"et#1(s,a) otherwise

    %&'

    ('

    where

  • 4949CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Replacing Traces

    • Using accumulating traces, frequently visited states canhave eligibilities greater than greater than 11• This can be a problem for convergence

    •• Replacing traces:Replacing traces: Instead of addingadding 1 when you visit astate, set that trace to set that trace to 11

    et (s) =!"et#1(s) if s $ st

    1 if s = st

    %&'

    ('

    5050CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Replacing Traces Example• Same 19 state random walk task as before• Replacing traces perform better than accumulating traces over more

    values of !

    5151CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Why Replacing Traces?• Replacing traces can significantly speed learning

    • They can make the system perform well for a broader set of parameters

    • Accumulating traces can do poorly on certain types of tasks

    Why is this task particularly onerous foraccumulating traces?

    5252CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    More Replacing Traces

    • Off-line replacing trace TD(1) is identical to first-visit MC

    • Extension to action-values (Q values):• When you revisit a state, what should you do with the traces for the

    actions you didn’t take?

    • Singh and Sutton say to set them to zero:

    • What effect would this have on the previous problem? (right/wrong)

    et (s,a) =10

    !"et#1(s,a)

    $

    %&

    '&

    if s = st and a = atif s = st and a ( at

    if s ( st

  • 5353CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Implementation Issues with Traces

    • Could require much more computation• But most eligibility traces are VERY close to zero

    • If you implement it in Matlab, backup is only one line ofcode and is very fast (Matlab is optimized for matrices)

    5454CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Variable !• Can generalize to variable !

    • Here ! is a function of time• Could define

    et (s) =!"tet#1(s) if s $ st

    !"tet#1(s) +1 if s = st

    %&'

    ('

    !t = !(st ) or !t = !t"

    5555CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Conclusions

    • Provides efficient, incremental way to combine MCand TD• Includes advantages of MC (can deal with lack of Markov

    property)

    • Includes advantages of TD (using TD error, bootstrapping)• Can significantly speed learning• Does have a cost in computation

    5656CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    The two views

  • END5858CSE 190: Reinforcement Learning, LectureCSE 190: Reinforcement Learning, Lecture on Chapteron Chapter 77

    Unified View