chapter 6: temporal difference learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 6: Temporal Difference Learning

Introduce Temporal Difference (TD) learning Focus first on policy evaluation, or prediction, methods Then extend to control methods

Objectives of this chapter:


TD Prediction

Simple every - visit Monte Carlo method :

V(st ) V(st) Rt V (st )

Policy Evaluation (the prediction problem): for a given policy , compute the state-value function V

Recall:

The simplest TD method, TD(0) :

V(st ) V(st) rt1 V (st1 ) V(st )

target: the actual return after time tMust wait until the end of the episode

target: an estimate of the return


Simple Monte Carlo

T T T TT

T T T T T

V(st ) V(st) Rt V (st ) where Rt is the actual return following state st .

st

T T

T T

TT T

T TT


Simplest TD Method

T T T TT

T T T T T

st1

rt1

st

V(st ) V(st) rt1 V (st1 ) V(st )

TTTTT

T T T T T


cf. Dynamic Programming

V(st ) E rt 1 V(st )

T

T T T

st

rt1

st1

T

TT

T

TT

T

T

T


TD Bootstraps and Samples

Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps

Sampling: update does not involve an expected value MC samples DP does not sample TD samples


Example: Driving Home

State Elapsed Time(minutes)

PredictedTime to Go

PredictedTotal Time

leaving office 0 30 30

reach car,raining

5 35 40

exit highway 20 15 35

behind truck 30 10 40

home street 40 3 43

arrive home 43 0 43


Driving Home

Changes recommended by Monte Carlo methods =1)

Changes recommendedby TD methods (=1)


Advantages of TD Learning

TD methods do not require a model of the environment, only experience

TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome

– Less memory– Less peak computation

You can learn without the final outcome– From incomplete sequences

Both MC and TD converge (under certain assumptions to be detailed later), but which is faster?


Random Walk Example

Values learned by TD(0) aftervarious numbers of episodes

Episodes terminate on either end

with reward 0 (left) and 1 (right)


TD and MC on the Random Walk

Data averaged over100 sequences of episodes


Optimality of TD(0)

Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence.

Compute updates according to TD(0), but only update estimates after each complete pass through the data.

For any finite Markov prediction task, under batch updating,TD(0) converges for sufficiently small .

Constant- MC also converges under these conditions, but toa different answer! It converges to sample averages of the actual returns at each visited state.


Random Walk under Batch Updating

After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.TD seems to be better than MC, but MC gives optimum estimate of mean square error over the actual returns. So why TD seems better than the optimum? In what sense?


You are the Predictor

Suppose you observe the following 8 episodes:

A, 0, B, 0B, 1B, 1B, 1B, 1B, 1B, 1B, 0

V(A)?

V(B)?



V(A)?



The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a batch Monte Carlo method gets

If we consider the sequentiality of the problem, then we would set V(A)=.75

This is correct for the maximum likelihood estimate of a Markov model generating the data

i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?)

This is called the certainty-equivalence estimate This is what TD(0) gets


Learning An Action-Value Function

Estimate Q for the current behavior policy .

After every transition from a nonterminal state st , do this :

Q st , at Q st , at rt1 Q st1,at 1 Q st ,at If st1 is terminal, then Q(st1, at1 ) 0.


Sarsa: On-Policy TD Control

Turn this into a control method by always updating thepolicy to be greedy with respect to the current estimate:

The name SARSA comes from the sequence s(t) a(t) r(t+1) s(t+1) a(t+1)


Windy Gridworld

undiscounted, episodic, reward = –1 until goal


Results of Sarsa on the Windy Gridworld



% Example 6.5

% states are numbered from 1 to 70 according to raw and column index

% matrix is scanned columnwise

clear all;

close all;

% rewards

r=0*(1:70)-1;

r(53)=0;

% next state for action - going north

S_prime(1,1)=1;

S_prime(1,2:70)=(1:69);

S_prime(1,1:7:64)=(1:7:64);

Spr=reshape(S_prime,7,10);

% wind effect

Spr(:,4:9)=[Spr(1,4:9); Spr(1:6,4:9)];

Spr(:,7:8)=[Spr(1,7:8); Spr(1:6,7:8)];

S_prime(1,:)=reshape(Spr,1,70);

% next state for action - going east

Spr=(8:70);

Spr=reshape(Spr,7,9);

Spr=[Spr Spr(:,9)];

% wind effect

Spr(:,4:9)=[Spr(1,4:9); Spr(1:6,4:9)];

Spr(:,7:8)=[Spr(1,7:8); Spr(1:6,7:8)];


% next state for action - going south

Spr=(1:70);


Spr=[Spr(2:7,:); Spr(7,:)];

% wind effect

Spr(:,4:9)=[Spr(1,4:9); Spr(1:6,4:9)];

Spr(1,4:9)=(22:7:57);

Spr(:,7:8)=[Spr(1,7:8); Spr(1:6,7:8)];


% next state for action - going west

Spr=(1:70);


Spr=[Spr(:,1) Spr(:,1:9)];

% wind effect

Spr(:,4:9)=[Spr(1,4:9); Spr(1:6,4:9)];

Spr(:,7:8)=[Spr(1,7:8); Spr(1:6,7:8)];


% initialize variables

Qsa=zeros(4,70);

alpha=0.9;

epsi=0.1;

gamma=0.9;

num_episodes=170;

t=1;



% repeat for each episode

for episode=1:num_episodes

epsi=1/t;

% intitialize state

s=4;

% chose action a from s using epsilon greedy policy

chose_policy=rand>epsi;

if chose_policy

[val,a]=max(Qsa(:,s));

else

a=ceil(rand*(1-eps)*4);

end;

% repeat for each step of episode

episode_size(episode)=0;

path=[ ];

random=[ ];

while s~=53

epsi=1/t;

% take action a, observe r, s_pr

s_pr=S_prime(a,s);

% chose action a_pr from s_pr using epsilon greedy policy

chose_policy_pr=rand>epsi;

if chose_policy_pr

[val,a_pr]=max(Qsa(:,s_pr));

random=[random 0];

else

a_pr=ceil(rand*(1-eps)*4);

random=[random 1];

end;

% Bellman equation

Qsa(a,s)=Qsa(a,s)+alpha*(-1+gamma*Qsa(a_pr,s_pr)-Qsa(a,s));

% the following selection works better

% Qsa(a,s)=Qsa(a_pr,s_pr)-1;

s=s_pr;

a=a_pr;

episode_size(episode)=episode_size(episode)+1;

path=[path s];

t=t+1;

end; %while s

end;



plot([0 cumsum(episode_size)],(0:num_episodes));

xlabel('Time step');

ylabel('Episode index');

title('Windy gridworld learning rate with eps=1/t, alpha=0.9');

figure(2)

plot( episode_size(1:num_episodes))

xlabel('Episode index');

ylabel('Episode length in steps');

title('Windy gridworld learning rate with eps=1/t, alpha=0.9');

% determine the optimum policy

for i=1:70

Optimum_policy(:,i)=(Qsa(:,i)==max(Qsa(:,i)));

end;

norm=sum(Optimum_policy);

for i=1:70

Optimum_policy(:,i)=Optimum_policy(:,i)./norm(i);

end

% Optimum path

% path =

% Columns 1 through 13

% 11 18 25 31 37 43 50 57 64 65 66 67 68

% Columns 14 through 15

% 61 53


Q-Learning: Off-Policy TD Control

One - step Q - learning :

Q st , at Q st , at rt 1 maxa

Q st1, a Q st , at

Different than in on-policy TD (SARSA)


Cliffwalking

greedy= 0.1Reward for all moves –1

Falling off the cliff –100

Sarsa

Q-learning


Actor-Critic Methods

Explicit representation of policy as well as value function

Minimal computation to select actions

Can learn an explicit stochastic policy

Can put constraints on policies Appealing as psychological and

neural models


Actor-Critic Details

TD - error is used to evaluate actions :

t rt 1 V (st1 ) V(st )

If actions are determined by preferences, p(s,a), as follows :

t (s,a) Pr at a st s ep(s, a)

e p(s ,b)

b

,

then you can update the preferences like this :

p(st , at ) p(st ,at ) t

So with positive the preference for this action is increased and opposite


Dopamine Neurons and TD Error

W. Schultz et al. Universite de Fribourg

Amygdala

Nucleus accumbens

Hippocampus

Striatum

(Caudate nucleus)

VTA &

Substantia Nigra

(not shown)


The CAR task

In the CAR task, a rat is trained that a conditioned stimulus (CS), such as a light or sound, precedes an unconditioned stimulus (US) such as an electric shock. Under normal conditions, the rat will learn to do two things:

First, it will learn to escape the shock by moving to the other side of the cage (for example).

Secondly, it will eventually learn that the CS predicts the shock, and the rat will then avoid the shock entirely by taking evasive action in response to the CS.


A computational model

Following formal models of Reinforcement Learning (RL) (Sutton and Barto, 1998), we translate the CAR task into a Markov Decision Process (MDP).Each circle represents a different state the simulated rat can be in, and the arrows denote the outcome of each of the two possible actions. The shock of the US elicits a punishment of -1. The rat must learn to take the appropriate action in each state which minimises overall punishment.


R-Learning

R-Learning is an off-policy control method for advanced RL when one

neither discounts nor divides experience into distinct episodes.

Instead try to find the maximum reward per time step.


Average Reward Per Time Step in R-Learning

Average expected reward per time step under policy :

limn

1n

E rt t1

n

the same for each state if ergodici.e. nonzero probability of reaching any state

Value of a state relative to :

˜ V s E rt k st s k1

Value of a state - action pair relative to :

˜ Q s, a E rt k

st s,at a k1

This is a transient value

declining to zero as all

states reach their final

average reward


Access-Control Queuing Task

n servers Customers have four different

priorities, which pay reward of 1, 2, 3, or 4, if served

At each time step, customer at head of queue is accepted (assigned to a server) or removed from the queue

Proportion of randomly distributed high priority customers in queue is h

Busy server becomes free with probability p on each time step

Statistics of arrivals and departures are unknown

n=10, h=.5, p=.06

Apply R-learning


Afterstates

Usually, a state-value function evaluates states in which the agent can take an action.

But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe.

Why is this useful?We do not know the opponent’s response so determining the state value is difficult Action-value also do not work since different states and actions may result in the same board position

What is this in general?


Summary

TD prediction Introduced one-step tabular model-free TD methods Extend prediction to control by employing some form of GPI

On-policy control: Sarsa Off-policy control: Q-learning and R-learning

These methods bootstrap and sample, combining aspects of DP and MC methods

chapter 6: temporal difference learning

Documents

experience td

policy td controlturn

td predictionpolicy

monte carlo methods

prediction problem

training data

batch updatingafter

td0 getslearning