infinite horizon dynamic programming models · infinite horizon dp models: general assumptions 2/2...

Jaroslav Sklenář

2017

University of MaltaDepartment of Statistics & Operations Research

e-mail: [email protected]: http://staff.um.edu.mt/jskl1/

Infinite Horizon Dynamic Programming Models

Dynamic Programming

History:

Richard Bellman (1920-1984)

Principal of Optimality

"For every stage and every decision that starts an optimal plan from this stage, the optimal plan consists of the given decision followed by the plan which is optimal with respect to the succeeding stage".

Bellman Equations

Used in various algorithms to evaluate a given policy and to find an optimal policy in acceptable time.

Infinite horizon DP models - presentation

• Markov Decision Process - assumptions, symbols– Direct full enumeration algorithm

• Bellman Equations for discounted rewards case– Evaluation of a given policy

– Finding an optimal policy by

• Policy iteration

• Value iteration

• Bellman Equations for average reward case and the Semi-Markov Decision Process

• Reinforcement Learning

• Response Surface and other memory saving methods

Four classes of DP models

Horizon\Nature Deterministic Stochastic

Finite

Infinite

Special type of

dealt with earlier

Multistage Decision Model: stage t graphically

DP – finite horizon deterministic models

Consequences of t ∞

- Decision stages (though still present) loose their identity.They are equal, having the same states and decisions.

- We deal with homogenous (stationary, time independent)stochastic discrete-time systems.

- If time information is important for decision making, timeis incorporated in the state. Ex: time to flight departure.

- We cannot use the multivariate optimization point of view,only optimal control view available. Objective = optimal policy (mapping of states to actions).

Infinite horizon DP models: general assumptions 1/2

- Entities involved: system, decision maker (controller).

- Actions taken at discrete time points, we assume unit(constant) time intervals – will be relaxed.

- Action (typically, not necessarily) changes system state and generates some reward/cost.

Discrete time dynamicstochastic system

Control based ona certain policy

Action(Decision)

State

Infinite horizon DP models: general assumptions 2/2

- Only homogenous (time independent) entities considered.

- Transition rewards are deterministic (will be relaxed), bounded and depend on current state, action and next state.

- Markovian (in general sense) property of the system:probability of reaching a certain next state given only by current state and current action.

- Policy is deterministic, the action is given by current state.

- Objective function to be maximized takes the form of:- average reward per unit time- total discounted rewards from infinite sequence of actions.

Together we have Markov Decision Process (MDP).

Infinite horizon DP models: symbols & notation 1/2

S = finite nonempty set of states: i , j SNote: identification of stages not used

A(i) = finite nonempty set of actions available at state i. Oftenonly decision making states with |A(i)|>1 are considered

u(i) A(i) = action taken in state i under policy u

p (i,a,j) = [moving from state i to state j if action a is taken]Probabilities arranged in Transition Probability Matrix (TPM)

p (i,a,j) = p (i,u(i),j) = Pu(i,j)

r (i,a,j) = immediate reward from transition from i to j under action aRewards arranged in Transition Reward Matrix (TRM)

r (i,a,j) = r (i,u(i),j) = Ru(i,j)

Infinite horizon DP models: symbols & notation 2/2

r (i,a) = expected immediate reward for state i and action a

u(i) = limiting probability of state i under policy uUnder ergodicity assumption we get the vector uby solving the equations

u = average reward per transition (unit time) under policy u

( , ) ( , , ) ( , , ), ( )j S

r i a p i a j r i a j a u i

T T T, 1u u u u P e

( ) ( , ( ))u u

i S

i r i u i

Optimization by full enumeration #1

For each policyCompute expected immediate rewards for all states

Compute limiting probabilities u:

Compute average reward u per transition:Select the best policy

Ex: |S|=100, |A(i)|=2 for all states

Number of different policies is 2100 1.2677 x 1030

1s per policy ~ 4 x 1016 years

( , ( )) ( , ( ), ) ( , ( ), )j S

r i u i p i u i j r i u i j i S

T T T

, 1u u u u P e

( ) ( , ( ))u u

i S

i r i u i

Def: Value function vector for policy u and discounted rewards:

where 0<λ<1 is the discounting factor and expectation is over all trajectories.

Next we find Bellman equations that can be used to:

1) Compute the value function vector for a given policy

2) Find an optimal policy

Bellman Equations (BE) for Discounted Rewards

1

1 1

1

( ) lim [ ( , ( ), ) | ]

Ek

s

u s s sk

s

J i r x u x x x i i S

Bellman Equations for Discounted Rewards

BE for evaluation of policy u :

Optimization by full enumeration #2Compute the value function vector for each policySelect the best policy

i jp(i,u(i),j)

r(i,u(i),j)Ju (i) Ju (j)

| |

1

| | | |

1 1

| |

1

1

( ) [ ( , ( ), ) ( )] ( , ( ), )[ ( , ( ), ) ( )]

( , ( ), ) ( , ( ), ) ( , ( ), ) ( )

( ) ( , ( )) ( , ( ), ) ( )

, ( )

S

u j u u

j

S S

u

j j

S

u u

j

u u u u u u u

J i r i u i j J j p i u i j r i u i j J j

p i u i j r i u i j p i u i j J j

J i r i u i p i u i j J j i S

J r P J J I P r

E

Problem: Fast algorithm to find an optimal policy

Definition of value function repeated:

Lemma 1: The value function vector defined as:

is the optimal value function vector associated with the optimal policy.

Justification: The above is clearly an upper bound on J*, for existence of a common maximizer umax, see later.


1

1 1

1

( ) lim [ ( , ( ), ) | ]

Ek

s

u s s sk

s

J i r x u x x x i i S

*( ) max{ ( )}

uu

J i J i i S


We introduce the transformation Tu :

| |

1

1

( ( )) ( ) ( , ( )) ( , ( ), ) ( )

( ) ( ( )) 2

S

u u

j

k k

u u u

T J i T J i r i u i p i u i j J j i S

T J i T T J i k

Transformation Tu is based on the above BE for given policy u :

| |

1

( ) ( , ( )) ( , ( ), ) ( )

( ) ( )

S

u u

j

u u u


J i T J i i S


| |

( )1

1

( ( )) ( ) max{ ( , ) ( , , ) ( )}

( ) ( ( )) 2

S

a A ij

k k

T J i TJ i r i a p i a j J j i S

T J i T T J i k

Similarly we introduce the transformation T :

Transformation T is based on the Bellman optimality equation(see later). We also note that by selecting the policy umax made of maximizers in above terms we get:

max

| |

max( ) 1

( ) arg max{ ( , ) ( , , ) ( )}

( ) ( )

S

a A i j

u

u i r i a p i a j J j i S

T J i TJ i


Proposition 1: Both transformations are monotone:

Proof by induction for Tu (similarly for T ):

( ) '( ) ( ) '( ), ( ) '( )

, 0

k k k k

u uJ i J i T J i T J i T J i T J i

i S k

| |

1

| |

1

| |1

1

| |1

1

1: ( ) ( , ( )) ( , ( ), ) ( )

( , ( )) ( , ( ), ) '( ) '( )

holds for : ( ) ( , ( )) ( , ( ), ) ( )

( , ( )) ( , ( ), ) '( ) '( )

S

u

j

S

u

j

Sm m

u u

j

Sm m

u u

j

k T J i r i u i p i u i j J j

r i u i p i u i j J j T J i

k m T J i r i u i p i u i j T J j

r i u i p i u i j T J j T J i


max | |i

ix

x

Proposition 2: Both transformations are contractive with respect to the max norm. This means that for any two vectors J and J’:

Proof: We assume that TJ(i) TJ’’(i) for all i. Also let

' ' ' '

u u

T T T TJ J J J J J J J

where 0<λ<1 is the discounting factor. Reminder:

| |

( )1

( ) arg max{ ( , ) ( , , ) ( )}

S

u A ij

a i r i u p i u j J j i S

If a(i) is a maximizer then by the definition of mapping T we have:

| |

1

( ) [ ( , ( )) ( , ( ), ) ( )]

S

j

TJ i r i a i p i a i j J j i S


Proof cont.Similarly by replacing J by J’’ we obtain:

| |

( )1

( ) arg max{ ( , ) ( , , ) '( )}

S

u A ij

b i r i u p i u j J j i S

Since b(i) maximizes the term in square brackets we have:

| |

1

'( ) [ ( , ( )) ( , ( ), ) '( )]

S

j

TJ i r i b i p i b i j J j i S

| |

1

| |

1

'( ) [ ( , ( )) ( , ( ), ) '( )]

'( ) [ ( , ( )) ( , ( ), ) '( )]

S

j

S

j




| |

1

| | | |

1 1

| | | |

1 1

0 ( ) '( ) [ ( , ( )) ( , ( ), ) ( )]

[ ( , ( )) ( , ( ), ) '( )] ( , ( ), )[ ( ) '( )]

( , ( ), ) max | ( ) '( ) | max | ( ) '( ) | ( , ( ), )

m

S

j

S S

j j

S S

j jj j

TJ i TJ i r i a i p i a i j J j

r i a i p i a i j J j p i a i j J j J j

p i a i j J j J j J j J j p i a i j

ax | ( ) '( ) | || ' ||

j

J j J j J J

Proof cont.Combining all the above we obtain (for all i ):

Thus we can write:

( ) '( ) || ' ||

TJ i TJ i i SJ J


Proof cont.Similarly by assuming that TJ(i) ≤ TJ’’(i) for all i we obtain:

'( ) ( ) || ' || || ' ||

TJ i TJ i i SJ J J J

The two inequalities together (both LHS are nonnegative)

'( ) ( ) || ' ||

( ) '( ) || ' ||

TJ i TJ i i S

TJ i TJ i i S

J J

J J

Give the result:

max | ( ) '( ) | || ' ||

|| ' || || ' ||

iTJ i TJ i

T T

J J

J J J J

For the mapping Tu the proof is similar.


Lemma 2: for a given vector h of dimension |S|:

Verified for k=1:

1

1 1 1

1

( ) [ ( ) ( , ( ), ) | ]

k

k k s

u k s s s

s

T h i h x r x u x x x i i SE

| |

1

| |

1

11 1

1 1 1

1

( ) ( , ( )) ( , ( ), ) ( )

( ) ( , ( ), )[ ( , ( ), )) ( )]

[ ( ) ( , ( ), ) | ]

S

u

j

S

u

j

s

k s s s

s

T J i r i u i p i u i j J j i S

T h i p i u i j r i u i j h j

h x r x u x x x iE


Lemma 2: for a given vector h of dimension |S|:

Verified for k=2:

1

1 1 1

1

( ) [ ( ) ( , ( ), ) | ]

k

k k s

u k s s s

s

T h i h x r x u x x x i i SE

| |2

1

| | | |

1 1

| | | | | |

1 1 1

( ) ( , ( ), )[ ( , ( ), ) ( ( ))]

( , ( ), )[ ( , ( ), ) ( , ( ), )[ ( , ( ), ) ( )]]

( , ( ), ) ( , ( ), ) ( , ( ), ) ( , ( ), )[ ( , ( ), )

S

u u

j

S S

j l

S S S

j j l

T h i p i u i j r i u i j T h j

p i u i j r i u i j p j u j l r j u j l h l

p i u i j r i u i j p i u i j p j u j l r j u j l

| | | | 22 2 1

1 1 1

1 1 1

]

( , ( ), ) ( , ( ), )[ ( )] [ ( ) ( , ( ), ) | ]

S S

s

k s s s

j l s

p i u i j p j u j l h l h x r x u x x x iE

For complete proof see Bertsekas D.P. DP and Optimal Control


Proposition 3: For any bounded function h : S the optimalvalue function vector satisfies:

Proof:

*( ) lim ( )

k

kJ i T h i i S

1

1 1

1

1 1

1 1 1 1

1 1

1 1

1 1 1 1

1

( ) lim [ ( , ( ), ) | ]

lim [ ( , ( ), ) | ] lim [ ( , ( ), ) | ]

[ ( , ( ), ) | ] lim [ ( , ( ), ) | ]

ks

u s s sk

s

P ks s

s s s s s sk k

s s P

Ps s

s s s s s sk

s s P

J i r x u x x x i

r x u x x x i r x u x x x i

r x u x x x i r x u x x x i

E

E E

E E1

k

Immediate rewards are finite: 1| ( , ( ), ) | for all

s s sr x u x x M s


Applied to the 2nd term:

1 1 1

1 1

1 1 1

lim | [ ( , ( ), ) | ] | lim[ ]1

Pk ks s s

s s sk ks P s P s P

r x u x x x i M M M

E

Adding Ju(i ) to each side we get:

( ) ( ) ( )1 1

P P

u u u

M MJ i J i A J i

We denote the limit by A :

1

1 1

1

lim [ ( , ( ), ) | ] , | |1

,1 1 1 1

Pks

s s sk

s P

P P P P

Mr x u x x x i A A

M M M MA A

E


Using A in the first equation, the above inequality becomes:

Since λ >0 we have:

max | ( ) | ( ), max | ( ) | [ ( )] max | ( ) |

max | ( ) | [ ( )] max | ( ) | (2)

i i i

P P P

i i

h i h i h i h i h i

h i h i h i

E

E

Adding (1) and (2) gives :

1

1 1

1

1

1 1

1

( ) [ ( , ( ), ) | ]

( ) [ ( , ( ), ) | ] ( ) (1)1 1

Ps

u s s s

s

P PPs

u s s s u

s

J i r x u x x x i A

M MJ i r x u x x x i J i

E

E

1

1 1

1

( ) max | ( ) | [ ( , ( ), ) | ] [ ( )]1

( ) max | ( ) |1

P PP s P

u s s si

s

PP

ui

MJ i h i r x u x x x i h i

MJ i h i

E E


Now we use the Lemma 2 for the middle term:

By taking the limit for P ∞ we get for λP=0 the result:

The above holds for any policy. Selecting the one that maximizes the above terms and by Lemma 1 we get:

1

1 1 1

1

( ) [ ( ) ( , ( ), ) | ]

( ) max | ( ) | ( ) ( ) max | ( ) |1 1

k

k k s

u k s s s

s

P PP P P

u u ui i

T h i h x r x u x x x i i S

M MJ i h i T h i J i h i

E

* *( ) max | ( ) | ( ) ( ) max | ( ) |

1 1

P PP P P

i i

M MJ i h i T h i J i h i

* * *( ) lim ( ) ( ) lim ( ) ( )

P P

P PJ i T h i J i T h i J i


The proposition *( ) lim ( )

k

kJ i T h i i S

Defines the so-called Bellman Optimality Equation (BOE) to find optimal policy:

| |* *

( )1

* *

( ) max{ ( , ) ( , , ) ( )}

( ) ( )

S

a A ij

J i r i a p i a j J j i S

J i TJ i i S

As the direct solution of the BOE by linear algebra is not possible, we use algorithms based on the two transformations. There are two algorithms (with several modifications):

Policy IterationValue Iteration


| |

1

( ) ( , ( )) ( , ( ), ) ( )S

k k k k

j


Solving Bellman Optimality Equation by Policy Iteration

Algorithm:1. k=1, select any policy uk

2. Evaluate policy uk by solving the (basic) BE:

If possible leave uk+1(i ) =uk(i )

4. If the policy did not change then uk is optimal. Otherwise k=k+1, go to step 2.

3. Improvement step. Find a new policy uk+1 such that:| |

1

( ) 1

( ) arg max{ ( , ) ( , , ) ( )}

S

k k

a A i j


function [mu,J,ri,Pmu,rr] = SDPpoliterDR(S,A,P,r,lambda)

mu = ones(S,1); % initial mu(i) = 1

for i=1:S % for all states i

for a=1:A % for all actions a

rr(i,a) = P(i,:,a)*r(i,:,a)'; % expected immediate rewards

end

end

improved = 1; % flag whether improvement happened

while improved == 1

improved = 0;


ri(i) = rr(i,mu(i)); % expected rewards for given policy

for j=1:S % for all states j

Pmu(i,j) = P(i,j,mu(i)); % TPM for given policy

end

end

J = (eye(S) - lambda*Pmu)\ri'; % computing J for given policy



y(a) = rr(i,a) + lambda*(P(i,:,a)*J); % this is maximized

end

[J(i) newmui] = max(y); % maximization

if J(i)>y(mu(i)) % improvement ?

improved = 1;

mu(i) = newmui; % updating policy if improved

end

end

end


Solving Bellman Optimality Equation by Policy Iteration

Notes on the algorithm

- Provides an optimum policy in finite number of iterations.Formal convergence proof exists, but intuitively:

- each iteration improves the policy- there is finite number of policies

- Each iteration solves the system of |S| linear equations

Matlab: J = (eye(S) - lambda*Pmu)\ri';

- Summary: “relatively small number of complicated iterations”


1( ) (1 ) / 2

k k

J J

| |1

( )1

( ) max{ ( , ) ( , , ) ( )}S

k k

a A ij


Solving Bellman Optimality Equation by Value Iteration

Algorithm:

1. k=1, select any vector J1, specify > 02. Compute:

then go to step 4, otherwise k=k+1, go to step 2.

4. The optimal policy is given by:

3. If

| |

max( ) 1

( ) arg max{ ( , ) ( , , ) ( )}S

k

a A i j



Proposition 4: The value iteration algorithm generates an - optimal policy; that is, if Ju is the value function vector provided by the algorithm and J* is the optimal value function vector, then:

Proof: As u is made of optimal actions in each state, we have:

* 1 1 *

k k

u uJ J J J J J

From triangular inequality of a norm:


*

u

J J

1,

k k

u u u uT T T

J J J J J

Using the above and the contraction property we obtain:


Proof – cont.

1 1 1 1 1

1 1 1 1 1

1 1 1 1

k k k k k

u u u u u

k k k k k k

u u u u

k k k k k k

u u

T T T T

T T T T T T T

T T T T

J J J J J J J J

J J J J J J J J

J J J J J J J J

Rearranging the first and the last terms provides:


1 1

1

k k k

uJ J J J

Similarly we can obtain:

1 * 1

1

k k kJ J J J


Proof – cont.

Inserting above terms in triangular inequality:


Since the last inequality is tested in the step 3 of the algorithm,

* 1 1 *

1 1 12

1 1 1

k k

u u

k k k k k k

J J J J J J

J J J J J J

1( ) (1 ) / 2

k kJ J

we finally get: *

uJ J

function [mu,J,rr] = SDPvaliterDR(S,A,P,r,lambda,epsilon)

J = zeros(S,1); % initial J(i) = 0



rr(i,a) = P(i,:,a)*r(i,:,a)'; % expected immediate rewards

end

end

d = 0.5*epsilon*(1-lambda)/lambda; % norm limit

nrm = d+1; % initial norm > d

while nrm > d

K = J; % save old value functions




end

[J(i) mu(i)] = max(y); % maximization

end

nrm = norm(J - K,inf); % ||.||inf norm

end

for i=1:S % the final result



end

[J(i) mu(i)] = max(y); % maximization

end



Notes on the algorithm

- Provides only approximate solution, the so-called - optimal policy.

- The max-norm decreases with every iteration.

- Each iteration updates a vector and computes the norm,no complicated matrix operations.

- Summary: “relatively big number of simple iterations”

(modifications for faster convergence exist)

Comparison of the two methods – duration in CPU seconds(Matlab R2007b ; Intel Core 2 T5600 1.83GHz, 1GB RAM, =0.0001)

* 10 matrices with 6.25 million entries each 500 MB of data

Problem size

|S| |A|

Policy

iteration

Value

iteration

100 50 0.05 0.17

500 10 1.42 2.17

1000 5 9.86 3.53

1000 10 10.8 7.48

1500 5 32.4 7.72

2000 5 75.5 16.7

2500* 5 144.5 25.7

3000 2 245.7 11.4

Def: u = average reward per transition (unit time) under policy u :

Proposition 5: If a scalar and a |S| vector h satisfy:

then is the average reward associated with the policy u.

If a scalar * and a |S| vector J* satisfy:

then * is the optimal average reward, J* is the optimal value function vector and the policy u* made of the maximizers in the RHS of the equation is the optimal policy.

Bellman Equations for Average Reward

( ) ( , ( ))u u

i S

i r i u i

| |

1

( ) ( , ( )) ( , ( ), ) ( )

S

j

h i r i u i p i u i j h j i S

| |* * *

( )1

( ) max{ ( , ) ( , , ) ( )}

S

a A ij


Proof (1st part, outline): We define the transformation:

Then the 1st BE can be written as:

By induction it is easy to prove the following:

Similarly to the above Lemma 2 we have that:

Inserting this into the above equation gives the result.


| |

1

( ( )) ( ) ( , ( )) ( , ( ), ) ( )

S

u u

j

L J i L J i r i u i p i u i j J j i S

1 1

1 1

1 1

1

[ ( , ( ), ) | ][ ( )] ( )

[ ( , ( ), ) | ]

lim

k

s s s

k s

k

s s s

s

k

r x u x x x ih x h i

i Sk k k

r x u x x x i

k

EE

E

( ) ( )u

h i L h i i S

( ) ( )k

uk h i L h i i S

1 1 1

1

( ) [ ( ) ( , ( ), ) | ]

k

k

u k s s s

s

L h i h x r x u x x x i i SE

Proof to the 2nd part of the proposition is similar, same operations are performed on an inequality when other than optimal policy is taken.

Proposition 5 defines two Bellman equations that can be used to:

- evaluate a given policy- generate the optimal policy

Again the 2nd Bellman optimality equation cannot be solved by linear algebra methods.

There are two algorithms (with several modifications):


(times needed are similar to the discounted rewards case)


Additional assumption:

t (i,a,j) = deterministic transition time of going from state i to state junder action aTimes arranged in Transition Time Matrix (TTM)

t (i,a,j) = t (i,u(i),j) = Tu(i,j)

DTMDP (Deterministic Time MDP)

t (i,a) = average time spent in a transition from state i and the action a :

Average reward of an SMDP:

Bellman Equations for Semi-Markov Decision Problems

( , ) ( , , ) ( , , ), ( )j S

t i a p i a j t i a j a u i

| |

1 1

1 1

| |

1 1

1 1

[ ( , ( ), ) | ] ( ) ( , ( ))

( ) lim

[ ( , ( ), ) | ] ( ) ( , ( ))

Sk

s s s u

s iu uk S

k

s s s us i

r x u x x x i i r i u i

i

t x u x x x i i t i u i

E

E

Bellman Equation for SMDP to evaluate the policy u :

Bellman Optimality Equation for SMDP :

Again there are two algorithms (with several modifications):


Bellman Equations for Semi-Markov Decision Problems

| |

1

( ) ( , ( )) ( , ( )) ( , ( ), ) ( )

S

u

j

h i r i u i t i u i p i u i j h j i S

| |* * *

( )1

( ) max{ ( , ) ( , ( )) ( , , ) ( )}

S

a A ij

J i r i a t i u i p i a j J j i S

What’s wrong with Dynamic Programming ?

We have efficient algorithms that are relatively easy to implement with theoretically guaranteed convergence.

Why is DP used so rarely ?

It is cursed twice !


Curse of dimensionality (Bellman ~1957)

|S| = 1000 1,000,000 entries of TPM,TRM for each action

|S| = 1,000,000 1012 entries of TPM,TRM for each action

(Each matrix made of 8 TB of data)

It is cursed twice !


Curse of modelling

TPM + TRM (+ TTM) = Theoretical model

Problem: How to obtain the values ?

a) pdf’s known: transition probabilities can be obtainedby evaluating multiple integrals

b) unknown distributions: statistical evaluation of data available

c) known underlying distributions: matrices can be generatedby simulation

Generating matrices by simulation

Basic ideas

Keep counters and sums:

N (i,a) ~ incremented if action a selected in state iM (i,a,j) ~ incremented if action a selected in state i

results in transition to state jR (i,a,j) ~ sum of rewards generated if action a selected

in state i results in transition to state j

Note: Sum R not needed if the reward is updated by the Robins-Monro algorithm (r s = generated sample):

( , , ) ( , , )( , , ) ( , , )

( , ) ( , , )

M i a j R i a jp i a j r i a j

N i a M i a j

1 1 1 1( , , ) (1 ) ( , , ) ( , , ) 1/( 1)

n n n n s n

r i a j r i a j r i a j n

Generating matrices by simulation

Robins-Monro algorithm (1951)

si = i -th sampleXn = mean computed from n samples

11

1 11 1 1

1 1

1 1 1 1 1

1 1 1 1

( 1)

1 1 1 1 1

(1 ) 1/( 1)

n ni i n

n n n n n nn i i

n n n n nn

n n n n n n

s s sX n s X n X X s

Xn n n n

X n X s X sX

n n n n n

X X s n

Reinforcement Learning (RL)

Needed: Model free algorithm based on underlying distributionsonly (no need of the matrices). It would solve:

curse of modelling totallycurse of dimensionality partially (see later)

Solution: Q-learning algorithm (Watkins 1989)

Wikipedia:

… in computer science, reinforcement learning is a sub-area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states …

Other names: neuro-dynamic programming, dynamic programming stochastic approximation, simulation-based dynamic programming,approximate dynamic programming

Reinforcement Learning

Q-learning algorithm – derivation 1/2

Bellman optimality equation repeated:| |

* *

( )1

( ) max{ ( , ) ( , , ) ( )}

S

a A ij

J i r i a p i a j J j

We define the Q-factor for state i and action a :| |

*

1

( , ) ( , , )[ ( , , ) ( )]

S

j

Q i a p i a j r i a j J j

Comparing these equations gives: *

( )( ) max ( , )

a A iJ i Q i a

So we get the Q-factor version of the Bellman equation:

| |

( )1

( , ) ( , , )[ ( , , ) max ( , )]

S

b A jj

Q i a p i a j r i a j Q j b

(All above algorithms have their Q-factor versions)


Q-learning algorithm – derivation 2/2

Q-factor Bellman equation expressed as expectation of a sample:

If the Q-factor is the expectation of a sample, we can use simulation and the Robins-Monro algorithm to get its approximate value :

| |

( )1

( )

( , ) ( , , )[ ( , , ) max ( , )]

[ ( , , ) max ( , )] [ ]

S

b A jj

b A j

Q i a p i a j r i a j Q j b

r i a j Q j b sampleE E

Where the reward r (i,a,j) and state j are generated by the simulator.

1 1 1

( )

1

( , ) (1 ) ( , ) [ ( , , ) max ( , )]

1/( 1)

n n n n n

b A j

n

Q i a Q i a r i a j Q j b

n


3. Simulate action a in state i by generating next state j and rewardr (i,a,j). Increment the visit factor and k, compute :

( )( , ) (1 ) ( , ) [ ( , , ) max ( , )]

b A jQ i a Q i a r i a j Q j b

Q-learning algorithm:

1. Initialize the Q-factors, the visit factors and the number ofjumps k to 0:

( , ) 0 ( , ) 0 ( ) ; 0 Q i a V i a i S a A i k

Set step size 0 < s < 1, kmax , generate any initial state i.

2. Let current state be i. Select action aA(i) with probability 1/|A(i)|.

( , ) ( , ) 1, 1, / ( , ) V i a V i a k k s V i a

4. Update Q (i,a) by the Robins-Monro formula:

5. If k<kmax , set i = j and go to step 2. Otherwise continue.

6. Find the optimal policy:max

( )( ) arg max ( , )

b A iu i Q i b i S


Q-learning algorithm – notes 1/3

1. Interface RL Algorithm - Simulator:

2. kmax = (large) total number of iterations

3. Problem: heavily asynchronous updating of the Q-factors.Actions for given state i are selected all with the same probability1/|A(i)|, but generated states j depend on underlying distributions.Solution = smaller step. Some step updating formulae:

1

1 6 111 22

2

/( 1), 1 ( 1 for Robins-Monro)

, 0.1, 10 , 11 /(1 )

n

n n n

s n s s

ss s r r

n s

Simulator

RL Algorithm

, ( , , )j r i a j,i a



Synchronous convergence:

For synchronous updating the algorithm

2

0 0

( )

k k

k k

1( , ) (1 ) ( , ) [ ( ) ] ,

k k k k k

Q i a Q i a H w i aQ

where wk is the (simulation) noise generated in the k-th iteration,converges to a fixed point of H with probability 1 provided it exists, if the following conditions are satisfied :

1. The function H is Lipschitz continuous and non-expansive w.r.t. the max norm.

2. The step size k satisfies:

3. The iterates Qk(i,a) are bounded.4. For all states for some constants A, B and some norm:

2 2[ ] 0 [|| || ] || ( , ) ||

k k kw w A B Q i aE E



Asynchronous convergence:

Let Qm(i,a) be the iterate that has been updated m times synchronously, let Rn(i,a) be the iterate that has been updated n times asynchronously. Then the error of asynchronism is defined to be:

The degree of asynchronism is the difference in the number of updatings |m - n| (also called the age gap). Also assume that

1. For a given > 0, there exists a rule for the step size such thatthe error of asynchronism in the k-th iteration for any state-actionpair is less than .

2. With k , the error of asynchronism asymptotically converges to 0with a diminishing step size, as long as the degree of asynchronismis finite.

( , ) | ( , ) ( , ) | , k k k

i a R i a Q i a i a

0 0( , ) ( , ) , R i a Q i a i a Then:

Needed:

- Simulator based on underlying distributions moving fromstate to state and generating rewards

- RL Algorithm (easy to implement)

- Enough time, but linear increase with the number of states:

|S|=m, maxi{|A(i)|}=n O(mn2) performance for given averagenumber of Q-factor updates per state-action pair.

- Enough memory:

|S|=m, |A(i)|=n 2mn locations for Q-factors and visit factors:Problem

Curse of dimensionality solved partially for |A(i)|<|S|

Reinforcement Learning - Summary

Response Surface Methods (RSM)

Idea 1: Instead of storing the value Q (i,a) for each state-actionpair, find a function f (i,a,b) where b is the stored vectorof parameters. The function f is the metamodel used tocompute the Q-factor values:

Steps:

1. Sampling from the function space.

2. Function fitting (regression).

3. Testing the metamodel.

( , ) ( , , ) , Q i a f i a i ab

Problem: knowledge of the metamodel assumed, regressionis a model-based method. This assumption isgenerally not satisfied in DP.

Partial solution: trial & error by various metamodels.

Response Surface Methods

Idea 2: Instead of storing the value Q (i,a) for each state-actionpair, use the samples to train and validate a nonlinearneural network and store only its weights. To obtain theQ-factor value, connect the state-action pair to input nodes,the output node then provides the Q-factor:

(i , a) Q (i , a)

Here: neural network = model-free function approximation

(For small |A(i)| a separate network may be used for each action)


3. Simulate action a in state i by generating next state j and rewardr (i,a,j). Increment k, compute .

(1 ) [ ( , , ) ]next

q q r i a j q

Q-learning algorithm using a neural network:

1. Initialize the weights of the neural network to small randomnumbers (same for all actions).Set step size s, kmax , k = 0, generate any initial state i.

2. Let current state be i. Select action aA(i) with probability 1/|A(i)|.

4. Determine the output q of the neural network for inputs i and a.Also find outputs for state j and all actions from A(j). Let the maximum of these outputs be qnext. Update q as follows:

6. The policy learned is stored in the weights of the neural network.Optimal actions for each state are those generating the maximumoutput of the neural network.

5. Use q to update the neural network by an incremental algorithm.The new data piece has q as the output and i, a as the inputs.If k<kmax , set i = j and go to step 2. Otherwise continue.


Neuro-Dynamic Programming - Notes

1. Spill-over happens when other than the updated Q-factors arealso affected when the network is being trained. Partial solution: divide the state space into compartments,use separate neural networks within the compartments.

2. Over-fitting of the neural network also affects other than theupdated Q-factors.Partial solution: in the training process perform only few(sometimes only one) iteration.

3. Local optima are the danger when training a nonlinear neuralnetwork.Partial solution: divide the state space into compartments,use linear neural networks within the compartments.

Proper understanding of RL behavior when coupled with functionapproximation still not available.

Other Memory Saving Methods

State Aggregation

Idea: Lump states with similar characteristics (w.r.t. the reward)together into single states. Robust suggested methodoften based on various encoding schemes.

Problems:

1. Grouping may result in loosing Markov property (theoremgiving conditions to preserve Markov property available).

2. Even if Markov property is preserved lumping may resultin loosing optimality of the solution.

3. Finding similarity w.r.t. the reward may be difficult (TRMmatrix is unknown). Application dependent rules must beused.

Other Memory Saving Methods

Interpolation Methods

Idea: Store representative Q-factors, determine other Q-factorsby some interpolation technique. Replace the nearestrepresentative factor by the new one.

The unknown Q-factor is computed as an (un)weighted averageof some close representatives (various norms and techniques areused).

Conclusion

Combination of the RL algorithm with Simulation solves theCurse of Modelling problem. Only underlying distributions to beused in the simulator are required.

Response Surface methods, esp. use of Neural Networks, solvethe Curse of Dimensionality problem.

Many technical problems remain, both successful applications andfailures are reported.

Many important theoretical results have been obtained, but someareas like coupling of the RL algorithm with function approximationare still open to further research.

Algorithms are not difficult to implement provided the underlyingdistributions can be obtained.

If the matrices are available and the number of states is not large,use of classical methods based on Bellman equations is preferable.

DP methods – Summary

MethodCurse of

Modelling

Resolved

Curse of

Dimensionality

Resolved

Standard BE

Algorithms N N

Standard

Algorithms;

Matrices by

Simulation

Y N

Reinforcement

Learning OnlyY Partially

Reinforcement

Learning with

Response

Surface Method

Y Y

Matlab functions currently available

Discounted Rewards by Policy Iteration

Discounted Rewards by Value Iteration

Average Reward by Policy Iteration

Average Reward by Value Iteration

(Contact me)

Thank You

infinite horizon dynamic programming models · infinite horizon dp models: general assumptions 2/2...

Documents