infinite horizon dynamic programming models · infinite horizon dp models: general assumptions 2/2...
TRANSCRIPT
![Page 1: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/1.jpg)
Jaroslav Sklenář
2017
University of MaltaDepartment of Statistics & Operations Research
e-mail: [email protected]: http://staff.um.edu.mt/jskl1/
Infinite Horizon Dynamic Programming Models
![Page 2: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/2.jpg)
Dynamic Programming
History:
Richard Bellman (1920-1984)
Principal of Optimality
"For every stage and every decision that starts an optimal plan from this stage, the optimal plan consists of the given decision followed by the plan which is optimal with respect to the succeeding stage".
Bellman Equations
Used in various algorithms to evaluate a given policy and to find an optimal policy in acceptable time.
![Page 3: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/3.jpg)
Infinite horizon DP models - presentation
• Markov Decision Process - assumptions, symbols– Direct full enumeration algorithm
• Bellman Equations for discounted rewards case– Evaluation of a given policy
– Finding an optimal policy by
• Policy iteration
• Value iteration
• Bellman Equations for average reward case and the Semi-Markov Decision Process
• Reinforcement Learning
• Response Surface and other memory saving methods
![Page 4: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/4.jpg)
Four classes of DP models
Horizon\Nature Deterministic Stochastic
Finite
Infinite
Special type of
dealt with earlier
![Page 5: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/5.jpg)
Multistage Decision Model: stage t graphically
DP – finite horizon deterministic models
![Page 6: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/6.jpg)
Consequences of t ∞
- Decision stages (though still present) loose their identity.They are equal, having the same states and decisions.
- We deal with homogenous (stationary, time independent)stochastic discrete-time systems.
- If time information is important for decision making, timeis incorporated in the state. Ex: time to flight departure.
- We cannot use the multivariate optimization point of view,only optimal control view available. Objective = optimal policy (mapping of states to actions).
![Page 7: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/7.jpg)
Infinite horizon DP models: general assumptions 1/2
- Entities involved: system, decision maker (controller).
- Actions taken at discrete time points, we assume unit(constant) time intervals – will be relaxed.
- Action (typically, not necessarily) changes system state and generates some reward/cost.
Discrete time dynamicstochastic system
Control based ona certain policy
Action(Decision)
State
![Page 8: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/8.jpg)
Infinite horizon DP models: general assumptions 2/2
- Only homogenous (time independent) entities considered.
- Transition rewards are deterministic (will be relaxed), bounded and depend on current state, action and next state.
- Markovian (in general sense) property of the system:probability of reaching a certain next state given only by current state and current action.
- Policy is deterministic, the action is given by current state.
- Objective function to be maximized takes the form of:- average reward per unit time- total discounted rewards from infinite sequence of actions.
Together we have Markov Decision Process (MDP).
![Page 9: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/9.jpg)
Infinite horizon DP models: symbols & notation 1/2
S = finite nonempty set of states: i , j SNote: identification of stages not used
A(i) = finite nonempty set of actions available at state i. Oftenonly decision making states with |A(i)|>1 are considered
u(i) A(i) = action taken in state i under policy u
p (i,a,j) = [moving from state i to state j if action a is taken]Probabilities arranged in Transition Probability Matrix (TPM)
p (i,a,j) = p (i,u(i),j) = Pu(i,j)
r (i,a,j) = immediate reward from transition from i to j under action aRewards arranged in Transition Reward Matrix (TRM)
r (i,a,j) = r (i,u(i),j) = Ru(i,j)
![Page 10: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/10.jpg)
Infinite horizon DP models: symbols & notation 2/2
r (i,a) = expected immediate reward for state i and action a
u(i) = limiting probability of state i under policy uUnder ergodicity assumption we get the vector uby solving the equations
u = average reward per transition (unit time) under policy u
( , ) ( , , ) ( , , ), ( )j S
r i a p i a j r i a j a u i
T T T, 1u u u u P e
( ) ( , ( ))u u
i S
i r i u i
![Page 11: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/11.jpg)
Optimization by full enumeration #1
For each policyCompute expected immediate rewards for all states
Compute limiting probabilities u:
Compute average reward u per transition:Select the best policy
Ex: |S|=100, |A(i)|=2 for all states
Number of different policies is 2100 1.2677 x 1030
1s per policy ~ 4 x 1016 years
( , ( )) ( , ( ), ) ( , ( ), )j S
r i u i p i u i j r i u i j i S
T T T
, 1u u u u P e
( ) ( , ( ))u u
i S
i r i u i
![Page 12: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/12.jpg)
Def: Value function vector for policy u and discounted rewards:
where 0<λ<1 is the discounting factor and expectation is over all trajectories.
Next we find Bellman equations that can be used to:
1) Compute the value function vector for a given policy
2) Find an optimal policy
Bellman Equations (BE) for Discounted Rewards
1
1 1
1
( ) lim [ ( , ( ), ) | ]
Ek
s
u s s sk
s
J i r x u x x x i i S
![Page 13: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/13.jpg)
Bellman Equations for Discounted Rewards
BE for evaluation of policy u :
Optimization by full enumeration #2Compute the value function vector for each policySelect the best policy
i jp(i,u(i),j)
r(i,u(i),j)Ju (i) Ju (j)
| |
1
| | | |
1 1
| |
1
1
( ) [ ( , ( ), ) ( )] ( , ( ), )[ ( , ( ), ) ( )]
( , ( ), ) ( , ( ), ) ( , ( ), ) ( )
( ) ( , ( )) ( , ( ), ) ( )
, ( )
S
u j u u
j
S S
u
j j
S
u u
j
u u u u u u u
J i r i u i j J j p i u i j r i u i j J j
p i u i j r i u i j p i u i j J j
J i r i u i p i u i j J j i S
J r P J J I P r
E
![Page 14: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/14.jpg)
Problem: Fast algorithm to find an optimal policy
Definition of value function repeated:
Lemma 1: The value function vector defined as:
is the optimal value function vector associated with the optimal policy.
Justification: The above is clearly an upper bound on J*, for existence of a common maximizer umax, see later.
Bellman Equations for Discounted Rewards
1
1 1
1
( ) lim [ ( , ( ), ) | ]
Ek
s
u s s sk
s
J i r x u x x x i i S
*( ) max{ ( )}
uu
J i J i i S
![Page 15: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/15.jpg)
Bellman Equations for Discounted Rewards
We introduce the transformation Tu :
| |
1
1
( ( )) ( ) ( , ( )) ( , ( ), ) ( )
( ) ( ( )) 2
S
u u
j
k k
u u u
T J i T J i r i u i p i u i j J j i S
T J i T T J i k
Transformation Tu is based on the above BE for given policy u :
| |
1
( ) ( , ( )) ( , ( ), ) ( )
( ) ( )
S
u u
j
u u u
J i r i u i p i u i j J j i S
J i T J i i S
![Page 16: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/16.jpg)
Bellman Equations for Discounted Rewards
| |
( )1
1
( ( )) ( ) max{ ( , ) ( , , ) ( )}
( ) ( ( )) 2
S
a A ij
k k
T J i TJ i r i a p i a j J j i S
T J i T T J i k
Similarly we introduce the transformation T :
Transformation T is based on the Bellman optimality equation(see later). We also note that by selecting the policy umax made of maximizers in above terms we get:
max
| |
max( ) 1
( ) arg max{ ( , ) ( , , ) ( )}
( ) ( )
S
a A i j
u
u i r i a p i a j J j i S
T J i TJ i
![Page 17: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/17.jpg)
Bellman Equations for Discounted Rewards
Proposition 1: Both transformations are monotone:
Proof by induction for Tu (similarly for T ):
( ) '( ) ( ) '( ), ( ) '( )
, 0
k k k k
u uJ i J i T J i T J i T J i T J i
i S k
| |
1
| |
1
| |1
1
| |1
1
1: ( ) ( , ( )) ( , ( ), ) ( )
( , ( )) ( , ( ), ) '( ) '( )
holds for : ( ) ( , ( )) ( , ( ), ) ( )
( , ( )) ( , ( ), ) '( ) '( )
S
u
j
S
u
j
Sm m
u u
j
Sm m
u u
j
k T J i r i u i p i u i j J j
r i u i p i u i j J j T J i
k m T J i r i u i p i u i j T J j
r i u i p i u i j T J j T J i
![Page 18: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/18.jpg)
Bellman Equations for Discounted Rewards
max | |i
ix
x
Proposition 2: Both transformations are contractive with respect to the max norm. This means that for any two vectors J and J’:
Proof: We assume that TJ(i) TJ’’(i) for all i. Also let
' ' ' '
u u
T T T TJ J J J J J J J
where 0<λ<1 is the discounting factor. Reminder:
| |
( )1
( ) arg max{ ( , ) ( , , ) ( )}
S
u A ij
a i r i u p i u j J j i S
If a(i) is a maximizer then by the definition of mapping T we have:
| |
1
( ) [ ( , ( )) ( , ( ), ) ( )]
S
j
TJ i r i a i p i a i j J j i S
![Page 19: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/19.jpg)
Bellman Equations for Discounted Rewards
Proof cont.Similarly by replacing J by J’’ we obtain:
| |
( )1
( ) arg max{ ( , ) ( , , ) '( )}
S
u A ij
b i r i u p i u j J j i S
Since b(i) maximizes the term in square brackets we have:
| |
1
'( ) [ ( , ( )) ( , ( ), ) '( )]
S
j
TJ i r i b i p i b i j J j i S
| |
1
| |
1
'( ) [ ( , ( )) ( , ( ), ) '( )]
'( ) [ ( , ( )) ( , ( ), ) '( )]
S
j
S
j
TJ i r i a i p i a i j J j i S
TJ i r i a i p i a i j J j i S
![Page 20: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/20.jpg)
Bellman Equations for Discounted Rewards
| |
1
| | | |
1 1
| | | |
1 1
0 ( ) '( ) [ ( , ( )) ( , ( ), ) ( )]
[ ( , ( )) ( , ( ), ) '( )] ( , ( ), )[ ( ) '( )]
( , ( ), ) max | ( ) '( ) | max | ( ) '( ) | ( , ( ), )
m
S
j
S S
j j
S S
j jj j
TJ i TJ i r i a i p i a i j J j
r i a i p i a i j J j p i a i j J j J j
p i a i j J j J j J j J j p i a i j
ax | ( ) '( ) | || ' ||
j
J j J j J J
Proof cont.Combining all the above we obtain (for all i ):
Thus we can write:
( ) '( ) || ' ||
TJ i TJ i i SJ J
![Page 21: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/21.jpg)
Bellman Equations for Discounted Rewards
Proof cont.Similarly by assuming that TJ(i) ≤ TJ’’(i) for all i we obtain:
'( ) ( ) || ' || || ' ||
TJ i TJ i i SJ J J J
The two inequalities together (both LHS are nonnegative)
'( ) ( ) || ' ||
( ) '( ) || ' ||
TJ i TJ i i S
TJ i TJ i i S
J J
J J
Give the result:
max | ( ) '( ) | || ' ||
|| ' || || ' ||
iTJ i TJ i
T T
J J
J J J J
For the mapping Tu the proof is similar.
![Page 22: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/22.jpg)
Bellman Equations for Discounted Rewards
Lemma 2: for a given vector h of dimension |S|:
Verified for k=1:
1
1 1 1
1
( ) [ ( ) ( , ( ), ) | ]
k
k k s
u k s s s
s
T h i h x r x u x x x i i SE
| |
1
| |
1
11 1
1 1 1
1
( ) ( , ( )) ( , ( ), ) ( )
( ) ( , ( ), )[ ( , ( ), )) ( )]
[ ( ) ( , ( ), ) | ]
S
u
j
S
u
j
s
k s s s
s
T J i r i u i p i u i j J j i S
T h i p i u i j r i u i j h j
h x r x u x x x iE
![Page 23: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/23.jpg)
Bellman Equations for Discounted Rewards
Lemma 2: for a given vector h of dimension |S|:
Verified for k=2:
1
1 1 1
1
( ) [ ( ) ( , ( ), ) | ]
k
k k s
u k s s s
s
T h i h x r x u x x x i i SE
| |2
1
| | | |
1 1
| | | | | |
1 1 1
( ) ( , ( ), )[ ( , ( ), ) ( ( ))]
( , ( ), )[ ( , ( ), ) ( , ( ), )[ ( , ( ), ) ( )]]
( , ( ), ) ( , ( ), ) ( , ( ), ) ( , ( ), )[ ( , ( ), )
S
u u
j
S S
j l
S S S
j j l
T h i p i u i j r i u i j T h j
p i u i j r i u i j p j u j l r j u j l h l
p i u i j r i u i j p i u i j p j u j l r j u j l
| | | | 22 2 1
1 1 1
1 1 1
]
( , ( ), ) ( , ( ), )[ ( )] [ ( ) ( , ( ), ) | ]
S S
s
k s s s
j l s
p i u i j p j u j l h l h x r x u x x x iE
For complete proof see Bertsekas D.P. DP and Optimal Control
![Page 24: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/24.jpg)
Bellman Equations for Discounted Rewards
Proposition 3: For any bounded function h : S the optimalvalue function vector satisfies:
Proof:
*( ) lim ( )
k
kJ i T h i i S
1
1 1
1
1 1
1 1 1 1
1 1
1 1
1 1 1 1
1
( ) lim [ ( , ( ), ) | ]
lim [ ( , ( ), ) | ] lim [ ( , ( ), ) | ]
[ ( , ( ), ) | ] lim [ ( , ( ), ) | ]
ks
u s s sk
s
P ks s
s s s s s sk k
s s P
Ps s
s s s s s sk
s s P
J i r x u x x x i
r x u x x x i r x u x x x i
r x u x x x i r x u x x x i
E
E E
E E1
k
Immediate rewards are finite: 1| ( , ( ), ) | for all
s s sr x u x x M s
![Page 25: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/25.jpg)
Bellman Equations for Discounted Rewards
Applied to the 2nd term:
1 1 1
1 1
1 1 1
lim | [ ( , ( ), ) | ] | lim[ ]1
Pk ks s s
s s sk ks P s P s P
r x u x x x i M M M
E
Adding Ju(i ) to each side we get:
( ) ( ) ( )1 1
P P
u u u
M MJ i J i A J i
We denote the limit by A :
1
1 1
1
lim [ ( , ( ), ) | ] , | |1
,1 1 1 1
Pks
s s sk
s P
P P P P
Mr x u x x x i A A
M M M MA A
E
![Page 26: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/26.jpg)
Bellman Equations for Discounted Rewards
Using A in the first equation, the above inequality becomes:
Since λ >0 we have:
max | ( ) | ( ), max | ( ) | [ ( )] max | ( ) |
max | ( ) | [ ( )] max | ( ) | (2)
i i i
P P P
i i
h i h i h i h i h i
h i h i h i
E
E
Adding (1) and (2) gives :
1
1 1
1
1
1 1
1
( ) [ ( , ( ), ) | ]
( ) [ ( , ( ), ) | ] ( ) (1)1 1
Ps
u s s s
s
P PPs
u s s s u
s
J i r x u x x x i A
M MJ i r x u x x x i J i
E
E
1
1 1
1
( ) max | ( ) | [ ( , ( ), ) | ] [ ( )]1
( ) max | ( ) |1
P PP s P
u s s si
s
PP
ui
MJ i h i r x u x x x i h i
MJ i h i
E E
![Page 27: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/27.jpg)
Bellman Equations for Discounted Rewards
Now we use the Lemma 2 for the middle term:
By taking the limit for P ∞ we get for λP=0 the result:
The above holds for any policy. Selecting the one that maximizes the above terms and by Lemma 1 we get:
1
1 1 1
1
( ) [ ( ) ( , ( ), ) | ]
( ) max | ( ) | ( ) ( ) max | ( ) |1 1
k
k k s
u k s s s
s
P PP P P
u u ui i
T h i h x r x u x x x i i S
M MJ i h i T h i J i h i
E
* *( ) max | ( ) | ( ) ( ) max | ( ) |
1 1
P PP P P
i i
M MJ i h i T h i J i h i
* * *( ) lim ( ) ( ) lim ( ) ( )
P P
P PJ i T h i J i T h i J i
![Page 28: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/28.jpg)
Bellman Equations for Discounted Rewards
The proposition *( ) lim ( )
k
kJ i T h i i S
Defines the so-called Bellman Optimality Equation (BOE) to find optimal policy:
| |* *
( )1
* *
( ) max{ ( , ) ( , , ) ( )}
( ) ( )
S
a A ij
J i r i a p i a j J j i S
J i TJ i i S
As the direct solution of the BOE by linear algebra is not possible, we use algorithms based on the two transformations. There are two algorithms (with several modifications):
Policy IterationValue Iteration
![Page 29: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/29.jpg)
Bellman Equations for Discounted Rewards
| |
1
( ) ( , ( )) ( , ( ), ) ( )S
k k k k
j
J i r i u i p i u i j J j i S
Solving Bellman Optimality Equation by Policy Iteration
Algorithm:1. k=1, select any policy uk
2. Evaluate policy uk by solving the (basic) BE:
If possible leave uk+1(i ) =uk(i )
4. If the policy did not change then uk is optimal. Otherwise k=k+1, go to step 2.
3. Improvement step. Find a new policy uk+1 such that:| |
1
( ) 1
( ) arg max{ ( , ) ( , , ) ( )}
S
k k
a A i j
u i r i a p i a j J j i S
![Page 30: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/30.jpg)
function [mu,J,ri,Pmu,rr] = SDPpoliterDR(S,A,P,r,lambda)
mu = ones(S,1); % initial mu(i) = 1
for i=1:S % for all states i
for a=1:A % for all actions a
rr(i,a) = P(i,:,a)*r(i,:,a)'; % expected immediate rewards
end
end
improved = 1; % flag whether improvement happened
while improved == 1
improved = 0;
for i=1:S % for all states i
ri(i) = rr(i,mu(i)); % expected rewards for given policy
for j=1:S % for all states j
Pmu(i,j) = P(i,j,mu(i)); % TPM for given policy
end
end
J = (eye(S) - lambda*Pmu)\ri'; % computing J for given policy
for i=1:S % for all states i
for a=1:A % for all actions a
y(a) = rr(i,a) + lambda*(P(i,:,a)*J); % this is maximized
end
[J(i) newmui] = max(y); % maximization
if J(i)>y(mu(i)) % improvement ?
improved = 1;
mu(i) = newmui; % updating policy if improved
end
end
end
![Page 31: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/31.jpg)
Bellman Equations for Discounted Rewards
Solving Bellman Optimality Equation by Policy Iteration
Notes on the algorithm
- Provides an optimum policy in finite number of iterations.Formal convergence proof exists, but intuitively:
- each iteration improves the policy- there is finite number of policies
- Each iteration solves the system of |S| linear equations
Matlab: J = (eye(S) - lambda*Pmu)\ri';
- Summary: “relatively small number of complicated iterations”
![Page 32: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/32.jpg)
Bellman Equations for Discounted Rewards
1( ) (1 ) / 2
k k
J J
| |1
( )1
( ) max{ ( , ) ( , , ) ( )}S
k k
a A ij
J i r i a p i a j J j i S
Solving Bellman Optimality Equation by Value Iteration
Algorithm:
1. k=1, select any vector J1, specify > 02. Compute:
then go to step 4, otherwise k=k+1, go to step 2.
4. The optimal policy is given by:
3. If
| |
max( ) 1
( ) arg max{ ( , ) ( , , ) ( )}S
k
a A i j
u i r i a p i a j J j i S
![Page 33: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/33.jpg)
Bellman Equations for Discounted Rewards
Proposition 4: The value iteration algorithm generates an - optimal policy; that is, if Ju is the value function vector provided by the algorithm and J* is the optimal value function vector, then:
Proof: As u is made of optimal actions in each state, we have:
* 1 1 *
k k
u uJ J J J J J
From triangular inequality of a norm:
Solving Bellman Optimality Equation by Value Iteration
*
u
J J
1,
k k
u u u uT T T
J J J J J
Using the above and the contraction property we obtain:
![Page 34: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/34.jpg)
Bellman Equations for Discounted Rewards
Proof – cont.
1 1 1 1 1
1 1 1 1 1
1 1 1 1
k k k k k
u u u u u
k k k k k k
u u u u
k k k k k k
u u
T T T T
T T T T T T T
T T T T
J J J J J J J J
J J J J J J J J
J J J J J J J J
Rearranging the first and the last terms provides:
Solving Bellman Optimality Equation by Value Iteration
1 1
1
k k k
uJ J J J
Similarly we can obtain:
1 * 1
1
k k kJ J J J
![Page 35: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/35.jpg)
Bellman Equations for Discounted Rewards
Proof – cont.
Inserting above terms in triangular inequality:
Solving Bellman Optimality Equation by Value Iteration
Since the last inequality is tested in the step 3 of the algorithm,
* 1 1 *
1 1 12
1 1 1
k k
u u
k k k k k k
J J J J J J
J J J J J J
1( ) (1 ) / 2
k kJ J
we finally get: *
uJ J
![Page 36: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/36.jpg)
function [mu,J,rr] = SDPvaliterDR(S,A,P,r,lambda,epsilon)
J = zeros(S,1); % initial J(i) = 0
for i=1:S % for all states i
for a=1:A % for all actions a
rr(i,a) = P(i,:,a)*r(i,:,a)'; % expected immediate rewards
end
end
d = 0.5*epsilon*(1-lambda)/lambda; % norm limit
nrm = d+1; % initial norm > d
while nrm > d
K = J; % save old value functions
for i=1:S % for all states i
for a=1:A % for all actions a
y(a) = rr(i,a) + lambda*(P(i,:,a)*J); % this is maximized
end
[J(i) mu(i)] = max(y); % maximization
end
nrm = norm(J - K,inf); % ||.||inf norm
end
for i=1:S % the final result
for a=1:A % for all actions a
y(a) = rr(i,a) + lambda*(P(i,:,a)*J); % this is maximized
end
[J(i) mu(i)] = max(y); % maximization
end
![Page 37: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/37.jpg)
Bellman Equations for Discounted Rewards
Solving Bellman Optimality Equation by Value Iteration
Notes on the algorithm
- Provides only approximate solution, the so-called - optimal policy.
- The max-norm decreases with every iteration.
- Each iteration updates a vector and computes the norm,no complicated matrix operations.
- Summary: “relatively big number of simple iterations”
(modifications for faster convergence exist)
![Page 38: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/38.jpg)
Comparison of the two methods – duration in CPU seconds(Matlab R2007b ; Intel Core 2 T5600 1.83GHz, 1GB RAM, =0.0001)
* 10 matrices with 6.25 million entries each 500 MB of data
Problem size
|S| |A|
Policy
iteration
Value
iteration
100 50 0.05 0.17
500 10 1.42 2.17
1000 5 9.86 3.53
1000 10 10.8 7.48
1500 5 32.4 7.72
2000 5 75.5 16.7
2500* 5 144.5 25.7
3000 2 245.7 11.4
![Page 39: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/39.jpg)
Def: u = average reward per transition (unit time) under policy u :
Proposition 5: If a scalar and a |S| vector h satisfy:
then is the average reward associated with the policy u.
If a scalar * and a |S| vector J* satisfy:
then * is the optimal average reward, J* is the optimal value function vector and the policy u* made of the maximizers in the RHS of the equation is the optimal policy.
Bellman Equations for Average Reward
( ) ( , ( ))u u
i S
i r i u i
| |
1
( ) ( , ( )) ( , ( ), ) ( )
S
j
h i r i u i p i u i j h j i S
| |* * *
( )1
( ) max{ ( , ) ( , , ) ( )}
S
a A ij
J i r i a p i a j J j i S
![Page 40: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/40.jpg)
Proof (1st part, outline): We define the transformation:
Then the 1st BE can be written as:
By induction it is easy to prove the following:
Similarly to the above Lemma 2 we have that:
Inserting this into the above equation gives the result.
Bellman Equations for Average Reward
| |
1
( ( )) ( ) ( , ( )) ( , ( ), ) ( )
S
u u
j
L J i L J i r i u i p i u i j J j i S
1 1
1 1
1 1
1
[ ( , ( ), ) | ][ ( )] ( )
[ ( , ( ), ) | ]
lim
k
s s s
k s
k
s s s
s
k
r x u x x x ih x h i
i Sk k k
r x u x x x i
k
EE
E
( ) ( )u
h i L h i i S
( ) ( )k
uk h i L h i i S
1 1 1
1
( ) [ ( ) ( , ( ), ) | ]
k
k
u k s s s
s
L h i h x r x u x x x i i SE
![Page 41: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/41.jpg)
Proof to the 2nd part of the proposition is similar, same operations are performed on an inequality when other than optimal policy is taken.
Proposition 5 defines two Bellman equations that can be used to:
- evaluate a given policy- generate the optimal policy
Again the 2nd Bellman optimality equation cannot be solved by linear algebra methods.
There are two algorithms (with several modifications):
Policy IterationValue Iteration
(times needed are similar to the discounted rewards case)
Bellman Equations for Average Reward
![Page 42: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/42.jpg)
Additional assumption:
t (i,a,j) = deterministic transition time of going from state i to state junder action aTimes arranged in Transition Time Matrix (TTM)
t (i,a,j) = t (i,u(i),j) = Tu(i,j)
DTMDP (Deterministic Time MDP)
t (i,a) = average time spent in a transition from state i and the action a :
Average reward of an SMDP:
Bellman Equations for Semi-Markov Decision Problems
( , ) ( , , ) ( , , ), ( )j S
t i a p i a j t i a j a u i
| |
1 1
1 1
| |
1 1
1 1
[ ( , ( ), ) | ] ( ) ( , ( ))
( ) lim
[ ( , ( ), ) | ] ( ) ( , ( ))
Sk
s s s u
s iu uk S
k
s s s us i
r x u x x x i i r i u i
i
t x u x x x i i t i u i
E
E
![Page 43: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/43.jpg)
Bellman Equation for SMDP to evaluate the policy u :
Bellman Optimality Equation for SMDP :
Again there are two algorithms (with several modifications):
Policy IterationValue Iteration
Bellman Equations for Semi-Markov Decision Problems
| |
1
( ) ( , ( )) ( , ( )) ( , ( ), ) ( )
S
u
j
h i r i u i t i u i p i u i j h j i S
| |* * *
( )1
( ) max{ ( , ) ( , ( )) ( , , ) ( )}
S
a A ij
J i r i a t i u i p i a j J j i S
![Page 44: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/44.jpg)
What’s wrong with Dynamic Programming ?
We have efficient algorithms that are relatively easy to implement with theoretically guaranteed convergence.
Why is DP used so rarely ?
![Page 45: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/45.jpg)
It is cursed twice !
What’s wrong with Dynamic Programming ?
Curse of dimensionality (Bellman ~1957)
|S| = 1000 1,000,000 entries of TPM,TRM for each action
|S| = 1,000,000 1012 entries of TPM,TRM for each action
(Each matrix made of 8 TB of data)
![Page 46: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/46.jpg)
It is cursed twice !
What’s wrong with Dynamic Programming ?
Curse of modelling
TPM + TRM (+ TTM) = Theoretical model
Problem: How to obtain the values ?
a) pdf’s known: transition probabilities can be obtainedby evaluating multiple integrals
b) unknown distributions: statistical evaluation of data available
c) known underlying distributions: matrices can be generatedby simulation
![Page 47: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/47.jpg)
Generating matrices by simulation
Basic ideas
Keep counters and sums:
N (i,a) ~ incremented if action a selected in state iM (i,a,j) ~ incremented if action a selected in state i
results in transition to state jR (i,a,j) ~ sum of rewards generated if action a selected
in state i results in transition to state j
Note: Sum R not needed if the reward is updated by the Robins-Monro algorithm (r s = generated sample):
( , , ) ( , , )( , , ) ( , , )
( , ) ( , , )
M i a j R i a jp i a j r i a j
N i a M i a j
1 1 1 1( , , ) (1 ) ( , , ) ( , , ) 1/( 1)
n n n n s n
r i a j r i a j r i a j n
![Page 48: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/48.jpg)
Generating matrices by simulation
Robins-Monro algorithm (1951)
si = i -th sampleXn = mean computed from n samples
11
1 11 1 1
1 1
1 1 1 1 1
1 1 1 1
( 1)
1 1 1 1 1
(1 ) 1/( 1)
n ni i n
n n n n n nn i i
n n n n nn
n n n n n n
s s sX n s X n X X s
Xn n n n
X n X s X sX
n n n n n
X X s n
![Page 49: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/49.jpg)
Reinforcement Learning (RL)
Needed: Model free algorithm based on underlying distributionsonly (no need of the matrices). It would solve:
curse of modelling totallycurse of dimensionality partially (see later)
Solution: Q-learning algorithm (Watkins 1989)
Wikipedia:
… in computer science, reinforcement learning is a sub-area of machine learning concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states …
Other names: neuro-dynamic programming, dynamic programming stochastic approximation, simulation-based dynamic programming,approximate dynamic programming
![Page 50: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/50.jpg)
Reinforcement Learning
Q-learning algorithm – derivation 1/2
Bellman optimality equation repeated:| |
* *
( )1
( ) max{ ( , ) ( , , ) ( )}
S
a A ij
J i r i a p i a j J j
We define the Q-factor for state i and action a :| |
*
1
( , ) ( , , )[ ( , , ) ( )]
S
j
Q i a p i a j r i a j J j
Comparing these equations gives: *
( )( ) max ( , )
a A iJ i Q i a
So we get the Q-factor version of the Bellman equation:
| |
( )1
( , ) ( , , )[ ( , , ) max ( , )]
S
b A jj
Q i a p i a j r i a j Q j b
(All above algorithms have their Q-factor versions)
![Page 51: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/51.jpg)
Reinforcement Learning
Q-learning algorithm – derivation 2/2
Q-factor Bellman equation expressed as expectation of a sample:
If the Q-factor is the expectation of a sample, we can use simulation and the Robins-Monro algorithm to get its approximate value :
| |
( )1
( )
( , ) ( , , )[ ( , , ) max ( , )]
[ ( , , ) max ( , )] [ ]
S
b A jj
b A j
Q i a p i a j r i a j Q j b
r i a j Q j b sampleE E
Where the reward r (i,a,j) and state j are generated by the simulator.
1 1 1
( )
1
( , ) (1 ) ( , ) [ ( , , ) max ( , )]
1/( 1)
n n n n n
b A j
n
Q i a Q i a r i a j Q j b
n
![Page 52: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/52.jpg)
Reinforcement Learning
3. Simulate action a in state i by generating next state j and rewardr (i,a,j). Increment the visit factor and k, compute :
( )( , ) (1 ) ( , ) [ ( , , ) max ( , )]
b A jQ i a Q i a r i a j Q j b
Q-learning algorithm:
1. Initialize the Q-factors, the visit factors and the number ofjumps k to 0:
( , ) 0 ( , ) 0 ( ) ; 0 Q i a V i a i S a A i k
Set step size 0 < s < 1, kmax , generate any initial state i.
2. Let current state be i. Select action aA(i) with probability 1/|A(i)|.
( , ) ( , ) 1, 1, / ( , ) V i a V i a k k s V i a
4. Update Q (i,a) by the Robins-Monro formula:
5. If k<kmax , set i = j and go to step 2. Otherwise continue.
6. Find the optimal policy:max
( )( ) arg max ( , )
b A iu i Q i b i S
![Page 53: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/53.jpg)
Reinforcement Learning
Q-learning algorithm – notes 1/3
1. Interface RL Algorithm - Simulator:
2. kmax = (large) total number of iterations
3. Problem: heavily asynchronous updating of the Q-factors.Actions for given state i are selected all with the same probability1/|A(i)|, but generated states j depend on underlying distributions.Solution = smaller step. Some step updating formulae:
1
1 6 111 22
2
/( 1), 1 ( 1 for Robins-Monro)
, 0.1, 10 , 11 /(1 )
n
n n n
s n s s
ss s r r
n s
Simulator
RL Algorithm
, ( , , )j r i a j,i a
![Page 54: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/54.jpg)
Reinforcement Learning
Q-learning algorithm – notes 2/3
Synchronous convergence:
For synchronous updating the algorithm
2
0 0
( )
k k
k k
1( , ) (1 ) ( , ) [ ( ) ] ,
k k k k k
Q i a Q i a H w i aQ
where wk is the (simulation) noise generated in the k-th iteration,converges to a fixed point of H with probability 1 provided it exists, if the following conditions are satisfied :
1. The function H is Lipschitz continuous and non-expansive w.r.t. the max norm.
2. The step size k satisfies:
3. The iterates Qk(i,a) are bounded.4. For all states for some constants A, B and some norm:
2 2[ ] 0 [|| || ] || ( , ) ||
k k kw w A B Q i aE E
![Page 55: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/55.jpg)
Reinforcement Learning
Q-learning algorithm – notes 3/3
Asynchronous convergence:
Let Qm(i,a) be the iterate that has been updated m times synchronously, let Rn(i,a) be the iterate that has been updated n times asynchronously. Then the error of asynchronism is defined to be:
The degree of asynchronism is the difference in the number of updatings |m - n| (also called the age gap). Also assume that
1. For a given > 0, there exists a rule for the step size such thatthe error of asynchronism in the k-th iteration for any state-actionpair is less than .
2. With k , the error of asynchronism asymptotically converges to 0with a diminishing step size, as long as the degree of asynchronismis finite.
( , ) | ( , ) ( , ) | , k k k
i a R i a Q i a i a
0 0( , ) ( , ) , R i a Q i a i a Then:
![Page 56: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/56.jpg)
Needed:
- Simulator based on underlying distributions moving fromstate to state and generating rewards
- RL Algorithm (easy to implement)
- Enough time, but linear increase with the number of states:
|S|=m, maxi{|A(i)|}=n O(mn2) performance for given averagenumber of Q-factor updates per state-action pair.
- Enough memory:
|S|=m, |A(i)|=n 2mn locations for Q-factors and visit factors:Problem
Curse of dimensionality solved partially for |A(i)|<|S|
Reinforcement Learning - Summary
![Page 57: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/57.jpg)
Response Surface Methods (RSM)
Idea 1: Instead of storing the value Q (i,a) for each state-actionpair, find a function f (i,a,b) where b is the stored vectorof parameters. The function f is the metamodel used tocompute the Q-factor values:
Steps:
1. Sampling from the function space.
2. Function fitting (regression).
3. Testing the metamodel.
( , ) ( , , ) , Q i a f i a i ab
Problem: knowledge of the metamodel assumed, regressionis a model-based method. This assumption isgenerally not satisfied in DP.
Partial solution: trial & error by various metamodels.
![Page 58: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/58.jpg)
Response Surface Methods
Idea 2: Instead of storing the value Q (i,a) for each state-actionpair, use the samples to train and validate a nonlinearneural network and store only its weights. To obtain theQ-factor value, connect the state-action pair to input nodes,the output node then provides the Q-factor:
(i , a) Q (i , a)
Here: neural network = model-free function approximation
(For small |A(i)| a separate network may be used for each action)
![Page 59: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/59.jpg)
Response Surface Methods
3. Simulate action a in state i by generating next state j and rewardr (i,a,j). Increment k, compute .
(1 ) [ ( , , ) ]next
q q r i a j q
Q-learning algorithm using a neural network:
1. Initialize the weights of the neural network to small randomnumbers (same for all actions).Set step size s, kmax , k = 0, generate any initial state i.
2. Let current state be i. Select action aA(i) with probability 1/|A(i)|.
4. Determine the output q of the neural network for inputs i and a.Also find outputs for state j and all actions from A(j). Let the maximum of these outputs be qnext. Update q as follows:
6. The policy learned is stored in the weights of the neural network.Optimal actions for each state are those generating the maximumoutput of the neural network.
5. Use q to update the neural network by an incremental algorithm.The new data piece has q as the output and i, a as the inputs.If k<kmax , set i = j and go to step 2. Otherwise continue.
![Page 60: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/60.jpg)
Response Surface Methods
Neuro-Dynamic Programming - Notes
1. Spill-over happens when other than the updated Q-factors arealso affected when the network is being trained. Partial solution: divide the state space into compartments,use separate neural networks within the compartments.
2. Over-fitting of the neural network also affects other than theupdated Q-factors.Partial solution: in the training process perform only few(sometimes only one) iteration.
3. Local optima are the danger when training a nonlinear neuralnetwork.Partial solution: divide the state space into compartments,use linear neural networks within the compartments.
Proper understanding of RL behavior when coupled with functionapproximation still not available.
![Page 61: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/61.jpg)
Other Memory Saving Methods
State Aggregation
Idea: Lump states with similar characteristics (w.r.t. the reward)together into single states. Robust suggested methodoften based on various encoding schemes.
Problems:
1. Grouping may result in loosing Markov property (theoremgiving conditions to preserve Markov property available).
2. Even if Markov property is preserved lumping may resultin loosing optimality of the solution.
3. Finding similarity w.r.t. the reward may be difficult (TRMmatrix is unknown). Application dependent rules must beused.
![Page 62: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/62.jpg)
Other Memory Saving Methods
Interpolation Methods
Idea: Store representative Q-factors, determine other Q-factorsby some interpolation technique. Replace the nearestrepresentative factor by the new one.
The unknown Q-factor is computed as an (un)weighted averageof some close representatives (various norms and techniques areused).
![Page 63: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/63.jpg)
Conclusion
Combination of the RL algorithm with Simulation solves theCurse of Modelling problem. Only underlying distributions to beused in the simulator are required.
Response Surface methods, esp. use of Neural Networks, solvethe Curse of Dimensionality problem.
Many technical problems remain, both successful applications andfailures are reported.
Many important theoretical results have been obtained, but someareas like coupling of the RL algorithm with function approximationare still open to further research.
Algorithms are not difficult to implement provided the underlyingdistributions can be obtained.
If the matrices are available and the number of states is not large,use of classical methods based on Bellman equations is preferable.
![Page 64: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/64.jpg)
DP methods – Summary
MethodCurse of
Modelling
Resolved
Curse of
Dimensionality
Resolved
Standard BE
Algorithms N N
Standard
Algorithms;
Matrices by
Simulation
Y N
Reinforcement
Learning OnlyY Partially
Reinforcement
Learning with
Response
Surface Method
Y Y
![Page 65: Infinite Horizon Dynamic Programming Models · Infinite horizon DP models: general assumptions 2/2 - Only homogenous (time independent) entities considered. - Transition rewards are](https://reader034.vdocument.in/reader034/viewer/2022050213/5f5f80c1f0bfc0614c760b40/html5/thumbnails/65.jpg)
Matlab functions currently available
Discounted Rewards by Policy Iteration
Discounted Rewards by Value Iteration
Average Reward by Policy Iteration
Average Reward by Value Iteration
(Contact me)
Thank You