topics in reinforcement learning:rollout and …topics in reinforcement learning: rollout and...

Topics in Reinforcement Learning:Rollout and Approximate Policy Iteration

ASU, CSE 691, Spring 2020

Dimitri P. [email protected]

Lecture 4

Bertsekas Reinforcement Learning 1 / 27

Outline

1 Review of Approximation in Value Space and Rollout

2 Review of On-Line Rollout for Deterministic Finite-State Problems

3 Stochastic Rollout and Monte Carlo Tree Search

4 On-Line Rollout for Deterministic Infinite Spaces Problems

5 Model Predictive Control


Recall Approximation in Value Space and the Three Approximations

minuk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial

Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)

Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)

Feature Extraction Features: Material Balance, uk = µdk

#xk(Ik)

$

Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2

State xk Feature Vector φk(xk) Approximator r′kφk(xk)

x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i

s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni

u1k u2

k u3k u4

k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree

p(j1) p(j2) p(j3) p(j4)

Neighbors of im Projections of Neighbors of im

State x Feature Vector φ(x) Approximator φ(x)′r

ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk

Stock at Period k +1 Initial State A C AB AC CA CD ABC

1

Approximations: Computation of Jk+ℓ: (Could be approximate)

DP minimization Replace E{·} with nominal values

(certainty equivalent control)

Limited simulation (Monte Carlo tree search)

Simple choices Parametric approximation Problem approximation

Rollout

minuk

E!gk(xk, uk, wk) + Jk+1(xk+ℓ)

"


E

#gk(xk, uk, wk) +

k+ℓ−1$

m=k+1

gk

%xm, µm(xm), wm

&+ Jk+ℓ(xk+ℓ)

'

First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial




%xk(Ik)

&





u1k u2

k u3k u4


1

b+k b−

k Permanent trajectory P k Tentative trajectory T k

minuk

E!gk(xk, uk, wk)+Jk+1(xk+1)

"

Approximate Min Approximate E{·} Computation of Jk+1

Optimal control sequence {u∗0, . . . , u∗

k, . . . , u∗N−1}

Tail subproblem Time x∗k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN(xN )

Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N

uk uk xk+1 xk+1 xN xN x′N

Φr = Π#T

(λ)µ (Φr)

$Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk

minu∈U(i)

%nj=1 pij(u)

#g(i, u, j) + J(j)

$Computation of J :

Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)

max{0, ξ} J(x)

Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction

Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1

Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution

Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)

Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)

Set of States (u1) Set of States (u1, u2) Neural Network

Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)

Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)

Candidate (m + 1)-Solutions (u1, . . . , um, um+1)

Cost G(u) Heuristic N -Solutions

Piecewise Constant Aggregate Problem Approximation

Artificial Start State End State


Feature Vector F (i) Aggregate Cost Approximation Cost Jµ

#F (i)

$

R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗

3 Cost Jµ

#F (i)

$

I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗

3 Cost Jµ

#F (i)

$

1

uk = µk(xk, rk) µk(·, rk) µk(xk) xk At xk

µk(xk) Jk(xk) xsk, us

k = µk(xsk) s = 1, . . . , q µk(xk, rk) µ(·, r) µ(x, r)

Motion equations xk+1 = fk(xk, uk) Current State x

Penalty for deviating from nominal trajectory

State and control constraints Keep state close to a trajectory

Control Probabilities Run the Base Policy

Truncated Horizon Rollout Terminal Cost Approximation J

J∗3 (x3) J∗

2 (x2) J∗1 (x1) Optimal Cost J∗

0 (x0) = J∗(x0)

Optimal Cost J∗k (xk) xk xk+1 x

′k+1 x

′′k+1

Opt. Cost J∗k+1(xk+1) Opt. Cost J∗

k+1(x′k+1) Opt. Cost J∗

k+1(x′′k+1)

xk uk u′k u

′′k Matrix of Intercity Travel Costs

Corrected J J J* Cost Jµ

!F (i), r

"of i ≈ Jµ(i) Jµ(i) Feature Map

Jµ

!F (i), r

": Feature-based parametric architecture State

r: Vector of weights Original States Aggregate States

Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities

Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq

Use a Neural Scheme or Other Scheme

Possibly Include “Handcrafted” Features

Generate Features F (i) of Formulate Aggregate Problem

Generate “Impoved” Policy µ by “Solving” the Aggregate Problem

Same algorithm learned multiple games (Go, Shogi)

Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)

Approximation in a space of basis functions Plays much better thanall chess programs

Cost αkg(i, u, j) Transition probabilities pij(u) Wp

Controlled Markov Chain Evaluate Approximate Cost Jµ of

Evaluate Approximate Cost Jµ

!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)

": Vector of Features of i

Jµ

!F (i)

": Feature-based architecture Final Features

1

Approximate Q-Factor Qk(xk, uk)System: xk+1 = 2xk + uk Control constraint: |uk| 1

Cost per stage: x2k + u2

k

{X0, X1, . . . , XN} must be reachable Largest reachable tube

x0 Control uk (`� 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0

Complete Tours Current Partial Tour Next Cities Next States

Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1

Base Heuristic Minimization Possible Path

Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath

Jk+1(xk+1) = minuk+12Uk+1(xk+1)

En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,

2-Step Lookahead (onestep lookahead plus one step approx-imation)

Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0

Truncated Rollout Terminal Cost Approximation J

Parametric approximation Neural nets Discretization


Cost Function Approximation Jk+`

Rollout, Model Predictive Control

Rollout Control uk Rollout Policy µk Base Policy Cost

b+k b�k Permanent trajectory P k Tentative trajectory T k

minuk

En

gk(xk, uk, wk)+Jk+1(xk+1)o

Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1

Optimal control sequence {u⇤0, . . . , u⇤

k, . . . , u⇤N�1} Simplify E{·}

Tail subproblem Time x⇤k Future Stages Terminal Cost k N

Stage k Future Stages Terminal Cost gN (xN )

1

Approximate Q-Factor Qk(xk, uk)

Min Approximation E{·} Approximation Cost-to-Go Approximation

System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1


k


x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization

Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0


Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1



Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′




Cost Function Approximation Jk+ℓ



b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}


1





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,









b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}


1





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,









b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}


1

ONE-STEP LOOKAHEAD


E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial




#xk(Ik)

$





u1k u2

k u3k u4


p(j1) p(j2) p(j3) p(j4)






1


E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%





#xk(Ik)

$





u1k u2

k u3k u4


p(j1) p(j2) p(j3) p(j4)






1


E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%





#xk(Ik)

$





u1k u2

k u3k u4


p(j1) p(j2) p(j3) p(j4)






1






Rollout


E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%





#xk(Ik)

$





u1k u2

k u3k u4


p(j1) p(j2) p(j3) p(j4)

1

u1k u2

k u3k u4

k u5k Constraint Relaxation U U1 U2

At State xk

minuk ,µk+1,...,µk+ℓ−1

E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%

Subspace S = {Φr | r ∈ ℜs} x∗ x

Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search

T (λ)(x) = T (x) x = P (c)(x)

x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1

c

T (λ)(x) = T (x) x = P (c)(x)

Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)

Extrapolation Formula T (λ) = P (c) · T = T · P (c)

Multistep Extrapolation T (λ) = P (c) · T = T · P (c)

1

Tail problem approximation u1k u2

k u3k u4


At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation

Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2


E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%



T (λ)(x) = T (x) x = P (c)(x)


c

T (λ)(x) = T (x) x = P (c)(x)




1


k u3k u4





E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%



T (λ)(x) = T (x) x = P (c)(x)


c

T (λ)(x) = T (x) x = P (c)(x)




1


k u3k u4





E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%



T (λ)(x) = T (x) x = P (c)(x)


c

T (λ)(x) = T (x) x = P (c)(x)




1

MULTISTEP LOOKAHEAD


The Pure Form of Rollout: For a Suboptimal Base Policy π, UseJk+`(xk+`) = Jk+`,π(xk+`)


E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%





#xk(Ik)

$





u1k u2

k u3k u4


p(j1) p(j2) p(j3) p(j4)






1






Rollout


E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%





#xk(Ik)

$





u1k u2

k u3k u4


p(j1) p(j2) p(j3) p(j4)

1

u1k u2

k u3k u4


At State xk


E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%



T (λ)(x) = T (x) x = P (c)(x)


c

T (λ)(x) = T (x) x = P (c)(x)




1


k u3k u4





E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%



T (λ)(x) = T (x) x = P (c)(x)


c

T (λ)(x) = T (x) x = P (c)(x)




1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1

Heuristic Cost“Future”

Initial State x0 s Terminal State t Length = 1

x0 a 0 1 2 t b C Destination

J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from

within Wp+

Prob. u Prob. 1 � u Cost 1 Cost 1 �pu

J(1) = min�c, a + J(2)

J(2) = b + J(1)

J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1

System: xk+1 = 2xk + uk Control constraint: |uk| 1


k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,







Rollout Control uk, Model Predictive Control


minuk

En




k, . . . , u⇤N�1} Simplify E{·}



Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N

uk uk xk+1 xk+1 xN xN x0N

1



k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,








Rollout Control uk Rollout Policy µk


minuk

En




k, . . . , u⇤N�1} Simplify E{·}




1



k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,










minuk

En




k, . . . , u⇤N�1} Simplify E{·}




1

Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ


E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

Deterministic Transition xk+1 = fk(xk, uk)

1

Use a suboptimal/heuristic policy at the end of limited lookahead

The suboptimal policy is called base policy.

The lookahead policy is called rollout policy.

The aim of rollout is policy improvement (i.e., rollout policy performs better thanthe base policy); true under some assumptions. In practice: good performance,very reliable, very simple to implement.

Rollout in its “standard" forms involves simulation and on-line implementation.

The simulation can be prohibitively expensive (so further approximations may beneeded); particularly for stochastic problems and multistep lookahead.


Deterministic Rollout: At xk+1, Use a Heuristic with Cost Hk+1(xk+1)



Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk)

x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1




Belief State pk Controller µk Control uk = µk(pk) . . .

x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk u0k u00

k xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






k xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1




Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors

x0 x1 xk xN x0N x00

N uk u0k u00

k xk+1 x0k+1 x00

k+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces

Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1

Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations

Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk

x0 x1 xk xN x′N x′′

N uk u′k u′′

k xk+1 x′k+1 x′′

k+1



J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from

within Wp+

Prob. u Prob. 1 − u Cost 1 Cost 1 − √u

J(1) = min!c, a + J(2)

"

J(2) = b + J(1)

J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0

f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1

(x)

Improper policy µ

Proper policy µ

1






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min!c, a + J(2)

"

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

Initial City Current Partial Tour Next Cities

Nearest Neighbor Heuristic


E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)

#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)


minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)



1




E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)



1




E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)



1

Initial City Current Partial Tour Next Cities Next States



E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)



1

At state xk , for every pair (xk , uk ), uk ∈ Uk (xk ), we generate a Q-factor

Qk (xk , uk ) = gk (xk , uk ) + Hk+1(fk (xk , uk )

)using the base heuristic [Hk+1(xk+1) is the heuristic cost starting from xk+1].

We select the control uk with minimal Q-factor.

We move to next state xk+1, and continue.

The scheme is very general: The heuristic can be anything (stage- orstate-dependent)! May not necessarily correspond to a policy.


Criteria for Cost Improvement of a Rollout Algorithm - SequentialConsistency, and Sequential Improvement

Remember:Any heuristic (no matter how inconsistent or silly) is in principle admissible to use asbase heuristic.

So it is important to properties guaranteeing that the rollout policy has noworse performance than the base heuristic

Two such conditions are sequential consistency and sequential improvement.

The base heuristic is sequentially consistent if and only if it can be implementedwith a legitimate DP policy {µ0, . . . , µN−1}.Example: Greedy heuristics tend to be sequentially consistent.

A sequentially consistent heuristic is also sequentially improving. See next slide.

We will see that any heuristic can be “fortified" to become sequentially improving.


Policy Improvement for Sequentially Improving Heuristics

Sequential improvement holds if for all xk (Best heuristic Q-factor ≤ Heuristic cost):

minuk∈Uk (xk )

[gk (xk , uk ) + Hk+1

(fk (xk , uk )

)]≤ Hk (xk ),

where Hk (xk ) is the cost of the trajectory generated by the heuristic starting from xk .True for sequ. consistent heuristic [Hk (xk ) =one of the Q-factors minimized on the left].

Cost improvement property for a sequentially improving base heuristic:

Let the rollout policy be π = {µ0, . . . , µN−1}, and let Jk,π(xk ) denote its cost startingfrom xk . Then for all xk and k , Jk,π(xk ) ≤ Hk (xk ).

Proof by induction: It holds for k = N, since JN,π = HN = gN . Assume that itholds for index k + 1. We show that it holds for index k :

Jk,π(xk ) = gk(xk , µk (xk )

)+ Jk+1,π

(fk(xk , µk (xk )

))(by DP equation)

≤ gk(xk , µk (xk )

)+ Hk+1

(fk (xk , µk (xk ))

)(by induction hypothesis)

= minuk∈Uk (xk )

[gk (xk , uk ) + Hk+1

(fk (xk , uk )

)](by definition of rollout)

≤ Hk (xk ) (by sequential improvement condition)Bertsekas Reinforcement Learning 9 / 27

A Working Break: Challenge Question

s i1 im�1 im . . . (0, 0 (N,�N) (N, 0) I (N, N) �N 0 g(i) I N � 2 Ni

j1 j2 j3 j4

p(j1) p(j2) p(j3) p(j4)


State x Feature Vector �(x) Approximator �(x)0r

` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1

Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk


ACB ACD CAB CAD CDA

SA SB CAB CAC CCA CCD CBC CCB CCD

CAB CAD CDA CCD CBD CDB CAB

Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)

...

p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw

0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2

System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk

3 5 2 4 6 2

1


j1 j2 j3 j4

p(j1) p(j2) p(j3) p(j4)






ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

1


j1 j2 j3 j4

p(j1) p(j2) p(j3) p(j4)






ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

1


j1 j2 j3 j4

p(j1) p(j2) p(j3) p(j4)






ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

1

s i1 im�1 im . . . (0, 0 (N,�N) (N, 0) i (N, N) �N 0 g(i) I N �2 N i

j1 j2 j3 j4

p(j1) p(j2) p(j3) p(j4)






ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

1


j1 j2 j3 j4

p(j1) p(j2) p(j3) p(j4)






ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

1


j1 j2 j3 j4

p(j1) p(j2) p(j3) p(j4)






ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

1

s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni

j1 j2 j3 j4

p(j1) p(j2) p(j3) p(j4)






ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

1


j1 j2 j3 j4

p(j1) p(j2) p(j3) p(j4)






ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

1


j1 j2 j3 j4

p(j1) p(j2) p(j3) p(j4)






ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

1


j1 j2 j3 j4

p(j1) p(j2) p(j3) p(j4)






ACB ACD CAB CAD CDA




...




0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2


3 5 2 4 6 2

1




E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)



1


Nearest Neighbor Heuristic Move to the Right


E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)



1


Nearest Neighbor Heuristic Move to the Right Possible Path


E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)



1

Time

Space

Walk on a line of length 2N starting at position 0. At each of N steps, move oneunit to the left or one unit to the right.

Objective is to land at a position i of small cost g(i) after N steps.

Question: Consider a base heuristic that takes steps to the right only. Is itsequentially consistent or sequentially improving? How will the rollout performcompared to the base heuristic?

Compare with a superheuristic/combination of two heuristics: 1) Move only to theright, and 2) Move only to the left. Base heuristic chooses the path of best cost.


A Counterexample to Cost Improvement

x0 u⇤0, x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3

Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Approximation in Policy Space Heuristic Cost Approximation for

for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)

Approximate Q-Factor Q(x, u) At x Approximation J

minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o

Truncated Rollout Policy µ m Steps

Approximate Q-Factor Q(x, u) At x

Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1


k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,






1

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3






minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o





k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,






1

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3






minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o





k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,






1

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3






minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o





k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,






1

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3






minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o





k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,






1

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3






minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o





k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,






1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice

Optimal Trajectory Chosen by Base Heuristic at x0

Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2





minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$



Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1


k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,




1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3


Optimal Trajectory Chosen by Base Heuristic at x0






minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,




1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1


Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0






minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1








minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1








minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1








minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1








minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1








minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1








minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

Suppose at x0 there is a unique optimal trajectory (x0, u∗0 , x∗1 , u

∗1 , x∗2 ).

Suppose the base heuristic produces this optimal trajectory starting at x0.

Rollout uses the base heuristic to construct a trajectory starting from x∗1 and x1.

Suppose the heuristic’s trajectory starting from x∗1 is “bad" (has high cost).

Then the rollout algorithm rejects the optimal control u∗0 in favor of the other controlu0, and moves to a nonoptimal next state x1 = f0(x0, u0).

So without some safeguard, the rollout can deviate from an already available goodtrajectory.

This suggests a possible remedy: Follow the currently best trajectory if rolloutsuggests following an inferior trajectory.


Fortified Rollout: Restores Cost Improvement for Base Heuristics thatare not Sequentially Improving





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min!c, a + J(2)

"

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min!c, a + J(2)

"

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

Permanent trajectory Tk uk uk xk+1 xk+1

�r = ⇧�T

(�)µ (�r)

�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)

Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk

minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)

�Computation of J :

Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)

max{0, ⇠} J(x)


Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1






Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN�1)







�F (i)

�

R1 R2 R3 R` Rq�1 Rq r⇤q�1 r⇤3 Cost Jµ

�F (i)

�

I1 I2 I3 I` Iq�1 Iq r⇤2 r⇤3 Cost Jµ

�F (i)

�

Aggregate States Scoring Function V (i) J⇤(i) 0 n n� 1 State i Cost

function Jµ(i)I1 ... Iq I2 g(i, u, j)...

TD(1) Approximation TD(0) Approximation V1(i) and V0(i)

Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)

�jf =

⇢1 if j 2 If

0 if j /2 If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

1


�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)















�F (i)

�


�F (i)

�


�F (i)

�





�jf =

⇢1 if j 2 If

0 if j /2 If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

1


�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)















�F (i)

�


�F (i)

�


�F (i)

�





�jf =

⇢1 if j 2 If

0 if j /2 If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

1


�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)















�F (i)

�


�F (i)

�


�F (i)

�





�jf =

⇢1 if j 2 If

0 if j /2 If

1 10 20 30 40 50 I1 I2 I3 i J1(i)

1

Permanent trajectory Pk Tentative trajectory Tk

uk uk xk+1 xk+1 xN xN

Φr = Π!T

(λ)µ (Φr)

"Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)


minu∈U(i)

#nj=1 pij(u)

!g(i, u, j) + J(j)

"Computation of J :


max{0, ξ} J(x)















!F (i)

"


3 Cost Jµ

!F (i)

"


3 Cost Jµ

!F (i)

"

Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost




φjf =

$1 if j ∈ If

0 if j /∈ If

1

Permanent trajectory Pk Tentative trajectory Tk


Φr = Π!T

(λ)µ (Φr)



minu∈U(i)

#nj=1 pij(u)

!g(i, u, j) + J(j)

"Computation of J :


max{0, ξ} J(x)















!F (i)

"


3 Cost Jµ

!F (i)

"


3 Cost Jµ

!F (i)

"





φjf =

$1 if j ∈ If

0 if j /∈ If

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

Permanent trajectory P k Tentative trajectory T k


Φr = Π!T

(λ)µ (Φr)



minu∈U(i)

#nj=1 pij(u)

!g(i, u, j) + J(j)

"Computation of J :


max{0, ξ} J(x)















!F (i)

"


3 Cost Jµ

!F (i)

"


3 Cost Jµ

!F (i)

"





φjf =

$1 if j ∈ If

0 if j /∈ If

1

Permanent trajectory P k Tentative trajectory T k


Φr = Π!T

(λ)µ (Φr)



minu∈U(i)

#nj=1 pij(u)

!g(i, u, j) + J(j)

"Computation of J :


max{0, ξ} J(x)















!F (i)

"


3 Cost Jµ

!F (i)

"


3 Cost Jµ

!F (i)

"





φjf =

$1 if j ∈ If

0 if j /∈ If

1




E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)



1




E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)



1




E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)



1

Tentative Best Trajectory T k


!F (i), r


Jµ

!F (i), r





Use a Neural Scheme or Other Scheme System Equation or Simulator










!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)


Jµ

!F (i)


If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ =$J | J ≥ J+, J(t) = 0

%

VI converges to J+ from within W+

Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp

1

Upon reaching state xk it stores the permanent trajectory

Pk = {x0, u0, . . . , uk−1, xk}

that has been constructed up to stage k , and also stores a tentative best trajectory

T k = {xk , uk , xk+1, uk+1, . . . , uN−1, xN}

The tentative best trajectory is such that Pk ∪ T k is the best complete trajectorycomputed up to stage k of the algorithm.

Initially, P0 = {x0} and T 0 is the trajectory computed by the base heuristic from x0.

At each step, fortified rollout follows the best trajectory computed thus far.Bertsekas Reinforcement Learning 12 / 27

Illustration of Fortified Algorithm

x0 u⇤0, x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3






minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o





k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,






1

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3






minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o





k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,






1

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3






minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o





k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,






1

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3






minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o





k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,






1

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3






minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o





k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,






1

x0 u⇤0 x⇤

1 u⇤1 x⇤

2 u⇤2 x⇤

3 u1 x2 u2 x3






minu2U(x)

Ew

ng(x, u, w) + ↵J

�f(x, u, w)

�o





k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,






1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1








minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1








minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1








minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1








minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1








minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1








minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1








minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1


Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial






minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u1 x2 u2 x3

x0 u∗0 x∗

1 u∗1 x∗

2 u∗2 x∗

3 u0 x1 u1 x1


Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial






minu∈U(x)

Ew

!g(x, u, w) + αJ

"f(x, u, w)

#$





k









E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,


1

Tentative Best


!F (i), r


Jµ

!F (i), r





Use a Neural Scheme or Other Scheme System Equation or Simulator










!F (i)

"of

F (i) =!F1(i), . . . , Fs(i)


Jµ

!F (i)


If Jµ

!F (i), r

"=

#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture

(r1, . . . , rs: Scalar weights)

Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π

Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π

W+ =$J | J ≥ J+, J(t) = 0

%

VI converges to J+ from within W+

Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp

1

Fortified rollout stores as initial tentative best trajectory the unique optimaltrajectory (x0, u∗0 , x

∗1 , u

∗1 , x∗2 ) generated by the base heuristic at x0.

Starting at x∗0 it runs the heuristic from x∗1 and x1, and (despite that the heuristicprefers x1 to x∗1 ) it discards the control u0 in favor of u∗0 , which is dictated by thetentative best trajectory.

It then sets the permanent trajectory to (x0, u∗0 , x∗1 ) and the tentative best trajectory

to (x∗1 , u∗1 , x∗2 ).


Multistep Deterministic Rollout with Terminal Cost Approximation

Selective Depth Lookahead Tree


�xk(Ik)

�

Mobility, Safety, etc Weighting of Features Score Position Evaluator

States xk+1 States xk+2

x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i


u1k u2

k u3k u4


p(j1) p(j2) p(j3) p(j4)






ACB ACD CAB CAD CDA




...

1

Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk

xk xk+1 xk+2 Truncated Heuristic Terminal Cost Approximation





b+k b−


minuk


"



k, . . . , u∗N−1} Simplify E{·}





Φr = Π#T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

#g(i, u, j) + J(j)

$Computation of J :


max{0, ξ} J(x)











1


xk States xk+1 States xk+2 Truncated Rollout Terminal Cost Ap-proximation





b+k b−


minuk


"



k, . . . , u∗N−1} Simplify E{·}





Φr = Π#T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

#g(i, u, j) + J(j)

$Computation of J :


max{0, ξ} J(x)










1







b+k b−


minuk


"



k, . . . , u∗N−1} Simplify E{·}





Φr = Π#T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

#g(i, u, j) + J(j)

$Computation of J :


max{0, ξ} J(x)










1







b+k b−


minuk


"



k, . . . , u∗N−1} Simplify E{·}





Φr = Π#T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

#g(i, u, j) + J(j)

$Computation of J :


max{0, ξ} J(x)










1







b+k b−


minuk


"



k, . . . , u∗N−1} Simplify E{·}





Φr = Π#T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

#g(i, u, j) + J(j)

$Computation of J :


max{0, ξ} J(x)










1


xk States xk+1 States xk+2

Truncated Rollout Terminal Cost Approximation





b+k b−


minuk


"



k, . . . , u∗N−1} Simplify E{·}





Φr = Π#T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

#g(i, u, j) + J(j)

$Computation of J :


max{0, ξ} J(x)










1



Truncated Rollout Terminal Cost Approximation





b+k b−


minuk


"



k, . . . , u∗N−1} Simplify E{·}





Φr = Π#T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

#g(i, u, j) + J(j)

$Computation of J :


max{0, ξ} J(x)










1








b+k b−


minuk


"



k, . . . , u∗N−1} Simplify E{·}





Φr = Π#T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

#g(i, u, j) + J(j)

$Computation of J :


max{0, ξ} J(x)










1

Terminal approximation saves computation but cost improvement is lost.

We can prove cost improvement, assuming sequential consistency and a specialproperty of the terminal cost function approximation that resembles sequentialimprovement (more on this when we discuss infinite horizon rollout).

It is not necessarily true that longer lookahead leads to improved performance; butusually true (similar counterexamples as in the last lecture).

It is not necessarily true that increasing the length of the rollout leads to improvedperformance (some examples indicate this). Moreover, long rollout is costly.

Experimentation with length of rollout and terminal cost function approximation arerecommended.


Stochastic Rollout [Uses Jk+`(xk+`) = Jk+`,π(xk+`) and MC Simulation]min

uk,µk+1,...,µk+ℓ−1

E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%





#xk(Ik)

$





u1k u2

k u3k u4


p(j1) p(j2) p(j3) p(j4)






1






Rollout


E

!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%





#xk(Ik)

$





u1k u2

k u3k u4


p(j1) p(j2) p(j3) p(j4)

1

u1k u2

k u3k u4


At State xk


E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%



T (λ)(x) = T (x) x = P (c)(x)


c

T (λ)(x) = T (x) x = P (c)(x)




1


k u3k u4





E!gk(xk, uk, wk) +

k+ℓ−1"

m=k+1

gk

#xm, µm(xm), wm

$+ Jk+ℓ(xk+ℓ)

%



T (λ)(x) = T (x) x = P (c)(x)


c

T (λ)(x) = T (x) x = P (c)(x)




1



Heuristic Cost“Future”




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1



k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,







Rollout Control uk, Model Predictive Control


minuk

En




k, . . . , u⇤N�1} Simplify E{·}





1



k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,








Rollout Control uk Rollout Policy µk


minuk

En




k, . . . , u⇤N�1} Simplify E{·}




1



k









En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,










minuk

En




k, . . . , u⇤N�1} Simplify E{·}




1

Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ


E

!gk(xk, uk, wk) +

k+ℓ−1"

i=k+1

gi

#xi, µi(xi), wi

$+ Jk+ℓ(xk+ℓ)

%

bk Belief States bk+1 bk+2 Policy µ m Steps

Truncated Rollout Policy µ m Steps Φr∗λ

B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2

Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )

µ COMPOSITE SYSTEM SIMULATOR FOR POMDP

(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)

uk = rpek + rizk + rddk ξij(u) pij(u)

Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost

Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i

One-step Lookahead with J(j) =&

y∈A φjyr∗y bk Control uk = µk(bk)

p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm

... (e.g., a NN) Data (xs, cs)

V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗

Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm

Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout

Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ

Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized

Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)

State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)

Deterministic Transition xk+1 = fk(xk, uk)

1

Consider the pure case (no truncation, no terminal cost approximation)Assume that the base heuristic is a legitimate policy π = {µ0, . . . , µN−1} (i.e., issequentially consistent, in the context of deterministic problems).

Let π = {µ0, . . . , µN−1} be the rollout policy. Then cost improvement is obtained

Jk,π(xk ) ≤ Jk,π(xk ), for all xk and k .

Essentially identical induction proof as for the sequentially improving case (see thetext).


Backgammon Example

Av. Score by Monte-Carlo Simulation




Possible Moves

Announced by Tesauro in 1996.

Truncated rollout with cost function approximation provided by TD-Gammon (a1991 famous program by Tesauro, involving a neural network trained by a form ofapproximate policy iteration that uses “Temporal Differences").

Plays better than TD-Gammon, and better than any human.

Too slow for real-time play (without parallel hardware), due to excessive simulationtime.


Monte Carlo Tree Search - Motivation

We assumed equal effort for evaluation of Q-factors of all controls at a state xk

Drawbacks:

The trajectories may be too long because the horizon length N is large (or infinite,in an infinite horizon context).

Some controls uk may be clearly inferior to others and may not be worth as muchsampling effort.

Some controls uk that appear to be promising may be worth exploring betterthrough multistep lookahead.

Monte Carlo tree search (MCTS) is a “randomized" form of lookaheadMCTS aims to trade off computational economy with a hopefully small risk ofdegradation in performance.

It involves adaptive simulation (simulation effort adapted to the perceived quality ofdifferent controls).

Aims to balance exploitation (extra simulation effort on controls that lookpromising) and exploration (adequate exploration of the potential of all controls).


Monte Carlo Tree Search - Adaptive Simulation






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min!c, a + J(2)

"

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min!c, a + J(2)

"

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1

Sample Q-Factors Simulation




E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1





E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1





E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1





E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1

Sample Q-Factors Simulation Control 1 Control 2 Control 3




E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1





E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1





E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1

MCTS provides an economical sampling policy to estimate the Q-factors

Qk (xk , uk ) = E{

gk (xk , uk ,wk ) + Jk+1(fk (xk , uk ,wk )

)}, uk ∈ Uk (xk )

Assume that Uk (xk ) contains a finite number of elements, say i = 1, . . . ,m

After the nth sampling period we have Qi,n, the empirical mean of the Q-factor ofeach control i (total sample value divided by total number of samplescorresponding to i). We view Qi,n as an exploitation index.

How do we use the estimates Qi,n to select the control to sample next?


MCTS Based on Statistical Tests






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min!c, a + J(2)

"

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min!c, a + J(2)

"

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1





E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1





E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1





E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1



Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n



E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1



Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n



E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1



Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n



E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1

A good sampling policy balances exploitation (sample controls that seem mostpromising, i.e., a small Qi,n) and exploration (sample controls with small sample count).

A popular strategy: Sample next the control i that minimizes the sum Qi,n + Ri,n

where Ri,n is an exploration index.

Ri,n is based on a confidence interval formula and depends on the sample count si

of control i (which comes from analysis of multiarmed bandit problems).

The UCB rule (upper confidence bound) sets Ri,n = −c√

log n/si , where c is apositive constant, selected empirically (values c ≈

√2 are suggested, assuming

that Qi,n is normalized to take values in the range [−1, 0]).

MCTS with UCB rule has been extended to multistep lookahead ... but AlphaZerohas used a different (semi-heuristic) rule.


Classical Control Problems - Infinite Control Spaces

REGULATION PROBLEMKeep the state near the origin

PATH PLANNINGKeep State Close to a

Trajectory

Need to deal with state and control constraints; linear-quadratic is often inadequate


On-Line Rollout for Deterministic Infinite-Spaces Problems





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min!c, a + J(2)

"

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min!c, a + J(2)

"

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1




E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)



1

Sample Q-Factors Simulation Control 1 States xk+`


Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n



En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,









minuk

En




k, . . . , u⇤N�1} Simplify E{·}





�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)

1



Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n



En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,









minuk

En




k, . . . , u⇤N�1} Simplify E{·}





�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)

1






En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,









minuk

En




k, . . . , u⇤N�1} Simplify E{·}





�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)

1






En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,









minuk

En




k, . . . , u⇤N�1} Simplify E{·}





�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)

1






En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,









minuk

En




k, . . . , u⇤N�1} Simplify E{·}





�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)

1

Control uk (ℓ − 1)-Stages Minimization

Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0






E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :

1

Control uk (ℓ − 1)-Stages Minimization







E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :

1

Control uk (ℓ − 1)-Stages Base Heuristic Minimization







E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :

1








E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :

1

Suppose the control space is infiniteOne possibility is discretization of Uk (xk ); but excessive number of Q-factors.

Another possibility is to use optimization heuristics that look (`− 1) steps ahead.

Seemlessly combine the k th stage minimization and the optimization heuristic intoa single `-stage deterministic optimization.

Can solve it by nonlinear programming/optimal control methods (e.g., quadraticprogramming, gradient-based).

This is the idea underlying model predictive control (MPC).Bertsekas Reinforcement Learning 23 / 27

Model Predictive Control for Regulation Problems





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min!c, a + J(2)

"

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min!c, a + J(2)

"

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1




E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)



1



Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n



E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1






En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,









minuk

En




k, . . . , u⇤N�1} Simplify E{·}





�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)

1






En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,









minuk

En




k, . . . , u⇤N�1} Simplify E{·}





�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)

1






En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,









minuk

En




k, . . . , u⇤N�1} Simplify E{·}





�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)

1

Sample Q-Factors Simulation Control 1 State xk+ℓ = 0






E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


1







E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


1








E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :

1








E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :

1

System: xk+1 = fk (xk , uk )

Cost per stage: gk (xk , uk ) ≥ 0, the origin 0 is cost-free and absorbing.

State and control constraints: xk ∈ Xk , uk ∈ Uk (xk ) for all k

At xk solve an `-step lookahead version of the problem, requiring xk+` = 0 whilesatisfying the state and control constraints.

If {uk , . . . , uk+`−1} is the control sequence so obtained, apply uk .


Relation to Rollout





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1





x0 x1 xk xN uk xk+1




within Wp+


J(1) = min�c, a + J(2)

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min!c, a + J(2)

"

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1






N uk u′k u′′

k xk+1 x′k+1 x′′

k+1




within Wp+


J(1) = min!c, a + J(2)

"

J(2) = b + J(1)



(x)

Improper policy µ

Proper policy µ

1




E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)



1



Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n



E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


max{0, ξ} J(x)

1






En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,









minuk

En




k, . . . , u⇤N�1} Simplify E{·}





�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)

1






En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,









minuk

En




k, . . . , u⇤N�1} Simplify E{·}





�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)

1






En

gk+1(xk+1, uk+1, wk+1)

+Jk+2

�fk+1(xk+1, uk+1, wk+1)

�o,









minuk

En




k, . . . , u⇤N�1} Simplify E{·}





�r = ⇧�T

(�)µ (�r)



minu2U(i)

Pnj=1 pij(u)

�g(i, u, j) + J(j)



max{0, ⇠} J(x)

1







E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


1







E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :


1








E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :

1








E!

gk+1(xk+1, uk+1, wk+1)

+Jk+2

"fk+1(xk+1, uk+1, wk+1)

#$,








b+k b−


minuk


$



k, . . . , u∗N−1} Simplify E{·}





Φr = Π"T

(λ)µ (Φr)



minu∈U(i)

%nj=1 pij(u)

"g(i, u, j) + J(j)

#Computation of J :

1

It is rollout with base heuristic the (`− 1)-step min (0 is cost-free and absorbing).

This heuristic is sequentially improving (not sequentially consistent), i.e.,

minuk∈Uk (xk )

[gk (xk , uk ) + Hk+1

(fk (xk , uk )

)]≤ Hk (xk )

because (opt. cost to reach 0 in ` steps) ≤ (opt. cost to reach 0 in `− 1 steps)

Sequential improvement implies “stability":∑∞

k=0 gk (xk , uk ) ≤ H0(x0) <∞, where{x0, u0, x1, u1, . . .} is the state and control sequence generated by MPC.

Major issue: How do we know that the optimization of the base heuristic issolvable (e.g., there exists ` such that we can drive xk+` to 0 for all xk ∈ Xk whileobserving the state and control constraints). Methods of reachability of targettubes can be used for this (see the text).


About the Next Lecture

We will cover:Rollout for multiagent problems

Constrained rollout

Discrete optimization problems and rollout

PLEASE READ AS MUCH OF THE REMAINDER OF CHAPTER 2 AS YOU CAN

PLEASE DOWNLOAD THE LATEST VERSION


topics in reinforcement learning:rollout and …topics in reinforcement learning: rollout and...

Documents