topics in reinforcement learning:rollout and …topics in reinforcement learning: rollout and...
TRANSCRIPT
Topics in Reinforcement Learning:Rollout and Approximate Policy Iteration
ASU, CSE 691, Spring 2020
Dimitri P. [email protected]
Lecture 4
Bertsekas Reinforcement Learning 1 / 27
Outline
1 Review of Approximation in Value Space and Rollout
2 Review of On-Line Rollout for Deterministic Finite-State Problems
3 Stochastic Rollout and Monte Carlo Tree Search
4 On-Line Rollout for Deterministic Infinite Spaces Problems
5 Model Predictive Control
Bertsekas Reinforcement Learning 2 / 27
Recall Approximation in Value Space and the Three Approximations
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
DP minimization Replace E{·} with nominal values
(certainty equivalent control)
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk
E!gk(xk, uk, wk) + Jk+1(xk+ℓ)
"
minuk,µk+1,...,µk+ℓ−1
E
#gk(xk, uk, wk) +
k+ℓ−1$
m=k+1
gk
%xm, µm(xm), wm
&+ Jk+ℓ(xk+ℓ)
'
First ℓ Steps “Future” First StepNonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
%xk(Ik)
&
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
1
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
"
Approximate Min Approximate E{·} Computation of Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π#T
(λ)µ (Φr)
$Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
#g(i, u, j) + J(j)
$Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
#F (i)
$
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
#F (i)
$
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
#F (i)
$
1
uk = µk(xk, rk) µk(·, rk) µk(xk) xk At xk
µk(xk) Jk(xk) xsk, us
k = µk(xsk) s = 1, . . . , q µk(xk, rk) µ(·, r) µ(x, r)
Motion equations xk+1 = fk(xk, uk) Current State x
Penalty for deviating from nominal trajectory
State and control constraints Keep state close to a trajectory
Control Probabilities Run the Base Policy
Truncated Horizon Rollout Terminal Cost Approximation J
J∗3 (x3) J∗
2 (x2) J∗1 (x1) Optimal Cost J∗
0 (x0) = J∗(x0)
Optimal Cost J∗k (xk) xk xk+1 x
′k+1 x
′′k+1
Opt. Cost J∗k+1(xk+1) Opt. Cost J∗
k+1(x′k+1) Opt. Cost J∗
k+1(x′′k+1)
xk uk u′k u
′′k Matrix of Intercity Travel Costs
Corrected J J J* Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
1
Approximate Q-Factor Qk(xk, uk)System: xk+1 = 2xk + uk Control constraint: |uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
1
Approximate Q-Factor Qk(xk, uk)
Min Approximation E{·} Approximation Cost-to-Go Approximation
System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
1
Approximate Q-Factor Qk(xk, uk)
Min Approximation E{·} Approximation Cost-to-Go Approximation
System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
1
Approximate Q-Factor Qk(xk, uk)
Min Approximation E{·} Approximation Cost-to-Go Approximation
System: xk+1 = 2xk + uk Control constraint: |uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
1
ONE-STEP LOOKAHEAD
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
DP minimization Replace E{·} with nominal values
(certainty equivalent control)
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
1
u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
MULTISTEP LOOKAHEAD
Bertsekas Reinforcement Learning 4 / 27
The Pure Form of Rollout: For a Suboptimal Base Policy π, UseJk+`(xk+`) = Jk+`,π(xk+`)
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
DP minimization Replace E{·} with nominal values
(certainty equivalent control)
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
1
u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost“Future”
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
System: xk+1 = 2xk + uk Control constraint: |uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout Control uk, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
1
System: xk+1 = 2xk + uk Control constraint: |uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
1
System: xk+1 = 2xk + uk Control constraint: |uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
1
Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
Deterministic Transition xk+1 = fk(xk, uk)
1
Use a suboptimal/heuristic policy at the end of limited lookahead
The suboptimal policy is called base policy.
The lookahead policy is called rollout policy.
The aim of rollout is policy improvement (i.e., rollout policy performs better thanthe base policy); true under some assumptions. In practice: good performance,very reliable, very simple to implement.
Rollout in its “standard" forms involves simulation and on-line implementation.
The simulation can be prohibitively expensive (so further approximations may beneeded); particularly for stochastic problems and multistep lookahead.
Bertsekas Reinforcement Learning 5 / 27
Deterministic Rollout: At xk+1, Use a Heuristic with Cost Hk+1(xk+1)
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk)
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk)
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors
x0 x1 xk xN x0N x00
N uk u0k u00
k xk+1 x0k+1 x00
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Initial City Current Partial Tour Next Cities Next States
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
At state xk , for every pair (xk , uk ), uk ∈ Uk (xk ), we generate a Q-factor
Qk (xk , uk ) = gk (xk , uk ) + Hk+1(fk (xk , uk )
)using the base heuristic [Hk+1(xk+1) is the heuristic cost starting from xk+1].
We select the control uk with minimal Q-factor.
We move to next state xk+1, and continue.
The scheme is very general: The heuristic can be anything (stage- orstate-dependent)! May not necessarily correspond to a policy.
Bertsekas Reinforcement Learning 7 / 27
Criteria for Cost Improvement of a Rollout Algorithm - SequentialConsistency, and Sequential Improvement
Remember:Any heuristic (no matter how inconsistent or silly) is in principle admissible to use asbase heuristic.
So it is important to properties guaranteeing that the rollout policy has noworse performance than the base heuristic
Two such conditions are sequential consistency and sequential improvement.
The base heuristic is sequentially consistent if and only if it can be implementedwith a legitimate DP policy {µ0, . . . , µN−1}.Example: Greedy heuristics tend to be sequentially consistent.
A sequentially consistent heuristic is also sequentially improving. See next slide.
We will see that any heuristic can be “fortified" to become sequentially improving.
Bertsekas Reinforcement Learning 8 / 27
Policy Improvement for Sequentially Improving Heuristics
Sequential improvement holds if for all xk (Best heuristic Q-factor ≤ Heuristic cost):
minuk∈Uk (xk )
[gk (xk , uk ) + Hk+1
(fk (xk , uk )
)]≤ Hk (xk ),
where Hk (xk ) is the cost of the trajectory generated by the heuristic starting from xk .True for sequ. consistent heuristic [Hk (xk ) =one of the Q-factors minimized on the left].
Cost improvement property for a sequentially improving base heuristic:
Let the rollout policy be π = {µ0, . . . , µN−1}, and let Jk,π(xk ) denote its cost startingfrom xk . Then for all xk and k , Jk,π(xk ) ≤ Hk (xk ).
Proof by induction: It holds for k = N, since JN,π = HN = gN . Assume that itholds for index k + 1. We show that it holds for index k :
Jk,π(xk ) = gk(xk , µk (xk )
)+ Jk+1,π
(fk(xk , µk (xk )
))(by DP equation)
≤ gk(xk , µk (xk )
)+ Hk+1
(fk (xk , µk (xk ))
)(by induction hypothesis)
= minuk∈Uk (xk )
[gk (xk , uk ) + Hk+1
(fk (xk , uk )
)](by definition of rollout)
≤ Hk (xk ) (by sequential improvement condition)Bertsekas Reinforcement Learning 9 / 27
A Working Break: Challenge Question
s i1 im�1 im . . . (0, 0 (N,�N) (N, 0) I (N, N) �N 0 g(i) I N � 2 Ni
j1 j2 j3 j4
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
1
s i1 im�1 im . . . (0, 0 (N,�N) (N, 0) I (N, N) �N 0 g(i) I N � 2 Ni
j1 j2 j3 j4
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
1
s i1 im�1 im . . . (0, 0 (N,�N) (N, 0) I (N, N) �N 0 g(i) I N � 2 Ni
j1 j2 j3 j4
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
1
s i1 im�1 im . . . (0, 0 (N,�N) (N, 0) I (N, N) �N 0 g(i) I N � 2 Ni
j1 j2 j3 j4
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
1
s i1 im�1 im . . . (0, 0 (N,�N) (N, 0) i (N, N) �N 0 g(i) I N �2 N i
j1 j2 j3 j4
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
1
s i1 im�1 im . . . (0, 0 (N,�N) (N, 0) i (N, N) �N 0 g(i) I N �2 N i
j1 j2 j3 j4
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
1
s i1 im�1 im . . . (0, 0 (N,�N) (N, 0) i (N, N) �N 0 g(i) I N �2 N i
j1 j2 j3 j4
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
1
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
j1 j2 j3 j4
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
1
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
j1 j2 j3 j4
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
1
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
j1 j2 j3 j4
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
1
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
j1 j2 j3 j4
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
p22 p2n p2(n�1) p2(n�1) p(n�1)(n�1) p(n�1)n pnn
2nd Game / Timid Play 2nd Game / Bold Play
1st Game / Timid Play 1st Game / Bold Play pd 1� pd pw 1� pw
0 � 0 1 � 0 0 � 1 1.5 � 0.5 1 � 1 0.5 � 1.5 0 � 2
System xk+1 = fk(xk, uk, wk) uk = µk(xk) µk wk xk
3 5 2 4 6 2
1
Complete Tours Current Partial Tour Next Cities Next States
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Complete Tours Current Partial Tour Next Cities Next States
Nearest Neighbor Heuristic Move to the Right
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Complete Tours Current Partial Tour Next Cities Next States
Nearest Neighbor Heuristic Move to the Right Possible Path
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Time
Space
Walk on a line of length 2N starting at position 0. At each of N steps, move oneunit to the left or one unit to the right.
Objective is to land at a position i of small cost g(i) after N steps.
Question: Consider a base heuristic that takes steps to the right only. Is itsequentially consistent or sequentially improving? How will the rollout performcompared to the base heuristic?
Compare with a superheuristic/combination of two heuristics: 1) Move only to theright, and 2) Move only to the left. Base heuristic chooses the path of best cost.
Bertsekas Reinforcement Learning 10 / 27
A Counterexample to Cost Improvement
x0 u⇤0, x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
1
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
1
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
1
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
1
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
1
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Optimal Trajectory Chosen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Optimal Trajectory Chosen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
Suppose at x0 there is a unique optimal trajectory (x0, u∗0 , x∗1 , u
∗1 , x∗2 ).
Suppose the base heuristic produces this optimal trajectory starting at x0.
Rollout uses the base heuristic to construct a trajectory starting from x∗1 and x1.
Suppose the heuristic’s trajectory starting from x∗1 is “bad" (has high cost).
Then the rollout algorithm rejects the optimal control u∗0 in favor of the other controlu0, and moves to a nonoptimal next state x1 = f0(x0, u0).
So without some safeguard, the rollout can deviate from an already available goodtrajectory.
This suggests a possible remedy: Follow the currently best trajectory if rolloutsuggests following an inferior trajectory.
Bertsekas Reinforcement Learning 11 / 27
Fortified Rollout: Restores Cost Improvement for Base Heuristics thatare not Sequentially Improving
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk)
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Permanent trajectory Tk uk uk xk+1 xk+1
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN�1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
�F (i)
�
R1 R2 R3 R` Rq�1 Rq r⇤q�1 r⇤3 Cost Jµ
�F (i)
�
I1 I2 I3 I` Iq�1 Iq r⇤2 r⇤3 Cost Jµ
�F (i)
�
Aggregate States Scoring Function V (i) J⇤(i) 0 n n� 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
�jf =
⇢1 if j 2 If
0 if j /2 If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
1
Permanent trajectory Tk uk uk xk+1 xk+1
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN�1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
�F (i)
�
R1 R2 R3 R` Rq�1 Rq r⇤q�1 r⇤3 Cost Jµ
�F (i)
�
I1 I2 I3 I` Iq�1 Iq r⇤2 r⇤3 Cost Jµ
�F (i)
�
Aggregate States Scoring Function V (i) J⇤(i) 0 n n� 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
�jf =
⇢1 if j 2 If
0 if j /2 If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
1
Permanent trajectory Tk uk uk xk+1 xk+1
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN�1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
�F (i)
�
R1 R2 R3 R` Rq�1 Rq r⇤q�1 r⇤3 Cost Jµ
�F (i)
�
I1 I2 I3 I` Iq�1 Iq r⇤2 r⇤3 Cost Jµ
�F (i)
�
Aggregate States Scoring Function V (i) J⇤(i) 0 n n� 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
�jf =
⇢1 if j 2 If
0 if j /2 If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
1
Permanent trajectory Tk uk uk xk+1 xk+1
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N �1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN�1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
�F (i)
�
R1 R2 R3 R` Rq�1 Rq r⇤q�1 r⇤3 Cost Jµ
�F (i)
�
I1 I2 I3 I` Iq�1 Iq r⇤2 r⇤3 Cost Jµ
�F (i)
�
Aggregate States Scoring Function V (i) J⇤(i) 0 n n� 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
�jf =
⇢1 if j 2 If
0 if j /2 If
1 10 20 30 40 50 I1 I2 I3 i J1(i)
1
Permanent trajectory Pk Tentative trajectory Tk
uk uk xk+1 xk+1 xN xN
Φr = Π!T
(λ)µ (Φr)
"Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
#nj=1 pij(u)
!g(i, u, j) + J(j)
"Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
$1 if j ∈ If
0 if j /∈ If
1
Permanent trajectory Pk Tentative trajectory Tk
uk uk xk+1 xk+1 xN xN
Φr = Π!T
(λ)µ (Φr)
"Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
#nj=1 pij(u)
!g(i, u, j) + J(j)
"Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
$1 if j ∈ If
0 if j /∈ If
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Permanent trajectory P k Tentative trajectory T k
uk uk xk+1 xk+1 xN xN
Φr = Π!T
(λ)µ (Φr)
"Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
#nj=1 pij(u)
!g(i, u, j) + J(j)
"Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
$1 if j ∈ If
0 if j /∈ If
1
Permanent trajectory P k Tentative trajectory T k
uk uk xk+1 xk+1 xN xN x′N
Φr = Π!T
(λ)µ (Φr)
"Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
#nj=1 pij(u)
!g(i, u, j) + J(j)
"Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
Piecewise Constant Aggregate Problem Approximation
Artificial Start State End State
Piecewise Constant Aggregate Problem Approximation
Feature Vector F (i) Aggregate Cost Approximation Cost Jµ
!F (i)
"
R1 R2 R3 Rℓ Rq−1 Rq r∗q−1 r∗
3 Cost Jµ
!F (i)
"
I1 I2 I3 Iℓ Iq−1 Iq r∗2 r∗
3 Cost Jµ
!F (i)
"
Aggregate States Scoring Function V (i) J∗(i) 0 n n − 1 State i Cost
function Jµ(i)I1 ... Iq I2 g(i, u, j)...
TD(1) Approximation TD(0) Approximation V1(i) and V0(i)
Aggregate Problem Approximation TD(0) Approximation V1(i) andV0(i)
φjf =
$1 if j ∈ If
0 if j /∈ If
1
Complete Tours Current Partial Tour Next Cities Next States
Nearest Neighbor Heuristic Move to the Right Possible Path
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Complete Tours Current Partial Tour Next Cities Next States
Nearest Neighbor Heuristic Move to the Right Possible Path
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Complete Tours Current Partial Tour Next Cities Next States
Nearest Neighbor Heuristic Move to the Right Possible Path
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Tentative Best Trajectory T k
Corrected J J J* Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme System Equation or Simulator
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ =$J | J ≥ J+, J(t) = 0
%
VI converges to J+ from within W+
Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp
1
Upon reaching state xk it stores the permanent trajectory
Pk = {x0, u0, . . . , uk−1, xk}
that has been constructed up to stage k , and also stores a tentative best trajectory
T k = {xk , uk , xk+1, uk+1, . . . , uN−1, xN}
The tentative best trajectory is such that Pk ∪ T k is the best complete trajectorycomputed up to stage k of the algorithm.
Initially, P0 = {x0} and T 0 is the trajectory computed by the base heuristic from x0.
At each step, fortified rollout follows the best trajectory computed thus far.Bertsekas Reinforcement Learning 12 / 27
Illustration of Fortified Algorithm
x0 u⇤0, x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
1
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
1
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
1
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
1
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
1
x0 u⇤0 x⇤
1 u⇤1 x⇤
2 u⇤2 x⇤
3 u1 x2 u2 x3
Base Policy Rollout Policy Approximation in Value Space n n � 1n � 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu2U(x)
Ew
ng(x, u, w) + ↵J
�f(x, u, w)
�o
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗2 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u1 x2 u2 x3
x0 u∗0 x∗
1 u∗1 x∗
2 u∗2 x∗
3 u0 x1 u1 x1
High Cost Transition Chosen by Heuristic at x∗1 Rollout Choice
Permanent Trajectory Tentative Trajectory Optimal Trajectory Cho-sen by Base Heuristic at x0 Initial
Base Policy Rollout Policy Approximation in Value Space n n − 1n − 2
One-Step or Multistep Lookahead for stages Possible Terminal Cost
Approximation in Policy Space Heuristic Cost Approximation for
for Stages Beyond Truncation yk Feature States yk+1 Cost gk(xk, uk)
Approximate Q-Factor Q(x, u) At x Approximation J
minu∈U(x)
Ew
!g(x, u, w) + αJ
"f(x, u, w)
#$
Truncated Rollout Policy µ m Steps
Approximate Q-Factor Q(x, u) At x
Cost Data Policy Data System: xk+1 = 2xk + uk Control constraint:|uk| ≤ 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
1
Tentative Best
Corrected J J J* Cost Jµ
!F (i), r
"of i ≈ Jµ(i) Jµ(i) Feature Map
Jµ
!F (i), r
": Feature-based parametric architecture State
r: Vector of weights Original States Aggregate States
Position “value” Move “probabilities” Simplify E{·}Choose the Aggregation and Disaggregation Probabilities
Use a Neural Network or Other Scheme Form the Aggregate StatesI1 Iq
Use a Neural Scheme or Other Scheme System Equation or Simulator
Possibly Include “Handcrafted” Features
Generate Features F (i) of Formulate Aggregate Problem
Generate “Impoved” Policy µ by “Solving” the Aggregate Problem
Same algorithm learned multiple games (Go, Shogi)
Aggregate costs r∗ℓ Cost function J0(i) Cost function J1(j)
Approximation in a space of basis functions Plays much better thanall chess programs
Cost αkg(i, u, j) Transition probabilities pij(u) Wp
Controlled Markov Chain Evaluate Approximate Cost Jµ of
Evaluate Approximate Cost Jµ
!F (i)
"of
F (i) =!F1(i), . . . , Fs(i)
": Vector of Features of i
Jµ
!F (i)
": Feature-based architecture Final Features
If Jµ
!F (i), r
"=
#sℓ=1 Fℓ(i)rℓ it is a linear feature-based architecture
(r1, . . . , rs: Scalar weights)
Wp: Functions J ≥ Jp with J(xk) → 0 for all p-stable π
Wp′ : Functions J ≥ Jp′ with J(xk) → 0 for all p′-stable π
W+ =$J | J ≥ J+, J(t) = 0
%
VI converges to J+ from within W+
Cost: g(xk, uk) ≥ 0 VI converges to Jp from within Wp
1
Fortified rollout stores as initial tentative best trajectory the unique optimaltrajectory (x0, u∗0 , x
∗1 , u
∗1 , x∗2 ) generated by the base heuristic at x0.
Starting at x∗0 it runs the heuristic from x∗1 and x1, and (despite that the heuristicprefers x1 to x∗1 ) it discards the control u0 in favor of u∗0 , which is dictated by thetentative best trajectory.
It then sets the permanent trajectory to (x0, u∗0 , x∗1 ) and the tentative best trajectory
to (x∗1 , u∗1 , x∗2 ).
Bertsekas Reinforcement Learning 13 / 27
Multistep Deterministic Rollout with Terminal Cost Approximation
Selective Depth Lookahead Tree
Feature Extraction Features: Material Balance, uk = µdk
�xk(Ik)
�
Mobility, Safety, etc Weighting of Features Score Position Evaluator
States xk+1 States xk+2
x0 xk im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2N i
s i1 im�1 im . . . (0, 0) (N,�N) (N, 0) i (N, N) �N 0 g(i) I N � 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector �(x) Approximator �(x)0r
` Stages Riccati Equation Iterates P P0 P1 P2 �2 � 1 �2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k � wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CAD CDA
SA SB CAB CAC CCA CCD CBC CCB CCD
CAB CAD CDA CCD CBD CDB CAB
Do not Repair Repair 1 2 n�1 n p11 p12 p1n p1(n�1) p2(n�1)
...
1
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk
xk xk+1 xk+2 Truncated Heuristic Terminal Cost Approximation
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
"
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π#T
(λ)µ (Φr)
$Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
#g(i, u, j) + J(j)
$Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
Cost G(u) Heuristic N -Solutions
1
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk
xk States xk+1 States xk+2 Truncated Rollout Terminal Cost Ap-proximation
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
"
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π#T
(λ)µ (Φr)
$Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
#g(i, u, j) + J(j)
$Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
1
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk
xk States xk+1 States xk+2 Truncated Rollout Terminal Cost Ap-proximation
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
"
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π#T
(λ)µ (Φr)
$Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
#g(i, u, j) + J(j)
$Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
1
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk
xk States xk+1 States xk+2 Truncated Rollout Terminal Cost Ap-proximation
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
"
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π#T
(λ)µ (Φr)
$Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
#g(i, u, j) + J(j)
$Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
1
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk
xk States xk+1 States xk+2 Truncated Rollout Terminal Cost Ap-proximation
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
"
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π#T
(λ)µ (Φr)
$Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
#g(i, u, j) + J(j)
$Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
1
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk
xk States xk+1 States xk+2
Truncated Rollout Terminal Cost Approximation
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
"
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π#T
(λ)µ (Φr)
$Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
#g(i, u, j) + J(j)
$Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
1
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk
xk States xk+1 States xk+2
Truncated Rollout Terminal Cost Approximation
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
"
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π#T
(λ)µ (Φr)
$Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
#g(i, u, j) + J(j)
$Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
1
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk
xk States xk+1 States xk+2
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
"
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π#T
(λ)µ (Φr)
$Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
#g(i, u, j) + J(j)
$Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
Candidate (m+2)-Solutions (u1, . . . , um, um+1, um+2) (m+2)-Solution
Set of States (u1) Set of States (u1, u2) Set of States (u1, u2, u3)
Run the Heuristics From Each Candidate (m+2)-Solution (u1, . . . , um, um+1)
Set of States (u1) Set of States (u1, u2) Neural Network
Set of States u = (u1, . . . , uN ) Current m-Solution (u1, . . . , um)
Cost G(u) Heuristic N -Solutions u = (u1, . . . , uN−1)
Candidate (m + 1)-Solutions (u1, . . . , um, um+1)
1
Terminal approximation saves computation but cost improvement is lost.
We can prove cost improvement, assuming sequential consistency and a specialproperty of the terminal cost function approximation that resembles sequentialimprovement (more on this when we discuss infinite horizon rollout).
It is not necessarily true that longer lookahead leads to improved performance; butusually true (similar counterexamples as in the last lecture).
It is not necessarily true that increasing the length of the rollout leads to improvedperformance (some examples indicate this). Moreover, long rollout is costly.
Experimentation with length of rollout and terminal cost function approximation arerecommended.
Bertsekas Reinforcement Learning 14 / 27
Stochastic Rollout [Uses Jk+`(xk+`) = Jk+`,π(xk+`) and MC Simulation]min
uk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
Neighbors of im Projections of Neighbors of im
State x Feature Vector φ(x) Approximator φ(x)′r
ℓ Stages Riccati Equation Iterates P P0 P1 P2 γ2 − 1 γ2PP+1
Cost of Period k Stock Ordered at Period k Inventory Systemr(uk) + cuk xk+1 = xk + u + k − wk
Stock at Period k +1 Initial State A C AB AC CA CD ABC
1
Approximations: Computation of Jk+ℓ: (Could be approximate)
DP minimization Replace E{·} with nominal values
(certainty equivalent control)
Limited simulation (Monte Carlo tree search)
Simple choices Parametric approximation Problem approximation
Rollout
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
First ℓ Steps “Future”Nonlinear Ay(x) + b φ1(x, v) φ2(x, v) φm(x, v) r x Initial
Selective Depth Lookahead Tree σ(ξ) ξ 1 0 -1 Encoding y(x)
Linear Layer Parameter v = (A, b) Sigmoidal Layer Linear WeightingCost Approximation r′φ(x, v)
Feature Extraction Features: Material Balance, uk = µdk
#xk(Ik)
$
Mobility, Safety, etc Weighting of Features Score Position EvaluatorStates xk+1 States xk+2
State xk Feature Vector φk(xk) Approximator r′kφk(xk)
x0 xk im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2N i
s i1 im−1 im . . . (0, 0) (N, −N) (N, 0) i (N, N) −N 0 g(i) I N − 2 Ni
u1k u2
k u3k u4
k Selective Depth Adaptive Simulation Tree Projections ofLeafs of the Tree
p(j1) p(j2) p(j3) p(j4)
1
u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Tail problem approximation u1k u2
k u3k u4
k u5k Constraint Relaxation U U1 U2
At State xk Current state x0 ... MCTS Lookahead Minimization Cost-to-go Approximation
Empty schedule LOOKAHEAD MINIMIZATION ROLLOUT States xk+2
minuk ,µk+1,...,µk+ℓ−1
E!gk(xk, uk, wk) +
k+ℓ−1"
m=k+1
gk
#xm, µm(xm), wm
$+ Jk+ℓ(xk+ℓ)
%
Subspace S = {Φr | r ∈ ℜs} x∗ x
Rollout: Simulation with fixed policy Parametric approximation at the end Monte Carlo tree search
T (λ)(x) = T (x) x = P (c)(x)
x − T (x) y − T (y) ∇f(x) x − P (c)(x) xk xk+1 xk+2 Slope = −1
c
T (λ)(x) = T (x) x = P (c)(x)
Extrapolation by a Factor of 2 T (λ) = P (c) · T = T · P (c)
Extrapolation Formula T (λ) = P (c) · T = T · P (c)
Multistep Extrapolation T (λ) = P (c) · T = T · P (c)
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost“Future”
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
System: xk+1 = 2xk + uk Control constraint: |uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout Control uk, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
1
System: xk+1 = 2xk + uk Control constraint: |uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
1
System: xk+1 = 2xk + uk Control constraint: |uk| 1
Cost per stage: x2k + u2
k
{X0, X1, . . . , XN} must be reachable Largest reachable tube
x0 Control uk (`� 1)-Stages Base Heuristic Minimization
Target Tube 0 k Sample Q-Factors (`� 1)-Stages State xk+` = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
Rollout Control uk Rollout Policy µk Base Policy Cost
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
1
Sample Space Event {a ≤ X ≤ b} a b x PDF fX(x) δ x x + δ
minuk,µk+1,...,µk+ℓ−1
E
!gk(xk, uk, wk) +
k+ℓ−1"
i=k+1
gi
#xi, µi(xi), wi
$+ Jk+ℓ(xk+ℓ)
%
bk Belief States bk+1 bk+2 Policy µ m Steps
Truncated Rollout Policy µ m Steps Φr∗λ
B(b, u, z) h(u) Artificial Terminal to Terminal Cost gN(xN ) ik bk ik+1 bk+1 ik+2 uk uk+1 uk+2
Original System Observer Controller Belief Estimator zk+1 zk+2 with Cost gN (xN )
µ COMPOSITE SYSTEM SIMULATOR FOR POMDP
(a) (b) Category c(x, r) c∗(x) System PID Controller yk y ek = yk − y + − τ Object x hc(x, r) p(c | x)
uk = rpek + rizk + rddk ξij(u) pij(u)
Aggregate States j ∈ S f(u) u u1 = 0 u2 uq uq−1 . . . b = 0 ik b∗ b∗ = Optimized b Transition Cost
Policy Improvement by Rollout Policy Space Approximation of Rollout Policy at state i
One-step Lookahead with J(j) =&
y∈A φjyr∗y bk Control uk = µk(bk)
p(z; r) 0 z r r + ϵ1 r + ϵ2 r + ϵm r − ϵ1 r − ϵ2 r − ϵm · · · p1 p2 pm
... (e.g., a NN) Data (xs, cs)
V Corrected V Solution of the Aggregate Problem Transition Cost Transition Cost J∗
Start End Plus Terminal Cost Approximation S1 S2 S3 Sℓ Sm−1 Sm
Disaggregation Probabilities dxi dxi = 0 for i /∈ Ix Base Heuristic Truncated Rollout
Aggregation Probabilities φjy φjy = 1 for j ∈ Iy Selective Depth Rollout Policy µ
Maxu State xk Policy µk(xk, rk) h(u, xk, rk) h(c, x, r) hu(xk, rk) Randomized Policy Idealized
Generate “Improved” Policy µ by µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
State i y(i) Ay(i) + b φ1(i, v) φm(i, v) φ2(i, v) J(i, v) = r′φ(i, v)
Deterministic Transition xk+1 = fk(xk, uk)
1
Consider the pure case (no truncation, no terminal cost approximation)Assume that the base heuristic is a legitimate policy π = {µ0, . . . , µN−1} (i.e., issequentially consistent, in the context of deterministic problems).
Let π = {µ0, . . . , µN−1} be the rollout policy. Then cost improvement is obtained
Jk,π(xk ) ≤ Jk,π(xk ), for all xk and k .
Essentially identical induction proof as for the sequentially improving case (see thetext).
Bertsekas Reinforcement Learning 16 / 27
Backgammon Example
Av. Score by Monte-Carlo Simulation
Av. Score by Monte-Carlo Simulation
Av. Score by Monte-Carlo Simulation
Av. Score by Monte-Carlo Simulation
Possible Moves
Announced by Tesauro in 1996.
Truncated rollout with cost function approximation provided by TD-Gammon (a1991 famous program by Tesauro, involving a neural network trained by a form ofapproximate policy iteration that uses “Temporal Differences").
Plays better than TD-Gammon, and better than any human.
Too slow for real-time play (without parallel hardware), due to excessive simulationtime.
Bertsekas Reinforcement Learning 17 / 27
Monte Carlo Tree Search - Motivation
We assumed equal effort for evaluation of Q-factors of all controls at a state xk
Drawbacks:
The trajectories may be too long because the horizon length N is large (or infinite,in an infinite horizon context).
Some controls uk may be clearly inferior to others and may not be worth as muchsampling effort.
Some controls uk that appear to be promising may be worth exploring betterthrough multistep lookahead.
Monte Carlo tree search (MCTS) is a “randomized" form of lookaheadMCTS aims to trade off computational economy with a hopefully small risk ofdegradation in performance.
It involves adaptive simulation (simulation effort adapted to the perceived quality ofdifferent controls).
Aims to balance exploitation (extra simulation effort on controls that lookpromising) and exploration (adequate exploration of the potential of all controls).
Bertsekas Reinforcement Learning 18 / 27
Monte Carlo Tree Search - Adaptive Simulation
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Sample Q-Factors Simulation
Complete Tours Current Partial Tour Next Cities Next States
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation
Complete Tours Current Partial Tour Next Cities Next States
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation
Complete Tours Current Partial Tour Next Cities Next States
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation
Complete Tours Current Partial Tour Next Cities Next States
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation Control 1 Control 2 Control 3
Complete Tours Current Partial Tour Next Cities Next States
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation Control 1 Control 2 Control 3
Complete Tours Current Partial Tour Next Cities Next States
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation Control 1 Control 2 Control 3
Complete Tours Current Partial Tour Next Cities Next States
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
MCTS provides an economical sampling policy to estimate the Q-factors
Qk (xk , uk ) = E{
gk (xk , uk ,wk ) + Jk+1(fk (xk , uk ,wk )
)}, uk ∈ Uk (xk )
Assume that Uk (xk ) contains a finite number of elements, say i = 1, . . . ,m
After the nth sampling period we have Qi,n, the empirical mean of the Q-factor ofeach control i (total sample value divided by total number of samplescorresponding to i). We view Qi,n as an exploitation index.
How do we use the estimates Qi,n to select the control to sample next?
Bertsekas Reinforcement Learning 19 / 27
MCTS Based on Statistical Tests
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Sample Q-Factors Simulation
Complete Tours Current Partial Tour Next Cities Next States
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation
Complete Tours Current Partial Tour Next Cities Next States
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation
Complete Tours Current Partial Tour Next Cities Next States
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation
Complete Tours Current Partial Tour Next Cities Next States
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation Control 1 Control 2 Control 3
Complete Tours Current Partial Tour Next Cities Next States
Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation Control 1 Control 2 Control 3
Complete Tours Current Partial Tour Next Cities Next States
Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation Control 1 Control 2 Control 3
Complete Tours Current Partial Tour Next Cities Next States
Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
A good sampling policy balances exploitation (sample controls that seem mostpromising, i.e., a small Qi,n) and exploration (sample controls with small sample count).
A popular strategy: Sample next the control i that minimizes the sum Qi,n + Ri,n
where Ri,n is an exploration index.
Ri,n is based on a confidence interval formula and depends on the sample count si
of control i (which comes from analysis of multiarmed bandit problems).
The UCB rule (upper confidence bound) sets Ri,n = −c√
log n/si , where c is apositive constant, selected empirically (values c ≈
√2 are suggested, assuming
that Qi,n is normalized to take values in the range [−1, 0]).
MCTS with UCB rule has been extended to multistep lookahead ... but AlphaZerohas used a different (semi-heuristic) rule.
Bertsekas Reinforcement Learning 20 / 27
Classical Control Problems - Infinite Control Spaces
REGULATION PROBLEMKeep the state near the origin
PATH PLANNINGKeep State Close to a
Trajectory
Need to deal with state and control constraints; linear-quadratic is often inadequate
Bertsekas Reinforcement Learning 22 / 27
On-Line Rollout for Deterministic Infinite-Spaces Problems
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial City Current Partial Tour Next Cities Next States
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Sample Q-Factors Simulation Control 1 States xk+`
Complete Tours Current Partial Tour Next Cities Next States
Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
1
Sample Q-Factors Simulation Control 1 States xk+`
Complete Tours Current Partial Tour Next Cities Next States
Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
1
Sample Q-Factors Simulation Control 1 States xk+`
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
1
Sample Q-Factors Simulation Control 1 States xk+`
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
1
Sample Q-Factors Simulation Control 1 States xk+`
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
1
Control uk (ℓ − 1)-Stages Minimization
Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
1
Control uk (ℓ − 1)-Stages Minimization
Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
1
Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
1
Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
1
Suppose the control space is infiniteOne possibility is discretization of Uk (xk ); but excessive number of Q-factors.
Another possibility is to use optimization heuristics that look (`− 1) steps ahead.
Seemlessly combine the k th stage minimization and the optimization heuristic intoa single `-stage deterministic optimization.
Can solve it by nonlinear programming/optimal control methods (e.g., quadraticprogramming, gradient-based).
This is the idea underlying model predictive control (MPC).Bertsekas Reinforcement Learning 23 / 27
Model Predictive Control for Regulation Problems
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial City Current Partial Tour Next Cities Next States
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Sample Q-Factors Simulation Control 1 Control 2 Control 3
Complete Tours Current Partial Tour Next Cities Next States
Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation Control 1 States xk+`
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
1
Sample Q-Factors Simulation Control 1 States xk+`
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
1
Sample Q-Factors Simulation Control 1 States xk+`
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
1
Sample Q-Factors Simulation Control 1 State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
1
Sample Q-Factors Simulation Control 1 State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
1
Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
1
Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
1
System: xk+1 = fk (xk , uk )
Cost per stage: gk (xk , uk ) ≥ 0, the origin 0 is cost-free and absorbing.
State and control constraints: xk ∈ Xk , uk ∈ Uk (xk ) for all k
At xk solve an `-step lookahead version of the problem, requiring xk+` = 0 whilesatisfying the state and control constraints.
If {uk , . . . , uk+`−1} is the control sequence so obtained, apply uk .
Bertsekas Reinforcement Learning 25 / 27
Relation to Rollout
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N � 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N � 1 c(N) c(N � 1) k k + 1
Heuristic Cost “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . .
x0 x1 xk xN uk xk+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) ! 0 for all p-stable ⇡ from x0 with x0 2 X and ⇡ 2 Pp,x0 Wp+ = {J 2 J | J+ J} Wp+ from
within Wp+
Prob. u Prob. 1 � u Cost 1 Cost 1 �pu
J(1) = min�c, a + J(2)
J(2) = b + J(1)
J⇤ Jµ Jµ0 Jµ00Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; ✓k) f(x; ✓k+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x⇤ = F (x⇤) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) c(k) c(k + 1) c(N − 1) Parking Spaces
Stage 1 Stage 2 Stage 3 Stage N N − 1 c(N) c(N − 1) k k + 1
Heuristic Cost Heuristic “Future” System xk+1 = fk(xk, uk, wk) xk Observations
Belief State pk Controller µk Control uk = µk(pk) . . . Q-Factors Current State xk
x0 x1 xk xN x′N x′′
N uk u′k u′′
k xk+1 x′k+1 x′′
k+1
Initial State x0 s Terminal State t Length = 1
x0 a 0 1 2 t b C Destination
J(xk) → 0 for all p-stable π from x0 with x0 ∈ X and π ∈ Pp,x0 Wp+ = {J ∈ J | J+ ≤ J} Wp+ from
within Wp+
Prob. u Prob. 1 − u Cost 1 Cost 1 − √u
J(1) = min!c, a + J(2)
"
J(2) = b + J(1)
J∗ Jµ Jµ′ Jµ′′Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fµk(x) Fµk+1
(x)
Improper policy µ
Proper policy µ
1
Initial City Current Partial Tour Next Cities Next States
Nearest Neighbor Heuristic
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
Cost 0 Cost g(i, u, j) Monte Carlo tree search First Step “Future”Feature Extraction
Node Subset S1 SN Aggr. States Stage 1 Stage 2 Stage 3 Stage N −1
1
Sample Q-Factors Simulation Control 1 Control 2 Control 3
Complete Tours Current Partial Tour Next Cities Next States
Q1,n + R1,n Q2,n + R2,n Q3,n + R3,n
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
max{0, ξ} J(x)
1
Sample Q-Factors Simulation Control 1 States xk+`
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
1
Sample Q-Factors Simulation Control 1 States xk+`
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
1
Sample Q-Factors Simulation Control 1 States xk+`
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+`�1
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+12Uk+1(xk+1)
En
gk+1(xk+1, uk+1, wk+1)
+Jk+2
�fk+1(xk+1, uk+1, wk+1)
�o,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree `-StepShortest path problem xk xk States xk+1 States xk+2 u u0
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+`
Rollout, Model Predictive Control
b+k b�k Permanent trajectory P k Tentative trajectory T k
minuk
En
gk(xk, uk, wk)+Jk+1(xk+1)o
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u⇤0, . . . , u⇤
k, . . . , u⇤N�1} Simplify E{·}
Tail subproblem Time x⇤k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN (xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x0N
uk uk xk+1 xk+1 xN xN x0N
�r = ⇧�T
(�)µ (�r)
�⇧(Jµ) µ(i) 2 arg minu2U(i) Qµ(i, u, r)
Subspace M = {�r | r 2 <m} Based on Jµ(i, r) Jµk
minu2U(i)
Pnj=1 pij(u)
�g(i, u, j) + J(j)
�Computation of J :
Good approximation Poor Approximation �(⇠) = ln(1 + e⇠)
max{0, ⇠} J(x)
1
Sample Q-Factors Simulation Control 1 State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
1
Sample Q-Factors Simulation Control 1 State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
Good approximation Poor Approximation σ(ξ) = ln(1 + eξ)
1
Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
1
Control uk (ℓ − 1)-Stages Base Heuristic Minimization
Sample Q-Factors (ℓ − 1)-Stages State xk+ℓ = 0
Complete Tours Current Partial Tour Next Cities Next States
Q1,n +R1,n Q2,n +R2,n Q3,n +R3,n Stage k Stages k+1, . . . , k+ ℓ−1
Base Heuristic Minimization Possible Path
Simulation Nearest Neighbor Heuristic Move to the Right PossiblePath
Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)
E!
gk+1(xk+1, uk+1, wk+1)
+Jk+2
"fk+1(xk+1, uk+1, wk+1)
#$,
2-Step Lookahead (onestep lookahead plus one step approx-imation)
Certainty equivalence Monte Carlo tree search Lookahead tree ℓ-StepShortest path problem xk xk States xk+1 States xk+2 u u′
Truncated Rollout Terminal Cost Approximation J
Parametric approximation Neural nets Discretization
Parametric approximation Neural nets Discretization
Cost Function Approximation Jk+ℓ
Rollout, Model Predictive Control
b+k b−
k Permanent trajectory P k Tentative trajectory T k
minuk
E!gk(xk, uk, wk)+Jk+1(xk+1)
$
Approximate Min Approximate E{·} Approximate Cost-to-Go Jk+1
Optimal control sequence {u∗0, . . . , u∗
k, . . . , u∗N−1} Simplify E{·}
Tail subproblem Time x∗k Future Stages Terminal Cost k N
Stage k Future Stages Terminal Cost gN(xN )
Control uk Cost gk(xk, uk) x0 xk xk+1 xN xN x′N
uk uk xk+1 xk+1 xN xN x′N
Φr = Π"T
(λ)µ (Φr)
#Π(Jµ) µ(i) ∈ arg minu∈U(i) Qµ(i, u, r)
Subspace M = {Φr | r ∈ ℜm} Based on Jµ(i, r) Jµk
minu∈U(i)
%nj=1 pij(u)
"g(i, u, j) + J(j)
#Computation of J :
1
It is rollout with base heuristic the (`− 1)-step min (0 is cost-free and absorbing).
This heuristic is sequentially improving (not sequentially consistent), i.e.,
minuk∈Uk (xk )
[gk (xk , uk ) + Hk+1
(fk (xk , uk )
)]≤ Hk (xk )
because (opt. cost to reach 0 in ` steps) ≤ (opt. cost to reach 0 in `− 1 steps)
Sequential improvement implies “stability":∑∞
k=0 gk (xk , uk ) ≤ H0(x0) <∞, where{x0, u0, x1, u1, . . .} is the state and control sequence generated by MPC.
Major issue: How do we know that the optimization of the base heuristic issolvable (e.g., there exists ` such that we can drive xk+` to 0 for all xk ∈ Xk whileobserving the state and control constraints). Methods of reachability of targettubes can be used for this (see the text).
Bertsekas Reinforcement Learning 26 / 27
About the Next Lecture
We will cover:Rollout for multiagent problems
Constrained rollout
Discrete optimization problems and rollout
PLEASE READ AS MUCH OF THE REMAINDER OF CHAPTER 2 AS YOU CAN
PLEASE DOWNLOAD THE LATEST VERSION
Bertsekas Reinforcement Learning 27 / 27