mit dynamic programming lecture slides
TRANSCRIPT
-
8/11/2019 MIT Dynamic Programming Lecture Slides
1/261
LECTURE SLIDES ON DYNAMIC PROGRAMMING
BASED ON LECTURES GIVEN AT THE
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
CAMBRIDGE, MASS
FALL 2004
DIMITRI P. BERTSEKAS
These lecture slides are based on the book:Dynamic Programming and Optimal Control:2nd edition, Vols. I and II, Athena Scientific,2001, by Dimitri P. Bertsekas; see
http://www.athenasc.com/dpbook.html
Last Updated: December 2004
The slides are copyrighted, but may be freelyreproduced and distributed for any noncom-mercial purpose.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
2/261
6.231 DYNAMIC PROGRAMMING
LECTURE 1
LECTURE OUTLINE
Problem Formulation
Examples The Basic Problem Significance of Feedback
-
8/11/2019 MIT Dynamic Programming Lecture Slides
3/261
-
8/11/2019 MIT Dynamic Programming Lecture Slides
4/261
BASIC STRUCTURE OF STOCHASTIC DP
Discrete-time systemxk+1 =fk(xk, uk, wk), k= 0, 1, . . . , N 1
k:Discrete time xk:State;summarizes past information that
is relevant for future optimization
uk: Control;decision to be selected at timekfrom a given set
wk: Random parameter(also called distur-
bance or noise depending on the context)
N: Horizonor number of times control isapplied
Cost function that is additive over time
E
gN(xN) +
N1k=0
gk(xk, uk, wk)
-
8/11/2019 MIT Dynamic Programming Lecture Slides
5/261
INVENTORY CONTROL EXAMPLE
InventorySystem
Stock Ordered at
Period k
Stock at Period k Stock at Period k + 1
Demand at Periodk
xk
wk
xk + 1= xk+uk -wk
ukCost of Period k
cuk+ r (xk +uk-wk)
Discrete-time system
xk+1 =fk(xk, uk, wk) =xk+ uk
wk
Cost function that is additive over time
E
gN(xN) +
N1k=0
gk(xk, uk, wk)
=EN1
k=0
cuk+ r(xk+ uk wk)
Optimizationover policies: Rules/functions uk =k(xk)that map states to controls
-
8/11/2019 MIT Dynamic Programming Lecture Slides
6/261
ADDITIONAL ASSUMPTIONS
The set of values that the control ukcan takedepend at most onxkand not on priorxoru
Probability distribution of wkdoes not dependon past valueswk1, . . . , w0, but may depend onxkanduk
Otherwise past values of wor xwould beuseful for future optimization
Sequence of events envisioned in periodk: xkoccurs according to
xk =fk1
xk1, uk1, wk1
ukis selected with knowledge ofxk, i.e.,
uk
U(xk)
wkis random and generated according to adistribution
Pwk(xk, uk)
-
8/11/2019 MIT Dynamic Programming Lecture Slides
7/261
DETERMINISTIC FINITE-STATE PROBLEMS
Scheduling example: Find optimal sequence ofoperations A, B, C, D
A must precede B, and C must precede D Given startup costSAand SC, and setup tran-sition costCmnfrom operationmto operationn
A
SA
C
SC
AB
CAB
ACCAC
CDA
CAD
ABC
CA
CCD CD
ACD
ACB
CAB
CAD
CBC
CCB
CCD
CAB
CCA
CDA
CCD
CBD
CDB
CBD
CDB
CAB
InitialState
-
8/11/2019 MIT Dynamic Programming Lecture Slides
8/261
STOCHASTIC FINITE-STATE PROBLEMS
Example: Find two-game chess match strategy Timidplay draws with prob. pd > 0and loseswith prob.1 pd. Boldplay wins with prob.pw jbetter than thecurrent path s --> j ?)
Is di+ a
ij< UPPER
?
(Does the path s --> i --> jhave a chance to be partof a shorter s --> t path ?)
YES
YES
INSERT
OPEN
Set dj = di+ aij
-
8/11/2019 MIT Dynamic Programming Lecture Slides
32/261
EXAMPLE
ABC ABD ACB ACD ADB ADC
ABCD
AB AC AD
ABDC ACBD ACDB ADBC ADCB
Artificial Terminal Node t
Origin Node sA
1
11
20 20
2020
44
4 4
1515 5
5
3 3
5
33
15
2
3
4
5
6
7
8
9
Iter. No. Node Exiting OPEN OPEN after Iteration UPPER
0 - 1 1 1 2, 7,10 2 2 3, 5, 7, 10 3 3 4, 5, 7, 10 4 4 5, 7, 10 43
5 5 6, 7, 10 43
6 6 7, 10 13
7 7 8, 10 13
8 8 9, 10 13
9 9 10 13
10 10 Empty 13
Note thatsome nodes never entered OPEN
-
8/11/2019 MIT Dynamic Programming Lecture Slides
33/261
LABEL CORRECTING METHODS
Origins, destinationt, lengthsaijthat are 0. di(label ofi): Length of the shortest path foundthus far (initiallydi = exceptds = 0). The labeldiis implicitly associated with ans ipath. UPPER: Labeldtof the destination OPEN list: Contains active nodes (initiallyOPEN={s})
i j
REMOVE
Is di+ aij< dj ?
(Is the path s --> i --> jbetter than thecurrent path s --> j ?)
Is di+ aij< UPPER ?
(Does the path s --> i --> j
have a chance to be partof a shorter s --> t path ?)
YES
YES
INSERT
OPEN
Set dj = di+ aij
-
8/11/2019 MIT Dynamic Programming Lecture Slides
34/261
6.231 DYNAMIC PROGRAMMING
LECTURE 4
LECTURE OUTLINE
Label correcting methods for shortest paths
Variants of label correcting methods
Branch-and-bound as a shortest path algorithm
-
8/11/2019 MIT Dynamic Programming Lecture Slides
35/261
LABEL CORRECTING METHODS
Origins, destinationt, lengthsaijthat are 0. di(label ofi): Length of the shortest path foundthus far (initiallydi = exceptds = 0). The labeldiis implicitly associated with ans ipath. UPPER: Labeldtof the destination OPEN list: Contains active nodes (initiallyOPEN={s})
i j
REMOVE
Is di+ aij< dj ?
(Is the path s --> i --> jbetter than thecurrent path s --> j ?)
Is di+ aij< UPPER ?
(Does the path s --> i --> j
have a chance to be partof a shorter s --> t path ?)
YES
YES
INSERT
OPEN
Set dj = di+ aij
-
8/11/2019 MIT Dynamic Programming Lecture Slides
36/261
VALIDITY OF LABEL CORRECTING METHODS
Proposition:If there exists at least one pathfrom the origin to the destination, the label cor-recting algorithm terminates with UPPER equalto the shortest distance from the origin to the des-
tination.
Proof: (1) Each time a node j enters OPEN,its label is decreased and becomes equal to thelength of some path fromstoj
(2) The number of possible distinct path lengthsis finite, so the number of times a node can enter
OPEN is finite, and the algorithm terminates(3) Let (s, j1, j2, . . . , jk, t)be a shortest path andlet d be the shortest distance. If UPPER > d
at termination, UPPER will also be larger than the
length of all the paths (s, j1, . . . , jm), m= 1, . . . , k,
throughout the algorithm. Hence, node jk willnever enter the OPEN list with djk equal to theshortest distance from s tojk. Similarly nodejk1will never enter the OPEN list with djk1equal tothe shortest distance fromsto jk1. Continue to
j1to get a contradiction.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
37/261
MAKING THE METHOD EFFICIENT
Reduce the value of UPPER as quickly as pos-sible
Try to discover good s tpaths early inthe course of the algorithm
Keep the number of reentries into OPEN low
Try to remove from OPEN nodes with smalllabel first.
Heuristic rationale: if di is small, then djwhen set to di+aijwill be accordingly small,so reentrance of jin the OPEN list is lesslikely.
Reduce the overhead for selecting the node tobe removed from OPEN
These objectives are often in conflict. They giverise to a large variety of distinct implementations Good practical strategies try to strike a compro-mise between low overhead and small label nodeselection.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
38/261
NODE SELECTION METHODS
Depth-first search:Remove from the top ofOPEN and insert at the top of OPEN.
Has low memory storage properties (OPENis not too long). Reduces UPPER quickly.
Origin Nodes
Destination Node t
4
2
3
4 5
6
7 8 9
3
2
Best-first search (Djikstra):Remove fromOPEN a node with minimum value of label.
Interesting property: Each node will be in-serted in OPEN at most once.
Many implementations/approximations
-
8/11/2019 MIT Dynamic Programming Lecture Slides
39/261
ADVANCED INITIALIZATION
Instead of starting from di =for all i= s,start with
di =length of some path from stoi (ordi = )
OPEN= {i =t | di < }
Motivation: Get a small starting value of UP-PER.
No node with shortest distanceinitial valueof UPPER will enter OPEN
Good practical idea: Run a heuristic (or use common sense) to
get a good starting pathPfromstot
Use as UPPER the length of P, and as dithe path distances of all nodes ialongP
Very useful also in reoptimization, where wesolve the same problem with slightly different data
-
8/11/2019 MIT Dynamic Programming Lecture Slides
40/261
VARIANTS OF LABEL CORRECTING METHODS
If a lower bound hj of the true shortest dis-tance fromjto tis known, use the test
di+ aij + hj
-
8/11/2019 MIT Dynamic Programming Lecture Slides
41/261
BRANCH-AND-BOUND METHOD
Problem: Minimize f(x)over a finiteset offeasible solutionsX.
Idea of branch-and-bound: Partition the feasi-ble set into smaller subsets, and then calculatecertain bounds on the attainable cost within some
of the subsets to eliminate from further consider-ation other subsets.
Bounding Principle
Given two subsetsY1 XandY2 X, supposethat we have bounds
f1 min
xY1f(x), f2 min
xY2f(x).
Then, if f2 f1, the solutions in Y1may be dis-regarded since their cost cannot be smaller thanthe cost of the best solution inY2.
The B+B algorithm can be viewed as a la-bel correcting algorithm, where lower bounds de-fine the arc costs, and upper bounds are used to
strengthen the test for admission to OPEN.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
42/261
SHORTEST PATH IMPLEMENTATION
Acyclic graph/partition of Xinto subsets (typi-cally a tree). The leafs consist of single solutions.
Upper/Lower bounds fY
and fYfor the mini-mum cost over each subsetYcan be calculated.
The lower bound of a leaf
{x}
isf(x)
Each arc(Y, Z)has lengthfZ f
Y
Shortest distance fromXtoY=fY f
X
Distance from origin Xto a leaf {x} is f(x)fX
Distance from origin Xto a leaf {x} is f(x)fX Shortest path from Xto the set of leafs givesthe optimal cost and optimal solution
UPPER is the smallest f(x)out of leaf nodes
{x}
examined so far {1,2,3,4,5}
{1,2,}
{4,5}{1,2,3}
{1} {2}
{3} {4} {5}
-
8/11/2019 MIT Dynamic Programming Lecture Slides
43/261
BRANCH-AND-BOUND ALGORITHM
Step 1:Remove a nodeYfrom OPEN. For eachchildYjofY, do the following: If fY j
-
8/11/2019 MIT Dynamic Programming Lecture Slides
44/261
6.231 DYNAMIC PROGRAMMING
LECTURE 5
LECTURE OUTLINE
Examples of stochastic DP problems
Linear-quadratic problems Inventory control
-
8/11/2019 MIT Dynamic Programming Lecture Slides
45/261
LINEAR-QUADRATIC PROBLEMS
System: xk+1 =Akxk+ Bkuk+ wk Quadratic cost
Ewk
k=0,1,...,N1 xNQNxN+
N1
k=0(xkQkxk+ u
kRkuk)
where Qk 0 and Rk >0 (in the positive (semi)definsense).
wkare independent and zero mean DP algorithm:
JN(xN) =xNQNxN,
Jk(xk) = minuk
E
xkQkxk+ ukRkuk
+ Jk+1(Akxk+ Bkuk+ wk) Key facts: Jk(xk)is quadratic Optimal policy {0, . . . , N1} is linear:
k(xk) =Lkxk
Similar treatment of a number of variants
-
8/11/2019 MIT Dynamic Programming Lecture Slides
46/261
DERIVATION
By induction verify thatk(xk) =Lkxk, Jk(xk) =x
kKkxk+constant,
whereLkare matrices given by
Lk = (BkKk+1Bk+ Rk)1BkKk+1Ak,
and where Kkare symmetric positive semidefinitematrices given by
KN =QN,
Kk =Ak
Kk+1 Kk+1Bk(BkKk+1Bk+ Rk)1BkKk+1
Ak+ Qk.
This is called thediscrete-time Riccati equation. Just like DP, it starts at the terminal timeNandproceeds backwards.
Certainty equivalence holds (optimal policy isthe same as whenwkis replaced by its expected
valueE{wk} = 0).
-
8/11/2019 MIT Dynamic Programming Lecture Slides
47/261
ASYMPTOTIC BEHAVIOR OF RICCATI EQUATION
Assume time-independent system and cost perstage, and some technical assumptions: contro-lability of(A, B)and observability of(A, C)whereQ=CC
The Riccati equation converges limk Kk =
K, where Kis pos. definite, and is the unique(within the class of pos. semidefinite matrices) so-lution of thealgebraic Riccati equation
K=AK KB(BKB+ R)1BKA + Q
The corresponding steady-state controller (x) =Lx, where
L= (BKB+ R)1BKA,
is stable in the sense that the matrix(A + BL)ofthe closed-loop system
xk+1 = (A + BL)xk+ wk
satisfieslimk A + BL k = 0.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
48/261
GRAPHICAL PROOF FOR SCALAR SYSTEMS
A
2
RB
2 + Q
P 0
Q
F(P)
450
PPk Pk + 1 P*
-R
B
2
Riccati equation (withPk =KNk):
Pk+1 =A2
Pk B2P2k
B2Pk+ R
+ Q,
orPk+1 =F(Pk),where
F(P) = A2RP
B2P+ R+ Q.
Note the two steady-state solutions, satisfying
P =F(P), of which only one is positive.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
49/261
RANDOM SYSTEM MATRICES
Suppose that {A0, B0}, . . . , {AN1, BN1} arenot known but rather are independent random ma-trices that are also independent of thewk
DP algorithm is
JN(xN) =xNQNxN,
Jk(xk) = minuk
Ewk,Ak,Bk
xkQkxk
+ ukRkuk+ Jk+1(Akxk+ Bkuk+ wk) Optimal policyk(xk) =Lkxk,where
Lk =
Rk+ E{BkKk+1Bk}
1
E{BkKk+1Ak},
and where the matricesKkare given by
KN =QN,
Kk =E{AkKk+1Ak} E{AkKk+1Bk}
Rk+ E{BkKk+1Bk
}1
E
{BkKk+1Ak
}+ Q
-
8/11/2019 MIT Dynamic Programming Lecture Slides
50/261
PROPERTIES
Certainty equivalence may not hold Riccati equation may not converge to a steady-state
Q
450
0 P
F (P)
- R
E{B2}
We havePk+1 = F(Pk),where
F(P) = E{A2}RPE{B2}P+ R + Q +
T P2
E{B2}P+ R ,
T =E{A2}E{B2} E{A}2
E{B}2
-
8/11/2019 MIT Dynamic Programming Lecture Slides
51/261
INVENTORY CONTROL
xk: stock, uk: inventory purchased, wk: de-mand
xk+1 =xk+ uk wk, k= 0, 1, . . . , N 1
Minimize
E
N1k=0
cuk+ r(xk+ uk wk)
where, for somep >0andh >0,
r(x) =p max(0, x) + h max(0, x)
DP algorithm:
JN(xN) = 0,
Jk(xk) = minuk0
cuk+H(xk+uk)+E
Jk+1(xk+ukwk)
whereH(x + u) =E{r(x + u w)}.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
52/261
OPTIMAL POLICY
DP algorithm can be written asJN(xN) = 0,
Jk(xk) = minuk0
Gk(xk+ uk) cxk,
where
Gk(y) =cy+ H(y) + E
Jk+1(y w)
.
If Gkis convex and lim|x| Gk(x) , wehave
k(xk) =
Sk xk ifxk < Sk,0 ifxk Sk,
whereSkminimizesGk(y).
This is shown, assuming thatc < p, by showingthatJkis convex for allk, and
lim|x|
Jk(x)
-
8/11/2019 MIT Dynamic Programming Lecture Slides
53/261
JUSTIFICATION
Graphical inductive proof thatJkis convex.
- cy
- cy
y
H(y)
cy+ H(y)
SN - 1
cSN - 1
JN - 1(xN - 1)
xN - 1SN - 1
-
8/11/2019 MIT Dynamic Programming Lecture Slides
54/261
6.231 DYNAMIC PROGRAMMING
LECTURE 6
LECTURE OUTLINE
Stopping problems
Scheduling problems Other applications
-
8/11/2019 MIT Dynamic Programming Lecture Slides
55/261
PURE STOPPING PROBLEMS
Two possible controls: Stop (incur a one-time stopping cost, and
move to cost-free and absorbing stop state)
Continue [using xk+1 = fk(xk, wk)and in-curring the cost-per-stage]
Each policy consists of apartitionof the set ofstatesxkinto two regions:
Stop region, where we stop Continue region, where we continue
STOPREGION
CONTINUEREGION
Stop State
-
8/11/2019 MIT Dynamic Programming Lecture Slides
56/261
EXAMPLE: ASSET SELLING
A person has an asset, and at k= 0, 1, . . . , N 1receives a random offerwk
May accept wkand invest the money at fixedrate of interest r, or reject wkand wait for wk+1.Must accept the last offerwN1
DP algorithm (xk: current offer,T: stop state):
JN(xN) =
xN ifxN=T,0 ifxN =T,
Jk(xk) =max(1 + r)Nkxk, EJk+1(wk) ifxk =T
0 ifxk =T
Optimal policy;
accept the offerxk ifxk > k,
reject the offerxk ifxk < k,
where
k = E
Jk+1(wk)
(1 + r)Nk .
-
8/11/2019 MIT Dynamic Programming Lecture Slides
57/261
FURTHER ANALYSIS
0 1 2 N- 1 N k
ACCEPT
REJECT
1
N - 1
2
Can show thatk k+1for allkProof: Let Vk(xk) =Jk(xk)/(1 + r)Nk for xk=
T.Then the DP algorithm is
VN(xN) =xNand
Vk(xk) = max
xk, (1 + r)1 E
w
Vk+1(w)
.
We have k =EwVk+1(w)/(1 + r), so it is enoughto show thatVk(x) Vk+1(x)for allxandk. Startwith VN1(x) VN(x)and use the monotonicityproperty of DP.
We can also show thatk aas k .Suggests that for an infinite horizon the optimal
policy is stationary.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
58/261
GENERAL STOPPING PROBLEMS
At timek, we may stop at costt(xk)or choosea controluk U(xk)and continue
JN(xN) =t(xN),
Jk(xk) = mint(xk), minukU(xk) Eg(xk, uk, wk)+ Jk+1
f(xk, uk, wk)
Optimal to stop at timekfor statesxin the set
Tk = x t(x) minuU(x) Eg(x,u,w) + Jk+1f(x,u,w) Since JN1(x) JN(x), we have Jk(x) Jk+1(x)for allk, so
T0
Tk
Tk+1
TN1
.
Interesting case is when all theTkare equal (toTN1, the set where it is better to stop than to goone step and stop). Can be shown to be true if
f(x,u,w) TN1, for allx TN1, u U(x),
-
8/11/2019 MIT Dynamic Programming Lecture Slides
59/261
SCHEDULING PROBLEMS
Set of tasks to perform, the ordering is subjectto optimal choice.
Costs depend on the orderThere may be stochastic uncertainty, and prece-dence and resource availability constraints
Some of the hardest combinatorial problemsare of this type (e.g., traveling salesman, vehicle
routing, etc.)
Some special problems admit a simple quasi-
analytical solution method Optimal policy has an index form, i.e., each
task has an easily calculable index, andit is optimal to select the task that has themaximum value of index (multi-armed bandit
problems - to be discussed later) Some problems can be solved by aninter-
change argument(start with some sched-ule, interchange two adjacent tasks, and see
what happens)
-
8/11/2019 MIT Dynamic Programming Lecture Slides
60/261
EXAMPLE: THE QUIZ PROBLEM
Given a list ofNquestions. If questioniis an-swered correctly (given probabilitypi), we receiverewardRi; if not the quiz terminates. Choose or-der of questions to maximize expected reward.
Let iand jbe the kth and(k+ 1)st questions
in an optimally ordered list
L= (i0, . . . , ik1, i , j , ik+2, . . . , iN1)
E{reward ofL} =E
reward of{i0, . . . , ik1}+pi0 pik1 (piRi+pipjRj)+pi0 pik1pipjE
reward of {ik+2, . . . , iN1}
Consider the list withiandjinterchanged
L = (i0, . . . , ik1, j , i , ik+2, . . . , iN1)
Since L is optimal, E{reward ofL} E{reward ofLso it follows thatpiRi +pipjRj pjRj+pjpiRior
piRi/(1 pi) pjRj/(1 pj).
-
8/11/2019 MIT Dynamic Programming Lecture Slides
61/261
MINIMAX CONTROL
Consider basic problem with the difference thatthe disturbancewkinstead of being random, it is
just known to belong to a given setWk(xk, uk).
Find policythat minimizes the cost
J(x0) = maxwkWk(xk,k(xk))
k=0,1,...,N1
gN(xN)+
N1
k=0gk
xk, k(xk), w
The DP algorithm takes the form
JN(xN) =gN(xN),
Jk(xk) = minukU(xk) maxwkWk(xk,uk)gk(xk, uk, wk)
+ Jk+1
fk(xk, uk, wk)
(Exercise 1.5 in the text, solution posted on thewww).
-
8/11/2019 MIT Dynamic Programming Lecture Slides
62/261
UNKNOWN-BUT-BOUNDED CONTROL
For each k, keep the xkof the controlled systemxk+1 =fk
xk, k(xk), wk
inside a given setXk, thetarget set at timek.
This is a minimax control problem, where thecost at stagekis
gk(xk) =
0 ifxk Xk,1 ifxk / Xk.
We must reach at timekthe setXk =
xk | Jk(xk) = 0
in order to be able to maintain the state within the
subsequent target sets.Start with XN =XN, and for k= 0, 1, . . . , N 1,
Xk =
xk Xk | there existsuk Uk(xk)such thfk(xk, uk, wk)
Xk+1, for allwk
Wk(xk, uk
-
8/11/2019 MIT Dynamic Programming Lecture Slides
63/261
6.231 DYNAMIC PROGRAMMING
LECTURE 7
LECTURE OUTLINE
Deterministic continuous-time optimal control
Examples Connection with the calculus of variations The Hamilton-Jacobi-Bellman equation as acontinuous-time limit of the DP algorithm
The Hamilton-Jacobi-Bellman equation as a suf-ficient condition
Examples
-
8/11/2019 MIT Dynamic Programming Lecture Slides
64/261
PROBLEM FORMULATION
We have a continuous-time dynamic systemx(t) =f
x(t), u(t)
, 0 t T, x(0) : given,
where
x(t)
n is the state vector at timet
u(t)U m is the control vector at timet,Uis the control constraint set
Tis the terminal time.Any admissible control trajectory u(t) | t [0, T](piecewise continuous function u(t) | t [0, T]with u(t) U for all t [0, T]), uniquely deter-mines
x(t) | t [0, T].
Find an admissible control trajectory
u(t) | t
[0, T] and corresponding state trajectory x(t) | t [0, T], that minimizes a cost function of the formh
x(T)
+
T0
g
x(t), u(t)
dt
f ,h,gare assumed continuously differentiable.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
65/261
EXAMPLE I
Motion control: A unit mass moves on a lineunder the influence of a forceu.
x(t) = x1(t), x2(t): position and velocity ofthe mass at timet
Problem: From a given x1(0), x2(0), bringthe mass near a given final position-velocity pair
(x1, x2)at timeTin the sense:
minimize
x1(T) x1
2
+
x2(T) x2
2
subject to the control constraint
|u(t)| 1, for allt [0, T].
The problem fits the framework with
x1(t) =x2(t), x2(t) =u(t),
h
x(T)
=
x1(T) x1
2
+
x2(T) x2
2
,
gx(t), u(t)= 0, for allt [0, T].
-
8/11/2019 MIT Dynamic Programming Lecture Slides
66/261
EXAMPLE II
A producer with production rate x(t)at time tmay allocate a portion u(t)of his/her productionrate to reinvestment and1 u(t)to production ofa storable good. Thusx(t)evolves according to
x(t) =u(t)x(t),
where >0is a given constant.
The producerwants tomaximize the total amountof product stored
T0
1 u(t)x(t)dt
subject to
0 u(t) 1, for allt [0, T].
The initial production rate x(0) is a given positivenumber.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
67/261
EXAMPLE III (CALCULUS OF VARIATIONS)
Length =0T
1 + (u(t))2 dt
x(t)
T t0
x(t) =u(t).
Given
Point Given
Line
Find a curve from a given point to a given linethat has minimum length.
The problem is
minimize
T0
1 +
x(t)
2dt
subject to x(0) =.
Reformulation as an optimal control problem:
minimize
T0
1 +
u(t)
2dt
subject to x(t) =u(t), x(0) =.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
68/261
HAMILTON-JACOBI-BELLMAN EQUATION I
We discretize [0, T] at times 0, , 2 , . . . , N ,where=T /N, and we let
xk =x(k), uk =u(k), k= 0, 1, . . . , N .
We also discretize the system and cost:
xk+1 =xk+f(xk, uk), h(xN)+N1k=0
g(xk, uk).
We write the DP algorithm for the discretizedproblem
J(N,x) =h(x),
J(k,x) = minuUg(x, u)+
J(k+1), x+f(x, u) Assume J is differentiable and Taylor-expand:
J(k,x) = min
uUg(x, u) +J(k,x) +tJ
(k,x)
+xJ
(k,x)
f(x, u) + o().
-
8/11/2019 MIT Dynamic Programming Lecture Slides
69/261
HAMILTON-JACOBI-BELLMAN EQUATION II
Let J(t, x) be the optimal cost-to-go of the con-tinuous problem. Assuming the limit is valid
limk, 0, k=t
J(k,x) =J(t, x), for allt, x,
we obtainfor allt, x,0 = min
uU
g(x, u) +tJ(t, x) +xJ(t, x)f(x, u)
with the boundary conditionJ(T, x) =h(x).
This is the Hamilton-Jacobi-Bellman (HJB) equa-tion apartialdifferential equation, which is sat-isfied for all time-state pairs (t, x) by the cost-to-gofunction J(t, x) (assuming J is differentiable andthe preceding informal limiting procedure is valid).
It is hard to tell a prioriif J
(t, x) is differentiable. So we use the HJB Eq. as a verification tool; ifwe can solve it for a differentiableJ(t, x), then:
J is the optimal-cost-to-go function
The control (t, x) that minimizes in the RHS
for each(t, x)defines an optimal control
-
8/11/2019 MIT Dynamic Programming Lecture Slides
70/261
VERIFICATION/SUFFICIENCY THEOREM
SupposeV(t, x)is a solution to the HJB equa-tion; that is, Vis continuously differentiable in tandx, and is such that for allt, x,
0 = minuUg(x, u) + tV(t, x) + xV(t, x)
f(x, u),V(T, x) =h(x), for allx.
Suppose also that(t, x)attains the minimumabove for alltandx.
Let x(t) | t [0, T]andu(t) =t, x(t),t [0, T], be the corresponding state and controltrajectories.
Then
V(t, x) =J(t, x), for allt, x,
and
u(t) | t [0, T]is optimal.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
71/261
PROOF
Let
{(u(t),x(t))
|t
[0, T]
}be any admissible contro
state trajectory. We have for allt [0, T]0g
x(t), u(t)
+tV
t, x(t)
+xV
t, x(t)
f
x(t), u(t)
Using the system equation x(t) = f
x(t),u(t)
,
the RHS of the above is equal tog
x(t),u(t)
+ d
dt
V(t,x(t))
Integrating this expression overt [0, T],
0 T0
gx(t),u(t)dt + VT,x(T)V0,x(0).UsingV(T, x) =h(x)and x(0) =x(0), we have
V0, x(0) hx(T) + T
0
gx(t),u(t)dt.If we useu(t)andx(t)in place of u(t)andx(t),the inequalities becomes equalities, and
V0, x(0)=hx(T) + T
0
gx(t), u(t)dt.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
72/261
EXAMPLE OF THE HJB EQUATION
Consider the scalar systemx(t) =u(t), with |u(t)| 1and cost(1/2)x(T)2.The HJB equation is0 = min
|u|1
tV(t, x) + xV(t, x)u, for allt, x,with the terminal conditionV(T, x) = (1/2)x2. Evident candidate for optimality: (t, x) =sgn(x). Corresponding cost-to-go
J(t, x) = 1
2max0,|x| (T t)2
.
We verify thatJ solves the HJB Eq., and thatu= sgn(x)attains the min in the RHS. Indeed,
tJ(t, x) = max0, |x| (T t)
,
xJ(t, x) =sgn(x) max0,|x| (T t).Substituting, the HJB Eq. becomes
0 = min|u|11 +sgn(x) umax0, |x| (T t)
-
8/11/2019 MIT Dynamic Programming Lecture Slides
73/261
LINEAR QUADRATIC PROBLEM
Consider then-dimensional linear systemx(t) =Ax(t) + Bu(t),
and the quadratic cost
x(T)
QTx(T) + T
0x(t)Qx(t) + u(t)Ru(t)dt
The HJB equation is
0 = minum x
Qx+uRu+tV(t, x)+xV(t, x)
(Ax+Bu
with the terminal conditionV(T, x) = xQTx.Wetry a solution of the form
V(t, x) =xK(t)x, K(t) :n
nsymmetric,
and show thatV(t, x)solves the HJB equation if
K(t) = K(t)AAK(t)+K(t)BR1BK(t)Q
with the terminal conditionK(T) =QT.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
74/261
6.231 DYNAMIC PROGRAMMING
LECTURE 8
LECTURE OUTLINE
Deterministic continuous-time optimal control
From the HJB equation to the Pontryagin Mini-mum Principle
Examples
-
8/11/2019 MIT Dynamic Programming Lecture Slides
75/261
THE HJB EQUATION
Continuous-time dynamic systemx(t) =f
x(t), u(t)
, 0 t T, x(0) :given
Cost function
h
x(T)
+ T0
g
x(t), u(t)
dt
J(t, x): optimal cost-to-go fromxat timet
HJB equation:For all(t, x)
0 = minuU
g(x, u) +tJ(t, x) +xJ(t, x)f(x, u)
with the boundary conditionJ(T, x) =h(x).
Verification theorem: If we can find a solution, it
must be equal to the optimal cost-to-go function.
Also a (closed-loop) policy(t, x)such that
(t, x)attains the min for each(t, x)
is optimal.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
76/261
HJB EQ. ALONG AN OPTIMAL TRAJECTORY
Observation I: An optimal control-state trajec-tory pair {(u(t), x(t)) | t [0, T]satisfies for allt [0, T]
u(t) = arg min
uUg
x(t), u
+xJ
t, x
(t)
f
x(t), u
() Observation II: To obtain an optimal control tra-
jectory{u(t) | t [0, T]via this equation, wedont need to knowxJ(t, x)forall(t, x)- onlythe time function
p(t) = xJt, x(t), t [0, T]. It turns out that calculatingp(t)is often easierthan calculatingJ(t, x)or xJ(t, x)for all(t, x).
Pontryagins minimum principle is just Eq. () to-gether with an equation for calculatingp(t), calledtheadjointequation.
Also, Pontryagins minimum principle is validmuch more generally, even in cases where J(t, x)
is not differentiable and the HJB has no solution.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
77/261
DERIVING THE ADJOINT EQUATION
The HJB equation holds as an identity for all(t, x), so it can be differentiated [the gradient ofthe RHS with respect to(t, x)is identically 0].
We need a tool for differentiation of minimumfunctions.
Lemma: LetF(t,x,u)be a continuously differen-tiable function of t , x n, and u m,and let Ube a convex subset ofm. Assumethat (t, x)is a continuously differentiable func-tion such that
(t, x) = arg minuU
F(t,x,u), for allt, x.
Then
tminuU
F(t,x,u)
= tFt,x,(t, x), for allt,x
minuU
F(t,x,u)
= xF
t,x,(t, x)
, for allt,
-
8/11/2019 MIT Dynamic Programming Lecture Slides
78/261
DIFFERENTIATING THE HJB EQUATION I
We set to zero the gradient with respect to xandtof the function
g
x, (t, x)
+tJ(t, x)+xJ
t, x
f
x, (t, x)
and we rely on the Lemma to disregard the termsinvolving the derivatives of(t, x)with respect totandx.
We obtain for all(t, x),
0 =xgx, (t, x)+2xtJ(t, x)+2xxJ
(t, x)f
x, (t, x)
+xf
x,
(t, x)
xJ(
0 = 2ttJ(t, x) + 2xtJ(t, x)f
x, (t, x)
,
where xfx, (t, x)is the matrixxf=
f1x1
fnx1
......
...f1xn
fnxn
-
8/11/2019 MIT Dynamic Programming Lecture Slides
79/261
DIFFERENTIATING THE HJB EQUATION II
The preceding equations hold for all(t, x). Wespecialize them along an optimal state and con-trol trajectory
x(t), u(t)
| t [0, T], whereu(t) =
t, x(t)
for allt [0, T].
We have x(t) =fx(t), u(t),so the terms2xtJ
t, x(t)
+ 2xxJ
t, x(t)
f
x(t), u(t)
2ttJ
t, x(t)
+ 2xtJ
t, x(t)
f
x(t), u(t)
are equal to the total derivativesd
dt
xJt, x(t), ddt
tJt, x(t),and we have
0 =xg
x, u(t)
+
d
dt
xJ
t, x(t)
+xf
x, u
(t)
xJ
t, x(t)
0 =
d
dttJt, x(t).
-
8/11/2019 MIT Dynamic Programming Lecture Slides
80/261
CONCLUSION FROM DIFFERENTIATING THE HJB
Definep(t) = xJ
t, x(t)
and
p0(t) = tJ
t, x(t)
We have theadjoint equationp(t) = xf
x(t), u(t)
p(t)xg
x(t), u(t)
and
p0(t) = 0
or equivalently,
p0(t) =constant, for allt [0, T].
Note also that, by definition JT, x(T) =h
x(T)
, so we have the following boundary con-dition at the terminal time:
p(T) =
hx(T)
-
8/11/2019 MIT Dynamic Programming Lecture Slides
81/261
NOTATIONAL SIMPLIFICATION
Define theHamiltonianfunctionH(x,u,p) =g(x, u) +pf(x, u)
The adjoint equation becomes
p(t) = xH
x(t), u(t), p(t)
The HJB equation becomes
0 = minuU
Hx(t), u , p(t) +p0(t)=H
x(t), u(t), p(t)
+p0(t)
so since p0(t) =constant, there is a constant Csuch that
H
x(t), u(t), p(t)
=C, for allt [0, T].
-
8/11/2019 MIT Dynamic Programming Lecture Slides
82/261
PONTRYAGIN MINIMUM PRINCIPLE
The preceding (highly informal) derivation issummarized as follows:
Minimum Principle:Let
u(t) | t [0, T]bean optimal control trajectory and let
x(t) | t
[0, T]be the corresponding state trajectory. Letalsop(t)be the solution of the adjoint equationp(t) = xH
x(t), u(t), p(t)
,
with the boundary condition
p(T) = hx(T).Then, for allt [0, T],
u(t) = arg minuU
Hx(t), u , p(t).
Furthermore, there is a constantCsuch that
H
x(t), u(t), p(t)
=C, for allt [0, T].
-
8/11/2019 MIT Dynamic Programming Lecture Slides
83/261
2-POINT BOUNDARY PROBLEM VIEW
The minimum principle is a necessary conditionfor optimalityand can be used to identify candi-dates for optimality.
We need to solve forx(t)andp(t)the differen-tial equations
x(t) =f
x(t), u(t)
p(t) = xH
x(t), u(t), p(t)
,
with split boundary conditions:
x(0) :given, p(T) = hx(T). The control trajectory is implicitly determinedfromx(t)andp(t)via the equation
u(t) = arg minuU
H
x(t), u , p(t)
.
This 2-point boundary value problem can be
addressed with a variety of numerical methods.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
84/261
ANALYTICAL EXAMPLE I
minimize T0
1 +
u(t)
2dt
subject to
x(t) =u(t), x(0) =.
Hamiltonian is
H(x,u,p) =
1 + u2 +pu,
and adjoint equation is p(t) = 0withp(T) = 0.
Hence,p(t) = 0 for all t [0, T], so minimizationof the Hamiltonian gives
u(t) = arg minu
1 + u2 = 0, for allt [0, T].
Therefore, x(t) = 0for allt, implying thatx(t)isconstant. Using the initial condition x(0) = , itfollows thatx(t) =for allt.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
85/261
ANALYTICAL EXAMPLE II
Optimal production problem
maximize
T0
1 u(t)x(t)dt
subject to0 u(t) 1for allt, andx(t) =u(t)x(t), x(0)>0 :given.
Hamiltonian: H(x,u,p) = (1 u)x +pux.
Adjoint equation isp(t) = u(t)p(t) 1 + u(t), p(T) = 0.
Maximization of the Hamiltonian overu [0, 1]:
u(t) = 0 ifp(t)< 1,
1 ifp(t) 1
.
Since p(T) = 0, for tclose to T, p(t) < 1/andu(t) = 0. Therefore, for t near Tthe adjoint equa-
tion has the form p(t) = 1.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
86/261
ANALYTICAL EXAMPLE II (CONTINUED)
T t0
p(t)
T - 1/
1/
Fort = T 1/, p(t)is equal to 1/, sou(t)changes tou(t) = 1.
Geometrical construction
T t0
p(t)
T - 1/
1/
T t0 T - 1/
u*(t)
u*(t) = 1 u*(t) = 0
-
8/11/2019 MIT Dynamic Programming Lecture Slides
87/261
6.231 DYNAMIC PROGRAMMING
LECTURE 9
LECTURE OUTLINE
Deterministic continuous-time optimal control
Variants of the Pontryagin Minimum Principle Fixed terminal state Free terminal time
Examples
Discrete-Time Minimum Principle
-
8/11/2019 MIT Dynamic Programming Lecture Slides
88/261
REVIEW
Continuous-time dynamic systemx(t) =f
x(t), u(t)
, 0 t T, x(0) :given
Cost function
h
x(T)
+ T0
g
x(t), u(t)
dt
J(t, x): optimal cost-to-go fromxat timet
HJB equation/Verification theorem: For all (t, x)
0 = minuU
g(x, u) +tJ(t, x) +xJ(t, x)f(x, u)
with the boundary conditionJ(T, x) =h(x).
Adjoint equation/vector: To compute an op-
timal state-control trajectory{(u(t), x(t))it isenough to know
p(t) = xJ
t, x(t)
, t [0, T].
Pontryagin theorem gives an equation forp(t).
-
8/11/2019 MIT Dynamic Programming Lecture Slides
89/261
NEC. CONDITION: PONTRYAGIN MIN. PRINCIPLE
Define the Hamiltonian functionH(x,u,p) =g(x, u) +pf(x, u).
Minimum Principle:Let u(t) | t [0, T]be an optimal control trajectory and let
x(t) | t
[0, T]
be the corresponding state trajectory. Letalsop(t)be the solution of the adjoint equation
p(t) = xHx(t), u(t), p(t),with the boundary condition
p(T) = h
x(T)
.
Then, for allt [0, T],u(t) = arg min
uUH
x(t), u , p(t)
.
Furthermore, there is a constantCsuch that
H x t u t t = C for all t 0 T .
-
8/11/2019 MIT Dynamic Programming Lecture Slides
90/261
VARIATIONS: FIXED TERMINAL STATE
Suppose that in addition to the initial statex(0),the final statex(T)is given.
Then the informal derivation of the adjoint equa-tionstill holds, but the terminal condition J(T, x) h(x)of the HJB equation is not true anymore.
In effect,
J(T, x) =
0 ifx=x(T) otherwise.
SoJ
(T, x)cannot be differentiated with respecttox, and the terminal boundary conditionp(T) =hx(T) for the adjoint equation does not hold. As compensation, we have the extra condition
x(T) :given,
thus maintaining the balance between boundaryconditions and unknowns.
Generalization: Some components of the ter-minal state are fixed.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
91/261
EXAMPLE WITH FIXED TERMINAL STATE
Consider finding the curve of minimum lengthconnecting two points(0, )and(T, ). We have
x(t) =u(t), x(0) =, x(T) =,
and the cost is T0 1 + u(t)2 dt.
T t0
x*(t)
The adjoint equation is p(t) = 0,implying that
p(t) =constant, for allt
[0, T].
Minimizing the Hamiltonian 1 + u2 +p(t)u:
u(t) =constant, for allt [0, T].
So optimal x(t) | t [0, T]is a straight line.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
92/261
VARIATIONS: FREE TERMINAL TIME
Initial state and/or the terminal state are given,but the terminal timeTis subject to optimization.
Let x(t), u(t) | t [0, T]be an optimalstate-control trajectory pair and letT be the opti-mal terminal time. Thenx(t), u(t)would still be
optimal ifTwere fixed atT, so
u(t) = arg minuU
H
x(t), u , p(t)
, for all t [0, T
wherep(t)is given by the adjoint equation.
In addition: H(x(t), u(t), p(t) ) = 0for all t[instead ofH(x(t), u(t), p(t)) constant]. Justification: We have
tJt, x(t)t=0 = 0
Along the optimal, the HJB equation is
tJ
t, x(t)
= Hx(t), u(t), p(t), for alltsoHx(0), u(0), p(0)= 0.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
93/261
MINIMUM-TIME EXAMPLE I
Unit mass moves horizontally: y(t) = u(t),wherey(t): position,u(t): force,u(t) [1, 1]. Given the initial position-velocity (y(0), y(0)),bring the object to (y(T), y(T)) = (0, 0)so thatthe time of transfer is minimum. Thus, we want to
minimizeT =
T0
1dt.
Let the state variables be
x1(t) =y(t), x2(t) = y(t),
so the system equation is
x1(t) =x2(t), x2(t) =u(t).
Initial state x1(0), x2(0): given andx1(T) = 0, x2(T) = 0.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
94/261
MINIMUM-TIME EXAMPLE II
If u(t) | t [0, T]is optimal,u(t)must min-imize the Hamiltonian for eacht, i.e.,
u(t) = arg min1u1
1 +p1(t)x2(t) +p2(t)u
.
Therefore
u(t) =
1 ifp2(t)
-
8/11/2019 MIT Dynamic Programming Lecture Slides
95/261
-
8/11/2019 MIT Dynamic Programming Lecture Slides
96/261
MINIMUM-TIME EXAMPLE IV
For intervals whereu(t) 1, the system movesalong the curves
x1(t) 12
x2(t)
2
: constant.
For intervals where u(t) 1, the systemmoves along the curves
x1(t) +1
2x2(t)2
: constant.
x1
x2
u(t)1
0
(a)
x
x2
0
u(t)-1
(b)
-
8/11/2019 MIT Dynamic Programming Lecture Slides
97/261
MINIMUM-TIME EXAMPLE V
To bring the system from the initial state x(0)to the origin with at most one switch, we use thefollowing switching curve.
x1
x2
u*(t)1
u*(t) -1
0
(x1(0),x2(0))
(a) If the initial state lies abovethe switching curve,use u(t) 1 until the state hits the switch-ing curve; then useu(t) 1.
(b) If the initial state lies belowthe switching curve,useu(t)1until the state hits the switch-ing curve; then useu(t) 1.
(c) If the initial state lies on the top (bottom)
part of the switching curve, use u(t) 1[u(t)
1, respectively].
-
8/11/2019 MIT Dynamic Programming Lecture Slides
98/261
DISCRETE-TIME MINIMUM PRINCIPLE
Minimize J(u) = gN(xN) +N1k=0 gk(xk, uk),subject touk Uk m, withUk: convex, and
xk+1 =fk(xk, uk), k= 0, . . . , N 1, x0 : given.
Introduce Hamiltonian function
Hk(xk, uk, pk+1) =gk(xk, uk) +pk+1fk(xk, uk)
Suppose{(uk, xk+1) | k = 0, . . . , N 1}areoptimal. Then for allk,
ukHkxk, uk, pk+1(ukuk) 0, for alluk Ukwherep1, . . . , pNare obtained from
pk = xkfk pk+1+ xkgk,
with the terminal conditionpN = gN(xN). If, in addition, the HamiltonianHkis a convexfunction ofukfor any fixedxkandpk+1, we have
uk
= arg minukUk
Hkxk, uk, pk+1, for allk.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
99/261
DERIVATION
We develop an expression for the gradientJ(u).We have, using the chain rule,
ukJ(u) =ukfk xk+1 fk+1 xN1 fN1 gN
+ukfk xk+1 fk+1 xN2 fN2 xN1 gN
+ukfk xk+1 gk+1
+ukgk,
where all gradients are evaluated along u and thecorresponding state trajectory.
Iintroduce the discrete-time adjoint equation
pk = xkfk pk+1+ xkgk, k= 1, . . . , N 1,
with terminal conditionpN = gN. Verify that, for allk,
ukJ(u0, . . . , uN1) = ukHk(xk, uk, pk+1)
-
8/11/2019 MIT Dynamic Programming Lecture Slides
100/261
6.231 DYNAMIC PROGRAMMING
LECTURE 10
LECTURE OUTLINE
Problems with imperfect state info
Reduction to the perfect state info case Machine repair example
-
8/11/2019 MIT Dynamic Programming Lecture Slides
101/261
BASIC PROBLEM WITH IMPERFECT STATE INFO
Same as basic problem of Chapter 1 with onedifference: the controller, instead of knowingxk,receives at each time kan observation of the form
z0 =h0(x0, v0), zk =hk(xk, uk1, vk), k 1
The observationzkbelongs to some spaceZk.The random observation disturbance vkis char-acterized by a probability distribution
Pvk ( | xk, . . . , x0, uk1, . . . , u0, wk1, . . . , w0, vk1, . . . , v0)
The initial state x0is also random and charac-terized by a probability distributionPx0 .
The probability distributionPwk( | xk, uk)ofwkis given, and it may depend explicitly on xkandukbut not onw0, . . . , wk1, v0, . . . , vk1.
The controlukis constrained to a given subsetUk(this subset does not depend on xk, which isnot assumed known).
-
8/11/2019 MIT Dynamic Programming Lecture Slides
102/261
INFORMATION VECTOR AND POLICIES
Denote by Ikthe information vector, i.e., theinformation available at timek:
Ik = (z0, z1, . . . , zk, u0, u1, . . . , uk1), k 1,I0 =z0.
We consider policies ={0, 1, . . . , N1},where each function kmaps the information vec-torIkinto a controlukand
k(Ik) Uk, for allIk, k 0.
We want to find a policythat minimizes
J = Ex0,wk,vk
k=0,...,N1
gN(xN) +
N1k=0
gk
xk, k(Ik), wk
subject to the equations
xk+1 =fk
xk, k(Ik), wk
, k 0,
z0 =h0(x0, v0), zk =hkxk, k1(Ik1), vk, k 1
-
8/11/2019 MIT Dynamic Programming Lecture Slides
103/261
EXAMPLE: MULTIACCESS COMMUNICATION I
Collection of transmitting stations sharing a com-mon channel, are synchronized to transmit pack-ets of data at integer times.
xk: backlog at the beginning of slot k.
a
k: random number of packet arrivals in slotk.
tk: the number of packets transmitted in slotk.
xk+1 =xk+ ak tk,
At kth slot, each of the xkpackets in the systemis transmitted with probability uk(common for allpackets). If two or more packets are transmitted
simultaneously, they collide.
Sotk = 1(a success) with probabilityxkuk(1
uk)xk1, andtk = 0(idle or collision) otherwise. Imperfect state info: The stations can observethe channel and determine whether in any oneslot there was a collision (two or more packets), a
success (one packet), or an idle (no packets).
-
8/11/2019 MIT Dynamic Programming Lecture Slides
104/261
EXAMPLE: MULTIACCESS COMMUNICATION II
Information vector at timek: The entire history(up to k) of successes, idles, and collisions (aswell as u0, u1, . . . , uk1). Mathematically, zk+1,the observation at the end of thekth slot, is
zk+1 =vk+1
where vk+1 yields an idle with probability (1uk)xk , a success with probability xkuk(1uk)xk1,and a collision otherwise.
If we had perfect state information, the DP al-gorithm would be
Jk(xk) =gk(xk)+ min0uk1
Eak
p(xk, uk)Jk+1(xk+ ak
+ 1p(xk, uk)Jk+1(xk+ ak),p(xk, uk) is the success probability xkuk(1uk)xk1 The optimal (perfect state information) policywould be to select the value ofukthat maximizesp(xk, uk), sok(xk) =
1xk
, for allxk
1.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
105/261
REFORMULATION AS A PERFECT INFO PROBLEM
We haveIk+1 = (Ik, zk+1, uk), k= 0, 1, . . . , N 2, I0 =z0.
View this as a dynamic system with stateIk, con-troluk, and random disturbancezk+1.
We have
P(zk+1| Ik, uk) =P(zk+1| Ik, uk, z0, z1, . . . , zk),
since z0, z1, . . . , zkare part of the information vec-
tor Ik. Thus the probability distribution of zk+1depends explicitly only on the stateIkand controlukand not on the prior disturbanceszk, . . . , z0.
Write
E
gk(xk, uk, wk)
=E
Exk,wk
gk(xk, uk, wk) | Ik, uk
so the cost per stage of the new system is
gk(Ik, uk) = Exk,wkgk(xk, uk, wk) | Ik, uk
-
8/11/2019 MIT Dynamic Programming Lecture Slides
106/261
DP ALGORITHM
Writing the DP algorithm for the (reformulated)perfect state info problem and doing the algebra:
Jk(Ik) = minukUk
E
xk,wk, zk+1
gk(xk, uk, wk)
+ Jk+1(Ik, zk+1, uk) | Ik, u
fork= 0, 1, . . . , N 2, and fork=N 1,
JN1(IN1) = minuN1UN1
ExN1, wN1
gN
fN1(xN1, uN1, wN1)
+ gN1(xN1, uN1, wN1) | IN1, uN1 The optimal costJ is given by
J =Ez0J0(z0).
-
8/11/2019 MIT Dynamic Programming Lecture Slides
107/261
MACHINE REPAIR EXAMPLE I
A machine can be in one of two states denotedP(good state) andP(bad state).
At the end of each period the machine is in-spected.
Two possible inspection outcomes: G (probably
good state) andB(probably bad state).
Transition probabilities:P P G
B
1/4
1/3
2/3 3/4
3/41
1/4
P P
State Transition Inspection
Possible actions after each inspection:C: Continue operation of the machine.
S: Stop the machine, determine its state, and ifinPbring it back to the good stateP.
Cost per stage:g(P, C) = 0, g(P, S) = 1, g(P , C) = 2, g(P , S) = 1
-
8/11/2019 MIT Dynamic Programming Lecture Slides
108/261
MACHINE REPAIR EXAMPLE II
The information vector at times0and1isI0 =z0, I1 = (z0, z1, u0),
and we seek functions 0(I0), 1(I1) that minimize
Ex0, w0, w1v0, v1
g
x0, 0(z0)
+g
x1, 1(z0, z1, 0(z0))
.
DP algorithm: Start with J2(I2) = 0. For k =0, 1, take the min over the two actions, C and S,
Jk(Ik) = min
P(xk =P | Ik)g(P, C)
+ P(xk =P | Ik)g(P , C)+ E
zk+1Jk+1(Ik, C , zk+1) | Ik, C,P(xk =P | Ik)g(P, S)
+ P(xk =P | Ik)g(P , S)
+ Ezk+1Jk+1(Ik, S , zk+1) | Ik, S
-
8/11/2019 MIT Dynamic Programming Lecture Slides
109/261
-
8/11/2019 MIT Dynamic Programming Lecture Slides
110/261
MACHINE REPAIR EXAMPLE IV
(2) ForI1 = (B,G,S)
P(x1 =P | B,G,S) =P(x1 =P | G,G,S) =7
J1(B,G,S) =
2
7 , 1(B,G,S) =C.
(3) ForI1 = (G,B,S)
P(x1 =P|
G, B|
S) =P(x1 =P , G, B, S )
P(G, B| S)=
13 34
23 34 + 13 14
23 14 + 13 34
23 34 + 13
=3
5,
J1(G,B,S) = 1, 1(G,B,S) =S.
Similarly, for all possibleI1, we computeJ1(I1),and 1(I1), which is to continue (u1 = C)if the
last inspection wasG, and to stop otherwise.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
111/261
-
8/11/2019 MIT Dynamic Programming Lecture Slides
112/261
6.231 DYNAMIC PROGRAMMING
LECTURE 11
LECTURE OUTLINE
Review of DP for imperfect state info
Linear quadratic problems Separation of estimation and control
-
8/11/2019 MIT Dynamic Programming Lecture Slides
113/261
REVIEW: PROBLEM WITH IMPERFECT STATE INF
Instead of knowingxk, we receive observationsz0 =h0(x0, v0), zk =hk(xk, uk1, vk), k 1
Ik: information vector available at timek:
I0 =z0, Ik = (z0, z1, . . . , zk, u0, u1, . . . , uk1), k
Optimization over policies = {0, 1, . . . , N1}wherek(Ik)
Uk, for allIkand k.
Find a policythat minimizes
J = Ex0,wk,vk
k=0,...,N1
gN(xN) +
N1k=0
gk
xk, k(Ik), wk
subject to the equations
xk+1 =fk
xk, k(Ik), wk
, k 0,
z0 =h0(x0, v0), zk =hkxk, k1(Ik1), vk, k 1
-
8/11/2019 MIT Dynamic Programming Lecture Slides
114/261
DP ALGORITHM
Reformulate to perfect state info problem, andwrite the DP algorithm:
Jk(Ik) = minukUk
E
xk,wk, zk+1
gk(xk, uk, wk)
+ Jk+1(Ik, zk+1, uk) | Ik, u
fork= 0, 1, . . . , N 2, and fork=N 1,
JN1(IN1) = minuN1UN1
ExN1, wN1
gN
fN1(xN1, uN1, wN1)
+ gN1(xN1, uN1, wN1) | IN1, uN1 The optimal costJ is given by
J =Ez0J0(z0).
-
8/11/2019 MIT Dynamic Programming Lecture Slides
115/261
LINEAR-QUADRATIC PROBLEMS
System: xk+1 =Akxk+ Bkuk+ wk Quadratic cost
Ewk
k=0,1,...,N1 xNQNxN+
N1
k=0(xkQkxk+ u
kRkuk)
whereQk 0andRk >0. Observations
zk =Ckxk+ vk, k= 0, 1, . . . , N 1. w0, . . . , wN1,v0, . . . , vN1indep. zero mean Key fact to show:
Optimal policy {0, . . . , N1} is of the form:
k(Ik) =LkE{xk | Ik}
Lk: same as for the perfect state info case
Estimation problem and control problem canbe solved separately
-
8/11/2019 MIT Dynamic Programming Lecture Slides
116/261
DP ALGORITHM I
Last stageN 1(supressing indexN 1):JN1(IN1) = min
uN1
ExN1,wN1
xN1QxN1
+ uN1RuN1+ (AxN1+ BuN1+ wN1)
Q(AxN1+ BuN1+ wN1) | IN1, uN1 Since E{wN1| IN1} = E{wN1} = 0, theminimization involves
minuN1
uN1(B
QB+ R)uN1
+ 2E{xN1 | IN1}AQBuN1
The minimization yields the optimalN1:
uN1 =N1(IN1) =LN1E{xN1| IN1}
where
LN1 = (B
QB+ R)1
B
QA
-
8/11/2019 MIT Dynamic Programming Lecture Slides
117/261
DP ALGORITHM II
Substituting in the DP algorithmJN1(IN1) = E
xN1
xN1KN1xN1| IN1
+ E
xN1xN1 E{xN1| IN1}
PN1xN1 E{xN1| IN1} | IN+ E
wN1
{wN1QNwN1},
where the matricesKN1andPN1are given by
PN1 =AN1QNBN1(RN1+ BN1QNBN1)
BN1QNAN1,KN1 =AN1QNAN1 PN1+ QN1.
Note the structure ofJN1: in addition to thequadratic and constant terms, it involves a quadraticin the estimation error
xN1 E{xN1| IN1}
-
8/11/2019 MIT Dynamic Programming Lecture Slides
118/261
DP ALGORITHM III
DP equation for periodN 2:JN2(IN2) = min
uN2
E
xN2,wN2,zN1
{xN2QxN2
+ uN2RuN2+ JN1(IN1) | IN2, uN2}=E
xN2QxN2 | IN2
+ min
uN2
uN2RuN2
+ ExN1KN1xN1 | IN2, uN2
+ E
xN1 E{xN1 | IN1}
PN1
xN1 E{xN1 | IN1} | IN2, uN2
+ EwN1{wN1QNwN1}.
Key point: We have excluded the next to lastterm from the minimization with respect touN2.
This term turns out to be independent ofuN2.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
119/261
QUALITY OF ESTIMATION LEMMA
For everyk, there is a functionMksuch that wehave
xkE{xk | Ik} =Mk(x0, w0, . . . , wk1, v0, . . . , vk),
independently of the policy being used. The following simplified version of the lemmaconveys the main idea.
Simplified Lemma: Let r, u, zbe random vari-ables such thatrand uare independent, and let
x=r+ u. Then
x E{x | z, u} =r E{r | z}.
Proof: We have
x E{x | z, u} =r+ u E{r+ u | z, u}=r+ u E{r | z, u} u=r E{r | z, u}=r
E
{r|
z}
.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
120/261
APPLYING THE QUALITY OF ESTIMATION LEMMA
Using the lemma,xN1 E{xN1| IN1} =N1,
where
N1: function ofx0, w0, . . . , wN2, v0, . . . , vN1
SinceN1is independent ofuN2, the condi-tional expectation ofN1PN1N1satisfies
E{N1PN1N1| IN2, uN2}=E{N1PN1N1| IN2}
and is independent ofuN2.
So minimization in the DP algorithm yields
uN2 =N2(IN2) =LN2E{xN2| IN2}
-
8/11/2019 MIT Dynamic Programming Lecture Slides
121/261
FINAL RESULT
Continuing similarly (using also the quality ofestimation lemma)
k(Ik) =LkE{xk| Ik},
whereLkis the same as for perfect state info:
Lk = (Rk+ BkKk+1Bk)1BkKk+1Ak,
withKkgenerated fromKN =QN,using
Kk =AkKk+1Ak Pk+ Qk,
Pk =AkKk+1Bk(Rk+ BkKk+1Bk)
1BkKk+1Ak
xk + 1= Akxk+ Bkuk+wk
Lk
uk
wk
xkzk= Ckxk+ vk
Delay
Estimator
E{xk|Ik}uk - 1
zk
vk
zkuk
-
8/11/2019 MIT Dynamic Programming Lecture Slides
122/261
SEPARATION INTERPRETATION
The optimal controller can be decomposed into(a) Anestimator, which uses the data to gener-
ate the conditional expectationE{xk| Ik}.(b) Anactuator, which multipliesE{xk| Ik} by
the gain matrix Lkand applies the control
inputuk =LkE{xk| Ik}. Generically the estimatexof a random vectorxgiven some information (random vector)I, whichminimizes the mean squared error
Ex{x x2 | I} = x2 2E{x | I}x + x2
isE{x|I} (set to zero the derivative with respectto xof the above quadratic form).
The estimator portion of the optimal controlleris optimal for the problem of estimating the statexkassuming the control is not subject to choice.
The actuator portion is optimal for the controlproblem assuming perfect state information.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
123/261
STEADY STATE/IMPLEMENTATION ASPECTS
AsN , the solution of the Riccati equationconverges to a steady state andLk L. If x0, wk, and vkare Gaussian, E{xk| Ik}isalinearfunction ofIkand is generated by a nicerecursive algorithm, the Kalman filter.
The Kalman filter involves also a Riccati equa-tion, so for N , and a stationary system, italso has a steady-state structure.
Thus, for Gaussian uncertainty, the solution isnice and possesses a steady state.For nonGaussian uncertainty, computing E{xk | Ikmaybe very difficult, so a suboptimal solution is
typically used.
Most common suboptimal controller: Replace
E{xk | Ik} by the estimateproducedby the Kalmanfilter (act as ifx0,wk, andvkare Gaussian).
It can be shown that this controller is optimalwithin the class of controllers that arelinearfunc-tions ofIk.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
124/261
6.231 DYNAMIC PROGRAMMING
LECTURE 12
LECTURE OUTLINE
DP for imperfect state info
Sufficient statisticsConditional state distribution as a sufficient statis-tic
Finite-state systems
Examples
-
8/11/2019 MIT Dynamic Programming Lecture Slides
125/261
REVIEW: PROBLEM WITH IMPERFECT STATE INF
Instead of knowingxk, we receive observationsz0 =h0(x0, v0), zk =hk(xk, uk1, vk), k 0
Ik: information vector available at timek:
I0 =z0, Ik = (z0, z1, . . . , zk, u0, u1, . . . , uk1), k
Optimization over policies = {0, 1, . . . , N1}wherek(Ik)
Uk, for allIkand k.
Find a policythat minimizes
J = Ex0,wk,vk
k=0,...,N1
gN(xN) +
N1k=0
gk
xk, k(Ik), wk
subject to the equations
xk+1 =fk
xk, k(Ik), wk
, k 0,
z0 =h0(x0, v0), zk =hkxk, k1(Ik1), vk, k 1
-
8/11/2019 MIT Dynamic Programming Lecture Slides
126/261
DP ALGORITHM
DP algorithm:Jk(Ik) = min
ukUk
E
xk,wk, zk+1
gk(xk, uk, wk)
+ Jk+1(Ik, zk+1, uk) | Ik, u
fork= 0, 1, . . . , N 2, and fork=N 1,
JN1(IN1) = minuN1UN1
E
xN1, wN1
gN
fN1(xN1, uN1, wN1)
+ gN1(xN1, uN1, wN1) | IN1, uN1
The optimal costJ is given by
J =Ez0
J0(z0)
.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
127/261
SUFFICIENT STATISTICS
Suppose that we can find a function Sk(Ik) suchthat the right-hand side of the DP algorithm canbe written in terms of some functionHkas
minukUk
HkSk(Ik), uk.Such a function Skis called a sufficient statistic. An optimal policy obtained by the precedingminimization can be written as
k(Ik) =k
Sk(Ik)
,
wherekis an appropriate function.
Example of a sufficient statistic: Sk(Ik) =Ik Another important sufficient statistic
Sk(Ik) =Pxk|Ik
-
8/11/2019 MIT Dynamic Programming Lecture Slides
128/261
DP ALGORITHM IN TERMS OFPXK |IK
It turns out thatPxk|Ikis generated recursivelyby a dynamic system (estimator) of the form
Pxk+1|Ik+1 = k
Pxk|Ik , uk, zk+1
for a suitable functionk DP algorithm can be written as
Jk(Pxk|Ik) = minukUk
E
xk,wk,zk+1gk(xk, uk, wk)
+ Jk+1k(Pxk|Ik , uk, zk+1) | Ik, uk
uk xk
Delay
Estimator
uk - 1
uk - 1
vk
zk
zk
wk
k - 1
Actuator
xk + 1=fk(xk,uk,wk) zk=hk(xk,uk - 1,vk)
System Measurement
P xk
| Ik
k
-
8/11/2019 MIT Dynamic Programming Lecture Slides
129/261
EXAMPLE: A SEARCH PROBLEM
At each period, decide to search or not searcha site that may contain a treasure.
If we search and a treasure is present, we findit with prob.and remove it from the site.
Treasures worth: V. Cost of search: C
States: treasure present & treasure not present Each search can be viewed as an observationof the state
Denote
pk : prob. of treasure present at the start of timek
withp0given.
p
kevolves at timekaccording to the equation
pk+1 =
pk if not search,0 if search and find treasur
pk(1)pk(1)+1pk
if search and no treasure
-
8/11/2019 MIT Dynamic Programming Lecture Slides
130/261
SEARCH PROBLEM (CONTINUED)
DP algorithm
Jk(pk) = max
0,C+pkV
+ (1
pk)Jk+1
pk(1 )
pk(1 ) + 1 pk,
withJN(pN) = 0.
Can be shown by induction that the functionsJksatisfy
Jk(pk) = 0, for allpk CV
Furthermore, it is optimal to search at periodkif and only if pkV C(expected reward from the next search the costof the search)
-
8/11/2019 MIT Dynamic Programming Lecture Slides
131/261
FINITE-STATE SYSTEMS
Suppose the system is a finite-state Markovchain, with states1, . . . , n.
Then the conditional probability distribution Pxk|Ikis a vector
P(xk = 1 | Ik), . . . , P (xk =n | Ik) The DP algorithm can be executed over then-dimensional simplex (state space is not expandingwith increasingk)
When the control and observation spaces arealso finite sets, it turns out that the cost-to-go func-
tionsJkin the DP algorithm are piecewise linearand concave (Exercise 5.7).
This is conceptually important and also (mod-erately) useful in practice.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
132/261
INSTRUCTION EXAMPLE
Teaching a student some item. Possible statesareL: Item learned, orL: Item not learned.
Possible decisions: T: Terminate the instruc-tion, orT: Continue the instruction for one periodand then conduct a test that indicates whether the
student has learned the item.
The test has two possible outcomes: R: Studentgives a correct answer, or R: Student gives anincorrect answer.
Probabilistic structureL L R
rt
1 1
1 - r1 - tL RL
Cost of instruction is Iper period Cost of terminating instruction; 0 if student haslearned the item, andC >0if not.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
133/261
-
8/11/2019 MIT Dynamic Programming Lecture Slides
134/261
INSTRUCTION EXAMPLE III
Write the DP algorithm asJk(pk) = min
(1 pk)C, I+ Ak(pk)
,
where
Ak(pk) =P(zk+1 =R | Ik)Jk+1(pk, R)+ P(zk+1 =R | Ik)Jk+1
(pk, R)
Can show by induction that Ak(p) are piecewiselinear, concave, monotonically decreasing, with
Ak1(p) Ak(p) Ak+1(p), for allp [0, 1].
0 p
C
I
I+ AN - 1(p)
I+ AN - 2(p)
I+ AN - 3(p)
1N- 1 N- 3N- 2 1 -
I
C
-
8/11/2019 MIT Dynamic Programming Lecture Slides
135/261
6.231 DYNAMIC PROGRAMMING
LECTURE 13
LECTURE OUTLINE
Suboptimal control
Certainty equivalent control Implementations and approximations Issues in adaptive control
-
8/11/2019 MIT Dynamic Programming Lecture Slides
136/261
PRACTICAL DIFFICULTIES OF DP
The curse of modeling The curse of dimensionality
Exponential growth of the computational andstorage requirements as the number of statevariables and control variables increases
Quick explosion of the number of states incombinatorial problems
Intractability of imperfect state informationproblems
There may be real-time solution constraints A family of problems may be addressed. Thedata of the problem to be solved is given with
little advance notice
The problem data may change as the systemis controlled need for on-line replanning
-
8/11/2019 MIT Dynamic Programming Lecture Slides
137/261
CERTAINTY EQUIVALENT CONTROL (CEC)
Replace the stochastic problem with a deter-ministic problem
At each time k, the uncertain quantities are fixedat some typical values
Implementation for an imperfect info problem.
At each timek:
(1) Compute a state estimate xk(Ik)given thecurrent information vectorIk.
(2) Fix thewi, i
k, at somewi(xi, ui). Solve
the deterministic problem:
minimize gN(xN)+
N1i=k
gi
xi, ui, wi(xi, ui)
subject toxk =xk(Ik)and fori k,
ui Ui, xi+1 =fi
xi, ui, wi(xi, ui)
.
(3) Use as control the first element in the optimal
control sequence found.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
138/261
ALTERNATIVE IMPLEMENTATION
Let d0(x0), . . . , dN1(xN1)be an optimalcontroller obtained from the DP algorithm for thedeterministic problem
minimize gN(xN) +
N1
k=0 gkxk, k(xk), wk(xk, uk)subject to xk+1 =fk
xk, k(xk), wk(xk, uk)
, k(xk)
The CEC applies at timekthe control input
k(Ik) =dkxk(Ik)
xk
Delay
Estimator
uk - 1
uk - 1
vk
zk
zk
wk
Actuator
xk + 1=fk(xk,uk,wk) zk=hk(xk,uk - 1,vk)
System Measurement
kd
u k =kd(xk)
xk(Ik)
-
8/11/2019 MIT Dynamic Programming Lecture Slides
139/261
-
8/11/2019 MIT Dynamic Programming Lecture Slides
140/261
PARTIALLY STOCHASTIC CEC
Instead of fixingallfuture disturbances to theirtypical values, fix only some, and treat the rest asstochastic.
Important special case: Treat an imperfect stateinformation problem as one of perfect state infor-
mation, using an estimate xk(Ik) of xkas if it wereexact.
Multiaccess Communication Example: Con-sider controlling the slotted Aloha system (dis-cussed in Ch. 5) by optimally choosing the prob-
ability of transmission of waiting packets. This isa hard problem of imperfect state info, whose per-
fect state info version is easy.
Natural partially stochastic CEC:
k(Ik) = min
1, 1xk(Ik)
,
wherexk(Ik)is an estimate of the current packetbacklog based on the entire past channel history
of successes, idles, and collisions (which isIk).
-
8/11/2019 MIT Dynamic Programming Lecture Slides
141/261
-
8/11/2019 MIT Dynamic Programming Lecture Slides
142/261
THE PROBLEM OF IDENTIFIABILITY
Suppose we consider two phases: A parameter identification phase (compute
an estimate of)
A control phase (apply control that would beoptimal if were true).
A fundamental difficulty: the control processmay make some of the unknown parameters in-
visible to the identification process.
Example: Consider the scalar systemxk+1 =axk+ buk+ wk, k= 0, 1, . . . , N 1,with the cost E
Nk=1(xk)
2
. If a and b are known,
the optimal control law isk(xk) = (a/b)xk.
If aand bare not known and we try to esti-
mate them while applying some nominal controllawk(xk) =xk, the closed-loop system is
xk+1 = (a + b)xk+ wk,
so identification can at best find (a+b)but not
the values of bothaandb.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
143/261
CEC AND IDENTIFIABILITY I
Suppose we have P{xk+1 | xk, uk, }and weuse a control law that is optimal for known:
k(Ik) =k(xk,k), with k: estimate of
There are three systems of interest:(a) The system (perhaps falsely) believed by thecontroller to be true, which evolves proba-
bilistically according to
Pxk+1 | xk, (xk,k),k.(b) The true closed-loop system, which evolves
probabilistically according to
Pxk+1 | xk, (xk,k), .
(c) The optimal closed-loop system that corre-sponds to the true value of the parameter,
which evolves probabilistically according to
Pxk+1 | xk, (xk, ), .
-
8/11/2019 MIT Dynamic Programming Lecture Slides
144/261
CEC AND IDENTIFIABILITY II
System Believed to beTrue
P{xk +1|xk,*(xk, k), k
}
Optimal Closed-Loop System
P{xk +1|xk,*(xk,),}
True Closed-Loop System
P{xk +1|xk,*(xk, k),}
^
^
^
Difficulty: There is a built-in mechanism forthe parameter estimates to converge to a wrong
value.
Assume that for some =and allxk+1,xk,
P
xk+1 | xk, (xk,),
=P
xk+1 | xk, (xk,),
i.e., there is a false value of parameter for whichthe system under closed-loop control looks ex-actly as if the false value were true.
Then, if the controller estimates at some timethe parameter to be , subsequent data will tend
to reinforce this erroneous estimate.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
145/261
REMEDY TO IDENTIFIABILITY PROBLEM
Introduce noise in the control applied, i.e., oc-casionally deviate from the CEC actions.
This provides a means to escape from wrongestimates.
However, introducing noise in the control may
be difficult to implement in practice.
Under some special circumstances, i.e., theself-tuning control context discussed in the book,the CEC is optimal in the limit, even if the param-
eter estimates converge to the wrong values. All of this touches upon some of the most so-phisticated aspects of adaptive control.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
146/261
6.231 DYNAMIC PROGRAMMING
LECTURE 14
LECTURE OUTLINE
Limited lookahead policies
Performance bounds Computational aspects Problem approximation approach
Vehicle routing example
Heuristic cost-to-go approximation Computer chess
-
8/11/2019 MIT Dynamic Programming Lecture Slides
147/261
LIMITED LOOKAHEAD POLICIES
One-step lookahead (1SL) policy: At each kandstatexk, use the controlk(xk)that
minukUk(xk)
E
gk(xk, uk, wk)+Jk+1
fk(xk, uk, wk)
,
where
JN =gN. Jk+1: approximation to true cost-to-goJk+1
Two-step lookahead policy: At each kand xk,use the controlk(xk) attaining the minimum above,where the function Jk+1is obtained using a 1SLapproximation (solve a 2-step DP problem).
If Jk+1is readily available and the minimizationabove is not too hard, the 1SL policy is imple-
mentable on-line.Sometimes one also replaces Uk(xk) above witha subset of most promising controlsUk(xk).
As the length of lookahead increases, the re-quired computation quickly explodes.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
148/261
PERFORMANCE BOUNDS
LetJk(xk)be the cost-to-go from(xk, k)of the1SL policy, based on functions Jk.
Assume that for all(xk, k), we have
Jk(xk)
Jk(xk), (*)
where JN =gNand for allk,
Jk(xk) = minukUk(xk)
E
gk(xk, uk, wk)
+ Jk+1fk(xk, uk, wk),
[so Jk(xk)is computed along withk(xk)]. Then
Jk(xk) Jk(xk), for all(xk, k).
Important application: When Jkis the cost-to-
go of some heuristic policy (then the 1SL policy is
called therolloutpolicy).
The bound can be extended to the case wherethere is akin the RHS of (*). Then
Jk(xk) Jk(xk) + k+ + N1
-
8/11/2019 MIT Dynamic Programming Lecture Slides
149/261
COMPUTATIONAL ASPECTS
Sometimes nonlinear programming can be usedto calculate the 1SL or the multistep version [par-ticularly whenUk(xk)is not a discrete set]. Con-nection with the methodology of stochastic pro-gramming.
The choice of the approximating functions Jkiscritical, and is calculated with a variety of methods.
Some approaches:(a) Problem Approximation: Approximate the op-
timal cost-to-go with some cost derived froma related but simpler problem
(b) Heuristic Cost-to-Go Approximation: Approx-imate the optimal cost-to-go with a functionof a suitable parametric form, whose param-
eters are tuned by some heuristic or system-atic scheme (Neuro-Dynamic Programming)
(c) Rollout Approach: Approximate the optimalcost-to-go with the cost of some suboptimal
policy, which is calculated either analytically
or by simulation
-
8/11/2019 MIT Dynamic Programming Lecture Slides
150/261
PROBLEM APPROXIMATION
Many (problem-dependent) possibilities Replace uncertain quantities by nominal val-
ues, or simplify the calculation of expected
values by limited simulation
Simplify difficult constraints or dynamics
Example of enforced decomposition: Route mvehicles that move over a graph. Each node has
a value. The first vehicle that passes through thenode collects its value. Max the total collectedvalue, subject to initial and final time constraints
(plus time windows and other constraints).
Usually the 1-vehicle version of the problem ismuch simpler. This motivates an approximationobtained by solving single vehicle problems.
1SL scheme: At time kand state xk(positionof vehicles and collected value nodes), considerall possiblekth moves by the vehicles, and at theresulting states we approximate the optimal value-
to-go with the value collected by optimizing thevehicle routes one-at-a-time
-
8/11/2019 MIT Dynamic Programming Lecture Slides
151/261
HEURISTIC COST-TO-GO APPROXIMATION
Use a cost-to-go approximation from a paramet-ric class J(x, r)where xis the current state andr = (r1, . . . , rm)is a vector of tunable scalars(weights).
By adjusting the weights, one can change the
shape of the approximation Jso that it is reason-ably close to the true optimal cost-to-go function.
Two key issues: The choice of parametric class J(x, r)(the
approximation architecture).
Method for tuning the weights (training thearchitecture).
Successful application strongly depends on howthese issues are handled, and on insight about the
problem. Sometimes a simulator is used, particularlywhen there is no mathematical model of the sys-tem.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
152/261
APPROXIMATION ARCHITECTURES
Divided in linear and nonlinear [i.e., linear ornonlinear dependence of J(x, r)onr].
Linear architectures are easier to train, but non-linear ones (e.g., neural networks) are richer.
Architectures based on feature extraction
Feature Extraction
MappingCost Approximator w/
Parameter Vector r
FeatureVector yStatex
Cost Approximation
J (y,r )
Ideally, the features will encode much of thenonlinearity that is inherent in the cost-to-go ap-proximated, and the approximation may be quiteaccurate without a complicated architecture.
Sometimes the state space is partitioned, andlocal features are introduced for each subset ofthe partition (they are 0 outside the subset).
With a well-chosen feature vectory(x), we canuse a linear architecture
J(x, r) = Jy(x), r= i
riyi(x)
-
8/11/2019 MIT Dynamic Programming Lecture Slides
153/261
COMPUTER CHESS I
Programs use a feature-based position evalua-tor that assigns a score to each move/position
FeatureExtraction
Weightingof Features
Score
Features:Material balance,Mobility,Safety, etc
Position Evaluator
Most often the weighting of features is linear butmultistep lookahead is involved.
Most often the training is done by trial and error.
Additional features: Depth first search Variable depth search when dynamic posi-
tions are involved
Alpha-beta pruning
-
8/11/2019 MIT Dynamic Programming Lecture Slides
154/261
COMPUTER CHESS II
Multistep lookahead treeP (White to Move)
M2
(+16)
(+16) (+20)
(+8) (+16) (+20) (+8)
8 +20 +18 +16 +24 +20 +10 +12 -4 +8 +21 +11 -5 +10 +32 +27 +10 +9 +3
(+16)
(+11)
(+11)
(+11) Black to
Move
Black to Move
White to Mov
M1
P 2
P 1
P 3
P 4
CutoffCutoff
Cutoff
Cutoff
Alpha-beta pruning: As the move scores areevaluated by depth-first search, branches whoseconsideration (based on the calculations so far)
cannot possibly change the optimal move are ne-glected
-
8/11/2019 MIT Dynamic Programming Lecture Slides
155/261
6.231 DYNAMIC PROGRAMMING
LECTURE 15
LECTURE OUTLINE
Rollout algorithms
Cost improvement property Discrete deterministic problems Sequential consistency and greedy algorithms
Sequential improvement
-
8/11/2019 MIT Dynamic Programming Lecture Slides
156/261
ROLLOUT ALGORITHMS
One-step lookahead policy:At each kandstatexk, use the controlk(xk)that
minukUk(xk)
E
gk(xk, uk, wk)+Jk+1
fk(xk, uk, wk)
,
where JN =gN. Jk+1: approximation to true cost-to-goJk+1
Rollout algorithm: When Jkis the cost-to-goof some heuristic policy (called thebase policy)
Cost improvement property (to be shown): Therollout algorithm achieves no worse (and usually
much better) cost than the base heuristic startingfrom the same state.
Main difficulty: Calculating Jk(xk)may be com-putationally intensive if the cost-to-go of the base
policy cannot be analytically calculated.
May involve Monte Carlo simulation if theproblem is stochastic.
Things improve in the deterministic case.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
157/261
EXAMPLE: THE QUIZ PROBLEM
A person is givenNquestions; answering cor-rectly questionihas probabilitypi, with rewardvi.
Quiz terminates at the first incorrect answer. Problem: Choose the ordering of questions soas to maximize the total expected reward.
Assuming no other constraints, it is optimal touse the index policy: Questions should be an-swered in decreasing order of the index of pref-erencepivi/(1 pi). With minor changes in the problem, the indexpolicy need not be optimal. Examples:
A limit (< N) on the maximum number ofquestions that can be answered.
Time windows, sequence-dependent rewards
precedence constraints.
Rollout with the index policy as base policy:Convenient because at a given state (subset ofquestions already answered), the index policy and
its expected reward can be easily calculated.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
158/261
COST IMPROVEMENT PROPERTY
LetJk(xk): Cost-to-go of the rollout policy
Hk(xk): Cost-to-go of the base policy
We claim thatJk(xk) Hk(xk)for allxkand kProof by induction: We have JN(xN) =HN(xN)for allxN. Assume that
Jk+1(xk+1) Hk+1(xk+1), xk+1.Then, for allxk
Jk(xk) =E
gk
xk, k(xk), wk
+ Jk+1
fk
xk, k(xk), wk
Egkxk, k(xk), wk+ Hk+1fkxk, k(xk), wkE
gk
xk, k(xk), wk
+ Hk+1
fk
xk, k(xk), wk
=Hk(xk)
-
8/11/2019 MIT Dynamic Programming Lecture Slides
159/261
EXAMPLE: THE BREAKTHROUGH PROBLEM
root
Given a binary tree withNstages.
Each arc is either free or is blocked (crossedout in the figure).
Problem: Find a free path from the root to theleaves (such as the one shown with thick lines).
Base heuristic (greedy): Follow the right branchif free; else follow the left branch if free.
For large Nand given prob. of free branch:the rollout algorithm requires O(N)times morecomputation, but has O(N)times larger prob. of
finding a free path than the greedy algorithm.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
160/261
DISCRETE DETERMINISTIC PROBLEMS
Any discrete optimization problem (with finitenumber of choices/feasible solutions) can be rep-resented as a sequential decision process by us-
ing a tree.
The leaves of the tree correspond to the feasible
solutions.
The problem can be solved by DP, starting fromthe leaves and going back towards the root.
Example: Traveling salesman problem. Find aminimum cost tour that goes exactly once througheach ofNcities.
ABC ABD ACB ACD ADB ADC
ABCD
AB AC AD
ABDC ACBD ACDB ADBC ADCB
Origin Node sA
Traveling salesman problem with four cities A, B, C, D
-
8/11/2019 MIT Dynamic Programming Lecture Slides
161/261
A CLASS OF GENERAL DISCRETE PROBLEMS
Generic problem: Given a graph with directed arcs A special nodescalled theorigin A set of terminal nodes, calleddestinations,
and a costg(i)for each destinationi.
Find min cost path starting at the origin, end-ing at one of the destination nodes.
Base heuristic: For any nondestination nodei,constructs a path(i, i1, . . . , im, i)starting atiand
ending at one of the destination nodes i. We callitheprojectionofi, and we denoteH(i) =g(i).
Rollout algorithm: Start at the origin; choosethe successor node with least cost projection
s i1 im
j1
j2
j3
j4
p(j1)
p(j2)
p(j3)
p(j4)
im-1
Neighbors of imProjections of
Neighbors of im
-
8/11/2019 MIT Dynamic Programming Lecture Slides
162/261
EXAMPLE: ONE-DIMENSIONAL WALK
A person takes either a unit step to the left or aunit step to the right. Minimize the costg(i)of thepointiwhere he will end up afterNsteps.
g(i)
iNN- 2-N 0
(N,0)
(0,0)
(N,-N) (N,N)
i_
i_
Base heuristic: Always go to the right. Rolloutfinds the rightmost local minimum.
Base heuristic: Compare always go to the rightand always go the left. Choose the best of thetwo. Rollout finds aglobal minimum.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
163/261
SEQUENTIAL CONSISTENCY
The base heuristic is sequentially consistentiffor every node i, whenever it generates the path(i, i1, . . . , im, i)starting at i, it also generates thepath(i1, . . . , im, i)starting at the node i1(i.e., allnodes of its path have the same projection).
Prime example of a sequentially consistent heuristic is agreedy algorithm. It uses anestimateF(i)of the optimal cost starting fromi.
At the typical step, given a path(i, i1, . . . , im),whereimis not a destination, the algorithm adds
to the path a nodeim+1such that
im+1 = arg minjN(im)
F(j)
If the base heuristic is sequentially consistent,the cost of the rollout algorithm is no more thanthe cost of the base heuristic. In particular, if
(s, i1, . . . , im)is the rollout path, we have
H(s) H(i1) H(im1) H(im)whereH(i) =cost of the heuristic starting fromi.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
164/261
-
8/11/2019 MIT Dynamic Programming Lecture Slides
165/261
6.231 DYNAMIC PROGRAMMING
LECTURE 16
LECTURE OUTLINE
More on rollout algorithms
Simulation-based methods Approximations of rollout algorithms Rolling horizon approximations
Discretization issues
Other suboptimal approaches
-
8/11/2019 MIT Dynamic Programming Lecture Slides
166/261
ROLLOUT ALGORITHMS
Rollout policy:At each kand state xk, usethe controlk(xk)that
minukUk(xk)
Qk(xk, uk),
where
Qk(xk, uk) =E
gk(xk, uk, wk)+Hk+1
fk(xk, uk, wk)
andHk+1(xk+1)is the cost-to-go of the heuristic.
Qk(xk, uk) is called the Q-factorof (xk, uk), andfor a stochastic problem, its computation may in-
volve Monte Carlo simulation.
Potential difficulty: To minimize overuktheQ-factor, we must form Q-factor differences Qk(xk, u)
Qk(xk, u). This differencing often amplifies thesimulation error in the calculation of the Q-factors.
Potential remedy: Compare any two controlsuanduby simulatingQk(xk, u)Qk(xk, u)directly.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
167/261
-
8/11/2019 MIT Dynamic Programming Lecture Slides
168/261
ROLLING HORIZON APPROACH
This is an l-step lookahead policy where thecost-to-go approximation is just 0.
Alternatively, the cost-to-go approximation is theterminal cost functiongN.
A short rolling horizon saves computation.
Paradox: It is not true that a longer rollinghorizon always improves performance.
Example: At the initial state, there are two con-trols available (1 and 2). At every other state, there
is only one control.
CurrentState
Optimal Trajectory
HighCost
... ...
... ...
1
2
LowCost
HighCost
l Stages
-
8/11/2019 MIT Dynamic Programming Lecture Slides
169/261
ROLLING HORIZON COMBINED WITH ROLLOUT
We can use a rolling horizon approximation incalculating the cost-to-go of the base heuristic.
Because the heuristic is suboptimal, the ratio-nale for a long rolling horizon becomes weaker.
Example: N-stage stopping problem where the
stopping cost is 0, the continuation cost is eitheror 1, where 0 < < 1/N, and the first statewith continuation cost equal to 1 is statem. Thenthe optimal policy is to stop at state m, and theoptimal cost is
m.
0 1 2 m N
Stopped State
1... ...
Consider the heuristic that continues at everystate, and the rollout policy that is based on thisheuristic, with a rolling horizon of l msteps. It will continue up to the firstm l+ 1stages,thus compiling a cost of (m l + 1). The rolloutperformance improves as lbecomes shorter!
-
8/11/2019 MIT Dynamic Programming Lecture Slides
170/261
-
8/11/2019 MIT Dynamic Programming Lecture Slides
171/261
GENERAL APPROACH FOR DISCRETIZATION I
Given a discrete-time system with state spaceS, consider a finite subsetS; for exampleScouldbe a finite grid within a continuous state spaceS.Assume stationarity for convenience, i.e., that thesystem equation and cost per stage are the same
for all times.We define an approximation to the original prob-lem, with state spaceS, as follows:
Express each x Sas a convex combinationof states inS, i.e.,
x=xiS
i(x)xi wherei(x) 0,i
i(x) = 1
Define a reduced dynamic system with state
space S, whereby from each xi Swe move tox = f(xi, u , w)according to the system equationof the original problem, and then move to xj Swith probabilitiesj(x).
Define similarly the corresponding cost per stage
of the transitions of the reduced system.
-
8/11/2019 MIT Dynamic Programming Lecture Slides
172/261
GENERAL APPROACH FOR DISCRETIZATION II