dp_slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 1/276

LECTURE SLIDES - DYNAMIC PROGRAMMIN

BASED ON LECTURES GIVEN AT THE

MASSACHUSETTS INST. OF TECHNOLOGY

CAMBRIDGE, MASS

FALL 2011

DIMITRI P. BERTSEKAS

These lecture slides are based on the book:

“Dynamic Programming and Optimal Con-trol: 3rd edition,” Vols. 1 and 2, AthenaScientific, 2007, by Dimitri P. Bertsekas;see

http://www.athenasc.com/dpbook.html

8/14/2019 DP_Slides_2011


6.231 DYNAMIC PROGRAMMING

LECTURE 1

LECTURE OUTLINE

• Problem Formulation

• Examples

• The Basic Problem

• Significance of Feedback

8/14/2019 DP_Slides_2011


DP AS AN OPTIMIZATION METHODOLOGY

• Generic optimization problem:

minu∈U

g(u)

where u is the optimization/decision variable, g(u)

is the cost function, and U is the constraint set• Categories of problems:

− Discrete (U is finite) or continuous

− Linear (g is linear and U is polyhedral) ornonlinear

− Stochastic or deterministic: In stochastic prob-lems the cost involves a stochastic parameterw, which is averaged, i.e., it has the form

g(u) = E w

G(u, w)

where w is a random parameter.

• DP can deal with complex stochastic problemswhere information about w becomes available instages, and the decisions are also made in stagesand make use of this information.

8/14/2019 DP_Slides_2011


BASIC STRUCTURE OF STOCHASTIC DP

• Discrete-time system

xk+1 = f k(xk, uk, wk), k = 0, 1, . . . , N − 1

− k: Discrete time

− xk: State; summarizes past information that

is relevant for future optimization− uk: Control; decision to be selected at time

k from a given set

− wk: Random parameter (also called distur-bance or noise depending on the context)

− N : Horizon or number of times control isapplied

• Cost function that is additive over time

E

gN (xN ) +

N −1k=0

gk(xk, uk, wk)

• Alternative system description: P (xk+1 | xk, uk)

xk+1 = wk with P (wk

|xk, uk) = P (xk+1

|xk, uk)

8/14/2019 DP_Slides_2011


INVENTORY CONTROL EXAMPLE

Inventory

System

Stock Ordered at

Period k

Stock at Period k Stock at Period k + 1

Demand at Period k

xk

wk

xk + 1 = xk + uk - wk

uk Cost ofP eriod k

c uk + r (xk + uk - wk )

• Discrete-time system

xk+1 = f k(xk, uk, wk) = xk + uk − wk

• Cost function that is additive over time

E

gN (xN ) +

N −1k=0

gk(xk, uk, wk)

= E N −1

k=0

cuk + r(xk + uk − wk)

• Optimization over policies: Rules/functions uk =µk(xk) that map states to controls

8/14/2019 DP_Slides_2011


ADDITIONAL ASSUMPTIONS

• The set of values that the control uk can takedepend at most on xk and not on prior x or u

• Probability distribution of wk does not dependon past values wk−1, . . . , w0, but may depend onxk and uk

− Otherwise past values of w or x would beuseful for future optimization

• Sequence of events envisioned in period k:

− xk occurs according to

xk = f k−1

xk−1, uk−1, wk−1

− uk is selected with knowledge of xk, i.e.,

uk ∈ U k(xk)

− wk is random and generated according to adistribution

P wk(xk, uk)

8/14/2019 DP_Slides_2011


DETERMINISTIC FINITE-STATE PROBLEMS

• Scheduling example: Find optimal sequence of operations A, B, C, D

• A must precede B, and C must precede D

• Given startup cost S A and S C , and setup tran-sition cost C mn from operation m to operation n

A

S A

C

S C

AB

CAB

ACCAC

CDA

CAD

ABC

CA

CCD CD

ACD

ACB

CAB

CAD

CBC

CCB

CCD

CAB

CCA

CDA

CCD

CBD

CDB

CBD

CDB

CAB

Initial

S ta te

8/14/2019 DP_Slides_2011


STOCHASTIC FINITE-STATE PROBLEMS

• Example: Find two-game chess match strategy• Timid play draws with prob. pd > 0 and loseswith prob. 1− pd. Bold play wins with prob. pw <1/2 and loses with prob. 1 − pw

1 - 0

0.5-0.5

0 - 1

2 - 0

1.5-0.5

1 - 1

0.5-1.5

0 - 2

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play

0 - 0

0.5-0.5

0 - 1

p d

1 - p d

1st Game / Bold Play

0 - 0

1 - 0

0 - 1

1 - p w

p w

1 - 0

0.5-0.5

0 - 1

2 - 0

1.5-0.5

1 - 1

0.5-1.5

0 - 2

p d

p d

p d

1 - p d

1 - p d

1 - p d

1 - p w

p w

1 - p w

p w

1 - p w

p w

8/14/2019 DP_Slides_2011


BASIC PROBLEM

• System xk+1 = f k(xk, uk, wk), k = 0, . . . , N −1• Control contraints uk ∈ U k(xk)

• Probability distribution P k(· | xk, uk) of wk

• Policies π = {µ0, . . . , µN −1}, where µk mapsstates x

k into controls u

k = µ

k(x

k) and is such

that µk(xk) ∈ U k(xk) for all xk

• Expected cost of π starting at x0 is

J π(x0) = E gN (xN ) +N −1

k=0

gk(xk, µk(xk), wk)• Optimal cost function

J ∗(x0) = minπ

J π(x0)

• Optimal policy π∗ satisfies

J π∗(x0) = J ∗(x0)

When produced by DP, π∗ is independent of x0.

8/14/2019 DP_Slides_2011


SIGNIFICANCE OF FEEDBACK

• Open-loop versus closed-loop policies

System

xk + 1 = f k(xk,u k,wk)

mk

uk = mk(xk) xk

wk

uk = µk(xk)

µk

• In deterministic problems open loop is as goodas closed loop

• Value of information; chess match example

• Example of open-loop policy: Play always bold

• Consider the closed-loop policy: Play timid if and only if you are ahead

Timid Play

1 - p d

p d

Bold Play

0 - 0

1 - 0

0 - 1

1 - p w

p w

1.5-0.5

1 - 1

1 - 1

0 - 2

1 - p w

p w

Bold Play

8/14/2019 DP_Slides_2011


VARIANTS OF DP PROBLEMS

• Continuous-time problems• Imperfect state information problems

• Infinite horizon problems

• Suboptimal control

8/14/2019 DP_Slides_2011


LECTURE BREAKDOWN

• Finite Horizon Problems (Vol. 1, Ch. 1-6)− Ch. 1: The DP algorithm (2 lectures)

− Ch. 2: Deterministic finite-state problems (1lecture)

− Ch. 3: Deterministic continuous-time prob-

lems (1 lecture)− Ch. 4: Stochastic DP problems (2 lectures)

− Ch. 5: Imperfect state information problems(2 lectures)

− Ch. 6: Suboptimal control (2 lectures)

• Infinite Horizon Problems - Simple (Vol. 1, Ch.7, 3 lectures)

• Infinite Horizon Problems - Advanced (Vol. 2)

− Ch. 1: Discounted problems - Computationalmethods (3 lectures)

− Ch. 2: Stochastic shortest path problems (1lecture)

− Ch. 6: Approximate DP (6 lectures)

8/14/2019 DP_Slides_2011


A NOTE ON THESE SLIDES

• These slides are a teaching aid, not a text• Don’t expect a rigorous mathematical develop-ment or precise mathematical statements

• Figures are meant to convey and enhance ideas,not to express them precisely

• Omitted proofs and a much fuller discussion canbe found in the texts, which these slides follow

8/14/2019 DP_Slides_2011



LECTURE 2

LECTURE OUTLINE

• The basic problem

• Principle of optimality

• DP example: Deterministic problem

• DP example: Stochastic problem

• The general DP algorithm

• State augmentation

8/14/2019 DP_Slides_2011


BASIC PROBLEM

• System xk+1 = f k(xk, uk, wk), k = 0, . . . , N −1• Control constraints uk ∈ U k(xk)

• Probability distribution P k(· | xk, uk) of wk

• Policies π = {µ0, . . . , µN −1}, where µk mapsstates x

k into controls u

k = µ

k(x

k) and is such

that µk(xk) ∈ U k(xk) for all xk

• Expected cost of π starting at x0 is

J π(x0) = E gN (xN ) +N −1

k=0

gk(xk, µk(xk), wk)• Optimal cost function

J ∗(x0) = minπ

J π(x0)

• Optimal policy π∗ is one that satisfies

J π∗(x0) = J ∗(x0)

8/14/2019 DP_Slides_2011


PRINCIPLE OF OPTIMALITY

• Let π∗ = {µ∗0, µ∗1, . . . , µ∗N −1} be optimal policy• Consider the “tail subproblem” whereby we areat xi at time i and wish to minimize the “cost-to-go” from time i to time N

E

gN (xN ) +N −1k=i

gk

xk, µk(xk), wk

and the “tail policy” {µ∗i , µ∗i+1, . . . , µ∗N −1}

0 Ni

xi Tail Subproblem

• Principle of optimality : The tail policy is opti-mal for the tail subproblem (optimization of the

future does not depend on what we did in the past)• DP first solves ALL tail subroblems of finalstage

• At the generic step, it solves ALL tail subprob-lems of a given time length, using the solution of

the tail subproblems of shorter time length

8/14/2019 DP_Slides_2011


DETERMINISTIC SCHEDULING EXAMPLE

• Find optimal sequence of operations A, B, C,D (A must precede B and C must precede D)

A

C

AB

AC

CDA

ABC

CA

CD

ACD

ACB

CAB

CAD

InitialState1 0

76

2

86

6

2

2

9

3

33

3

3

3

5

1

5

4

4

3

1

5

4

• Start from the last tail subproblem and go back-wards

• At each state-time pair, we record the optimalcost-to-go and the optimal decision

8/14/2019 DP_Slides_2011


STOCHASTIC INVENTORY EXAMPLE

InventorySystem

Stock Ordered atPeriod k

Stock at Period k Stock at Period k + 1

Demand at Period k

x k

w k

x k + 1 = x k + u k - w k

u k Cost of Period k

cu k + r (x k + u k - w k )

• Tail Subproblems of Length 1:

J N −1(xN −1) = minuN −1≥0

E wN −1

cuN −1

+ r(xN −1 + uN −1 − wN −1)

• Tail Subproblems of Length N − k:

J k(xk) = minuk≥0

E wk


+ J k+1(xk + uk − wk)

• J 0(x0) is opt. cost of initial state x0

8/14/2019 DP_Slides_2011


DP ALGORITHM

• Start with

J N (xN ) = gN (xN ),

and go backwards using

J k(xk) = minuk∈U k(xk) E wk

gk(xk, uk, wk)

+ J k+1

f k(xk, uk, wk)

, k = 0, 1, . . . , N − 1.

• Then J 0(x0), generated at the last step, is equalto the optimal cost J ∗(x0). Also, the policy

π∗ = {µ∗0, . . . , µ∗N −1}where µ∗k(xk) minimizes in the right side above foreach xk and k, is optimal

• Justification: Proof by induction that J k(xk) is

equal to J ∗k (xk), defined as the optimal cost of thetail subproblem that starts at time k at state xk

• Note:

− ALL the tail subproblems are solved (in ad-dition to the original problem)

− Intensive computational requirements

8/14/2019 DP_Slides_2011


PROOF OF THE INDUCTION STEP

• Let πk =

µk, µk+1, . . . , µN −1

denote a tailpolicy from time k onward

• Assume that J k+1(xk+1) = J ∗k+1(xk+1). Then

J ∗k (xk) = min(µk,πk+1)

E wk,...,wN −1

gkxk, µk(xk), wk+ gN (xN ) +

N −1i=k+1

gi

xi, µi(xi), wi

= minµk E wk

gkx

k, µ

k(xk

), wk

+ minπk+1

E

wk+1,...,wN −1

gN (xN ) +

N −1i=k+1

gi

xi, µi(xi), wi

= minµk

E wk gkxk, µk(xk), wk+ J ∗k+1f kxk, µk(xk), wk

= minµk

E wk

gk

xk, µk(xk), wk

+ J k+1

f k

xk, µk(xk), wk

= min

uk∈U k(xk)E wk

gk(xk, uk, wk) + J k+1

f k(xk, uk, wk)

= J k(xk)

8/14/2019 DP_Slides_2011


LINEAR-QUADRATIC ANALYTICAL EXAMPLE

Temperature u 0

Temperature u 1

FinalTemperature x 2

InitialTemperature x 0

Oven 1 Oven 2x 1

• System

xk+1 = (1 − a)xk + auk, k = 0, 1,

where a is given scalar from the interval (0, 1)

• Cost

r(x2 − T )2 + u20 + u21

where r is given positive scalar

• DP Algorithm:

J 2(x2) = r(x2 − T )2

J 1(x1) = minu1

u21 + r

(1 − a)x1 + au1 − T

2J 0(x0) = min

u0 u20 + J 1

(1 − a)x0 + au0

8/14/2019 DP_Slides_2011


STATE AUGMENTATION

• When assumptions of the basic problem areviolated (e.g., disturbances are correlated, cost isnonadditive, etc) reformulate/augment the state

• Example: Time lags

xk+1 = f k(xk, xk−1, uk, wk)

• Introduce additional state variable yk = xk−1.New system takes the form

xk+1

yk+1

= f k(xk, yk, uk, wk)

xk

View xk = (xk, yk) as the new state.

• DP algorithm for the reformulated problem:

J k(xk, xk−1) = minuk∈U k(xk)

E wk

gk(xk, uk, wk)

+ J k+1

f k(xk, xk−1, uk, wk), xk

8/14/2019 DP_Slides_2011



LECTURE 3

LECTURE OUTLINE

• Deterministic finite-state DP problems

• Backward shortest path algorithm

• Forward shortest path algorithm

• Shortest path examples

• Alternative shortest path algorithms

8/14/2019 DP_Slides_2011


DETERMINISTIC FINITE-STATE PROBLEM

. . .

. . .

. . .

Stage 0 Stage 1 Stage 2 Stage N - 1 Stage N

Initial State s

t Artificial TerminalNode

Terminal Arcswith Cost Equalto Terminal Cost

. . .

• States <==> Nodes

• Controls <==> Arcs

• Control sequences (open-loop) <==> pathsfrom initial state to terminal states

• akij : Cost of transition from state i ∈ S k to state

j ∈ S k+1 at time k (view it as “length” of the arc)• aN

it : Terminal cost of state i ∈ S N

• Cost of control sequence <==> Cost of the cor-responding path (view it as “length” of the path)

8/14/2019 DP_Slides_2011


BACKWARD AND FORWARD DP ALGORITHM

• DP algorithm:

J N (i) = aN it , i ∈ S N ,

J k(i) = minj∈S k+1

akij+J k+1( j)

, i ∈ S k, k = 0, . . . , N −1

The optimal cost is J 0(s) and is equal to thelength of the shortest path from s to t

• Observation: An optimal path s → t is also anoptimal path t → s in a “reverse” shortest pathproblem where the direction of each arc is reversedand its length is left unchanged

• Forward DP algorithm (= backward DP algo-rithm for the reverse problem):

J N ( j) = a0sj , j ∈ S 1,

J k( j) = mini∈S N −k

aN −kij + J k+1(i), j

∈S N −k+1

The optimal cost is J 0(t) = mini∈S N

aN it + J 1(i)

• View J k( j) as optimal cost-to-arrive to state jfrom initial state s

8/14/2019 DP_Slides_2011


A NOTE ON FORWARD DP ALGORITHMS

• There is no forward DP algorithm for stochasticproblems

• Mathematically, for stochastic problems, wecannot restrict ourselves to open-loop sequences,so the shortest path viewpoint fails

• Conceptually, in the presence of uncertainty,the concept of “optimal-cost-to-arrive” at a statexk does not make sense. For example, it may beimpossible to guarantee (with prob. 1) that anygiven state can be reached

• By contrast, even in stochastic problems, theconcept of “optimal cost-to-go” from any state xk

makes clear sense

8/14/2019 DP_Slides_2011


GENERIC SHORTEST PATH PROBLEMS

• {1, 2, . . . , N , t}: nodes of a graph (t: the desti-nation )

• aij : cost of moving from node i to node j

• Find a shortest (minimum cost) path from eachnode i to node t

• Assumption: All cycles have nonnegative length.Then an optimal path need not take more than N moves

• We formulate the problem as one where we re-quire exactly N moves but allow degenerate moves

from a node i to itself with cost aii = 0

J k(i) = opt. cost of getting from i to t in N −k moves

J 0(i): Cost of the optimal path from i to t.

• DP algorithm:

J k(i) = minj=1,...,N

aij+J k+1( j)

, k = 0, 1, . . . , N −2,

with J N −1(i) = ait, i = 1, 2, . . . , N

8/14/2019 DP_Slides_2011


EXAMPLE

2

7 5

25 5

6 1

3

0.53

1

2

4

0 1 2 3 4

1

2

3

4

5

State i

Stage k

3 3 3 3

4 4 4 5

4.5 4.5 5.5 7

2 2 2 2

Destination5

(a) (b)

J N −1(i) = ait, i = 1, 2, . . . , N ,

J k(i) = minj=1,...,N

aij+J k+1( j)

, k = 0, 1, . . . , N −2.

8/14/2019 DP_Slides_2011


ESTIMATION / HIDDEN MARKOV MODELS

• Markov chain with transition probabilities pij

• State transitions are hidden from view

• For each transition, we get an (independent)observation

• r(z; i, j): Prob. the observation takes value z

when the state transition is from i to j

• Trajectory estimation problem: Given the ob-servation sequence Z N = {z1, z2, . . . , zN }, what isthe “most likely” state transition sequence X N =

{x0, x1, . . . , xN

} [one that maximizes p(X N

|Z N )

over all X N = {x0, x1, . . . , xN }].

. . .

. . .

. . .

s x 0 x 1 x 2 x N - 1 x N t

8/14/2019 DP_Slides_2011


VITERBI ALGORITHM

• We have p(X N | Z N ) =

p(X N , Z N )

p(Z N )

where p(X N , Z N ) and p(Z N ) are the unconditionalprobabilities of occurrence of (X N , Z N ) and Z N

• Maximizing p(X N | Z N ) is equivalent with max-imizing ln( p(X N , Z N ))

• We have

p(X N , Z N ) = πx0

N

k=1

pxk−1xkr(zk; xk−1, xk)

so the problem is equivalent to

minimize

−ln(πx0)

−

N

k=1

ln pxk−1xkr(zk; xk−1, xk)over all possible sequences {x0, x1, . . . , xN }.

• This is a shortest path problem.

8/14/2019 DP_Slides_2011


GENERAL SHORTEST PATH ALGORITHMS

• There are many nonDP shortest path algo-rithms. They can all be used to solve deterministicfinite-state problems

• They may be preferable than DP if they avoidcalculating the optimal cost-to-go of EVERY state

• This is essential for problems with HUGE statespaces. Such problems arise for example in com-binatorial optimization

1

1 20

20

5

3

5

4

4

15

15

3

ABC ABD ACB ACD ADB ADC

ABCD

AB AC AD

ABDC ACBD ACDB ADBC ADCB

Artificial Terminal Node t

Origin Node s A

1

11

20 20

2020

44

4 4

1515 5

5

3 3

5

33

15

8/14/2019 DP_Slides_2011


LABEL CORRECTING METHODS

• Given: Origin s, destination t, lengths aij

≥0.

• Idea is to progressively discover shorter pathsfrom the origin s to every other node i

• Notation:

− di (label of i): Length of the shortest path

found (initially ds = 0, di = ∞ for i = s)− UPPER: The label dt of the destination

− OPEN list: Contains nodes that are cur-rently active in the sense that they are candi-dates for further examination (initially OPEN={s

Label Correcting AlgorithmStep 1 (Node Removal): Remove a node i fromOPEN and for each child j of i, do step 2

Step 2 (Node Insertion Test): If di + aij <min

{dj , UPPER

}, set dj = di + aij and set i to

be the parent of j. In addition, if j = t, place j inOPEN if it is not already in OPEN, while if j = t,set UPPER to the new value di + ait of dt

Step 3 (Termination Test): If OPEN is empty,terminate; else go to step 1

8/14/2019 DP_Slides_2011


VISUALIZATION/EXPLANATION

• Given: Origin s, destination t, lengths aij ≥ 0• di (label of i): Length of the shortest path foundthus far (initially ds = 0, di = ∞ for i = s). Thelabel di is implicitly associated with an s → i path

• UPPER: The label dt of the destination

• OPEN list: Contains “active” nodes (initiallyOPEN={s})

i j

REMOVE

Is di + aij < d j ?

(Is the path s --> i --> jbetter than thecurrent path s --> j ?)

Is di + aij < UPPER ?

(Does the path s --> i --> jhave a chance to be partof a shorter s --> t path ?)

YES

YES

INSERT

O P E N

Set d j = di + aij

8/14/2019 DP_Slides_2011


EXAMPLE


ABCD

AB AC AD


Artificial Terminal Node t

Origin Node s A

1

11

20 20

2020

44

4 4

1515 5

5

3 3

5

33

15

2

3

4

5

6

7

8

9

Iter. No. Node Exiting OPEN OPEN after Iteration UPPER

0 - 1 ∞1 1 2, 7,10 ∞2 2 3, 5, 7, 10 ∞3 3 4, 5, 7, 10 ∞4 4 5, 7, 10 43

5 5 6, 7, 10 43

6 6 7, 10 137 7 8, 10 13

8 8 9, 10 13

9 9 10 13

10 10 Empty 13

• Note that some nodes never entered OPEN

8/14/2019 DP_Slides_2011


VALIDITY OF LABEL CORRECTING METHOD

Proposition: If there exists at least one pathfrom the origin to the destination, the label cor-recting algorithm terminates with UPPER equalto the shortest distance from the origin to the des-tination

Proof: (1) Each time a node j enters OPEN, its

label is decreased and becomes equal to the lengthof some path from s to j

(2) The number of possible distinct path lengthsis finite, so the number of times a node can enterOPEN is finite, and the algorithm terminates

(3) Let (s, j1, j2, . . . , jk, t) be a shortest path andlet d∗ be the shortest distance. If UPPER > d∗

at termination, UPPER will also be larger thanthe length of all the paths (s, j1, . . . , jm), m =1, . . . , k, throughout the algorithm. Hence, node

jk will never enter the OPEN list with djk equalto the shortest distance from s to jk. Similarlynode jk−1 will never enter the OPEN list withdjk−1 equal to the shortest distance from s to jk−1.Continue to j1 to get a contradiction

8/14/2019 DP_Slides_2011



LECTURE 4

LECTURE OUTLINE

• Deterministic continuous-time optimal control

• Examples

• Connection with the calculus of variations

• The Hamilton-Jacobi-Bellman equation as acontinuous-time limit of the DP algorithm

• The Hamilton-Jacobi-Bellman equation as asufficient condition

• Examples

8/14/2019 DP_Slides_2011


PROBLEM FORMULATION

• Continuous-time dynamic system:

x(t) = f

x(t), u(t)

, 0 ≤ t ≤ T, x(0) : given,

where

− x(t)

∈ ℜn: state vector at time t

− u(t) ∈ U ⊂ ℜm: control vector at time t− U : control constraint set

− T : terminal time

• Admissible control trajectories

u(t) | t ∈ [0, T ]

:

piecewise continuous functions u(t)|

t∈

[0, T ]with u(t) ∈ U for all t ∈ [0, T ]; uniquely determine

x(t) | t ∈ [0, T ]

• Problem: Find an admissible control trajectory

u(t) | t ∈ [0, T ]

and corresponding state trajec-

tory x(t)

|t

∈[0, T ], that minimizes the cost

h

x(T )

+

T 0

g

x(t), u(t)

dt

• f ,h,g are assumed continuously differentiable

8/14/2019 DP_Slides_2011


EXAMPLE I

• Motion control: A unit mass moves on a lineunder the influence of a force u

• x(t) =

x1(t), x2(t)

: position and velocity of the mass at time t

• Problem: From a given x1(0), x2(0), bring the

mass “near” a given final position-velocity pair(x1, x2) at time T in the sense:

minimizex1(T ) − x1

2

+x2(T ) − x2

2

subject to the control constraint

|u(t)| ≤ 1, for all t ∈ [0, T ]

• The problem fits the framework with

x1(t) = x2(t), x2(t) = u(t),

h

x(T )

=x1(T ) − x1

2 +x2(T ) − x2

2,

gx(t), u(t) = 0, for all t ∈ [0, T ]

8/14/2019 DP_Slides_2011


EXAMPLE II

• A producer with production rate x(t) at time tmay allocate a portion u(t) of his/her productionrate to reinvestment and 1 − u(t) to production of a storable good. Thus x(t) evolves according to

x(t) = γu(t)x(t),

where γ > 0 is a given constant

• The producer wants to maximize the total amountof product stored

T

0

1 − u(t)

x(t)dt

subject to

0≤

u(t)≤

1, for all t∈

[0, T ]

• The initial production rate x(0) is a given pos-itive number

8/14/2019 DP_Slides_2011


EXAMPLE III (CALCULUS OF VARIATIONS)

Length = Ú0T

1 + (u(t))2 d t

a x(t)

T t0

x(t) = u(t).

GivenPoint Given

Line

T

0

1 +

u(t)

2dt

• Find a curve from a given point to a given linethat has minimum length

• The problem is

minimize

T 0

1 +

x(t)2

dt

subject to x(0) = α

• Reformulation as an optimal control problem:

minimize

T 0

1 +

u(t)2

dt

subject to x(t) = u(t), x(0) = α

8/14/2019 DP_Slides_2011


HAMILTON-JACOBI-BELLMAN EQUATION I

• We discretize [0, T ] at times 0, δ, 2δ , . . . , N δ ,where δ = T /N , and we let

xk = x(kδ ), uk = u(kδ ), k = 0, 1, . . . , N

• We also discretize the system and cost:

xk+1 = xk+f (xk, uk)·δ, h(xN )+N −1k=0

g(xk, uk)·δ

• We write the DP algorithm for the discretizedproblem

J ∗(Nδ,x) = h(x),

J ∗(kδ,x) = minu∈U

g(x, u)·δ + J ∗

(k+1)·δ, x+f (x, u)·δ

.

• Assume J ∗ is differentiable and Taylor-expand:

J ∗(kδ,x) = min

u∈U

g(x, u) · δ + J

∗(kδ,x) + ∇tJ ∗(kδ,x) · δ

+ ∇xJ ∗(kδ,x)′f (x, u) · δ + o(δ )

• Cancel J ∗(kδ,x), divide by δ , and take limit

8/14/2019 DP_Slides_2011


HAMILTON-JACOBI-BELLMAN EQUATION II

• Let J ∗(t, x) be the optimal cost-to-go of thecontinuous problem. Assuming the limit is valid

limk→∞, δ→0, kδ=t

J ∗(kδ,x) = J ∗(t, x), for all t,x,

we obtain for all t, x,

0 = minu∈U

g(x, u) +∇tJ ∗(t, x) +∇xJ ∗(t, x)′f (x, u)

with the boundary condition J ∗(T, x) = h(x)

• This is the Hamilton-Jacobi-Bellman (HJB)

equation – a partial differential equation, which issatisfied for all time-state pairs (t, x) by the cost-to-go function J ∗(t, x) (assuming J ∗ is differen-tiable and the preceding informal limiting proce-dure is valid)

• Hard to tell a priori if J ∗

(t, x) is differentiable• So we use the HJB Eq. as a verification tool; if we can solve it for a differentiable J ∗(t, x), then:

− J ∗ is the optimal-cost-to-go function

− The control µ∗(t, x) that minimizes in the

RHS for each (t, x) defines an optimal con-trol

8/14/2019 DP_Slides_2011


VERIFICATION/SUFFICIENCY THEOREM

• Suppose V (t, x) is a solution to the HJB equa-tion; that is, V is continuously differentiable in tand x, and is such that for all t, x,

0 = minu∈U

g(x, u) + ∇tV (t, x) + ∇xV (t, x)′f (x, u)

,

V (T, x) = h(x), for all x

• Suppose also that µ∗(t, x) attains the minimumabove for all t and x

• Let x∗(t)

|t∈

[0, T ] and u∗(t) = µ∗t, x∗(t),t ∈ [0, T ], be the corresponding state and controltrajectories

• Then

V (t, x) = J ∗(t, x), for all t,x,

and

u∗(t) | t ∈ [0, T ]

is optimal

8/14/2019 DP_Slides_2011


PROOF

Let

{(u(t), x(t))

| t

∈ [0, T ]

} be any admissible

control-state trajectory. We have for all t ∈ [0, T ]

0 ≤ g

x(t), u(t)

+∇tV

t, x(t)

+∇xV

t, x(t)′

f

x(t), u(t)

.

Using the system equation ˙x(t) = f

x(t), u(t)

,

the RHS of the above is equal to

g

x(t), u(t)

+ d

dt

V (t, x(t))

Integrating this expression over t ∈ [0, T ],

0 ≤ T

0 g

x(t), u(t)

dt + V

T, x(T )−V

0, x(0)

.

Using V (T, x) = h(x) and x(0) = x(0), we have

V

0, x(0)

≤ h

x(T )

+

T 0

g

x(t), u(t)

dt.

If we use u∗(t) and x∗(t) in place of u(t) and x(t),the inequalities becomes equalities, and

V

0, x(0)

= h

x∗(T )

+

T 0

g

x∗(t), u∗(t)

dt

8/14/2019 DP_Slides_2011


EXAMPLE OF THE HJB EQUATION

Consider the scalar system x(t) = u(t), with|u(t)

| ≤1 and cost (1/2)

x(T )2

. The HJB equation is

0 = min|u|≤1

∇tV (t, x)+∇xV (t, x)u

, for all t,x,

with the terminal condition V (T, x) = (1/2)x

2

• Evident candidate for optimality: µ∗(t, x) =−sgn(x). Corresponding cost-to-go

J ∗(t, x) = 1

2max

0, |x| − (T − t)

2

.

• We verify that J ∗ solves the HJB Eq., and thatu = −sgn(x) attains the min in the RHS. Indeed,

∇tJ ∗(t, x) = max

0, |x| − (T − t)

,

∇xJ ∗(t, x) = sgn(x)

·max0,

|x| −

(T −

t).

Substituting, the HJB Eq. becomes

0 = min|u|≤1

1 + sgn(x) · u

max

0, |x| − (T − t)

8/14/2019 DP_Slides_2011


LINEAR QUADRATIC PROBLEM

Consider the n-dimensional linear system

x(t) = Ax(t) + Bu(t),

and the quadratic cost

x(T )′QT x(T ) + T

0x(t)′Qx(t) + u(t)′Ru(t)dt

The HJB equation is

0 = minu∈ℜm

x′Qx+u

′Ru+∇tV (t, x)+∇xV (t, x)′(Ax+Bu)

,

with the terminal condition V (T, x) = x′QT x. Wetry a solution of the form

V (t, x) = x′K (t)x, K (t) : n × n symmetric,

and show that V (t, x) solves the HJB equation if

K (t) = −K (t)A−A′K (t)+K (t)BR−1B′K (t)−Q

with the terminal condition K (T ) = QT

8/14/2019 DP_Slides_2011



LECTURE 5

LECTURE OUTLINE

• Examples of stochastic DP problems

• Linear-quadratic problems

• Inventory control

8/14/2019 DP_Slides_2011


LINEAR-QUADRATIC PROBLEMS

• System: xk+1 = Akxk + Bkuk + wk

• Quadratic cost

E wk

k=0,1,...,N −1

x′N QN xN +

N −1

k=0(x′kQkxk + u′kRkuk)

where Qk ≥ 0 and Rk > 0 (in the positive (semi)definitesense).

• wk are independent and zero mean

• DP algorithm:

J N (xN ) = x′N QN xN ,

J k(xk) = minuk

E

x′kQkxk + u′kRkuk

+ J k+1(Akxk + Bkuk + wk)

• Key facts:− J k(xk) is quadratic

− Optimal policy {µ∗0, . . . , µ∗N −1} is linear:

µ∗k(xk) = Lkxk

− Similar treatment of a number of variants

8/14/2019 DP_Slides_2011


DERIVATION

• By induction verify that

µ∗k(xk) = Lkxk, J k(xk) = x′kK kxk +constant,

where Lk are matrices given by

Lk = −(B′kK k+1Bk + Rk)−1B′

kK k+1Ak,

and where K k are symmetric positive semidefinitematrices given by

K N = QN ,

K k = A′k

K k+1 − K k+1Bk(B′

kK k+1Bk

+ Rk)−1B′kK k+1

Ak + Qk.

• This is called the discrete-time Riccati equation .

• Just like DP, it starts at the terminal time N and proceeds backwards.

• Certainty equivalence holds (optimal policy isthe same as when wk is replaced by its expectedvalue E {wk} = 0).

8/14/2019 DP_Slides_2011


ASYMPTOTIC BEHAVIOR OF RICCATI EQ.

• Assume time-independent system and cost perstage, and some technical assumptions: controla-bility of (A, B) and observability of (A, C ) whereQ = C ′C

• The Riccati equation converges limk→−∞ K k =

K , where K is pos. definite, and is the unique(within the class of pos. semidefinite matrices) so-lution of the algebraic Riccati equation

K = A′

K − KB(B′KB + R)−1B′K

A + Q

• The corresponding steady-state controller µ∗(x) =Lx, where

L = −(B′KB + R)−1B′KA,

is stable in the sense that the matrix (A + BL) of the closed-loop system

xk+1 = (A + BL)xk + wk

satisfies limk→∞(A + BL)k = 0.

8/14/2019 DP_Slides_2011


GRAPHICAL PROOF FOR SCALAR SYSTEMS

A

2

R B

2 + Q

P 0

Q

F (P )

450

P P k P k + 1 P *

-R

B 2

• Riccati equation (with P k = K N −k):

P k+1 = A2

P k − B2P 2k

B2P k + R

+ Q,

or P k+1 = F (P k), where

F (P ) = A2RP B2P + R

+ Q.

• Note the two steady-state solutions, satisfyingP = F (P ), of which only one is positive.

8/14/2019 DP_Slides_2011


RANDOM SYSTEM MATRICES

• Suppose that {A0, B0}, . . . , {AN −1, BN −1} arenot known but rather are independent randommatrices that are also independent of the wk

• DP algorithm is

J N (xN ) = x′

N QN xN ,

J k(xk) = minuk

E wk,Ak,Bk

x′kQkxk

+ u′kRkuk + J k+1(Akxk + Bkuk + wk)

• Optimal policy µ∗k(xk) = Lkxk, where

Lk = −Rk + E {B′kK k+1Bk}

−1E {B′

kK k+1Ak},

and where the matrices K k are given by

K N = QN ,

K k = E {A′kK k+1Ak} − E {A′

kK k+1Bk}Rk + E {B′

kK k+1Bk}−1

E {B′kK k+1Ak} + Qk

8/14/2019 DP_Slides_2011


PROPERTIES

• Certainty equivalence may not hold• Riccati equation may not converge to a steady-state

Q

450

0 P

F (P )

- R

E {B 2}

• We have P k+1 = F (P k), where

F (P ) = E {A2}RP

E {B2}P + R + Q +

T P 2

E {B2}P + R,

T = E {A2}E {B2} −

E {A}

2

E {B}

2

8/14/2019 DP_Slides_2011


INVENTORY CONTROL

• xk: stock, uk: inventory purchased, wk: de-mand

xk+1 = xk + uk − wk, k = 0, 1, . . . , N − 1

• Minimize

E

N −1k=0


where, for some p > 0 and h > 0,

r(x) = p max(0, −x) + h max(0, x)

• DP algorithm:

J N (xN ) = 0,

J k(xk) = minuk≥0

cuk+H (xk+uk)+E

J k+1(xk+uk−wk)

,

where H (x + u) = E {r(x + u − w)}.

8/14/2019 DP_Slides_2011


OPTIMAL POLICY

• DP algorithm can be written as

J N (xN ) = 0,

J k(xk) = minuk≥0

Gk(xk + uk) − cxk,

where

Gk(y) = cy + H (y) + E

J k+1(y − w)

.

• If Gk is convex and lim|x|→∞ Gk(x)

→ ∞, we

have

µ∗k(xk) =

S k − xk if xk < S k,0 if xk ≥ S k,

where S k minimizes Gk(y).

• This is shown, assuming that c < p, by showingthat J k is convex for all k, and

lim|x|→∞

J k(x) → ∞

8/14/2019 DP_Slides_2011


JUSTIFICATION

• Graphical inductive proof that J k is convex.

- cy

- cy

y

H (y )

cy + H (y )

S N - 1

cS N - 1

J N - 1(x N - 1)

x N - 1S N - 1

8/14/2019 DP_Slides_2011



LECTURE 6

LECTURE OUTLINE

• Stopping problems

• Scheduling problems

• Other applications

8/14/2019 DP_Slides_2011


PURE STOPPING PROBLEMS

• Two possible controls:− Stop (incur a one-time stopping cost, and

move to cost-free and absorbing stop state)

− Continue [using xk+1 = f k(xk, wk) and in-curring the cost-per-stage]

• Each policy consists of a partition of the set of states xk into two regions:

− Stop region, where we stop

− Continue region, where we continue

STOPREGION

CONTINUEREGION

Stop State

8/14/2019 DP_Slides_2011


EXAMPLE: ASSET SELLING

• A person has an asset, and at k = 0, 1, . . . , N −1receives a random offer wk

• May accept wk and invest the money at fixedrate of interest r, or reject wk and wait for wk+1.Must accept the last offer wN −1

• DP algorithm (xk: current offer, T : stop state):

J N (xN ) =

xN if xN = T ,0 if xN = T ,

J k(xk) =max(1 + r)N −kx

k, E J

k+1(w

k) if x

k = T ,

0 if xk = T .

• Optimal policy;

accept the offer xk if xk > αk,

reject the offer xk if xk < αk,

where

αk = E

J k+1(wk)

(1 + r)N −k .

8/14/2019 DP_Slides_2011


FURTHER ANALYSIS

0 1 2 N - 1 N k

ACCEPT

REJECT

a 1

a N - 1

a 2

• Can show that αk ≥ αk+1 for all k

• Proof: Let V k(xk) = J k(xk)/(1 + r)N −k forxk

= T. Then the DP algorithm is V N (xN ) = xN

and

V k(xk) = max

xk, (1 + r)−1 E

w

V k+1(w)

.

We have αk

= E wV

k+1(w)/(1 + r), so it is enough

to show that V k(x) ≥ V k+1(x) for all x and k.Start with V N −1(x) ≥ V N (x) and use the mono-tonicity property of DP.

• We can also show that αk → a as k → −∞.Suggests that for an infinite horizon the optimal

policy is stationary.

8/14/2019 DP_Slides_2011


GENERAL STOPPING PROBLEMS

• At time k, we may stop at cost t(xk) or choosea control uk ∈ U (xk) and continue

J N (xN ) = t(xN ),

J k(xk) = min

t(xk), min

uk∈U (xk)E

g(xk, uk, wk)

+ J k+1

f (xk, uk, wk)

• Optimal to stop at time k for states x in theset

T k =

x t(x) ≤ minu∈U (x) E

g(x,u,w) + J k+1

f (x,u,w)

• Since J N −1(x) ≤ J N (x), we have J k(x) ≤J k+1(x) for all k, so

T 0

⊂ · · · ⊂T k

⊂T k+1

⊂ · · · ⊂T N −1.

• Interesting case is when all the T k are equal (toT N −1, the set where it is better to stop than to goone step and stop). Can be shown to be true if

f (x,u,w) ∈ T N −1, for all x ∈ T N −1, u ∈ U (x), w.

8/14/2019 DP_Slides_2011


SCHEDULING PROBLEMS

• Set of tasks to perform, the ordering is subjectto optimal choice.

• Costs depend on the order

• There may be stochastic uncertainty, and prece-dence and resource availability constraints

• Some of the hardest combinatorial problemsare of this type (e.g., traveling salesman, vehiclerouting, etc.)

• Some special problems admit a simple quasi-analytical solution method

− Optimal policy has an “index form”, i.e.,each task has an easily calculable “cost in-dex”, and it is optimal to select the taskthat has the minimum value of index (multi-armed bandit problems - to be discussed later)

− Some problems can be solved by an “inter-change argument”(start with some schedule,interchange two adjacent tasks, and see whathappens)

8/14/2019 DP_Slides_2011


EXAMPLE: THE QUIZ PROBLEM

• Given a list of N questions. If question i is an-swered correctly (given probability pi), we receivereward Ri; if not the quiz terminates. Choose or-der of questions to maximize expected reward.

• Let i and j be the kth and (k + 1)st questions

in an optimally ordered list

L = (i0, . . . , ik−1, i , j , ik+2, . . . , iN −1)

E {reward of L} = E

reward of {i0, . . . , ik−1}

+ pi0

· · · pik−1 ( piRi + pi pjRj)

+ pi0 · · · pik−1 pi pjE

reward of {ik+2, . . . , iN −1}

Consider the list with i and j interchanged

L′ = (i0, . . . , ik−1, j , i , ik+2, . . . , iN −1)

Since L is optimal, E {reward of L} ≥ E {reward of L′},so it follows that piRi + pi pjRj ≥ pjRj + pj piRi

or piRi/(1 − pi) ≥ pjRj/(1 − pj).

8/14/2019 DP_Slides_2011


MINIMAX CONTROL

• Consider basic problem with the difference thatthe disturbance wk instead of being random, it is just known to belong to a given set W k(xk, uk).

• Find policy π that minimizes the cost

J π(x0) = maxwk∈W k(xk,µk(xk))k=0,1,...,N −1

gN (xN )

+N −1k=0

gk

xk, µk(xk), wk

• The DP algorithm takes the form

J N (xN ) = gN (xN ),

J k(xk) = minuk

∈U (xk

)max

wk

∈W k

(xk

,uk

) gk(xk, uk, wk)

+ J k+1

f k(xk, uk, wk)

(Exercise 1.5 in the text, solution posted on thewww).

8/14/2019 DP_Slides_2011


UNKNOWN-BUT-BOUNDED CONTROL

• For each k, keep the xk of the controlled system

xk+1 = f k

xk, µk(xk), wk

inside a given set X k, the target set at time k.

• This is a minimax control problem, where thecost at stage k is

gk(xk) =

0 if xk ∈ X k,1 if xk /∈ X k.

• We must reach at time k the set

X k =

xk | J k(xk) = 0

in order to be able to maintain the state withinthe subsequent target sets.

• Start with X N = X N , and for k = 0, 1, . . . , N −1,

X k =

xk ∈ X k | there exists uk ∈ U k(xk) such that

f k(xk, uk, wk) ∈ X k+1, for all wk ∈ W k(xk, uk)

8/14/2019 DP_Slides_2011



LECTURE 7

LECTURE OUTLINE

• Problems with imperfect state info

• Reduction to the perfect state info case

• Linear quadratic problems

• Separation of estimation and control

8/14/2019 DP_Slides_2011


BASIC PROBL. W/ IMPERFECT STATE INFO

• Same as basic problem of Chapter 1 with onedifference: the controller, instead of knowing xk,receives at each time k an observation of the form

z0 = h0(x0, v0), zk = hk(xk, uk−1, vk), k ≥ 1

• The observation zk belongs to some space Z k.

• The random observation disturbance vk is char-acterized by a probability distribution

P vk (· | xk, . . . , x0, uk−1, . . . , u0, wk−1, . . . , w0, vk−1, . . . , v0)

• The initial state x0 is also random and charac-terized by a probability distribution P x0 .

• The probability distribution P wk(· | xk, uk) of wk is given, and it may depend explicitly on xk

and uk but not on w0, . . . , wk−1, v0, . . . , vk−1.• The control uk is constrained to a given subsetU k (this subset does not depend on xk, which isnot assumed known).

8/14/2019 DP_Slides_2011


INFORMATION VECTOR AND POLICIES

• Denote by I k the information vector , i.e., theinformation available at time k:

I k = (z0, z1, . . . , zk, u0, u1, . . . , uk−1), k ≥ 1,

I 0 = z0

• We consider policies π = {µ0, µ1, . . . , µN −1},where each function µk maps the information vec-tor I k into a control uk and

µk(I k) ∈ U k, for all I k, k ≥ 0

• We want to find a policy π that minimizes

J π = E x0,wk,vk

k=0,...,N −1

gN (xN ) +

N −1k=0

gk

xk, µk(I k), wk

subject to the equations

xk+1 = f k

xk, µk(I k), wk

, k ≥ 0,

z0 = h0(x0, v0), zk = hk

xk, µk−1(I k−1), vk

, k ≥ 1

8/14/2019 DP_Slides_2011


REFORMULATION AS PERFECT INFO PROBL.

• We have

I k+1 = (I k, zk+1, uk), k = 0, 1, . . . , N −2, I 0 = z0

View this as a dynamic system with state I k, con-trol uk, and random disturbance zk+1

• We have

P (zk+1 | I k, uk) = P (zk+1 | I k, uk, z0, z1, . . . , zk),

since z0, z1, . . . , zk are part of the information vec-

tor I k. Thus the probability distribution of zk+1depends explicitly only on the state I k and controluk and not on the prior “disturbances” zk, . . . , z0

• Write

E gk(xk, uk, wk) = E E xk,wk

gk(xk, uk, wk) | I k, ukso the cost per stage of the new system is

gk(I k, uk) = E xk,wk

gk(xk, uk, wk) | I k, uk

8/14/2019 DP_Slides_2011


DP ALGORITHM

• Writing the DP algorithm for the (reformulated)perfect state info problem and doing the algebra:

J k(I k) = minuk∈U k

E

xk,wk, zk+1

gk(xk, uk, wk)

+ J k+1(I k, zk+1, uk) | I k, ukfor k = 0, 1, . . . , N − 2, and for k = N − 1,

J N −1(I N −1) = minuN −1∈U N −1

E xN −1, wN −1

gN

f N −1(xN −1, uN −1, wN −1)

+ gN −1(xN −1, uN −1, wN −1) | I N −1, uN −1

• The optimal cost J ∗ is given by

J ∗ = E z0

J 0(z0)

8/14/2019 DP_Slides_2011


LINEAR-QUADRATIC PROBLEMS

• System: xk+1 = Akxk + Bkuk + wk

• Quadratic cost

E wk

k=0,1,...,N −1

x′N QN xN +

N −1

k=0(x′kQkxk + u′kRkuk)

where Qk ≥ 0 and Rk > 0

• Observations

zk = C kxk + vk, k = 0, 1, . . . , N

−1

• w0, . . . , wN −1, v0, . . . , vN −1 indep. zero mean

• Key fact to show:

− Optimal policy {µ∗0, . . . , µ∗N −1} is of the form:

µ∗k(I k) = LkE {xk | I k}

Lk: same as for the perfect state info case

− Estimation problem and control problem canbe solved separately

8/14/2019 DP_Slides_2011


DP ALGORITHM I

• Last stage N − 1 (supressing index N − 1):

J N −1(I N −1) = minuN −1

E xN −1,wN −1

x′N −1QxN −1

+ u′N −1RuN −1 + (AxN −1 + BuN −1 + wN −1)′

· Q(AxN −1 + BuN −1 + wN −1) | I N −1, uN −1

• Since E {wN −1 | I N −1} = E {wN −1} = 0, theminimization involves

minuN −1

u′N −1(B′QB + R)uN −1

+ 2E {xN −1 | I N −1}′A′QBuN −1

The minimization yields the optimal µ∗N −1:

u∗N −1

= µ∗N −1

(I N −1

) = LN −1

E {

xN −1 |

I N −1}

where

LN −1 = −(B′QB + R)−1B′QA

8/14/2019 DP_Slides_2011


DP ALGORITHM II

• Substituting in the DP algorithm

J N −1(I N −1) = E xN −1

x′N −1K N −1xN −1 | I N −1

+ E

xN −1xN −1 − E {xN −1 | I N −1}

′

· P N −1

xN −1 − E {xN −1 | I N −1} | I N −1

+ E wN −1

{w′N −1QN wN −1},

where the matrices K N −1 and P N −1 are given by

P N −1 = A′N −1QN BN −1(RN −1 + B′

N −1QN BN −1)−1

· B′N −1QN AN −1,

K N −1 = A′N −1QN AN −1 − P N −1 + QN −1

• Note the structure of J N −1: in addition tothe quadratic and constant terms, it involves aquadratic in the estimation error

xN −1 − E {xN −1 | I N −1}

8/14/2019 DP_Slides_2011


DP ALGORITHM III

• DP equation for period N − 2:

J N −2(I N −2) = minuN −2

E

xN −2,wN −2,zN −1

{x′N −2QxN −2

+ u′N −2RuN −2 + J N −1(I N −1) | I N −2, uN −2}= E

x′N −2QxN −2 | I N −2

+ minuN −2

u′N −2RuN −2

+ E

x′N −1K N −1xN −1 | I N −2, uN −2

+ E

xN −1 − E {xN −1 | I N −1}′

· P N −1

xN −1 − E {xN −1 | I N −1}

| I N −2, uN −2

+ E wN −1

{w′N −1QN wN −1}

• Key point: We have excluded the next to lastterm from the minimization with respect to uN −2

• This term turns out to be independent of uN −2

8/14/2019 DP_Slides_2011


QUALITY OF ESTIMATION LEMMA

• For every k, there is a function M k such thatwe have

xk−E {xk | I k} = M k(x0, w0, . . . , wk−1, v0, . . . , vk),

independently of the policy being used

• Using the lemma,

xN −1 − E {xN −1 | I N −1} = ξ N −1,

where

ξ N −1: function of x0, w0, . . . , wN −2, v0, . . . , vN −1

• Since ξ N −1 is independent of uN −2, the condi-tional expectation of ξ ′N −1P N −1ξ N −1 satisfies

E {ξ ′N −1P N −1ξ N −1 | I N −2, uN −2}= E {ξ ′N −1P N −1ξ N −1 | I N −2}

and is independent of uN −2.

• So minimization in the DP algorithm yields

u∗N −2 = µ∗N −2(I N −2) = LN −2 E {xN −2 | I N −2}

8/14/2019 DP_Slides_2011


FINAL RESULT

• Continuing similarly (using also the quality of estimation lemma)

µ∗k(I k) = LkE {xk | I k},

where Lk is the same as for perfect state info:

Lk = −(Rk + B′kK k+1Bk)−1B′

kK k+1Ak,

with K k generated using the Riccati equation:

K N = QN , K k = A′

kK k+1Ak − P k + Qk,

P k = A′kK k+1Bk(Rk + B′

kK k+1Bk)−1B′kK k+1Ak

x k + 1 = Ak x k + B k u k + w k

Lk

u k

w k

x k

z k = C k x k + v k

Delay

EstimatorE {x k | I k }

u k - 1

z k

v k

z k u k

8/14/2019 DP_Slides_2011


SEPARATION INTERPRETATION

• The optimal controller can be decomposed into(a) An estimator , which uses the data to gener-

ate the conditional expectation E {xk | I k}.

(b) An actuator , which multiplies E {xk | I k} bythe gain matrix Lk and applies the control

input uk = LkE {xk | I k}.

• Generically the estimate x of a random vector xgiven some information (random vector) I , whichminimizes the mean squared error

E x{x − x2 | I } = x2 − 2E {x | I }x + x2is E {x | I } (set to zero the derivative with respectto x of the above quadratic form).

• The estimator portion of the optimal controller

is optimal for the problem of estimating the statexk assuming the control is not subject to choice.

• The actuator portion is optimal for the controlproblem assuming perfect state information.

8/14/2019 DP_Slides_2011


STEADY STATE/IMPLEMENTATION ASPECTS

• As N → ∞, the solution of the Riccati equationconverges to a steady state and Lk → L.

• If x0, wk, and vk are Gaussian, E {xk | I k} isa linear function of I k and is generated by a nicerecursive algorithm, the Kalman filter.

• The Kalman filter involves also a Riccati equa-tion, so for N → ∞, and a stationary system, italso has a steady-state structure.

• Thus, for Gaussian uncertainty, the solution isnice and possesses a steady state.

• For nonGaussian uncertainty, computing E {xk | I k}maybe very difficult, so a suboptimal solution istypically used.

• Most common suboptimal controller: ReplaceE

{xk

|I k

}by the estimate produced by the Kalman

filter (act as if x0, wk, and vk are Gaussian).

• It can be shown that this controller is optimalwithin the class of controllers that are linear func-tions of I k.

8/14/2019 DP_Slides_2011



LECTURE 8

LECTURE OUTLINE

• DP for imperfect state info

• Sufficient statistics

• Conditional state distribution as a sufficientstatistic

• Finite-state systems

• Examples

8/14/2019 DP_Slides_2011


REVIEW: IMPERFECT STATE INFO PROBLEM

• Instead of knowing xk, we receive observations

z0 = h0(x0, v0), zk = hk(xk, uk−1, vk), k ≥ 0

• I k: information vector available at time k:

I 0 = z0, I k = (z0, z1, . . . , zk, u0, u1, . . . , uk−1), k ≥ 1

• Optimization over policies π = {µ0, µ1, . . . , µN −1},where µk(I k) ∈ U k, for all I k and k.

• Find a policy π that minimizes

J π = E x0,wk,vk

k=0,...,N −1

gN (xN ) +

N −1k=0

gk

xk, µk(I k), wk

subject to the equations

xk+1 = f k

xk, µk(I k), wk

, k ≥ 0,

z0 = h0(x0, v0), zk = hk

xk, µk−1(I k−1), vk

, k ≥ 1

8/14/2019 DP_Slides_2011


DP ALGORITHM

• DP algorithm:

J k(I k) = minuk∈U k

E

xk,wk, zk+1

gk(xk, uk, wk)

+ J k+1(I k, zk+1, uk) | I k, ukfor k = 0, 1, . . . , N − 2, and for k = N − 1,

J N −1(I N −1) = minuN −1∈U N −1

E xN −1, wN −1

gN

f N −1(xN −1, uN −1, wN −1)

+ gN −1(xN −1, uN −1, wN −1) | I N −1, uN −1

• The optimal cost J ∗ is given by

J ∗ = E z0

J 0(z0)

.

8/14/2019 DP_Slides_2011


SUFFICIENT STATISTICS

• Suppose that we can find a function S k(I k) suchthat the right-hand side of the DP algorithm canbe written in terms of some function H k as

minuk∈U k

H k

S k(I k), uk

.

• Such a function S k is called a sufficient statistic .

• An optimal policy obtained by the precedingminimization can be written as

µ∗k(I k) = µk

S k(I k)

,

where µk is an appropriate function.

• Example of a sufficient statistic: S k(I k) = I k

• Another important sufficient statistic

S k(I k) = P xk|I k

8/14/2019 DP_Slides_2011


DP ALGORITHM IN TERMS OF P XK |I K

• It turns out that P xk|I k is generated recursivelyby a dynamic system (estimator) of the form

P xk+1|I k+1 = Φk

P xk|I k , uk, zk+1

for a suitable function Φk

• DP algorithm can be written as

J k(P xk|I k) = minuk∈U k

E

xk,wk,zk+1

gk(xk, uk, wk)

+ J k+1Φk(P xk|I k

, uk, zk+1) | I k, uk

uk xk

Delay

Estimator

uk - 1

uk - 1

vk

zk

zk

wk

φ k - 1

Actuator

xk + 1 = fk(xk ,uk ,wk) zk = hk(xk ,uk - 1,vk)

System Measurement

Pxk| I

k

µ k

8/14/2019 DP_Slides_2011


EXAMPLE: A SEARCH PROBLEM

• At each period, decide to search or not searcha site that may contain a treasure.

• If we search and a treasure is present, we findit with prob. β and remove it from the site.

• Treasure’s worth: V . Cost of search: C

• States: treasure present & treasure not present

• Each search can be viewed as an observation of the state

• Denote

pk : prob. of treasure present at the start of time k

with p0 given.

• pk evolves at time k according to the equation

pk+1 =

pk if not search,0 if search and find treasure,

pk(1−β) pk(1−β)+1− pk

if search and no treasure.

8/14/2019 DP_Slides_2011


SEARCH PROBLEM (CONTINUED)

• DP algorithm

J k( pk) = max

0, −C + pkβV

+ (1 − pkβ )J k+1 pk(1 − β )

pk(1

−β ) + 1

− pk,

with J N ( pN ) = 0.

• Can be shown by induction that the functionsJ k satisfy

J k( pk) = 0, for all pk ≤ C βV

• Furthermore, it is optimal to search at periodk if and only if

pkβV ≥

C

(expected reward from the next search ≥ the costof the search)

8/14/2019 DP_Slides_2011


FINITE-STATE SYSTEMS - POMDP

• Suppose the system is a finite-state Markovchain, with states 1, . . . , n.

• Then the conditional probability distributionP xk|I k is a vector

P (xk = 1 | I k), . . . , P (xk = n | I k)

• The DP algorithm can be executed over the n-dimensional simplex (state space is not expandingwith increasing k)

• When the control and observation spaces arealso finite sets the problem is called a POMDP(Partially Observed Markov Decision Problem).

• For POMDP it turns out that the cost-to-gofunctions J k in the DP algorithm are piecewise

linear and concave (Exercise 5.7).• This is conceptually important. It is also usefulin practice because it forms the basis for approxi-mations.

8/14/2019 DP_Slides_2011


INSTRUCTION EXAMPLE

• Teaching a student some item. Possible statesare L: Item learned, or L: Item not learned.

• Possible decisions: T : Terminate the instruc-tion, or T : Continue the instruction for one periodand then conduct a test that indicates whether the

student has learned the item.• The test has two possible outcomes: R: Studentgives a correct answer, or R: Student gives anincorrect answer.

• Probabilistic structure

L L R

r t

1 1

1 - r 1 - t L R L

• Cost of instruction is I per period

• Cost of terminating instruction; 0 if student haslearned the item, and C > 0 if not.

8/14/2019 DP_Slides_2011


INSTRUCTION EXAMPLE II

• Let pk: prob. student has learned the item giventhe test results so far

pk = P (xk|I k) = P (xk = L | z0, z1, . . . , zk).

• Using Bayes’ rule we can obtain

pk+1 = Φ( pk, zk+1)

=

1−(1−t)(1− pk)1−(1−t)(1−r)(1− pk)

if zk+1 = R,

0 if zk+1 = R.

• DP algorithm:

J k( pk) = min

(1 − pk)C, I + E

zk+1J k+1

Φ( pk , zk+1)

.

starting with

J N −1( pN −1) = min

(1− pN −1)C, I +(1−t)(1− pN −1)C

.

8/14/2019 DP_Slides_2011


INSTRUCTION EXAMPLE III

• Write the DP algorithm as

J k( pk) = min

(1 − pk)C, I + Ak( pk)

,

where

Ak( pk) = P (zk+1 = R | I k)J k+1

Φ( pk, R)

+ P (zk+1 = R | I k)J k+1

Φ( pk, R)

• Can show by induction that Ak( p) are piecewiselinear, concave, monotonically decreasing, with

Ak−1( p) ≤ Ak( p) ≤ Ak+1( p), for all p ∈ [0, 1].

0 p

C

I

I + AN - 1(p )

I + AN - 2(p )

I + AN - 3(p )

1a N - 1 a N - 3a N - 2 1 -I

C

8/14/2019 DP_Slides_2011



LECTURE 9

LECTURE OUTLINE

• Suboptimal control

• Cost approximation methods: Classification

• Certainty equivalent control: An example

• Limited lookahead policies

• Performance bounds

• Problem approximation approach

• Parametric cost-to-go approximation

8/14/2019 DP_Slides_2011


PRACTICAL DIFFICULTIES OF DP

• The curse of modeling• The curse of dimensionality

− Exponential growth of the computational andstorage requirements as the number of statevariables and control variables increases

− Quick explosion of the number of states incombinatorial problems

− Intractability of imperfect state informationproblems

• There may be real-time solution constraints

− A family of problems may be addressed. Thedata of the problem to be solved is given withlittle advance notice

− The problem data may change as the systemis controlled – need for on-line replanning

8/14/2019 DP_Slides_2011


COST-TO-GO FUNCTION APPROXIMATION

• Use a policy computed from the DP equationwhere the optimal cost-to-go function J k+1 isreplaced by an approximation J k+1. (SometimesE

gk

is also replaced by an approximation.)

• Apply µk(xk), which attains the minimum in

minuk∈U k(xk)

E

gk(xk, uk, wk)+ J k+1

f k(xk, uk, wk)

• There are several ways to compute J k+1:

− Off-line approximation: The entire func-tion J k+1 is computed for every k, before thecontrol process begins.

− On-line approximation: Only the valuesJ k+1(xk+1) at the relevant next states xk+1are computed and used to compute uk just

after the current state xk becomes known.− Simulation-based methods: These are off-

line and on-line methods that share the com-mon characteristic that they are based onMonte-Carlo simulation. Some of these meth-

ods are suitable for problems of very largesize.

8/14/2019 DP_Slides_2011


CERTAINTY EQUIVALENT CONTROL (CEC)

• Idea: Replace the stochastic problem with adeterministic problem

• At each time k, the uncertain quantities arefixed at some “typical” values

• On-line implementation for a perfect state

info problem. At each time k:

(1) Fix the wi, i ≥ k, at some wi. Solve thedeterministic problem:

minimize gN (xN ) +

N −1i=k gi

xi, ui, wi

where xk is known, and

ui ∈ U i, xi+1 = f ixi, ui, wi.(2) Use as control the first element in the opti-

mal control sequence found.

• So we apply µk(xk) that minimizes

gkxk, uk, wk+ J k+1f k(xk, uk, wk)where J k+1 is the optimal cost of the correspond-ing deterministic problem.

8/14/2019 DP_Slides_2011


ALTERNATIVE OFF-LINE IMPLEMENTATION

• Let

µd

0(x0), . . . , µd

N −1(xN −1)

be an optimalcontroller obtained from the DP algorithm for thedeterministic problem

minimize gN (xN ) +

N −1

k=0

gkxk, µk(xk), wksubject to xk+1 = f k

xk, µk(xk), wk

, µk(xk) ∈ U k

• The CEC applies at time k the control inputµdk(xk).

• In an imperfect info version, xk is replaced byan estimate xk(I k).

x k

Delay

Estimator

u k - 1

u k - 1

v k

z k

z k

w k

Actuator

x k + 1 = f k (x k ,u k ,w k ) z k = h k (x k ,u k - 1,v k )System Measurement

m k

d

u k =m

k

d (x k

)

x k (I k )

8/14/2019 DP_Slides_2011


PARTIALLY STOCHASTIC CEC

• Instead of fixing all future disturbances to theirtypical values, fix only some, and treat the rest asstochastic.

• Important special case: Treat an imperfectstate information problem as one of perfect state

information, using an estimate xk(I k) of xk as if it were exact.

• Multiaccess communication example: Con-sider controlling the slotted Aloha system (Ex-ample 5.1.1 in the text) by optimally choosingthe probability of transmission of waiting pack-ets. This is a hard problem of imperfect stateinfo, whose perfect state info version is easy.

• Natural partially stochastic CEC:

µk(I k) = min

1,

1

xk(I k)

,

where xk(I k) is an estimate of the current packetbacklog based on the entire past channel historyof successes, idles, and collisions (which is I k).

8/14/2019 DP_Slides_2011


GENERAL COST-TO-GO APPROXIMATION

• One-step lookahead (1SL) policy: At eachk and state xk, use the control µk(xk) that

minuk∈U k(xk)

E


f k(xk, uk, wk)

,

where− J N = gN .

− J k+1: approximation to true cost-to-go J k+1

• Two-step lookahead policy: At each k andxk, use the control µk(xk) attaining the minimum

above, where the function J k+1 is obtained usinga 1SL approximation (solve a 2-step DP problem).

• If J k+1 is readily available and the minimiza-tion above is not too hard, the 1SL policy is im-plementable on-line.

• Sometimes one also replaces U k(xk) above witha subset of “most promising controls” U k(xk).

• As the length of lookahead increases, the re-quired computation quickly explodes.

8/14/2019 DP_Slides_2011


PERFORMANCE BOUNDS FOR 1SL

• Let J k(xk) be the cost-to-go from (xk, k) of the1SL policy, based on functions J k.

• Assume that for all (xk, k), we have

J k(xk) ≤ J k(xk), (*)

where J N = gN and for all k,

J k(xk) = minuk∈U k(xk)

E

gk(xk, uk, wk)

+ J k+1f k(xk, uk, wk),

[so J k(xk) is computed along with µk(xk)]. Then

J k(xk) ≤ J k(xk), for all (xk, k).

• Important application: When J k is the cost-to-go of some heuristic policy (then the 1SL policyis called the rollout policy).

• The bound can be extended to the case wherethere is a δ k in the RHS of (*). Then

J k(xk) ≤ J k(xk) + δ k + · · · + δ N −1

8/14/2019 DP_Slides_2011


COMPUTATIONAL ASPECTS

• Sometimes nonlinear programming can be usedto calculate the 1SL or the multistep version [par-ticularly when U k(xk) is not a discrete set]. Con-nection with stochastic programming methods.

• The choice of the approximating functions J k

is critical, and is calculated in a variety of ways.• Some approaches:

(a) Problem Approximation : Approximate theoptimal cost-to-go with some cost derivedfrom a related but simpler problem

(b) Parametric Cost-to-Go Approximation : Ap-proximate the optimal cost-to-go with a func-tion of a suitable parametric form, whose pa-rameters are tuned by some heuristic or sys-tematic scheme (Neuro-Dynamic Program-

ming)(c) Rollout Approach : Approximate the optimal

cost-to-go with the cost of some suboptimalpolicy, which is calculated either analyticallyor by simulation

8/14/2019 DP_Slides_2011


PROBLEM APPROXIMATION

• Many (problem-dependent) possibilities− Replace uncertain quantities by nominal val-

ues, or simplify the calculation of expectedvalues by limited simulation

− Simplify difficult constraints or dynamics

• Example of enforced decomposition: Route mvehicles that move over a graph. Each node has a“value.” The first vehicle that passes through thenode collects its value. Max the total collectedvalue, subject to initial and final time constraints(plus time windows and other constraints).

• Usually the 1-vehicle version of the problem ismuch simpler. This motivates an approximationobtained by solving single vehicle problems.

• 1SL scheme: At time k and state xk (position

of vehicles and “collected value nodes”), considerall possible kth moves by the vehicles, and at theresulting states we approximate the optimal value-to-go with the value collected by optimizing thevehicle routes one-at-a-time

8/14/2019 DP_Slides_2011


PARAMETRIC COST-TO-GO APPROXIMATION

• Use a cost-to-go approximation from a paramet-ric class J (x, r) where x is the current state andr = (r1, . . . , rm) is a vector of “tunable” scalars(weights).

• By adjusting the weights, one can change the

“shape” of the approximation ˜J so that it is rea-sonably close to the true optimal cost-to-go func-

tion.

• Two key issues:

− The choice of parametric class J (x, r) (the

approximation architecture).− Method for tuning the weights (“training”the architecture).

• Successful application strongly depends on howthese issues are handled, and on insight about the

problem.• Sometimes a simulator is used, particularlywhen there is no mathematical model of the sys-tem.

8/14/2019 DP_Slides_2011


APPROXIMATION ARCHITECTURES

• Divided in linear and nonlinear [i.e., linear ornonlinear dependence of J (x, r) on r].

• Linear architectures are easier to train, but non-linear ones (e.g., neural networks) are richer.

• Architectures based on feature extraction

Feature ExtractionMapping

Cost Approximator w/ Parameter Vector r

FeatureVector y State x

Cost Approximation

J (y ,r )

• Ideally, the features will encode much of the

nonlinearity that is inherent in the cost-to-go ap-proximated, and the approximation may be quiteaccurate without a complicated architecture.

• Sometimes the state space is partitioned, and“local” features are introduced for each subset of

the partition (they are 0 outside the subset).

• With a well-chosen feature vector y(x), we canuse a linear architecture

J (x, r) = J

y(x), r

=

iriyi(x)

8/14/2019 DP_Slides_2011


AN EXAMPLE - COMPUTER CHESS

• Programs use a feature-based position evaluatorthat assigns a score to each move/position

FeatureExtraction

Weightingof Features

Score

Features:Material balance,

Mobility,Safety, etc

Position Evaluator

• Many context-dependent special features.

• Most often the weighting of features is linearbut multistep lookahead is involved.

• Most often the training is done by trial anderror.

8/14/2019 DP_Slides_2011



LECTURE 10

LECTURE OUTLINE

• Rollout algorithms

• Cost improvement property

• Discrete deterministic problems

• Approximations of rollout algorithms

• Discretization of continuous time

• Discretization of continuous space

• Other suboptimal approaches

8/14/2019 DP_Slides_2011


ROLLOUT ALGORITHMS

• One-step lookahead policy: At each k andstate xk, use the control µk(xk) that

minuk∈U k(xk)

E


f k(xk, uk, wk)

,

where

− J N = gN .

− J k+1: approximation to true cost-to-go J k+1

• Rollout algorithm: When J k is the cost-to-goof some heuristic policy (called the base policy )

• Cost improvement property (to be shown): Therollout algorithm achieves no worse (and usuallymuch better) cost than the base heuristic startingfrom the same state.

• Main difficulty: Calculating J k(xk) may be

computationally intensive if the cost-to-go of thebase policy cannot be analytically calculated.

− May involve Monte Carlo simulation if theproblem is stochastic.

− Things improve in the deterministic case.

8/14/2019 DP_Slides_2011


EXAMPLE: THE QUIZ PROBLEM

• A person is given N questions; answering cor-rectly question i has probability pi, reward vi.Quiz terminates at the first incorrect answer.

• Problem: Choose the ordering of questions soas to maximize the total expected reward.

• Assuming no other constraints, it is optimal touse the index policy : Answer questions in decreas-ing order of pivi/(1 − pi).

• With minor changes in the problem, the indexpolicy need not be optimal. Examples:

− A limit (< N ) on the maximum number of questions that can be answered.

− Time windows, sequence-dependent rewards,precedence constraints.

• Rollout with the index policy as base policy:

Convenient because at a given state (subset of questions already answered), the index policy andits expected reward can be easily calculated.

• Very effective for solving the quiz problem andimportant generalizations in scheduling (see Bert-

sekas and Castanon, J. of Heuristics, Vol. 5, 1999).

8/14/2019 DP_Slides_2011


COST IMPROVEMENT PROPERTY

• Let

J k(xk): Cost-to-go of the rollout policy

H k(xk): Cost-to-go of the base policy

• We claim that J k(xk) ≤ H k(xk) for all xk, k

• Proof by induction: We have J N (xN ) = H N (xN )for all xN . Assume that

J k+1(xk+1)

≤H k+1(xk+1),

∀ xk+1.

Then, for all xk

J k(xk) = E

gk

xk, µk(xk), wk

+ J k+1

f k

xk, µk(xk), wk

≤ E gkxk, µk(xk), wk+ H k+1f kxk, µk(xk), wk≤ E

gk

xk, µk(xk), wk

+ H k+1

f k

xk, µk(xk), wk

= H k(xk)

− Induction hypothesis ==> 1st inequality

− Min selection of µk(xk) ==> 2nd inequality

− Definition of H k, µk ==> last equality

8/14/2019 DP_Slides_2011


DISCRETE DETERMINISTIC PROBLEMS

• Any discrete optimization problem (with finitenumber of choices/feasible solutions) can be repre-sented sequentially by breaking down the decisionprocess into stages.

• A tree/shortest path representation. The leaves

of the tree correspond to the feasible solutions.• Decisions can be made in stages.

− May complete partial solutions, one stage ata time.

− May apply rollout with any heuristic that

can complete a partial solution.− No costly stochastic simulation needed.

• Example: Traveling salesman problem. Find aminimum cost tour that goes exactly once througheach of N cities.


ABCD

AB AC AD


Origin Node s A

Traveling salesman problem with four cities A, B, C, D

8/14/2019 DP_Slides_2011


EXAMPLE: THE BREAKTHROUGH PROBLEM

root

• Given a binary tree with N stages.

• Each arc is either free or is blocked (crossed outin the figure).

• Problem: Find a free path from the root to theleaves (such as the one shown with thick lines).

• Base heuristic (greedy): Follow the right branch

if free; else follow the left branch if free.• This is a rare rollout instance that admits adetailed analysis.

• For large N and given prob. of free branch:the rollout algorithm requires O(N ) times more

computation, but has O(N ) times larger prob. of finding a free path than the greedy algorithm.

8/14/2019 DP_Slides_2011


DET. EXAMPLE: ONE-DIMENSIONAL WALK

• A person takes either a unit step to the left ora unit step to the right. Minimize the cost g(i) of the point i where he will end up after N steps.

g (i )

i N N - 2-N 0

(N ,0)

(0,0)

(N ,-N ) (N ,N )

i _

i _

• Base heuristic: Always go to the right. Rollout

finds the rightmost local minimum .

• Base heuristic: Compare always go to the rightand always go the left. Choose the best of the two.Rollout finds a global minimum .

8/14/2019 DP_Slides_2011


ROLLING HORIZON APPROACH

• This is an l-step lookahead policy where thecost-to-go approximation is just 0.

• Alternatively, the cost-to-go approximation isthe terminal cost function gN .

• A short rolling horizon saves computation.

• “Paradox”: It is not true that a longer rollinghorizon always improves performance.

• Example: At the initial state, there are twocontrols available (1 and 2). At every other state,there is only one control.

CurrentState

Optimal Trajectory

HighCost

... ...

... ...

1

2

LowCost

HighCost

l Stages

8/14/2019 DP_Slides_2011


ROLLING HORIZON WITH ROLLOUT

• We can use a rolling horizon approximation incalculating the cost-to-go of the base heuristic.

• Because the heuristic is suboptimal, the ratio-nale for a long rolling horizon becomes weaker.

• Example: N -stage stopping problem where the

stopping cost is 0, the continuation cost is either−ǫ or 1, where 0 < ǫ << 1, and the first statewith continuation cost equal to 1 is state m. Thenthe optimal policy is to stop at state m, and theoptimal cost is −mǫ.

0 1 2 m N

Stopped State

- e - e

1... ...

• Consider the heuristic that continues at every

state, and the rollout policy that is based on thisheuristic, with a rolling horizon of l ≤ m steps.

• It will continue up to the first m − l + 1 stages,thus compiling a cost of −(m− l +1)ǫ. The rolloutperformance improves as l becomes shorter!

• Limited vision may work to our advantage!

8/14/2019 DP_Slides_2011


MODEL PREDICTIVE CONTROL (MPC)

• Special case of rollout for controlling linear de-terministic systems (extensions to nonlinear/stochasticare similar)

− System: xk+1 = Axk + Buk

− Quadratic cost per stage: x′kQxk + u′kRuk

− Constraints: xk ∈ X , uk ∈ U (xk)• Assumption: For any x0 ∈ X there is a feasiblestate-control sequence that brings the system to 0in m steps, i.e., xm = 0

• MPC at state xk solves an m-step optimal con-

trol problem with constraint xk+m = 0, i.e., findsa sequence uk, . . . , uk+m−1 that minimizes

m−1ℓ=0

x′k+ℓQxk+ℓ + u′k+ℓRuk+ℓ

subject to xk+m = 0• Then applies the first control uk (and repeatsat the next state xk+1)

• MPC is rollout with heuristic derived from thecorresponding m

−1-step optimal control problem

• Key Property of MPC: Since the heuris-tic is stable, the rollout is also stable (by policyimprovement property).

8/14/2019 DP_Slides_2011


DISCRETIZATION

• If the state space and/or control space is con-tinuous/infinite, it must be replaced by a finitediscretization.

• Need for consistency, i.e., as the discretizationbecomes finer, the cost-to-go functions of the dis-

cretized problem converge to those of the contin-uous problem.

• Pitfall with discretizing continuous time:The control constraint set may change a lot as wepass to the discrete-time approximation.

• Example: Letx(t) = u(t),

with control constraint u(t) ∈ {−1, 1}. The reach-able states after time δ are x(t + δ ) = x(t) + u,with u

∈[−

δ, δ ].

• Compare it with the reachable states after wediscretize the system naively: x(t + δ ) = x(t) +δu(t), with u(t) ∈ {−1, 1}.

• “Convexification effect” of continuous time: a

discrete control constraint set in continuous-timedifferential systems, is equivalent to a continuouscontrol constraint set when the system is lookedat discrete times.

8/14/2019 DP_Slides_2011


SPACE DISCRETIZATION I

• Given a discrete-time system with state spaceS , consider a finite subset S ; for example S couldbe a finite grid within a continuous state space S .

• Difficulty: f (x,u,w) /∈ S for x ∈ S .

• We define an approximation to the original

problem, with state space S , as follows:

• Express each x ∈ S as a convex combination of states in S , i.e.,

x = xi∈S

γ i(x)xi where γ i(x)

≥0,

i

γ i(x) = 1

• Define a “reduced” dynamic system with statespace S , whereby from each xi ∈ S we move tox = f (xi, u , w) according to the system equation

of the original problem, and then move to xj ∈ S with probabilities γ j(x).

• Define similarly the corresponding cost per stageof the transitions of the reduced system.

• Note application to finite-state POMDP (Par-

tially Observed Markov Decision Problems)

8/14/2019 DP_Slides_2011


SPACE DISCRETIZATION II

• Let J k(xi) be the optimal cost-to-go of the “re-duced” problem from each state xi ∈ S and timek onward.

• Approximate the optimal cost-to-go of any x ∈S for the original problem by

J k(x) =xi∈S

γ i(x)J k(xi),

and use one-step-lookahead based on J k.

• The choice of coefficients γ i(x) is in principlearbitrary, but should aim at consistency, i.e., as

the number of states in S increases, J k(x) shouldconverge to the optimal cost-to-go of the originalproblem.

• Interesting observation: While the originalproblem may be deterministic, the reduced prob-

lem is always stochastic.

• Generalization: The set S may be any fi-nite set (not a subset of S ) as long as the coef-ficients γ i(x) admit a meaningful interpretationthat quantifies the degree of association of x with

xi.

8/14/2019 DP_Slides_2011


OTHER SUBOPTIMAL APPROACHES

• Minimize the DP equation error: Approxi-mate the optimal cost-to-go functions J k(xk) withJ k(xk, rk), where rk is a parameter vector, chosento minimize some form of error in the DP equa-tions

− Can be done sequentially going backwardsin time (approximate J k using an approxi-mation of J k+1).

• Direct approximation of control policies:For a subset of states xi, i = 1, . . . , m, find

µk(xi) = arg minuk∈U k(x

i)E

g(xi, uk, wk)

+ J k+1

f k(xi, uk, wk), rk+1

.

Then find µk(xk, sk), where sk is a vector of pa-rameters obtained by solving the problem

mins

mi=1

µk(xi) − µk(xi, s)2.

• Approximation in policy space: Do notbother with cost-to-go approximations. Parametrizethe policies as µk(xk, sk), and minimize the costfunction of the problem over the parameters sk.

8/14/2019 DP_Slides_2011



LECTURE 11

LECTURE OUTLINE


• Stochastic shortest path problems

• Bellman’s equation

• Dynamic programming – value iteration

• Discounted problems as special case of SSP

8/14/2019 DP_Slides_2011


TYPES OF INFINITE HORIZON PROBLEMS

• Same as the basic problem, but:− The number of stages is infinite.

− The system is stationary.

• Total cost problems: Minimize

J π(x0) = limN →∞

E wk

k=0,1,...

N −1k=0

αkg

xk, µk(xk), wk

− Stochastic shortest path problems (α = 1,finite-state system with a termination state)

− Discounted problems (α < 1, bounded costper stage)

− Discounted and undiscounted problems withunbounded cost per stage

• Average cost problems

limN →∞

1

N E wk

k=0,1,...

N −1k=0

g

xk, µk(xk), wk

• Infinite horizon characteristics: Challenging anal-ysis, elegance of solutions and algorithms

8/14/2019 DP_Slides_2011


PREVIEW OF INFINITE HORIZON RESULTS

• Key issue: The relation between the infiniteand finite horizon optimal cost-to-go functions.

• Illustration: Let α = 1 and J N (x) denote theoptimal cost of the N -stage problem, generatedafter N DP iterations, starting from J 0(x) ≡ 0

J k+1(x) = minu∈U (x)

E w

g(x,u,w) + J k

f (x,u,w)

, ∀ x

• Typical results for total cost problems:

− Convergence of DP algorithm (value itera-

tion):

J ∗(x) = limN →∞

J N (x), ∀ x

− Bellman’s Equation holds for all x:

J ∗(x) = minu∈U (x)

E w

g(x,u,w) + J ∗

f (x,u,w)

− If µ(x) minimizes in Bellman’s Eq., the pol-icy {µ , µ , . . .} is optimal.

• Bellman’s Eq. holds for all types of prob-lems. The other results are true for SSP (andbounded/discounted; unusual exceptions for otherproblems).

8/14/2019 DP_Slides_2011


STOCHASTIC SHORTEST PATH PROBLEMS

• Assume finite-state system: States 1, . . . , n andspecial cost-free termination state t

− Transition probabilities pij(u)

− Control constraints u ∈ U (i)

− Cost of policy π = {µ0, µ1, . . .} is

J π(i) = limN →∞

E

N −1k=0

g

xk, µk(xk) x0 = i

− Optimal policy if J π(i) = J ∗(i) for all i.

− Special notation: For stationary policies π ={µ , µ , . . .}, we use J µ(i) in place of J π(i).

• Assumption (Termination inevitable): Thereexists integer m such that for every policy andinitial state, there is positive probability that the

termination state will be reached after no morethat m stages; for all π, we have

ρπ = maxi=1,...,n

P {xm = t | x0 = i, π} < 1

8/14/2019 DP_Slides_2011


FINITENESS OF POLICY COST FUNCTIONS

• Letρ = max

πρπ.

Note that ρπ depends only on the first m compo-nents of the policy π, so that ρ < 1.

• For any π and any initial state i

P {x2m = t | x0 = i, π} = P {x2m = t | xm = t, x0 = i, π}

× P {xm = t | x0 = i, π} ≤ ρ2

and similarly

P {xkm = t | x0 = i, π} ≤ ρk, i = 1, . . . , n

• So E {Cost between times km and (k + 1)m − 1 }

≤ mρk

maxi=1,...,nu∈U (i)

g(i, u)

and

J π(i)

≤

∞k=0

mρk maxi=1,...,nu∈U (i)

g(i, u)

= m

1 − ρ maxi=1,...,nu∈U (i)

g(i, u)

8/14/2019 DP_Slides_2011


MAIN RESULT

• Given any initial conditions J 0(1), . . . , J 0(n),the sequence J k(i) generated by the DP iteration

J k+1(i) = minu∈U (i)

g(i, u) +

nj=1

pij(u)J k( j)

, ∀ i

converges to the optimal cost J ∗(i) for each i.

• Bellman’s equation has J ∗(i) as unique solution:

J ∗(i) = minu∈U (i)

g(i, u) +

n

j=1

pij(u)J ∗( j)

, ∀ i

J ∗(t) = 0

• A stationary policy µ is optimal if and onlyif for every state i, µ(i) attains the minimum in

Bellman’s equation.• Key proof idea: The “tail” of the cost series,

∞k=mK

E

g

xk, µk(xk)

vanishes as K increases to ∞.

8/14/2019 DP_Slides_2011


OUTLINE OF PROOF THAT J N → J ∗

• Assume for simplicity that J 0(i) = 0 for all i,and for any K ≥ 1, write the cost of any policy πas

J π(x0) =

mK−1

k=0

E

g

xk, µk(xk)

+

∞

k=mK

E

g

xk, µk(xk)

≤

mK−1k=0

E

g

xk, µk(xk)

+

∞k=K

ρkm maxi,u

|g(i, u)|

Take the minimum of both sides over π to obtain

J ∗(x0) ≤ J mK (x0) + ρK

1 − ρm max

i,u|g(i, u)|.

Similarly, we have

J mK (x0) − ρK

1 − ρ m maxi,u |g(i, u)| ≤ J ∗

(x0).

It follows that limK →∞ J mK (x0) = J ∗(x0).

• It can be seen that J mK (x0) and J mK +k(x0)converge to the same limit for k = 1, . . . , m −1 (since k extra steps far into the future don’tmatter), so

J N (x0) → J ∗(x0)

8/14/2019 DP_Slides_2011


EXAMPLE

• Minimizing the E {Time to Termination}: Let

g(i, u) = 1, ∀ i = 1, . . . , n, u ∈ U (i)

• Under our assumptions, the costs J ∗(i) uniquely

solve Bellman’s equation, which has the form


1 +

nj=1

pij(u)J ∗( j)

, i = 1, . . . , n

• In the special case where there is only one con-trol at each state, J ∗(i) is the mean first passagetime from i to t. These times, denoted mi, are theunique solution of the equations

mi = 1 +

nj=1

pijmj , i = 1, . . . , n .

8/14/2019 DP_Slides_2011


DISCOUNTED PROBLEMS

• Assume a discount factor α < 1.• Conversion to an SSP problem.

i j

p ij (u )

p ii (u ) p jj (u )

p ji (u )

a

1 - a

i j

p ij (u )

p ii (u ) p jj (u )

p ji (u )

a

a

a

1 - a

t

• Value iteration converges to J ∗ for all initial J 0:


g(i, u) + α

nj=1

pij(u)J k( j)

, ∀ i

• J ∗ is the unique solution of Bellman’s equation:


g(i, u) + α

nj=1

pij(u)J ∗( j)

, ∀ i

• A stationary policy µ is optimal if and onlyif for every state i, µ(i) attains the minimum inBellman’s equation.

8/14/2019 DP_Slides_2011



LECTURE 12

LECTURE OUTLINE

• Review of stochastic shortest path problems

• Computational methods for SSP− Value iteration

− Policy iteration

− Linear programming

• Computational methods for discounted prob-lems

8/14/2019 DP_Slides_2011


STOCHASTIC SHORTEST PATH PROBLEMS

• Assume finite-state system: States 1, . . . , n andspecial cost-free termination state t

− Transition probabilities pij(u)

− Control constraints u ∈ U (i)

− Cost of policy π = {µ0, µ1, . . .} is

J π(i) = limN →∞

E

N −1k=0

g

xk, µk(xk) x0 = i

− Optimal policy if J π(i) = J ∗(i) for all i.

− Special notation: For stationary policies π ={µ , µ , . . .}, we use J µ(i) in place of J π(i).

• Assumption (Termination inevitable): There ex-ists integer m such that for every policy and initialstate, there is positive probability that the termi-

nation state will be reached after no more that mstages; for all π, we have

ρπ = maxi=1,...,n

P {xm = t | x0 = i, π} < 1

8/14/2019 DP_Slides_2011


MAIN RESULT

• Given any initial conditions J 0(1), . . . , J 0(n),the sequence J k(i) generated by value iteration


g(i, u) +

nj=1

pij(u)J k( j)

, ∀ i

converges to the optimal cost J ∗(i) for each i.

• Bellman’s equation has J ∗(i) as unique solution:


g(i, u) +

n

j=1

pij(u)J ∗( j)

, ∀ i

• A stationary policy µ is optimal if and onlyif for every state i, µ(i) attains the minimum inBellman’s equation.

• Key proof idea: The “tail” of the cost series,

∞k=mK

E

g

xk, µk(xk)

vanishes as K increases to ∞.

8/14/2019 DP_Slides_2011


BELLMAN’S EQUATION FOR A SINGLE POLIC

• Consider a stationary policy µ• J µ(i), i = 1, . . . , n, are the unique solution of the linear system of n equations

J µ(i) = gi, µ(i)+n

j=1

pijµ(i)J µ( j),

∀i = 1, . . . , n

• Proof: This is just Bellman’s equation for amodified/restricted problem where there is onlyone policy, the stationary policy µ, i.e., the control

constraint set at state i is U (i) = {µ(i)}• The equation provides a way to compute J µ(i),i = 1, . . . , n, but the computation is substantialfor large n [O(n3)]

• For large n, value iteration may be preferable.

(Typical case of a large linear system of equations,where an iterative method may be better than adirect solution method.)

• For VERY large n, exact methods cannot beapplied, and approximations are needed. (We will

discuss these later.)

8/14/2019 DP_Slides_2011


POLICY ITERATION

• It generates a sequence µ1, µ2, . . . of stationarypolicies, starting with any stationary policy µ0.

• At the typical iteration, given µk, we performa policy evaluation step, that computes the J µk(i)as the solution of the (linear) system of equations

J (i) = g

i, µk(i)

+n

j=1

pij

µk(i)

J ( j), i = 1, . . . , n ,

in the n unknowns J (1), . . . , J (n). We then per-form a policy improvement step, which computes

a new policy µk+1 as

µk+1(i) = arg minu∈U (i)

g(i, u) +

nj=1

pij(u)J µk( j)

, ∀ i

• The algorithm stops when J µk(i) = J µk+1 (i) forall i

• Note the connection with the rollout algorithm,which is just a single policy iteration

8/14/2019 DP_Slides_2011


JUSTIFICATION OF POLICY ITERATION

• We can show thatJ µk+1 (i) ≤ J µk(i) for all i, k• Fix k and consider the sequence generated by

J N +1(i) = g

i, µk+1(i)

+n

j=1

pij

µk+1(i)

J N ( j)

where J 0(i) = J µk

(i). We have

J 0(i) = g

i, µk(i)

+n

j=1

pij

µk(i)

J 0( j)

≥ gi, µk+1(i)+n

j=1

pijµk+1(i)J 0( j) = J 1(i)

Using the monotonicity property of DP,

J 0(i) ≥ J 1(i) ≥ · · · ≥ J N (i) ≥ J N +1(i) ≥ · · · , ∀ i

Since J N (i) → J µk+1 (i) as N → ∞, we obtainJ µk(i) = J 0(i) ≥ J µk+1 (i) for all i. Also if J µk(i) =

J µk+1 (i) for all i, J µk solves Bellman’s equationand is therefore equal to J ∗

• A policy cannot be repeated, there are finitelymany stationary policies, so the algorithm termi-nates with an optimal policy

8/14/2019 DP_Slides_2011


LINEAR PROGRAMMING

• We claim that J ∗ is the “largest” J that satisfiesthe constraint

J (i) ≤ g(i, u) +n

j=1

pij(u)J ( j), (1)

for all i = 1, . . . , n and u ∈ U (i).

• Proof: If we use value iteration to generate asequence of vectors J k =

J k(1), . . . , J k(n)

start-

ing with a J 0 such that

J 0(i) ≤ minu∈U (i)

g(i, u) +

nj=1

pij(u)J 0( j)

, ∀ i

Then, J k(i) ≤ J k+1(i) for all k and i (mono-

tonicity property of DP) and J k → J ∗

, so thatJ 0(i) ≤ J ∗(i) for all i.

• So J ∗ = (J ∗(1), . . . , J ∗(n)) is the solution of thelinear program of maximizing

ni=1 J (i) subject

to the constraint (1).

8/14/2019 DP_Slides_2011


8/14/2019 DP_Slides_2011


DISCOUNTED PROBLEMS

• Assume a discount factor α < 1.• Conversion to an SSP problem.

i j

p ij (u )

p ii (u ) p jj (u )

p ji (u )

a

1 - a

i j

p ij (u )

p ii (u ) p jj (u )

p ji (u )

a

a

a

1 - a

t

• Value iteration converges to J ∗ for all initial J 0:


g(i, u) + α

nj=1

pij(u)J k( j)

, ∀ i

• J ∗ is the unique solution of Bellman’s equation:


g(i, u) + α

nj=1

pij(u)J ∗( j)

, ∀ i

8/14/2019 DP_Slides_2011


DISCOUNTED PROBLEMS (CONTINUED)

• Policy iteration converges finitely to an optimalpolicy, and linear programming works.

• Example: Asset selling over an infinite hori-zon. If accepted, the offer xk of period k, is in-vested at a rate of interest r.

• By depreciating the sale amount to period 0dollars, we view (1 + r)−kxk as the reward forselling the asset in period k at a price xk, wherer > 0 is the rate of interest. So the discount factoris α = 1/(1 + r).

• J ∗ is the unique solution of Bellman’s equation

J ∗(x) = max

x,

E

J ∗(w)

1 + r

.

• An optimal policy is to sell if and only if the cur-rent offer xk is greater than or equal to α, where

α = E

J ∗(w)

1 + r .

8/14/2019 DP_Slides_2011



LECTURE 13

LECTURE OUTLINE

• Average cost per stage problems

• Connection with stochastic shortest path prob-lems

• Bellman’s equation

• Value iteration

• Policy iteration

8/14/2019 DP_Slides_2011


AVERAGE COST PER STAGE PROBLEM

• Stationary system with finite number of statesand controls

• Minimize over policies π = {µ0, µ1,...}


1

N E wkk=0,1,...

N −1

k=0

gxk

, µk

(xk

), wk

• Important characteristics (not shared by othertypes of infinite horizon problems)

− For any fixed K , the cost incurred up to timeK does not matter (only the state that weare at time K matters)

− If all states “communicate” the optimal costis independent of the initial state [if we cango from i to j in finite expected time, we

must have J ∗(i) ≤ J ∗( j)]. So J ∗(i) ≡ λ∗ forall i.

− Because “communication” issues are so im-portant, the methodology relies heavily onMarkov chain theory.

8/14/2019 DP_Slides_2011


CONNECTION WITH SSP

• Assumption: State n is such that for all initialstates and all policies, n will be visited infinitelyoften (with probability 1).

• Divide the sequence of generated states intocycles marked by successive visits to n.

• Each of the cycles can be viewed as a statetrajectory of a corresponding stochastic shortestpath problem with n as the termination state.

i j

p ij (u )

p ii (u ) p jj (u )p ji (u )

n

p in (u ) p jn (u )

p nn (u )

p n j (u )p n i (u )

i j

p ij (u )

p ii (u ) p jj (u )p ji (u )

n

t

Artificial Termination State

SpecialState n

p n i (u )

p in (u )

p nn (u )

p n j

(u )

p jn (u )

• Let the cost at i of the SSP be g(i, u) − λ∗

• We will show that

Av. Cost Probl. ≡ A Min Cost Cycle Probl. ≡ SSP Probl.

8/14/2019 DP_Slides_2011


CONNECTION WITH SSP (CONTINUED)

• Consider a minimum cycle cost problem : Finda stationary policy µ that minimizes the expected cost per transition within a cycle

C nn(µ)

N nn(µ),

where for a fixed µ,

C nn(µ) : E {cost from n up to the first return to n}

N nn(µ) : E {time from n up to the first return to n}

• Intuitively, C nn(µ)/N nn(µ) = average cost of µ, and optimal cycle cost = λ∗, so

C nn(µ) − N nn(µ)λ∗ ≥ 0,

with equality if µ is optimal.• Thus, the optimal µ must minimize over µ theexpression C nn(µ) − N nn(µ)λ∗, which is the ex-pected cost of µ starting from n in the SSP withstage costs g(i, u) − λ∗.

• Also: Optimal SSP Cost = 0.

8/14/2019 DP_Slides_2011


BELLMAN’S EQUATION

• Let h∗(i) the optimal cost of this SSP problemwhen starting at the nontermination states i =1, . . . , n. Then, h∗(n) = 0, and h∗(1), . . . , h∗(n)solve uniquely the corresponding Bellman’s equa-tion

h∗(i) = minu∈U (i)

g(i, u) − λ∗ +n−1j=1

pij(u)h∗( j)

, ∀ i

• If µ∗ is an optimal stationary policy for the SSP

problem, we have (since Optimal SSP Cost = 0)

h∗(n) = C nn(µ∗) − N nn(µ∗)λ∗ = 0

• Combining these equations, we have

λ∗+h∗(i) = minu∈U (i)

g(i, u) +

nj=1

pij(u)h∗( j)

, ∀ i

• If µ∗(i) attains the min for each i, µ∗ is optimal.

8/14/2019 DP_Slides_2011


MORE ON THE CONNECTION WITH SSP

• Interpretation of h∗(i) as a relative or differen-tial cost : It is the minimum of

E {cost to reach n from i for the first time}− E {cost if the stage cost were λ∗ and not g(i, u)}

• We don’t know λ∗, so we can’t solve the aver-age cost problem as an SSP problem. But similarvalue and policy iteration algorithms are possible.

• Example: A manufacturer at each time:

− Receives an order with prob. p and no orderwith prob. 1 − p.

− May process all unfilled orders at cost K >0, or process no order at all. The cost perunfilled order at each time is c > 0.

− Maximum number of orders that can remainunfilled is n.

− Find a processing policy that minimizes thetotal expected cost per stage.

8/14/2019 DP_Slides_2011


EXAMPLE (CONTINUED)

• State = number of unfilled orders. State 0 isthe special state for the SSP formulation.

• Bellman’s equation: For states i = 0, 1, . . . , n−1

λ∗ + h∗(i) = min K + (1 − p)h∗(0) + ph∗(1),

ci + (1 − p)h∗(i) + ph∗(i + 1)

,

and for state n

λ∗ + h∗(n) = K + (1 − p)h∗(0) + ph∗(1)

• Optimal policy: Process i unfilled orders if

K +(1− p)h∗(0)+ ph∗(1) ≤ ci+(1− p)h∗(i)+ ph∗(i+1).

• Intuitively, h∗(i) is monotonically nondecreas-

ing with i (interpret h∗(i) as optimal costs-to-gofor the associate SSP problem). So a threshold pol-icy is optimal: process the orders if their numberexceeds some threshold integer m∗.

8/14/2019 DP_Slides_2011


VALUE ITERATION

• Natural value iteration method: Generate op-timal k-stage costs by DP algorithm starting withany J 0:


g(i, u) +n

j=1

pij(u)J k( j) ,

∀ i

• Result: limk→∞ J k(i)/k = λ∗ for all i.

• Proof outline: Let J ∗k be so generated from theinitial condition J ∗

0

= h∗. Then, by induction,

J ∗k (i) = kλ∗ + h∗(i), ∀i, ∀ k.

On the other hand,

J k(i) − J

∗k (i) ≤ maxj=1,...,n

J 0( j) − h

∗

( j), ∀ i

since J k(i) and J ∗k (i) are optimal costs for twok-stage problems that differ only in the terminalcost functions, which are J 0 and h∗.

8/14/2019 DP_Slides_2011


RELATIVE VALUE ITERATION

• The value iteration method just described hastwo drawbacks:

− Since typically some components of J k di-verge to ∞ or −∞, calculating limk→∞ J k(i)/kis numerically cumbersome.

− The method will not compute a correspond-ing differential cost vector h∗.

• We can bypass both difficulties by subtractinga constant from all components of the vector J k,so that the difference, call it hk, remains bounded.

• Relative value iteration algorithm: Pick anystate s, and iterate according to

hk+1(i) = minu∈U (i)

g(i, u) +

nj=1

pij(u)hk( j)

− min

u∈U (s)

g(s, u) +n

j=1

psj(u)hk( j)

, ∀ i

• Then we can show hk → h∗ (under an extraassumption).

8/14/2019 DP_Slides_2011


POLICY ITERATION

• At the typical iteration, we have a stationaryµk.

• Policy evaluation: Compute λk and hk(i) of µk,using the n + 1 equations hk(n) = 0 and

λk

+ hk

(i) = g

i, µk

(i)

+

nj=1 pij

µk

(i)

hk

( j), ∀ i

• Policy improvement: (For the λk-SSP) Find forall i

µk+1(i) = arg minu∈U (i)

g(i, u) +n

j=1

pij(u)hk( j)

• If λk+1 = λk and hk+1(i) = hk(i) for all i, stop;otherwise, repeat with µk+1 replacing µk.

• Result: For each k, we either have λk+1 < λk

or we have policy improvement for the λk-SSP:

λk+1 = λk, hk+1(i) ≤ hk(i), i = 1, . . . , n .

The algorithm terminates with an optimal policy.

8/14/2019 DP_Slides_2011



LECTURE 14

LECTURE OUTLINE

• Control of continuous-time Markov chains –

Semi-Markov problems• Problem formulation – Equivalence to discrete-time problems

• Discounted problems


8/14/2019 DP_Slides_2011


CONTINUOUS-TIME MARKOV CHAINS

• Stationary system with finite number of statesand controls

• State transitions occur at discrete times

• Control applied at these discrete times and staysconstant between transitions

• Time between transitions is random

• Cost accumulates in continuous time (may alsobe incurred at the time of transition)

• Example: Admission control in a system with

restricted capacity (e.g., a communication link)− Customer arrivals: a Poisson process

− Customers entering the system, depart afterexponentially distributed time

− Upon arrival we must decide whether to ad-

mit or to block a customer− There is a cost for blocking a customer

− For each customer that is in the system, thereis a customer-dependent reward per unit time

− Minimize time-discounted or average cost

8/14/2019 DP_Slides_2011


PROBLEM FORMULATION

• x(t) and u(t): State and control at time t• tk: Time of kth transition (t0 = 0)

• xk = x(tk); x(t) = xk for tk ≤ t < tk+1.

• uk = u(tk); u(t) = uk for tk ≤ t < tk+1.

• No transition probabilities; instead transitiondistributions (quantify the uncertainty about bothtransition time and next state)

Qij(τ, u) = P {tk+1−tk ≤ τ, xk+1 = j | xk = i, uk = u}

• Two important formulas:(1) Transition probabilities are specified by

pij(u) = P {xk+1 = j | xk = i, uk = u} = limτ →∞

Qij(τ, u)

(2) The Cumulative Distribution Function (CDF)of τ given i, j, u is (assuming pij(u) > 0)

P {tk+1−tk ≤ τ | xk = i, xk+1 = j, uk = u} = Qij(τ, u)

pij(u)

Thus, Qij(τ, u) can be viewed as a “scaled CDF”

8/14/2019 DP_Slides_2011


EXPONENTIAL TRANSITION DISTRIBUTIONS

• Important example of transition distributions:

Qij(τ, u) = pij(u)

1 − e−ν i(u)τ

,

where pij(u) are transition probabilities, and ν i(u)is called the transition rate at state i.

• Interpretation: If the system is in state i andcontrol u is applied

− the next state will be j with probability pij(u)

− the time between the transition to state iand the transition to the next state j is ex-ponentially distributed with parameter ν i(u)(independently of j):

P {transition time interval > τ | i, u} = e−ν i(u)τ

• The exponential distribution is memoryless.

This implies that for a given policy, the systemis a continuous-time Markov chain (the future de-pends on the past through the current state).

• Without the memoryless property, the Markovproperty holds only at the times of transition.

8/14/2019 DP_Slides_2011


COST STRUCTURES

• There is cost g(i, u) per unit time, i.e.

g(i, u)dt = the cost incurred in time dt

• There may be an extra “instantaneous” cost

g(i, u) at the time of a transition (let’s ignore thisfor the moment)

• Total discounted cost of π = {µ0, µ1, . . .} start-ing from state i (with discount factor β > 0)

limN →∞E N −1

k=0

tk+1

tk

e−βtg

xk, µk(xk)

dt x0 = i

• Average cost per unit time

limN →∞

1E {tN }

E N −1

k=0

tk+1

tk

g

xk, µk(xk)

dt x0 = i

• We will see that both problems have equivalentdiscrete-time versions.

8/14/2019 DP_Slides_2011


A NOTE ON NOTATION

• The scaled CDF Qij(τ, u) can be used to modeldiscrete, continuous, and mixed distributions forthe transition time τ .

• Generally, expected values of functions of τ canbe written as integrals involving d Qij(τ, u). For

example, the conditional expected value of τ giveni, j, and u is written as

E {τ | i,j,u} =

∞0

τ d Qij(τ, u)

pij(u)

• If Qij(τ, u) is continuous with respect to τ , its

derivative

q ij(τ, u) = dQij

dτ (τ, u)

can be viewed as a “scaled” density function. Ex-pected values of functions of τ can then be written

in terms of q ij(τ, u). For example

E {τ | i,j,u} =

∞0

τ q ij(τ, u)

pij(u) dτ

• If Qij(τ, u) is discontinuous and “staircase-like,”expected values can be written as summations.

8/14/2019 DP_Slides_2011


DISCOUNTED CASE - COST CALCULATION

• For a policy π = {µ0, µ1, . . .}, write

J π(i) = E {1st transition cost}+E {e−βτ

J π1 ( j) | i, µ0(i)}

where J π1 ( j) is the cost-to-go of the policy π1 =

{µ1, µ2, . . .

}• We calculate the two costs in the RHS. TheE {1st transition cost}, if u is applied at state i, is

G(i, u) = E j

E τ {1st transition cost | j}

=

n

j=1

pij(u) ∞0 τ

0

e−βtg(i, u)dt dQij(τ, u)

pij(u)

=

nj=1

∞0

1 − e−βτ

β g(i, u)dQij(τ, u)

• Thus the E

{1st transition cost

} is

G

i, µ0(i)

= g

i, µ0(i) nj=1

∞0

1 − e−βτ

β dQij

τ, µ0(i)

(The summation term can be viewed as a “dis-counted length of the transition interval t1 − t0”.)

8/14/2019 DP_Slides_2011


COST CALCULATION (CONTINUED)

• Also the expected (discounted) cost from thenext state j is

E

e−βτ J π1 ( j) | i, µ0(i)

= E j

E {e−βτ | i, µ0(i), j}J π1 ( j) | i, µ0(i)

=

nj=1

pij(µ0(i)) ∞

0

e−βτ dQij(τ, µ0(i)) pij(µ0(i))

J π1 ( j)

=n

j=1

mij

µ0(i)

J π1 ( j)

where mij(u) is given by

mij(u) =

∞0

e−βτ dQij(τ, u)

<

∞0

dQij(τ, u) = pij(u)

and can be viewed as the “effective discount fac-tor” [the analog of αpij(u) in discrete-time case].

• So J π(i) can be written as

J π(i) = G

i, µ0(i)

+n

j=1

mij

µ0(i)

J π1 ( j)

i.e., the (continuous-time discounted) cost of 1st

period, plus the (continuous-time discounted) cost-to-go from the next state.

8/14/2019 DP_Slides_2011


EQUIVALENCE TO AN SSP

• Similar to the discrete-time case, introduce astochastic shortest path problem with an artificialtermination state t

• Under control u, from state i the system movesto state j with probability mij(u) and to the ter-

mination state t with probability 1−n

j=1 mij(u)• Bellman’s equation: For i = 1, . . . , n,


G(i, u) +

n

j=1mij(u)J ∗( j)

• Analogs of value iteration, policy iteration, andlinear programming.

• If in addition to the cost per unit time g, thereis an extra (instantaneous) one-stage cost g(i, u),

Bellman’s equation becomes


g(i, u) + G(i, u) +

nj=1

mij(u)J ∗( j)

8/14/2019 DP_Slides_2011


MANUFACTURER’S EXAMPLE REVISITED

• A manufacturer receives orders with interarrivaltimes uniformly distributed in [0, τ max].

• He may process all unfilled orders at cost K > 0,or process none. The cost per unit time of anunfilled order is c. Max number of unfilled orders

is n.• The nonzero transition distributions are

Qi1(τ, Fill) = Qi(i+1)(τ, Not Fill) = min

1,

τ

τ max

• The one-stage expected cost G is

G(i, Fill) = 0, G(i, Not Fill) = γ c i,

where

γ =n

j=1

∞0

1 − e−βτ β

dQij(τ, u) = τ max

0

1 − e−βτ βτ max

dτ

• There is an “instantaneous” cost

g(i, Fill) = K, g(i, Not Fill) = 0

8/14/2019 DP_Slides_2011


MANUFACTURER’S EXAMPLE CONTINUED

• The “effective discount factors” mij(u) in Bell-man’s Equation are

mi1(Fill) = mi(i+1)(Not Fill) = α,

where

α =

∞0

e−βτ

dQij(τ, u) =

τ max

0

e−βτ

τ maxdτ =

1 − e−βτ max

βτ max

• Bellman’s equation has the form

J ∗(i) = min

K +αJ ∗(1), γci+αJ ∗(i+1)

, i = 1, 2, . . .

• As in the discrete-time case, we can concludethat there exists an optimal threshold i∗:

fill the orders <==> their number i exceeds i∗

8/14/2019 DP_Slides_2011


AVERAGE COST

• Minimize

limN →∞

1

E {tN }E

tN 0

g

x(t), u(t)

dt

assuming there is a special state that is “recurrent

under all policies”• Total expected cost of a transition

G(i, u) = g(i, u)τ i(u),

where τ i(u): Expected transition time.

• We now apply the SSP argument used for the

discrete-time case. Divide trajectory into cyclesmarked by successive visits to n. The cost at (i, u)is G(i, u) − λ∗τ i(u), where λ∗ is the optimal ex-pected cost per unit time. Each cycle is viewed asa state trajectory of a corresponding SSP problem

with the termination state being essentially n.• So Bellman’s Eq. for the average cost problem:

h∗(i) = minu∈U (i)

G(i, u) − λ∗τ i(u) +

nj=1

pij(u)h∗( j)

8/14/2019 DP_Slides_2011


MANUFACTURER EXAMPLE/AVERAGE COST

• The expected transition times are

τ i(Fill) = τ i(Not Fill) = τ max

2

the expected transition cost is

G(i, Fill) = 0, G(i, Not Fill) = c i τ max

2

and there is also the “instantaneous” cost

g(i, Fill) = K, g(i, Not Fill) = 0

• Bellman’s equation:

h∗(i) = minK − λ∗τ max

2 + h∗(1),

ci τ max2

− λ∗ τ max2

+ h∗(i + 1)

• Again it can be shown that a threshold policyis optimal.

8/14/2019 DP_Slides_2011



LECTURE 15

LECTURE OUTLINE

• We start a nine-lecture sequence on advanced

infinite horizon DP and approximate solution meth-ods

• We allow infinite state space, so the stochasticshortest path framework cannot be used any more

• Results are rigorous assuming a countable dis-

turbance space− This includes deterministic problems with

arbitrary state space, and countable stateMarkov chains

− Otherwise the mathematics of measure the-

ory make analysis difficult, although the fi-nal results are essentially the same as forcountable disturbance space

• The discounted problem is the proper startingpoint for this analysis

• The central mathematical structure is that theDP mapping is a contraction mapping (instead of existence of a termination state)

8/14/2019 DP_Slides_2011


DISCOUNTED PROBLEMS/BOUNDED COST

• Stationary system with arbitrary state space

xk+1 = f (xk, uk, wk), k = 0, 1, . . .

• Cost of a policy π = {µ0, µ1, . . .}


E wk

k=0,1,...

N −1k=0

αkg

xk, µk(xk), wk

with α < 1, and for some M , we have |g(x,u,w)| ≤M for all (x,u,w)

• Shorthand notation for DP mappings (operateon functions of state to produce other functions)

(T J )(x) = minu∈U (x)

E w

g(x,u,w) + αJ

f (x,u,w)

, ∀ x

T J is the optimal cost function for the one-stageproblem with stage cost g and terminal cost αJ .

• For any stationary policy µ

(T µJ )(x) = E w gx, µ(x), w+ αJ f (x, µ(x), w) , ∀ x

8/14/2019 DP_Slides_2011


“SHORTHAND” THEORY – A SUMMARY

• Cost function expressions [with J 0(x) ≡ 0]

J π(x) = limk→∞

(T µ0T µ1 · · ·T µkJ 0)(x), J µ(x) = limk→∞

(T kµJ 0)(x)

• Bellman’s equation: J ∗ = T J ∗, J µ = T µJ µ

• Optimality condition:

µ: optimal <==> T µJ ∗ = T J ∗

• Value iteration: For any (bounded) J and allx,

J ∗(x) = limk→∞(T kJ )(x)

• Policy iteration: Given µk,

− Policy evaluation: Find J µk by solving

J µk = T µkJ µk

− Policy improvement: Find µk+1 such that

T µk+1 J µk = T J µk

8/14/2019 DP_Slides_2011


TWO KEY PROPERTIES

• Monotonicity property: For any functions J and J ′ such that J (x) ≤ J ′(x) for all x, and anyµ

(T J )(x) ≤ (T J ′)(x), ∀ x,

(T µJ )(x) ≤ (T µJ ′)(x), ∀ x.

• Constant Shift property: For any J , anyscalar r, and any µ

T (J + re)

(x) = (T J )(x) + αr, ∀ x,

T µ(J + re)

(x) = (T µJ )(x) + αr, ∀ x,

where e is the unit function [e(x) ≡ 1].

8/14/2019 DP_Slides_2011


CONVERGENCE OF VALUE ITERATION

• If J 0 ≡ 0,

J ∗(x) = limN →∞

(T N J 0)(x), for all x

Proof: For any initial state x0, and policy π ={µ0, µ1, . . .},

J π(x0) = E

∞k=0

αkg

xk, µk(xk), wk

= E

N −1

k=0 αkgxk, µk(xk), wk

+ E

∞k=N

αkg

xk, µk(xk), wk

The tail portion satisfiesE

∞k=N

αkg

xk, µk(xk), wk

≤ αN M

1 − α,

where M ≥ |g(x,u,w)|. Take the min over π of both sides. Q.E.D.

8/14/2019 DP_Slides_2011


BELLMAN’S EQUATION

• The optimal cost function J ∗ satisfies Bellman’sEq., i.e. J ∗ = T (J ∗).

Proof: For all x and N ,

J ∗(x)

−

αN M

1 − α ≤(T N J 0)(x)

≤J ∗(x) +

αN M

1 − α

,

where J 0(x) ≡ 0 and M ≥ |g(x,u,w)|. ApplyingT to this relation, and using Monotonicity andConstant Shift,

(T J ∗)(x) − αN +1

M 1 − α

≤ (T N +1J 0)(x)

≤ (T J ∗)(x) + αN +1M

1 − α

Taking the limit as N

→ ∞ and using the fact

limN →∞

(T N +1J 0)(x) = J ∗(x)

we obtain J ∗ = T J ∗. Q.E.D.

8/14/2019 DP_Slides_2011


THE CONTRACTION PROPERTY

• Contraction property: For any bounded func-tions J and J ′, and any µ,

maxx

(T J )(x) − (T J ′)(x) ≤ α max

x

J (x) − J ′(x),

maxx(T µJ )(x)−(T µJ

′

)(x) ≤ α maxx

J (x)−J

′

(x).

Proof: Denote c = maxx∈S J (x) − J ′(x)

. Then

J (x) − c ≤ J ′(x) ≤ J (x) + c, ∀ x

Apply T to both sides, and use the Monotonicityand Constant Shift properties:

(T J )(x) − αc ≤ (T J ′)(x) ≤ (T J )(x) + αc, ∀ x

Hence

(T J )(x) − (T J ′)(x) ≤ αc, ∀ x.

Q.E.D.

8/14/2019 DP_Slides_2011


IMPLICATIONS OF CONTRACTION PROPERT

• We can strengthen our earlier result:• Bellman’s equation J = T J has a unique solu-tion, namely J ∗, and for any bounded J , we have

limk→∞

(T kJ )(x) = J ∗(x), ∀ x

Proof: Use

maxx

(T kJ )(x) − J ∗(x) = max

x

(T kJ )(x) − (T kJ ∗)(x)

≤ αk maxx J (x) − J ∗(x)

• Special Case: For each stationary µ, J µ is theunique solution of J = T µJ and

limk→∞

(T kµJ )(x) = J µ(x), ∀ x,

for any bounded J .

• Convergence rate: For all k,

maxx (T kJ )(x) − J ∗(x) ≤ αk max

x J (x) − J ∗(x)

8/14/2019 DP_Slides_2011


NEC. AND SUFFICIENT OPT. CONDITION

• A stationary policy µ is optimal if and only if µ(x) attains the minimum in Bellman’s equationfor each x; i.e.,

T J ∗ = T µJ ∗.

Proof: If T J ∗ = T µJ ∗, then using Bellman’s equa-tion (J ∗ = T J ∗), we have

J ∗ = T µJ ∗,

so by uniqueness of the fixed point of T µ, we obtainJ ∗ = J µ; i.e., µ is optimal.

• Conversely, if the stationary policy µ is optimal,we have J ∗ = J µ, so

J ∗ = T µJ ∗.

Combining this with Bellman’s equation (J ∗ =T J ∗), we obtain T J ∗ = T µJ ∗. Q.E.D.

8/14/2019 DP_Slides_2011


COMPUTATIONAL METHODS

• Value iteration and variants− Gauss-Seidel version

− Approximate value iteration

• Policy iteration and variants

− Combination with value iteration

− Modified policy iteration

− Asynchronous policy iteration

• Linear programming

maximize

ni=1

J (i)

subject to J (i) ≤ g(i, u) + αn

j=1

pij(u)J ( j), ∀ (i, u)

• Approximate linear programming: use in placeof J (i) a low-dim. basis function representation

J (i, r) =mk=1

rkwk(i)

and low-dim. LP (with many constraints)

8/14/2019 DP_Slides_2011



LECTURE 16

LECTURE OUTLINE

• Review of basic theory of discounted problems

• Monotonicity and contraction properties

• Contraction mappings in DP

• Discounted problems: Countable state spacewith unbounded costs

• Generalized discounted DP

8/14/2019 DP_Slides_2011




xk+1 = f (xk, uk, wk), k = 0, 1, . . .



E wk

k=0,1,...

N −1k=0

αkg

xk, µk(xk), wk




E w

g(x,u,w) + αJ

f (x,u,w)

, ∀ x




8/14/2019 DP_Slides_2011


8/14/2019 DP_Slides_2011


MAJOR PROPERTIES

• Monotonicity property: For any functionsJ and J ′ on the state space X such that J (x) ≤J ′(x) for all x ∈ X , and any µ

(T J )(x) ≤ (T J ′)(x), ∀ x ∈ X,

(T µJ )(x) ≤ (T µJ ′)(x), ∀ x ∈ X.

• Contraction property: For any bounded func-tions J and J ′, and any µ,

maxx(T J )(x) − (T J

′

)(x) ≤ α maxx

J (x) − J

′

(x),

maxx

(T µJ )(x)−(T µJ ′)(x) ≤ α max

x

J (x)−J ′(x).

• The contraction property can be written in

shorthand as

T J −T J ′ ≤ αJ −J ′, T µJ −T µJ ′ ≤ αJ −J ′,

where for any bounded function J , we denote by

J

the sup-norm

J = maxx∈X

J (x).

8/14/2019 DP_Slides_2011


CONTRACTION MAPPINGS

• Given a real vector space Y with a norm · (i.e., y ≥ 0 for all y ∈ Y , y = 0 if and only if y = 0, ay = |a|y for all scalars a and y ∈ Y ,and y + z ≤ y + z for all y, z ∈ Y )

• A function F : Y → Y is said to be a contraction

mapping if for some ρ ∈ (0, 1), we have

F y − F z ≤ ρy − z, for all y, z ∈ Y.

ρ is called the modulus of contraction of F .

• For m > 1, we say that F is an m-stage con-

traction if F m is a contraction.

• Important example: Let X be a set (e.g., statespace in DP), v : X → ℜ be a positive-valuedfunction. Let B(X ) be the set of all functionsJ : X

→ ℜsuch that J (s)/v(s) is bounded over s.

• We define a norm on B(X ), called the weighted sup-norm , by

J = maxs∈X

|J (s)|v(s)

.

• Important special case: The discounted prob-lem mappings T and T µ [for v(s) ≡ 1, ρ = α].

8/14/2019 DP_Slides_2011


A DP-LIKE CONTRACTION MAPPING

• Let X = {1, 2, . . .}, and let F : B(X ) → B(X )be a linear mapping of the form

(F J )(i) = b(i) +j∈X

a(i, j) J ( j), ∀ i

where b(i) and a(i, j) are some scalars. Then F isa contraction with modulus ρ if

j∈X |a(i, j)| v( j)

v(i) ≤ ρ, ∀ i

• Let F : B(X ) → B(X ) be a mapping of theform

(F J )(i) = minµ∈M

(F µJ )(i), ∀ i

where M is parameter set, and for each µ ∈ M ,F µ is a contraction mapping from B(X ) to B(X )with modulus ρ. Then F is a contraction mappingwith modulus ρ.

8/14/2019 DP_Slides_2011


CONTRACTION MAPPING FIXED-POINT TH.

• Contraction Mapping Fixed-Point Theo-rem: If F : B(X ) → B(X ) is a contraction withmodulus ρ ∈ (0, 1), then there exists a uniqueJ ∗ ∈ B(X ) such that

J ∗ = F J ∗.

Furthermore, if J is any function in B(X ), then{F kJ } converges to J ∗ and we have

F kJ − J ∗ ≤ ρkJ − J ∗, k = 1, 2, . . . .

• Similar result if F is an m-stage contractionmapping.

• This is a special case of a general result forcontraction mappings F : Y → Y over normed

vector spaces Y that are complete : every sequence{yk} that is Cauchy (satisfies ym − yn → 0 asm, n → ∞) converges.

• The space B(X ) is complete (see the text for aproof).

8/14/2019 DP_Slides_2011


GENERAL FORMS OF DISCOUNTED DP

• We consider an abstract form of discounted DPbased on contraction mappings

• Denote R(X ): set of real-valued functions J :X → ℜ, and let H : X × U × R(X ) → ℜ be agiven mapping. We consider the mapping


H (x,u,J ), ∀ x ∈ X.

• We assume that (T J )(x) > −∞ for all x ∈ X ,so T maps R(X ) into R(X ).

• For each “policy” µ ∈ M, we consider the map-ping T µ : R(X ) → R(X ) defined by

(T µJ )(x) = H

x, µ(x), J

, ∀ x ∈ X.

• We want to find a function J ∗ ∈ R(X ) suchthat


H (x,u,J ∗), ∀ x ∈ X

and a policy µ∗ such that T µ∗J ∗ = T J ∗.

8/14/2019 DP_Slides_2011


EXAMPLES

• Discounted problems

H (x,u,J ) = E

g(x,u,w) + αJ

f (x,u,w)

• Discounted Semi-Markov Problems

H (x,u,J ) = G(x, u) +n

y=1

mxy(u)J (y)

where mxy are “discounted” transition probabili-ties, defined by the transition distributions

• Shortest Path Problems

H (x,u,J ) =

axu + J (u) if u = d,axd if u = d

where d is the destination• Minimax Problems

H (x,u,J ) = maxw∈W (x,u)

g(x,u,w)+αJ

f (x,u,w)

8/14/2019 DP_Slides_2011



• Monotonicity assumption: If J, J ′ ∈ R(X ) andJ ≤ J ′, then

H (x,u,J ) ≤ H (x,u,J ′), ∀ x ∈ X, u ∈ U (x)

• Contraction assumption:− For every J ∈ B(X ), the functions T µJ andT J belong to B(X ).

− For some α ∈ (0, 1) and all J, J ′ ∈ B(X ), H satisfies

H (x,u,J )−H (x,u,J ′) ≤ α max

y∈X

J (y)−J ′(y)

for all x ∈ X and u ∈ U (x).

• We can show all the standard analytical and

computational results of discounted DP based onthese two assumptions

• With just the monotonicity assumption (as inshortest path problem) we can still show variousforms of the basic results under appropriate as-sumptions (like in the SSP problem)

8/14/2019 DP_Slides_2011


RESULTS USING CONTRACTION

• The mappings T µ and T are sup-norm contrac-tion mappings with modulus α over B(X ), andhave unique fixed points in B(X ), denoted J µ andJ ∗, respectively (cf. Bellman’s equation). Proof:From contraction property of H .

• For any J ∈ B(X ) and µ ∈ M,

limk→∞

T kµJ = J µ, limk→∞

T kJ = J ∗.

(cf. convergence of value iteration). Proof: Fromcontraction property of T µ and T .

• We have T µJ ∗ = T J ∗ if and only if J µ = J ∗

(cf. optimality condition). Proof: T µJ ∗ = T J ∗,then T µJ ∗ = J ∗, implying J ∗ = J µ. Conversely,if J µ = J ∗, then T µJ ∗ = T µJ µ = J µ = J ∗ = T J ∗.

• For any J ∈ B(X ) and µ ∈ M,

J µ − J ≤ T µJ − J 1 − α

(this is a useful bound for J µ)

8/14/2019 DP_Slides_2011


RESULTS USING MON. AND CONTRACTION

• Optimality of fixed point:

J ∗(x) = minµ∈M

J µ(x), ∀ x ∈ X

• Furthermore, for every ǫ > 0, there exists µǫ

∈M such that

J ∗(x) ≤ J µǫ(x) ≤ J ∗(x) + ǫ, ∀ x ∈ X

• Nonstationary policies: Consider the set Π of

all sequences π = {µ0, µ1, . . .} with µk ∈ M forall k, and define

J π(x) = lim inf k→∞

(T µ0 T µ1 · · · T µkJ )(x), ∀ x ∈ X,

with J being any function (the choice of J doesnot matter)

• We have

J ∗(x) = minπ∈Π

J π(x), ∀ x ∈ X

8/14/2019 DP_Slides_2011



LECTURE 17

LECTURE OUTLINE

• Review of computational theory of discounted

problems• Value iteration (VI), policy iteration (PI)

• Q-Learning

• Optimistic PI

• Asynchronous VI• Asynchronous PI

• Computational methods for generalized dis-counted DP

8/14/2019 DP_Slides_2011




xk+1 = f (xk, uk, wk), k = 0, 1, . . .



E wk

k=0,1,...

N −1k=0

αkg

xk, µk(xk), wk




E w

g(x,u,w) + αJ

f (x,u,w)

, ∀ x




8/14/2019 DP_Slides_2011


“SHORTHAND” THEORY – A SUMMARY

• Cost function expressions [with J 0(x) ≡ 0]

J π(x) = limk→∞

(T µ0T µ1 · · ·T µkJ 0)(x), J µ(x) = limk→∞

(T kµJ 0)(x)

• Bellman’s equation: J ∗ = T J ∗, J µ = T µJ µ

• Optimality condition:

µ: optimal <==> T µJ ∗ = T J ∗

• Value iteration: For any (bounded) J and allx,

J ∗(x) = limk→∞(T kJ )(x)

• Policy iteration: Given µk,

− Policy evaluation: Find J µk by solving

J µk = T µkJ µk

− Policy improvement: Find µk+1 such that

T µk+1 J µk = T J µk

8/14/2019 DP_Slides_2011


INTERPRETATION OF VI AND PI

J ∗ = TJ ∗

J ∗ = TJ ∗

TJ

TJ

J

45 Degree Line

J ∗ = TJ ∗

J

J µ1 = T µ1J µ1

Policy Improvement

Policy Improvement

T µ1J

Policy Evaluation

J 0

J 0

J 0

J 0

TJ 0

TJ 0

TJ 0

T 2J 0

T 2J 0

Value Iterations

8/14/2019 DP_Slides_2011


Q-LEARNING I

• We can write Bellman’s equation as


Q∗(i, u) i = 1, . . . , n ,

where Q∗ is the unique solution of

Q∗(i, u) =n

j=1

pij(u)

g(i,u,j) + α min

v∈U (j)Q∗( j, v)

• Q∗(i, u) is called the optimal Q-factor of (i, u)

• Q-factors have optimal cost interpretation inan “augmented” problem whose states are i and(i, u), u ∈ U (i), where i ∈ {1, . . . , n}• We can equivalently write the VI method as

J k+1(i) = minu∈U (i) Qk+1(i, u), i = 1, . . . , n ,

where Qk+1 is generated for all i and u ∈ U (i) by

Qk+1(i, u) =n

j=1

pij(u)g(i,u,j) + α minv∈U (j)

Qk( j, v)

8/14/2019 DP_Slides_2011


Q-LEARNING II

• Q-factors are no different than costs• They satisfy a Bellman equation Q = F Q where

(F Q)(i, u) =n

j=1 pij(u)

g(i,u,j) + α min

v∈U (j)Q( j, v)

• VI and PI for Q-factors are mathematicallyequivalent to VI and PI for costs

• They require equal amount of computation ...they just need more storage

• Having optimal Q-factors is convenient whenimplementing an optimal policy on-line by

µ∗(i) = minu∈U (i)

Q∗(i, u)

• Once Q∗(i, u) are known, the model [g and pij(u)] is not needed. Model-free operation

• Later we will see how stochastic/sampling meth-ods can be used to calculate (approximations of)

Q∗

(i, u) using a simulator of the system (no modelneeded)

8/14/2019 DP_Slides_2011


APPROXIMATE PI

• Suppose that the policy evaluation is approxi-mate, according to,

maxx

|J k(x) − J µk(x)| ≤ δ, k = 0, 1, . . .

and policy improvement is approximate, accordingto,

maxx

|(T µk+1 J k)(x)−(T J k)(x)| ≤ ǫ, k = 0, 1, . . .

where δ and ǫ are some positive scalars.

• Error Bound: The sequence {µk} generatedby approximate policy iteration satisfies

lim supk→∞

maxx∈S

J µk(x) − J ∗(x)

≤ ǫ + 2αδ

(1 − α)2

• Typical practical behavior: The method makessteady progress up to a point and then the iteratesJ µk oscillate within a neighborhood of J ∗.

8/14/2019 DP_Slides_2011


OPTIMISTIC PI

• This is PI, where policy evaluation is carriedout by a finite number of VI

• Shorthand definition: For some integers mk

T µkJ k = T J k, J k+1 = T mk

µk J k, k = 0, 1, . . .

− If mk ≡ 1 it becomes VI

− If mk = ∞ it becomes PI

− For intermediate values of mk, it is generallymore efficient than either VI or PI

TJ = minµ T µJ

J ∗ = TJ ∗

J

Policy Improvement

Policy Improvement

J 0

J 0

TJ 0

J µ0 = T µ0J µ0

TJ 0 = T µ0J 0 J 1 = T 2µ0

J 0

T µ0J

Approx. Policy Evaluation

8/14/2019 DP_Slides_2011



• We consider an abstract form of discounted DP• Denote R(X ): set of real-valued functions J :X → ℜ, and let H : X × U × R(X ) → ℜ be agiven mapping. We consider the mapping

(T J )(x) = minu∈U (x) H (x,u,J ), ∀ x ∈ X.

• For each µ, we consider

(T µJ )(x) = H

x, µ(x), J

, ∀ x ∈ X.

• We want to find a function J ∗ such that


H (x,u,J ∗), ∀ x ∈ X

and a µ∗ such that T µ∗J ∗ = T J ∗.

• Discounted, Discounted Semi-Markov, MinimaxH (x,u,J ) = E

g(x,u,w) + αJ

f (x,u,w)

H (x,u,J ) = G(x, u) +

ny=1

mxy(u)J (y)

H (x,u,J ) = maxw∈W (x,u)

g(x,u,w)+αJ

f (x,u,w)

8/14/2019 DP_Slides_2011


ASSUMPTIONS AND RESULTS

• Monotonicity assumption: If J, J ′ ∈ R(X ) andJ ≤ J ′, then

H (x,u,J ) ≤ H (x,u,J ′), ∀ x ∈ X, u ∈ U (x)

• Contraction assumption:

− For every J ∈ B(X ), the functions T µJ andT J belong to B(X ).

− For some α ∈ (0, 1) and all J, J ′ ∈ B(X ), H satisfies

H (x,u,J )−

H (x,u,J ′) ≤ α maxy∈XJ (y)

−J ′(y)

for all x ∈ X and u ∈ U (x).

• Standard algorithmic results extend:

− Generalized VI converges to J ∗, the unique

fixed point of T − Generalized PI and optimistic PI generate{µk} such that

limk→∞

J k − J ∗ = 0

8/14/2019 DP_Slides_2011



LECTURE 18

LECTURE OUTLINE

• Analysis of asynchronous VI and PI for gener-

alized discounted DP• Undiscounted problems

• Stochastic shortest path problems (SSP)

• Proper and improper policies

• Analysis and computational methods for SSP• Pathologies of SSP

8/14/2019 DP_Slides_2011


ASYNCHRONOUS ALGORITHMS

• Motivation for asynchronous algorithms− Faster convergence

− Parallel and distributed computation

− Simulation-based implementations

• General framework: Partition X into disjoint

nonempty subsets X 1, . . . , X m, and use separateprocessor ℓ updating J (x) for x ∈ X ℓ

• J be partitioned as

J = (J 1, . . . , J m),

where J ℓ is the restriction of J on the set X ℓ.

• Synchronous algorithm:

J t+1ℓ (x) = T (J t1, . . . , J tm)(x), x ∈ X ℓ, ℓ = 1, . . . , m

• Asynchronous algorithm: For some subsets of times Rℓ,

J t+1ℓ (x) =

T (J

τ ℓ1(t)1 , . . . , J

τ ℓm(t)m )(x) if t ∈ Rℓ,

J tℓ(x) if t /∈ Rℓ

where t − τ ℓj(t) are communication “delays”

8/14/2019 DP_Slides_2011


ONE-STATE-AT-A-TIME ITERATIONS

• Important special case: Assume n “states”, aseparate processor for each state, and no delays

• Generate a sequence of states {x0, x1, . . .}, gen-erated in some way, possibly by simulation (eachstate is generated infinitely often)

• Asynchronous VI:

J t+1ℓ =

T (J t1, . . . , J tn)(ℓ) if ℓ = xt,J tℓ if ℓ = xt,

where T (J t1

, . . . , J tn)(ℓ) denotes the ℓ-th compo-nent of the vector

T (J t1, . . . , J tn) = T J t,

and for simplicity we write J tℓ instead of J tℓ(ℓ)

• The special case where

{x0, x1, . . .} = {1, . . . , n , 1, . . . , n , 1, . . .}

is the Gauss-Seidel method

• We can show that J t → J ∗ under the contrac-tion assumption

8/14/2019 DP_Slides_2011


ASYNCHRONOUS CONV. THEOREM I

• Assume that for all ℓ, j = 1, . . . , m, Rℓ is infiniteand limt→∞ τ ℓj(t) = ∞• Proposition: Let T have a unique fixed point J ∗,and assume that there is a sequence of nonemptysubsets S (k) ⊂ R(X ) with S (k + 1) ⊂ S (k) for

all k, and with the following properties:(1) Synchronous Convergence Condition: Ev-

ery sequence {J k} with J k ∈ S (k) for eachk, converges pointwise to J ∗. Moreover, wehave

T J ∈ S (k+1), ∀ J ∈ S (k), k = 0, 1, . . . .

(2) Box Condition: For all k, S (k) is a Cartesianproduct of the form

S (k) = S 1(k) × · · · × S m(k),

where S ℓ(k) is a set of real-valued functionson X ℓ, ℓ = 1, . . . , m.

Then for every J

∈ S (0), the sequence

{J t

} gen-

erated by the asynchronous algorithm convergespointwise to J ∗.

8/14/2019 DP_Slides_2011


ASYNCHRONOUS CONV. THEOREM II

• Interpretation of assumptions:

S (0)S (k)

S (k + 1) J ∗

J = (J 1, J 2)

S 1(0)

S 2(0)TJ

A synchronous iteration from any J in S (k) movesinto S (k + 1) (component-by-component)

• Convergence mechanism:

S (0)S (k)

S (k + 1) J ∗

J = (J 1, J 2)

J 1 Iterations

J 2 Iteration

Key: “Independent” component-wise improve-ment. An asynchronous component iteration fromany J in S (k) moves into the corresponding com-

ponent portion of S (k + 1)

8/14/2019 DP_Slides_2011


UNDISCOUNTED PROBLEMS

• System: xk+1 = f (xk, uk, wk)• Cost of a policy π = {µ0, µ1, . . .}


E wk

k=0,1,...

N −1

k=0g

xk, µk(xk), wk

• Shorthand notation for DP mappings


E w

g(x,u,w) + J

f (x,u,w)

, ∀ x


(T µJ )(x) = E w

g

x, µ(x), w

+ J

f (x, µ(x), w)

, ∀ x

• Neither T nor T µ are contractions in general,but their monotonicity is helpful.

• SSP problems provide a “soft boundary” be-tween the easy finite-state discounted problemsand the hard undiscounted problems.

− They share features of both.

− Some of the nice theory is recovered because

of the termination state.

8/14/2019 DP_Slides_2011


SSP THEORY SUMMARY I

• As earlier, we have a cost-free term. state t, afinite number of states 1, . . . , n, and finite numberof controls, but we will make weaker assumptions.

• Mappings T and T µ (modified to account fortermination state t):

(T J )(i) = minu∈U (i)

g(i, u) +

nj=1

pij(u)J ( j)

, i = 1, . . . , n ,

(T µJ )(i) = gi, µ(i)+n

j=1

pijµ(i)J ( j), i = 1, . . . , n .

• Definition: A stationary policy µ is calledproper, if under µ, from every state i, there isa positive probability path that leads to t.

• Important fact: If µ is proper, T µ is contrac-

tion with respect to some weighted max norm

maxi

1

vi|(T µJ )(i)−(T µJ ′)(i)| ≤ ρµ max

i

1

vi|J (i)−J ′(i)|

• T is similarly a contraction if all µ are proper

(the case discussed in the text, Ch. 7, Vol. I).

8/14/2019 DP_Slides_2011


SSP THEORY SUMMARY II

• The theory can be pushed one step further.Assume that:

(a) There exists at least one proper policy

(b) For each improper µ, J µ(i) = ∞ for some i

• Then T is not necessarily a contraction, but:

− J ∗ is the unique solution of Bellman’s Equ.

− µ∗ is optimal if and only if T µ∗J ∗ = T J ∗

− limk→∞(T kJ )(i) = J ∗(i) for all i

− Policy iteration terminates with an optimal

policy, if started with a proper policy• Example: Deterministic shortest path problemwith a single destination t.

− States <=> nodes; Controls <=> arcs

− Termination state <=> the destination

− Assumption (a) <=> every node is con-nected to the destination

− Assumption (b) <=> all cycle costs > 0

8/14/2019 DP_Slides_2011


SSP ANALYSIS I

• For a proper policy µ, J µ is the unique fixedpoint of T µ, and T kµJ → J µ for all J (holds by thetheory of Vol. I, Section 7.2)

• A stationary µ satisfying J ≥ T µJ for some J must be proper - true because

J ≥ T kµJ = P kµJ +k−1m=0

P mµ gµ

and some component of the term on the rightblows up if µ is improper (by our assumptions).

• Consequence: T can have at most one fixedpoint.

Proof: If J and J ′ are two fixed points, select µand µ′ such that J = T J = T µJ and J ′ = T J ′ =

T µ′J ′

. By preceding assertion, µ and µ′

must beproper, and J = J µ and J ′ = J µ′ . Also

J = T kJ ≤ T kµ′

J → J µ′ = J ′

Similarly, J ′

≤J , so J = J ′.

8/14/2019 DP_Slides_2011


SSP ANALYSIS II

• We now show that T has a fixed point, and alsothat policy iteration converges.

• Generate a sequence of proper policies {µk} bypolicy iteration starting from a proper policy µ0.

• µ1 is proper and J µ0

≥J µ1 since

J µ0 = T µ0 J µ0 ≥ T J µ0 = T µ1 J µ0 ≥ T kµ1 J µ0 ≥ J µ1

• Thus {J µk} is nonincreasing, some policy µ isrepeated, with J µ = T J µ. So J µ is a fixed point

of T .• Next show T kJ → J µ for all J , i.e., value it-eration converges to the same limit as policy iter-ation. (Sketch: True if J = J µ, argue using theproperness of µ to show that the terminal cost

difference J − J µ does not matter.)• To show J µ = J ∗, for any π = {µ0, µ1, . . .}

T µ0 · · · T µk−1 J 0 ≥ T kJ 0,

where J 0 ≡

0. Take lim sup as k → ∞

, to obtainJ π ≥ J µ, so µ is optimal and J µ = J ∗.

8/14/2019 DP_Slides_2011


SSP ANALYSIS III

• Contraction Property: If all policies are proper(the assumption of Section 7.1, Vol. I), T µ andT are contractions with respect to a weighted supnorm.

Proof: Consider a new SSP problem where thetransition probabilities are the same as in the orig-inal, but the transition costs are all equal to −1.Let J be the corresponding optimal cost vector.For all µ,

J (i) = −1+ minu∈U (i)

n

j=1

pij(u)J ( j) ≤ −1+

n

j=1

pijµ(i)J ( j)

For vi = −J (i), we have vi ≥ 1, and for all µ,

n

j=1

pijµ(i)vj ≤ vi−

1≤

ρ vi, i = 1, . . . , n ,

where

ρ = maxi=1,...,n

vi − 1

vi< 1.

This implies contraction of T µ and T by the resultsof the preceding lecture.

8/14/2019 DP_Slides_2011


8/14/2019 DP_Slides_2011


PATHOLOGIES II: DETERM. SHORTEST PATHS

• If there is a cycle with cost < 0, Bellman’sequation has no solution [among functions J with−∞ < J (i) < ∞ for all i]. Example:

0

-1

11 2 t

• We have J ∗(1) = J ∗(2) = −∞.

• Bellman’s equation is

J (1) = J (2), J (2) = min−1 + J (1), 1].

• There is no solution [among functions J with−∞ < J (i) < ∞ for all i].

• Bellman’s equation has as solution J ∗(1) =J ∗(2) = −∞ [within the larger class of functionsJ (·) that can take the value −∞ for some (orall) states]. This situation can be generalized (seeChapter 3 of Vol. 2 of the text).

8/14/2019 DP_Slides_2011


PATHOLOGIES III: THE BLACKMAILER

• Two states, state 1 and the termination state t.• At state 1, choose a control u ∈ (0, 1] (theblackmail amount demanded) at a cost −u, andmove to t with probability u2, or stay in 1 withprobability 1 − u2.

• Every stationary policy is proper, but the con-trol set in not finite.

• For any stationary µ with µ(1) = u, we have

J µ(1) = −u + (1 − u2)J µ(1)

from which J µ(1) = − 1u

• Thus J ∗(1) = −∞, and there is no optimalstationary policy.

• It turns out that a nonstationary policy is

optimal: demand µk(1) = γ/(k + 1) at time k,with γ ∈ (0, 1/2).

− Blackmailer requests diminishing amounts overtime, which add to ∞.

− The probability of the victim’s refusal dimin-

ishes at a much faster rate, so the probabil-ity that the victim stays forever compliant isstrictly positive.

8/14/2019 DP_Slides_2011


SSP ALGORITHMS

• All the basic algorithms have counterparts• “Easy” case: All policies proper, in which casethe mappings T and T µ are contractions

• Even in the general case all basic algorithmshave satisfactory counterparts

• Value iteration, policy iteration

• Q-learning

• Optimistic policy iteration

• Asynchronous value iteration

• Asynchronous policy iteration

8/14/2019 DP_Slides_2011



LECTURE 19

LECTURE OUTLINE

• We begin a lecture series on approximate DP

for large/intractable problems.• Check the updated/expanded version of Chap-ter 6, Vol. 2, athttp://web.mit.edu/dimitrib/www/dpchapter.html

• Today we classify/overview the main approaches:

− Rollout/Simulation-based single policy iter-ation (will not discuss this further)

− Approximation in value space (approximatepolicy iteration, Q-Learning, Bellman errorapproach, approximate LP)

− Approximation in value space using problemapproximation (simplification - aggregation- limited lookahead) - will not discuss much

− Approximation in policy space (policy para-metrization, gradient methods)

8/14/2019 DP_Slides_2011


GENERAL ORIENTATION

• We will mainly adopt an n-state discountedmodel (the easiest case - but think of HUGE n).

• Extensions to SSP and average cost are possible(but more quirky). We will set aside for later.

• Other than manual/trial-and-error approaches

(e.g., as in computer chess), the only other ap-proaches are simulation-based. They are collec-tively known as “neuro-dynamic programming” or“reinforcement learning”.

• Simulation is essential for large state spaces

because of its (potential) computational complex-ity advantage in computing sums/expectations in-volving a very large number of terms.

• Simulation also comes in handy when an ana-lytical model of the system is unavailable, but a

simulation/computer model is possible.• Simulation-based methods are of three types:

− Rollout (we will not discuss further)

− Approximation in value space

− Approximation in policy space

8/14/2019 DP_Slides_2011


APPROXIMATION IN POLICY SPACE

• A brief discussion; we will return to it at theend.

• We parameterize the set of policies by a vectorr = (r1, . . . , rs) and we optimize the cost over r.

• Discounted problem example:

− Each value of r defines a stationary policy,with cost starting at state i denoted by J i(r).

− Use a gradient (or other) method to mini-mize over r

J (r) =

ni=1

q (i)J i(r),

where

q (1), . . . , q (n)

is some probability dis-tribution over the states.

• In a special case of this approach, the param-eterization of the policies is indirect, through anapproximate cost function.

− A cost approximation architecture parame-terized by r, defines a policy dependent on rvia the minimization in Bellman’s equation.

8/14/2019 DP_Slides_2011


APPROX. IN VALUE SPACE - APPROACHES

• Approximate PI (Policy evaluation/Policy im-provement)

− Uses simulation algorithms to approximatethe cost J µ of the current policy µ

− Projected equation and aggregation approaches

• Approximation of the optimal cost function J ∗

− Q-Learning: Use a simulation algorithm toapproximate the optimal costs J ∗(i) or theQ-factors

Q∗(i, u) = g(i, u) + α

nj=1

pij(u)J ∗( j)

− Bellman error approach: Find r to

minr E i

J (i, r) − (T J )(i, r)2

where E i{·} is taken with respect to somedistribution

− Approximate LP (discussed earlier - supple-

mented with clever schemes to overcome thelarge number of constraints issue)

8/14/2019 DP_Slides_2011


POLICY EVALUATE/POLICY IMPROVE

• General structure

System Simulator

Decision GeneratorDecision µ(i)

Cost-to-Go Approximator

Supplies Values J ( j, r)

Cost Approximation

Algorithm

J ( j, r)

State i

Samples

• J ( j, r) is the cost approximation for the pre-

ceding policy, used by the decision generator tocompute the current policy µ [whose cost is ap-proximated by J ( j, r) using simulation]

• There are several cost approximation/policyevaluation algorithms

• There are several important issues relating tothe design of each block (to be discussed in thefuture).

8/14/2019 DP_Slides_2011


POLICY EVALUATION APPROACHES I

• Approximate the cost of the current policy byusing a simulation method.

− Direct policy evaluation - Cost samples gen-erated by simulation, and optimization byleast squares

− Indirect policy evaluation− An example of indirect approach: Solvingthe projected equation Φr = ΠT µ(Φr) whereΠ is projection w/ respect to a suitable weightedEuclidean norm

S: Subspace spanned by basis functions

0

ΠJµ

Projectionon S


Tµ(Φr)

0

Φr = ΠTµ(Φr)

Projectionon S

Jµ

Direct Mehod: Projection of cost vector Jµ Indirect method: Solving a projectedform of Bellman’s equation

• Batch and incremental methods

• Regular and optimistic policy iteration

8/14/2019 DP_Slides_2011


POLICY EVALUATION APPROACHES II

• Projected equation methods− TD(λ): Stochastic iterative algorithm for solv-

ing Φr = ΠT µ(Φr)

− LSPE(λ): A simulation-based form of projected value iteration

Φrk+1 = ΠT µ(Φrk) + simulation noise


T(Φrk) = g + αPΦrk

0

Value Iterate

Projectionon S

Φrk+1

Simulation error


Φrk


0

Φrk+1

Value Iterate

Projectionon S

Projected Value Iteration (PVI) Least Squares Policy Evaluation (LSPE)

Φrk

− LSTD(λ): Solves a simulation-based approx-

imation using a linear system solver (e.g.,Gaussian elimination/Matlab)

• Aggregation methods: Solve

Φr = DT µ(Φr)

where the rows of D and Φ are prob. distributions(e.g., D and Φ “aggregate” rows and columns of the linear system J = T µJ ).

8/14/2019 DP_Slides_2011


THEORETICAL BASIS OF APPROXIMATE PI

• If policies are approximately evaluated using anapproximation architecture:

maxi

|J (i, rk) − J µk(i)| ≤ δ, k = 0, 1, . . .

• If policy improvement is also approximate,

maxi

|(T µk+1 J )(i, rk)−(T J )(i, rk)| ≤ ǫ, k = 0, 1, . . .

• Error Bound: The sequence

{µk

} generated

by approximate policy iteration satisfies

lim supk→∞

maxi

J µk(i) − J ∗(i)

≤ ǫ + 2αδ

(1 − α)2

• Typical practical behavior: The method makessteady progress up to a point and then the iteratesJ µk oscillate within a neighborhood of J ∗.

8/14/2019 DP_Slides_2011


THE USE OF SIMULATION

• Simulation advantage: Compute (approximately)sums with a very large number of terms.

• Example: Projection by Monte Carlo Simula-tion: Compute projection ΠJ of J ∈ ℜn on sub-space S = {Φr | r ∈ ℜs}, with respect to a

weighted Euclidean norm · v.• Find Φr∗, where

r∗ = arg minr∈ℜs

Φr−J 2v = arg minr∈ℜs

ni=1

vi

φ(i)′r−J (i)2

• Setting to 0 the gradient at r∗,

r∗ =

ni=1

viφ(i)φ(i)′

−1 ni=1

viφ(i)J (i)

• Approximate by simulation the two “expected

values”

rk =

kt=1

φ(it)φ(it)′

−1k

t=1

φ(it)J (it)

• Equivalent least squares alternative:

rk = arg minr∈ℜs

kt=1

φ(it)′r − J (it)

2

8/14/2019 DP_Slides_2011


THE ISSUE OF EXPLORATION

• To evaluate a policy µ, we need to generate costsamples using that policy - this biases the simula-tion by underrepresenting states that are unlikelyto occur under µ.

• As a result, the cost-to-go estimates of these

underrepresented states may be highly inaccurate.• This seriously impacts the improved policy µ.

• This is known as inadequate exploration - aparticularly acute difficulty when the randomnessembodied in the transition probabilities is “rela-

tively small” (e.g., a deterministic system).• One possibility to guarantee adequate explo-ration: Frequently restart the simulation and en-sure that the initial states employed form a richand representative subset.

• Another possibility: Occasionally generatingtransitions that use a randomly selected controlrather than the one dictated by the policy µ.

• Other methods, to be discussed later, use twoMarkov chains (one is the chain of the policy and

is used to generate the transition sequence, theother is used to generate the state sequence).

8/14/2019 DP_Slides_2011


APPROXIMATING Q-FACTORS

• The approach described so far for policy eval-uation requires calculating expected values for allcontrols u ∈ U (i) (and knowledge of pij(u)).

• Model-free alternative: Approximate Q-factors

Q(i,u,r) ≈n

j=1

pij(u)

g(i,u,j) + αJ µ( j)

and use for policy improvement the minimization

µ(i) = arg minu∈U (i)

Q(i,u,r)

• r is an adjustable parameter vector and Q(i,u,r)is a parametric architecture, such as

Q(i,u,r) =m

k=1

rkφk(i, u)

• Can use any method for constructing cost ap-proximations, e.g., TD(λ).

• Use the Markov chain with states (i, u) - pij(µ(i))is the transition prob. to ( j, µ(i)), 0 to other ( j, u′).

• Major concern: Acutely diminished exploration.

8/14/2019 DP_Slides_2011



LECTURE 20

LECTURE OUTLINE

• Discounted problems - Approximate policy eval-

uation/policy improvement• Direct approach - Least squares

• Indirect approach - The projected equation

• Contraction properties - Error bounds

• Matrix form of the projected equation• Simulation-based implementation

• LSTD and LSPE methods

8/14/2019 DP_Slides_2011


APPROXIMATE PI

Approximate Policy

Evaluation

Policy Improvement

Guess Initial Policy

Evaluate Approximate Cost

J µ(r) = Φr Using Simulation

Generate “Improved” Policy µ

• Linear cost function approximation

J (r) = Φr

where Φ is full rank n × s matrix with columnsthe basis functions, and ith row denoted φ(i)′.

• Policy “improvement”

µ(i) = arg minu∈U (i)

nj=1

pij(u)

g(i,u,j) + αφ( j)′r

8/14/2019 DP_Slides_2011


POLICY EVALUATION

• Focus on policy evaluation: approximate thecost of the current policy by using a simulationmethod.

− Direct policy evaluation - Cost samples gen-erated by simulation, and optimization byleast squares

− Indirect policy evaluation - solving the projected equation Φr = ΠT µ(Φr) where Π isprojection w/ respect to a suitable weightedEuclidean norm

S: Subspace spanned by basis functions0

ΠJµ

Projectionon S


Tµ(Φr)

0

Φr = ΠTµ(Φr)

Projectionon S

Jµ

Direct Mehod: Projection of cost vector Jµ Indirect method: Solving a projectedform of Bellman’s equation

• Recall that projection can be implemented bysimulation and least squares

8/14/2019 DP_Slides_2011


DIRECT POLICY EVALUATION

• Suppose we can implement in a simulator thecurrent policy µ, and want to calculate J µ by sim-ulation.

• Generate by simulation sample costs. Then:

J µ(i) ≈ 1M i

M im=1

c(i, m)

c(i, m) : mth (noisy) sample cost starting from state i

• Approximating well each J µ(i) is impracticalfor a large state space. Instead, a “compact rep-resentation” J µ(i, r) is used, where r is a tunableparameter vector.

• Direct approach: Calculate an optimal value r∗

of r by a least squares fit

r∗ = arg minr

ni=1

M im=1

c(i, m) − J µ(i, r)2

• Note that this is much easier when the archi-tecture is linear - but this is not a requirement.

8/14/2019 DP_Slides_2011


IMPLEMENTATION OF DIRECT APPROACH

• Focus on a batch: an N -transition portion(i0, . . . , iN ) of a simulated trajectory

• We view the numbers

N −1

t=k

αt−kgit, µ(it), it+1, k = 0, . . . , N

−1,

as cost samples, one per initial state i0, . . . , iN −1

• Solve the least squares problem

minr

12

N −1k=0

J (ik, r) − N −1

t=k

αt−kg

it, µ(it), it+12

• If J (ik, r) is linear in r, this problem can besolved by matrix inversion.

• Another possibility: Use a gradient-type method[applies also to cases where J (ik, r) is nonlinear]

• A variety of gradient methods: Batch and in-cremental

8/14/2019 DP_Slides_2011


INDIRECT POLICY EVALUATION

• For the current policy µ, consider the corre-sponding mapping T :

(T J )(i) =ni=1

pij

g(i, j)+αJ ( j)

, i = 1, . . . , n ,

• The solution J µ of Bellman’s equation J = T J is approximated by the solution of

Φr = ΠT (Φr)


T(Φr)

Φr = ΠT(Φr)

Projectionon S

Indirect method: Solving a projectedform of Bellmanʼs equation

8/14/2019 DP_Slides_2011


WEIGHTED EUCLIDEAN PROJECTIONS

• Consider a weighted Euclidean norm

J v =

ni=1

vi

J (i)2

,

where v is a vector of positive weights v1, . . . , vn.

• Let Π denote the projection operation onto

S = {Φr | r ∈ ℜs}

with respect to this norm, i.e., for any J ∈ ℜn,

ΠJ = Φr∗

wherer∗ = arg min

r∈ℜs J −

Φr2v

8/14/2019 DP_Slides_2011


KEY QUESTIONS AND RESULTS

• Does the projected equation have a solution?• Under what conditions is the mapping ΠT acontraction, so ΠT has unique fixed point?

• Assuming ΠT has unique fixed point Φr∗, howclose is Φr∗ to J µ?

• Assumption: P has a single recurrent class andno transient states, i.e., it has steady-state prob-abilities that are positive

ξ j = limN →∞

1

N

N k=1P (ik = j | i0 = i) > 0

• Proposition: ΠT is contraction of modulus αwith respect to the weighted Euclidean norm ·ξ,where ξ = (ξ 1, . . . , ξ n) is the steady-state proba-

bility vector. The unique fixed point Φr∗ of ΠT satisfies

J µ − Φr∗ξ ≤ 1√ 1 − α2

J µ − ΠJ µξ

8/14/2019 DP_Slides_2011


PRELIMINARIES: PROJECTION PROPERTIES

• Important property of the projection Π on S with weighted Euclidean norm · v. For all J ∈ℜn, J ∈ S , the Pythagorean Theorem holds:

J − J 2v = J − ΠJ 2v + ΠJ − J 2v

Proof: Geometrically, (J − ΠJ ) and (ΠJ − J )are orthogonal in the scaled geometry of the norm · v, where two vectors x, y ∈ ℜn are orthogonalif

ni=1 vixiyi = 0. Expand the quadratic in the

RHS below:

J − J 2v = (J − ΠJ ) + (ΠJ − J )2v

• The Pythagorean Theorem implies that the projection is nonexpansive, i.e.,

ΠJ − Π J v ≤ J − J v, for all J, J ∈ ℜn.

To see this, note that

Π(J

−J )2v ≤ Π(J

−J )2v + (I

−Π)(J

−J )2v

= J − J 2v

8/14/2019 DP_Slides_2011


PROOF OF CONTRACTION PROPERTY

• Lemma: We have

P zξ ≤ zξ, z ∈ ℜn

Proof: Let pij be the components of P . For allz ∈ ℜn, we have

P z2ξ =ni=1

ξ i

n

j=1

pijzj

2

≤ni=1

ξ i

nj=1

pijz2j

=n

j=1

n

i=1

ξ i pijz2j =n

j=1

ξ jz2j =

z

2ξ ,

where the inequality follows from the convexity of the quadratic function, and the next to last equal-ity follows from the defining property

ni=1 ξ i pij =

ξ j of the steady-state probabilities.

• Using the lemma, the nonexpansiveness of Π,and the definition T J = g + αP J , we have

ΠT J −ΠT J ξ ≤ T J −T J ξ = αP (J −J )ξ ≤ αJ −J ξ

for all J, J ∈ ℜn. Hence ΠT is a contraction of modulus α.

8/14/2019 DP_Slides_2011


PROOF OF ERROR BOUND

• Let Φr∗ be the fixed point of ΠT . We have

J µ − Φr∗ξ ≤ 1√ 1 − α2

J µ − ΠJ µξ.

Proof: We have

J µ − Φr∗2ξ = J µ − ΠJ µ2ξ +ΠJ µ − Φr∗

2ξ

= J µ − ΠJ µ2ξ +ΠT J µ − ΠT (Φr∗)

2ξ

≤ J µ − ΠJ µ2ξ + α2J µ − Φr∗2ξ ,

where

− The first equality uses the Pythagorean The-orem

− The second equality holds because J µ is the

fixed point of T and Φr∗

is the fixed pointof ΠT

− The inequality uses the contraction propertyof ΠT .

Q.E.D.

8/14/2019 DP_Slides_2011


MATRIX FORM OF PROJECTED EQUATION

• Its solution is the vector J = Φr∗, where r∗solves the problem

minr∈ℜs

Φr − (g + αP Φr∗)2ξ

.

• Setting to 0 the gradient with respect to r of this quadratic, we obtain

Φ′Ξ

Φr∗ − (g + αP Φr∗)

= 0,

where Ξ is the diagonal matrix with the steady-state probabilities ξ 1, . . . , ξ n along the diagonal.

• This is just the orthogonality condition: Theerror Φr∗ − (g + αP Φr∗) is “orthogonal” to thesubspace spanned by the columns of Φ.

• Equivalently, Cr∗ = d,

where

C = Φ′Ξ(I − αP )Φ, d = Φ′Ξg.

8/14/2019 DP_Slides_2011


PROJECTED EQUATION: SOLUTION METHOD

• Matrix inversion: r∗ = C −1d• Projected Value Iteration (PVI) method:

Φrk+1 = ΠT (Φrk) = Π(g + αP Φrk)

Converges to r∗ because ΠT is a contraction.


Φrk


0

Φrk+1

Value Iterate

Projectionon S

• PVI can be written as:

rk+1 = arg minr∈ℜs

Φr − (g + αP Φrk)2ξ

By setting to 0 the gradient with respect to r,

Φ′Ξ

Φrk+1 − (g + αP Φrk)

= 0,

which yields

rk+1 = rk − (Φ′ΞΦ)−1(Crk − d)

8/14/2019 DP_Slides_2011


SIMULATION-BASED IMPLEMENTATIONS

• Key idea: Calculate simulation-based approxi-mations based on k samples

C k ≈ C, dk ≈ d

• Matrix inversion r∗

= C −1

d is approximatedbyrk = C −1k dk

This is the LSTD (Least Squares Temporal Dif-ferences) Method.

• PVI method rk+1 = rk − (Φ′ΞΦ)−1(Crk − d) isapproximated by

rk+1 = rk − Gk(C krk − dk)

whereGk ≈ (Φ′ΞΦ)−1

This is the LSPE (Least Squares Policy Evalua-tion) Method.

• Key fact: C k, dk, and Gk can be computed

with low-dimensional linear algebra (of order s;the number of basis functions).

8/14/2019 DP_Slides_2011


SIMULATION MECHANICS

• We generate an infinitely long trajectory (i0, i1, . . .)of the Markov chain, so states i and transitions(i, j) appear with long-term frequencies ξ i and pij .

• After generating the transition (it, it+1), wecompute the row φ(it)′ of Φ and the cost com-

ponent g(it, it+1).• We form

C k = 1

k + 1

k

t=0φ(it)

φ(it)−αφ(it+1)

′ ≈ Φ′Ξ(I −αP )Φ

dk = 1

k + 1

kt=0

φ(it)g(it, it+1) ≈ Φ′Ξg

Also in the case of LSPE

Gk = 1

k + 1

kt=0

φ(it)φ(it)′ ≈ Φ′ΞΦ

• Convergence proof is simple: Use the law of large numbers.

• Note that C k, dk, and Gk can be formed incre-mentally.

8/14/2019 DP_Slides_2011



LECTURE 21

LECTURE OUTLINE

• Review of projected equation methods

• Optimistic versions

• Multistep projected equation methods

• Bias-variance tradeoff

• Exploration-enhanced implementations

• Oscillations

8/14/2019 DP_Slides_2011


REVIEW: PROJECTED BELLMAN EQUATION

• For a fixed policy µ to be evaluated, considerthe corresponding mapping T :

(T J )(i) =ni=1

pij

g(i, j)+αJ ( j)

, i = 1, . . . , n ,

or more compactly,

T J = g + αP J

• Approximate Bellman’s equation J = T J byΦr = ΠT (Φr) or the matrix form/orthogonalitycondition Cr∗ = d, where

C = Φ′Ξ(I − αP )Φ, d = Φ′Ξg.


T(Φr)

0

Φr = ΠT(Φr)

Projectionon S


8/14/2019 DP_Slides_2011


PROJECTED EQUATION METHODS

• Matrix inversion: r∗ = C −1d• Iterative Projected Value Iteration (PVI) method:

Φrk+1 = ΠT (Φrk) = Π(g + αP Φrk)

Converges to r∗ if ΠT is a contraction. True if

Π is projection with respect to · ξ, where ξ issteady-state distribution.

• Simulation-Based Implementations: Calculateapproximations based on k simulation samples

C k ≈

C, dk ≈

d

• LSTD: r∗ = C −1d is approximated by

rk = C −1k dk

• LSPE:

rk+1 = rk − Gk(C krk − dk)

where Gk ≈ (Φ′ΞΦ)−1. Converges to r∗ if ΠT iscontraction.

• Key fact: C k, dk, and Gk can be computed

with low-dimensional linear algebra (of order s;the number of basis functions).

8/14/2019 DP_Slides_2011


OPTIMISTIC VERSIONS

• Use coarse approximations C k ≈ C and dk ≈ d,based on few simulation samples

• Evaluate (coarsely) current policy µ, then do apolicy improvement

• Very complex behavior (see the subsequent dis-

cussion on oscillations)

• Often approaches the limit more quickly (asoptimistic methods often do)

• The matrix inversion/LSTD method has seriousproblems due to large simulation noise (because of

limited sampling)

• LSPE tends to cope better because of its itera-tive nature

• A stepsize γ ∈ (0, 1] in LSPE may be useful todamp the effect of simulation noise

rk+1 = rk − γGk(C krk − dk)

8/14/2019 DP_Slides_2011


MULTISTEP METHODS

• Introduce a multistep version of Bellman’s equa-tion J = T (λ)J , where for λ ∈ [0, 1),

T (λ) = (1 − λ)∞t=0

λtT t+1

Geometrically weighted sum of powers of T .

• Note that T t is a contraction with modulusαt, with respect to the weighted Euclidean norm·ξ, where ξ is the steady-state probability vectorof the Markov chain.

• Hence T (λ) is a contraction with modulus

αλ = (1 − λ)∞t=0

αt+1λt = α(1 − λ)

1 − αλ

Note that αλ

→0 as λ

→1.

• T t and T (λ) have the same fixed point J µ and

J µ − Φr∗λξ ≤ 1

1 − α2

λ

J µ − ΠJ µξ

where Φr∗λ is the fixed point of ΠT (λ).

• The fixed point Φr∗λ depends on λ.

8/14/2019 DP_Slides_2011


BIAS-VARIANCE TRADEOFF

• Error bound J µ−Φr∗λξ ≤ 1 1−α2

λJ µ−ΠJ µξ

• Bias-variance tradeoff:

− As λ ↑ 1, we have αλ ↓ 0, so error bound(and the quality of approximation) improvesas λ

↑1. In fact

limλ↑1

Φr∗λ = ΠJ µ

− But simulation noise in approximating

T (λ) = (1 − λ)

∞t=0

λtT t+1

increases.

Subspace S = {Φr | r ∈ ℜs}

J µ

Simulation errorΠJ µ

Bias

λ = 0

λ = 1

0 < λ < 1

Solution of projected equation

Simulation error

Φr = ΠT (λ)µ (Φr)

• Choice of λ is based on trial and error

8/14/2019 DP_Slides_2011


MULTISTEP PROJECTED EQ. METHODS

• The projected Bellman equation is

Φr = ΠT (λ)(Φr)

• In matrix form: C (λ)r = d(λ), where

C (λ) = Φ′ΞI

−αP (λ)Φ, d(λ) = Φ′Ξg(λ),

with

P (λ) = (1 − λ)∞ℓ=0

αℓλℓP ℓ+1, g(λ) =∞ℓ=0

αℓλℓP ℓg

• The LSTD(λ) method isC (λ)k

−1d(λ)k ,

where C (λ)k and d

(λ)k are simulation-based approx-

imations of C (λ) and d(λ).

• The LSPE(λ) method is

rk+1 = rk − γGk

C (λ)k rk − d

(λ)k

where Gk is a simulation-based approx. to (Φ′ΞΦ)−1

• TD(λ) is a simplified version of LSPE(λ) whereGk = I , and simpler approximations to C and dare used. It is more properly viewed as a stochas-tic iteration for solving the projected equation.

8/14/2019 DP_Slides_2011


MORE ON MULTISTEP METHODS

• The simulation process to obtain C (λ)k and d

(λ)k

is similar to the case λ = 0 (single simulation trajectory i0, i1, . . . more complex formulas)

C (λ)k =

1

k + 1

k

t=0

φ(it)k

m=t

αm−tλm−tφ(im)

−αφ(im+1)′,

d(λ)k =

1

k + 1

kt=0

φ(it)k

m=t

αm−tλm−tgim

• In the context of approximate policy iteration,we can use optimistic versions (few samples be-tween policy updates).

• Many different versions ... see on-line chapter.

• Note the λ-tradeoffs:

− As λ ↑ 1, C (λ)k and d(λ)k contain more “sim-ulation noise”, so more samples are neededfor a close approximation of rλ (the solutionof the projected equation)

− The error bound J µ−Φrλξ becomes smaller

− As λ ↑ 1, ΠT (λ) becomes a contraction forarbitrary projection norm

8/14/2019 DP_Slides_2011


POLICY ITERATION ISSUES

• 1st major issue: exploration. Common rem-edy is the off-policy approach: Replace P of cur-rent policy with

P = (I − B)P + BQ,

where B is a diagonal matrix with diagonal com-

ponents β i ∈ [0, 1] and Q is another transitionprobability matrix.

• Then LSTD and LSPE formulas must be mod-ified ... otherwise the policy associated with P (not P ) is evaluated (see the on-line chapter).

• 2nd major issue: oscillation of policies

• Analysis using the greedy partition: Rµ is theset of parameter vectors r for which µ is greedywith respect to J (·, r) = Φr

Rµ =

r | T µ(Φr) = T (Φr)

rµk

rµk+1

rµk+2

rµk+3

Rµk

Rµk+1

Rµk+2

Rµk+3

8/14/2019 DP_Slides_2011


MORE ON OSCILLATIONS/CHATTERING

• In the case of optimistic policy iteration a dif-ferent picture holds

r

µ

1

rµ2

rµ3

Rµ1

Rµ2

Rµ3

• Oscillations are less violent, but the “limit”point is meaningless!

• Fundamentally, oscillations are due to the lackof monotonicity of the projection operator, i.e.,J ≤ J ′ does not imply ΠJ ≤ ΠJ ′.

• If approximate PI uses policy evaluation

Φr = (W T µ)(Φr)

with W a monotone operator, the generated poli-cies converge (to a possibly nonoptimal limit).

• The operator W used in the aggregation ap-proach has this monotonicity property.

8/14/2019 DP_Slides_2011



LECTURE 22

LECTURE OUTLINE

• Aggregation as an approximation methodology

• Aggregate problem

• Examples of aggregation

• Simulation-based aggregation

• Q-Learning

8/14/2019 DP_Slides_2011


PROBLEM APPROXIMATION - AGGREGATIO

• Another major idea in ADP is to approximatethe cost-to-go function of the problem with thecost-to-go function of a simpler problem. The sim-plification is often ad-hoc/problem dependent.

• Aggregation is a systematic approach for prob-

lem approximation. Main elements:− Introduce a few “aggregate” states, viewedas the states of an “aggregate” system

− Define transition probabilities and costs of the aggregate system, by relating originalsystem states with aggregate states

− Solve (exactly or approximately) the “ag-gregate” problem by any kind of value or pol-icy iteration method (including simulation-based methods)

− Use the optimal cost of the aggregate prob-

lem to approximate the optimal cost of theoriginal problem

• Hard aggregation example: Aggregate statesare subsets of original system states, treated as if they all have the same cost.

8/14/2019 DP_Slides_2011


AGGREGATION/DISAGGREGATION PROBS

• The aggregate system transition probabilitiesare defined via two (somewhat arbitrary) choices:

• For each original system state j and aggregatestate y, the aggregation probability φjy

− Roughly, the “degree of membership of j in

the aggregate state y.”− In hard aggregation, φjy = 1 if state j be-

longs to aggregate state/subset y.

• For each aggregate state x and original systemstate i, the disaggregation probability dxi

− Roughly, the “degree to which i is represen-tative of x.”

− In hard aggregation, one possibility is allstates i that belongs to aggregate state/subsetx have equal dxi.

pij(u)

dxi φjy

ji

x y

OriginalSystem States

Aggregate States

ˆ pxy(u) =n

i=1

dxi

n

j=1

pij(u)φjy

Disaggregation

Probabilities

Aggregation

Probabilities

8/14/2019 DP_Slides_2011


AGGREGATE TRANSITION PROBABILITIES

• The transition probability from aggregate statex to aggregate state y under control u

ˆ pxy(u) =ni=1

dxi

nj=1

pij(u)φjy, or P (u) = DP (u)Φ

where the rows of D and Φ are the disaggr. and

aggr. probs.

• The expected transition cost is

g(x, u) =n

i=1dxi

n

j=1 pij(u)g(i,u,j), or g = DP g

• The optimal cost function of the aggregate prob-lem, denoted R, is

R(x) = minu∈U

g(x, u) + α

yˆ pxy(u) R(y)

, ∀ x

or R = minu[g + αP R] - Bellman’s equation forthe aggregate problem.

• The optimal cost function J ∗ of the originalproblem is approximated by J given by

J ( j) =y

φjy R(y), ∀ j

8/14/2019 DP_Slides_2011


EXAMPLE I: HARD AGGREGATION

• Group the original system states into subsets,and view each subset as an aggregate state

• Aggregation probs.: φjy = 1 if j belongs toaggregate state y.

1 2 3

4 5 6

7 8 9

x1 x2

x3 x4

Φ =

1 0 0 0

1 0 0 0

0 1 0 0

1 0 0 0

1 0 0 0

0 1 0 0

0 0 1 0

0 0 1 0

0 0 0 1

• Disaggregation probs.: There are many possi-bilities, e.g., all states i within aggregate state xhave equal prob. dxi.

• If optimal cost vector J ∗ is piecewise constantover the aggregate states/subsets, hard aggrega-tion is exact. Suggests grouping states with “roughlyequal” cost into aggregates.

• Soft aggregation (provides “soft boundaries”between aggregate states).

8/14/2019 DP_Slides_2011


EXAMPLE II: FEATURE-BASED AGGREGATIO

• If we know good features, it makes sense togroup together states that have “similar features”

• Essentially discretize the features and assign aweight to each discretization point

States Aggregate StatesFeatures

FeatureExtraction

• A general approach for passing from a feature-based state representation to an aggregation-basedarchitecture

• Aggregation-based architecture is more power-ful (nonlinear in the features)

• ... but may require many more aggregate statesto reach the same level of performance as the cor-responding linear feature-based architecture

8/14/2019 DP_Slides_2011


EXAMPLE III: REP. STATES/COARSE GRID

• Choose a collection of “representative” originalsystem states, and associate each one of them withan aggregate state

x

j2

j3

j1

y1 y2

y3

Original State Space

Representative/Aggregate States

• Disaggregation probabilities are dxi = 1 if i isequal to representative state x.

• Aggregation probabilities associate original sys-tem states with convex combinations of represen-

tative states j ∼y∈A

φjyy

• Well-suited for Euclidean space discretization

• Extends nicely to continuous state space, in-cluding belief space of POMDP

8/14/2019 DP_Slides_2011


APPROXIMATE PI BY AGGREGATION

• Consider approximate policy iteration for theoriginal problem, with policy evaluation done byaggregation.

• Evaluation of policy µ: J = ΦR, where R =DT µ(ΦR) (R is the vector of costs of aggregate

states corresponding to µ). Can be done by sim-ulation.

• Similar form to the projected equation ΦR =ΠT µ(ΦR) (ΦD in place of Π).

• Advantages: It has no problem with exploration

or with oscillations.• Disadvantage: The rows of D and Φ must beprobability distributions.

pij(u)

dxi φjy

ji

x y

OriginalSystem States

Aggregate States

ˆ pxy(u) =

n

i=1

dxi

n

j=1

pij(u)φjy

Disaggregation

Probabilities

Aggregation

Probabilities

8/14/2019 DP_Slides_2011


Q-LEARNING I

• Q-learning has two motivations:− Dealing with multiple policies simultaneously

− Using a model-free approach [no need to know pij(u), only be able to simulate them]

• The Q-factors are defined by

Q∗(i, u) =n

j=1

pij(u)

g(i,u,j) + αJ ∗( j)

, ∀ (i, u)

• Since J ∗ = T J ∗, we have J ∗(i) = minu∈U (i) Q∗(i, u)

so the Q factors solve the equation

Q∗(i, u) =n

j=1

pij(u)

g(i,u,j) + α min

u′∈U (j)Q∗( j, u′)

• Q∗(i, u) can be shown to be the unique solu-

tion of this equation. Reason: This is Bellman’sequation for a system whose states are the originalstates 1, . . . , n , together with all the pairs (i, u).

• Value iteration: For all (i, u)

Q(i, u) :=n

j=1

pij(u)

g(i,u,j) + α minu′∈U (j)

Q( j, u′)

8/14/2019 DP_Slides_2011


Q-LEARNING II

• Use any probabilistic mechanism to select se-quence of pairs (ik, uk) [all pairs (i, u) are choseninfinitely often], and for each k, select jk accordingto pikj(uk).

• At each k, Q-learning algorithm updates Q(ik, uk)

according to

Q(ik, uk) :=

1 − γ k(ik, uk)

Q(ik, uk)

+ γ k(ik, uk)

g(ik, uk, jk) + α min

u′∈U (jk)Q( jk, u′)

• Stepsize γ k(ik, uk) must converge to 0 at properrate (e.g., like 1/k).

• Important mathematical point: In the Q-factor version of Bellman’s equation the order of expectation and minimization is reversed relative

to the ordinary cost version of Bellman’s equation:


nj=1

pij(u)

g(i,u,j) + αJ ∗( j)

• Q-learning can be shown to converge to true/exact

Q-factors (a sophisticated proof).

• Major drawback: The large number of pairs(i, u) - no function approximation is used.

8/14/2019 DP_Slides_2011


Q-FACTOR APROXIMATIONS

• Introduce basis function approximation for Q-factors:

Q(i,u,r) = φ(i, u)′r

• We can use approximate policy iteration andLSPE/LSTD/TD for policy evaluation (exploration

issue is acute)• Optimistic policy iteration methods are fre-quently used on a heuristic basis.

• Example: Generate infinite trajectory {(ik, uk) |k = 0, 1, . . .

}• At iteration k, given rk and state/control (ik, uk):

(1) Simulate next transition (ik, ik+1) using thetransition probabilities pikj(uk).

(2) Generate control uk+1 from

uk+1 = arg minu∈U (ik+1)

Q(ik+1, u , rk)

(3) Update the parameter vector via

rk+1 = rk − (LSPE or TD-like correction)

• Unclear validity. There is a solid basis for animportant but very special case: optimal stopping(see the on-line chapter)

8/14/2019 DP_Slides_2011



LECTURE 23

LECTURE OUTLINE

• LSTD-like methods - Use to enhance explo-

ration• Additional topics in ADP

• Stochastic shortest path problems


• Nonlinear versions of the projected equation• Extension of Q-learning for optimal stopping

• Basis function adaptation

• Gradient-based approximation in policy space

8/14/2019 DP_Slides_2011


REVIEW: PROJECTED BELLMAN EQUATION

• For fixed policy µ to be evaluated, the solutionof Bellman’s equation J = T J is approximated bythe solution of

Φr = ΠT (Φr)

whose solution is in turn obtained using a simulation-based method such as LSPE(λ), LSTD(λ), or TD(λ).


T(Φr)

Φr = ΠT(Φr)

Projection

on S


• These ideas apply to other (linear) Bellmanequations, e.g., for SSP and average cost.

• Key Issue: Construct framework where ΠT [or

at least ΠT (λ)] is a contraction.

8/14/2019 DP_Slides_2011


STOCHASTIC SHORTEST PATHS

• Introduce approximation subspace

S = {Φr | r ∈ ℜs}and for a given proper policy, Bellman’s equationand its projected version

J = T J = g + P J, Φr = ΠT (Φr)

Also its λ-version

Φr = ΠT (λ)(Φr), T (λ) = (1 − λ)∞

t=0λtT t+1

• Question: What should be the norm of projection?

• Speculation based on discounted case: Itshould be a weighted Euclidean norm with weightvector ξ = (ξ 1, . . . , ξ n), where ξ i should be some

type of long-term occupancy probability of state i(which can be generated by simulation).

• But what does “long-term occupancy probabil-ity of a state” mean in the SSP context?

• How do we generate infinite length trajectories

given that termination occurs with prob. 1?

8/14/2019 DP_Slides_2011


SIMULATION TRAJECTORIES FOR SSP

• We envision simulation of trajectories up totermination, followed by restart at state i withsome fixed probabilities q 0(i) > 0.

• Then the “long-term occupancy probability of a state” of i is proportional to

q (i) =∞t=0

q t(i), i = 1, . . . , n ,

where

q t(i) = P (it = i), i = 1, . . . , n, t = 0, 1, . . .

• We use the projection norm

J q =

ni=1

q (i)

J (i)

2

[Note that 0 < q (i) < ∞, but q is not a prob.distribution. ]

• We can show that ΠT (λ) is a contraction withrespect to · ξ (see the next slide).

• LSTD(λ), LSPE(λ), and TD(λ) are possible.

8/14/2019 DP_Slides_2011


CONTRACTION PROPERTY FOR SSP

• We have q =∞

t=0 q t so

q ′P =∞t=0

q ′tP =∞t=1

q ′t = q ′ − q ′0

orni=1

q (i) pij = q ( j) − q 0( j), ∀ j

• To verify that ΠT is a contraction, we showthat there exists β < 1 such that P z2q ≤ β z2qfor all z ∈ ℜn.

• For all z ∈ ℜn

, we have

P z2q =ni=1

q (i)

n

j=1

pijzj

2

≤ni=1

q (i)n

j=1

pijz2j

=

nj=1

z2j

ni=1

q (i) pij =

nj=1

q ( j) − q 0( j)

z2j

= z2q − z2q0 ≤ β z2q

where

β = 1 − minj

q 0( j)

q ( j)

8/14/2019 DP_Slides_2011


AVERAGE COST PROBLEMS

• Consider a single policy to be evaluated, withsingle recurrent class, no transient states, and steady-state probability vector ξ = (ξ 1, . . . , ξ n).

• The average cost, denoted by η, is independentof the initial state

η = limN →∞

1

N E

N −1k=0

g

xk, xk+1

x0 = i

, ∀ i

• Bellman’s equation is J = F J with

F J = g − ηe + P J

where e is the unit vector e = (1, . . . , 1).

• The projected equation and its λ-version are

Φr = ΠF (Φr), Φr = ΠF (λ)

(Φr)• A problem here is that F is not a contractionwith respect to any norm (since e = P e).

• However, ΠF (λ) turns out to be a contractionwith respect to · ξ assuming that e does not be-

long to S and λ > 0 [the case λ = 0 is exceptional,but can be handled - see the text].

• LSTD(λ), LSPE(λ), and TD(λ) are possible.

8/14/2019 DP_Slides_2011


GENERALIZATION/UNIFICATION

• Consider approximate solution of x = T (x),where

T (x) = Ax + b, A is n × n, b ∈ ℜn

by solving the projected equation y = ΠT (y),

where Π is projection on a subspace of basis func-tions (with respect to some Euclidean norm).

• We will generalize from DP to the case whereA is arbitrary, subject only to

I − ΠA : invertible

• Benefits of generalization:

− Unification/higher perspective for TD meth-ods in approximate DP

− An extension to a broad new area of applica-tions, where a DP perspective may be help-ful

• Challenge: Dealing with less structure

− Lack of contraction

− Absence of a Markov chain

8/14/2019 DP_Slides_2011


GENERALIZED PROJECTED EQUATION

• Let Π be projection with respect to

xξ =

ni=1

ξ ix2i

where ξ ∈ ℜn is a probability distribution withpositive components.

• If r∗ is the solution of the projected equation,we have Φr∗ = Π(AΦr∗ + b) or

r∗ = arg minr∈ℜs

ni=1

ξ iφ(i)′r −

nj=1

aijφ( j)′r∗ − bi2

where φ(i)′ denotes the ith row of the matrix Φ.

• Optimality condition/equivalent form:

ni=1

ξ iφ(i)

φ(i) −

nj=1

aijφ( j)

′

r∗ =ni=1

ξ iφ(i)bi

• The two expected values can be approximatedby simulation.

8/14/2019 DP_Slides_2011


SIMULATION MECHANISM

Row Sampling According to ξ

i0 i1

j0 j1

ik ik+1

jk jk+1

. . . . . .

Column SamplingAccording to P

• Row sampling: Generate sequence {i0, i1, . . .}according to ξ , i.e., relative frequency of each rowi is ξ i

• Column sampling: Generate

(i0, j0), (i1, j1), . . .

according to some transition probability matrix P with

pij > 0 if aij = 0,

i.e., for each i, the relative frequency of (i, j) is pij

• Row sampling may be done using a Markovchain with transition matrix Q (unrelated to P )

• Row sampling may also be done without aMarkov chain - just sample rows according to someknown distribution ξ (e.g., a uniform)

8/14/2019 DP_Slides_2011


ROW AND COLUMN SAMPLING

Row Sampling According to ξ (May Use Markov Chain Q)

| |

Column Sampling

ccording to

Markov Chain

P ∼ |A|

i0 i1

j0 j1

ik ik+1

jk jk+1

. . . . . .

• Row sampling ∼ State Sequence Generation inDP. Affects:

− The projection norm.

− Whether ΠA is a contraction.

• Column sampling ∼ Transition Sequence Gen-eration in DP.

− Can be totally unrelated to row sampling.

Affects the sampling/simulation error.− “Matching” P with |A| is beneficial (has aneffect like in importance sampling).

• Independent row and column sampling allowsexploration at will! Resolves the exploration prob-

lem that is critical in approximate policy iteration.

8/14/2019 DP_Slides_2011


LSTD-LIKE METHOD

• Optimality condition/equivalent form of projected equation

n

i=1ξ iφ(i)

φ(i) −

n

j=1aijφ( j)

′

r∗ =n

i=1ξ iφ(i)bi

• The two expected values are approximated byrow and column sampling (batch 0 → t).

• We solve the linear equation

tk=0

φ(ik)

φ(ik) − aikjk

pikjkφ( jk)

′

rt =t

k=0

φ(ik)bik

• We have rt → r∗, regardless of ΠA being a con-traction (by law of large numbers; see next slide).

• An LSPE-like method is also possible, but re-quires that ΠA is a contraction.

• Under the assumption n

j=1 |aij | ≤ 1 for all i,there are conditions that guarantee contraction of ΠA; see the paper by Bertsekas and Yu,“Projected

Equation Methods for Approximate Solution of Large Linear Systems,” 2009.

8/14/2019 DP_Slides_2011


JUSTIFICATION W/ LAW OF LARGE NUMBER

• We will match terms in the exact optimalitycondition and the simulation-based version.

• Let ξ ti be the relative frequency of i in rowsampling up to time t.

• We have

1

t + 1

tk=0

φ(ik)φ(ik)′ =ni=1

ξ tiφ(i)φ(i)′ ≈ni=1

ξ iφ(i)φ(i)′

1

t + 1

t

k=0

φ(ik)bik =n

i=1

ξ tiφ(i)bi

≈

n

i=1

ξ iφ(i)bi

• Let ˆ ptij be the relative frequency of (i, j) incolumn sampling up to time t.

1

t + 1

t

k=0

aikjk

pikjkφ(ik)φ( jk)′

=ni=1

ξ ti

nj=1

ˆ ptijaij pij

φ(i)φ( j)′

≈

n

i=1

ξ i

n

j=1

aijφ(i)φ( j)′

8/14/2019 DP_Slides_2011


LSPE FOR OPTIMAL STOPPING EXTENDED

• Consider a system of the form

x = T (x) = Af (x) + b,

where f : ℜn → ℜn is a mapping with scalar com-ponents of the form f (x) =

f 1(x1), . . . , f n(xn)

.

• Assume that each f i : ℜ → ℜ is nonexpansive:f i(xi) − f i(xi) ≤ |xi − xi|, ∀ i, xi, xi ∈ ℜ

This guarantees that T is a contraction with re-spect to any weighted Euclidean norm ·ξ when-

ever A is a contraction with respect to that norm.• Algorithms similar to LSPE [approximatingΦrk+1 = ΠT (Φrk)] are then possible.

• Special case: In the optimal stopping problemof Section 6.4, x is the Q-factor corresponding to

the continuation action, α ∈ (0, 1) is a discountfactor, f i(xi) = min{ci, xi}, and A = αP , whereP is the transition matrix for continuing.

• If n

j=1 pij < 1 for some state i, and 0 ≤ P ≤Q, where Q is an irreducible transition matrix,

then Π((1−γ )I +γT ) is a contraction with respectto · ξ for all γ ∈ (0, 1), even with α = 1.

8/14/2019 DP_Slides_2011


BASIS FUNCTION ADAPTATION I

• An important issue in ADP is how to selectbasis functions.

• A possible approach is to introduce basis func-tions that are parametrized by a vector θ, andoptimize over θ, i.e., solve the problem

minθ∈Θ

F

J (θ)

where J (θ) is the solution of the projected equa-tion.

• One example is

F

J (θ)

=J (θ) − T

J (θ)2

• Another example is

F

J (θ)

=i∈I

|J (i) − J (θ)(i)|2,

where I is a subset of states, and J (i), i ∈ I, arethe costs of the policy at these states calculated

directly by simulation.

8/14/2019 DP_Slides_2011


BASIS FUNCTION ADAPTATION II

• Some algorithm may be used to minimize F

J (θ)

over θ.

• A challenge here is that the algorithm shoulduse low-dimensional calculations.

• One possibility is to use a form of random search

(the cross-entropy method); see the paper by Men-ache, Mannor, and Shimkin (Annals of Oper. Res.,Vol. 134, 2005)

• Another possibility is to use a gradient method.For this it is necessary to estimate the partial

derivatives of J (θ) with respect to the componentsof θ.

• It turns out that by differentiating the projected equation, these partial derivatives can becalculated using low-dimensional operations. See

the paper by Menache, Mannor, and Shimkin, andthe paper by Yu and Bertsekas (2009).

8/14/2019 DP_Slides_2011


APPROXIMATION IN POLICY SPACE I

• Consider an average cost problem, where theproblem data are parametrized by a vector r, i.e.,a cost vector g(r), transition probability matrixP (r). Let η(r) be the (scalar) average cost perstage, satisfying Bellman’s equation

η(r)e + h(r) = g(r) + P (r)h(r)

where h(r) is the corresponding differential costvector.• Consider minimizing η(r) over r (here the data

dependence on control is encoded in the parametriza-tion). We can try to solve the problem by nonlin-ear programming/gradient descent methods.

• Important fact: If ∆η is the change in η dueto a small change ∆r from a given r, we have

∆η = ξ ′

(∆g + ∆P h),where ξ is the steady-state probability distribu-tion/vector corresponding to P (r), and all the quan-tities above are evaluated at r:

∆η = η(r + ∆r) − η(r),

∆g = g(r+∆r)−g(r), ∆P = P (r+∆r)−P (r)

8/14/2019 DP_Slides_2011


APPROXIMATION IN POLICY SPACE II

• Proof of the gradient formula: We have,by “differentiating” Bellman’s equation,

∆η(r)·e+∆h(r) = ∆g(r)+∆P (r)h(r)+P (r)∆h(r)

By left-multiplying with ξ ′,

ξ′∆η(r)·e+ξ′∆h(r) = ξ′

∆g(r)+∆P (r)h(r)

+ξ′P (r)∆h(r)

Since ξ ′∆η(r) · e = ∆η(r) and ξ ′ = ξ ′P (r), thisequation simplifies to

∆η = ξ ′(∆g + ∆P h)

• Since we don’t know ξ , we cannot implement agradient-like method for minimizing η(r). An al-ternative is to use “sampled gradients”, i.e., gener-ate a simulation trajectory (i0, i1, . . .), and change

r once in a while, in the direction of a simulation-based estimate of ξ ′(∆g + ∆P h).

• Important Fact: ∆η can be viewed as anexpected value!

• There is much research on this subject, see e.g.,

the work of Marbach and Tsitsiklis, and Kondaand Tsitsiklis, and the refs given there.

8/14/2019 DP_Slides_2011



OVERVIEW-EPILOGUE

LECTURE OUTLINE

• Finite horizon problems

− Deterministic vs Stochastic− Perfect vs Imperfect State Info


− Stochastic shortest path problems

− Discounted problems

− Average cost problems

8/14/2019 DP_Slides_2011


FINITE HORIZON PROBLEMS - ANALYSIS

• Perfect state info− A general formulation - Basic problem, DP

algorithm

− A few nice problems admit analytical solu-tion

• Imperfect state info− Sufficient statistics - Reduction to perfect

state info

− Very few nice problems that admit analyticalsolution

− Finite-state problems admit reformulation asperfect state info problems whose states areprob. distributions (the belief vectors)

8/14/2019 DP_Slides_2011


FINITE HORIZON PROBS - EXACT COMP. SOL

• Deterministic finite-state problems− Equivalent to shortest path

− A wealth of fast algorithms

− Hard combinatorial problems are a specialcase (but # of states grows exponentially)

• Stochastic perfect state info problems

− The DP algorithm is the only choice

− Curse of dimensionality is big bottleneck

• Imperfect state info problems

− Forget it!− Only trivial examples admit an exact com-putational solution

8/14/2019 DP_Slides_2011


FINITE HORIZON PROBS - APPROX. SOL.

• Many techniques (and combinations thereof) tochoose from

• Simplification approaches

− Certainty equivalence

− Problem simplification

− Rolling horizon

− Aggregation - Coarse grid discretization

• Limited lookahead combined with:

− Rollout

− MPC (an important special case)− Feature-based cost function approximation

• Approximation in policy space

− Gradient methods

− Random search

8/14/2019 DP_Slides_2011


INFINITE HORIZON PROBLEMS - ANALYSIS

• A more extensive theory• Bellman’s equation

• Optimality conditions

• Contraction mappings

• A few nice problems admit analytical solution• Idiosynchracies of average cost problems

• Idiosynchracies of problems with no underlyingcontraction

• Elegant analysis

8/14/2019 DP_Slides_2011


INF. HORIZON PROBS - EXACT COMP. SOL.

• Value iteration− Variations (asynchronous, modified, etc)

• Policy iteration

− Variations (asynchronous, based on value it-eration, etc)

• Linear programming

• Elegant algorithmic analysis

• Curse of dimensionality is major bottleneck

8/14/2019 DP_Slides_2011


INFINITE HORIZON PROBS - ADP

dp_slides_2011

Documents