dp_slides_2011

276
LECTURE SLIDES - DYNAMIC PROGRAMMIN BASED ON LECTURES GIVEN AT THE MASSACHUSETTS INST. OF TECHNOLOGY CAMBRIDGE, MASS FALL 2011 DIMITRI P. BERTSEKAS These lecture slides are based on the book: “Dynamic Programming and Optimal Con- trol: 3rd edition,” Vols. 1 and 2, Athena Scien ti c, 2007, by Dimitri P. Bertsekas; see http://www.athenasc.com/dpbook.html

Upload: jose-espinoza

Post on 04-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 1/276

LECTURE SLIDES - DYNAMIC PROGRAMMIN

BASED ON LECTURES GIVEN AT THE

MASSACHUSETTS INST. OF TECHNOLOGY

CAMBRIDGE, MASS

FALL 2011

DIMITRI P. BERTSEKAS

These lecture slides are based on the book:

“Dynamic Programming and Optimal Con-trol: 3rd edition,” Vols. 1 and 2, AthenaScientific, 2007, by Dimitri P. Bertsekas;see

http://www.athenasc.com/dpbook.html

Page 2: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 2/276

6.231 DYNAMIC PROGRAMMING

LECTURE 1

LECTURE OUTLINE

•   Problem Formulation

•   Examples

•   The Basic Problem

•   Significance of Feedback

Page 3: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 3/276

DP AS AN OPTIMIZATION METHODOLOGY

•   Generic optimization problem:

minu∈U 

g(u)

where u is the optimization/decision variable, g(u)

is the cost function, and  U   is the constraint set•   Categories of problems:

−   Discrete (U   is finite) or continuous

−   Linear (g   is linear and   U   is polyhedral) ornonlinear

−  Stochastic or deterministic: In stochastic prob-lems the cost involves a stochastic parameterw, which is averaged, i.e., it has the form

g(u) = E w

G(u, w)

where w  is a random parameter.

•   DP can deal with complex stochastic problemswhere information about   w   becomes available instages, and the decisions are also made in stagesand make use of this information.

Page 4: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 4/276

BASIC STRUCTURE OF STOCHASTIC DP

•   Discrete-time system

xk+1  = f k(xk, uk, wk), k = 0, 1, . . . , N   − 1

−   k:  Discrete time

−  xk:   State; summarizes past information that

is relevant for future optimization−   uk:   Control;  decision to be selected at time

k  from a given set

−   wk:   Random parameter  (also called distur-bance or noise depending on the context)

−   N :   Horizon   or number of times control isapplied

•   Cost function that is additive over time

gN (xN ) +

N −1k=0

gk(xk, uk, wk)

•  Alternative system description: P (xk+1 | xk, uk)

xk+1 = wk   with  P (wk

 |xk, uk) = P (xk+1

 |xk, uk)

Page 5: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 5/276

INVENTORY CONTROL EXAMPLE

Inventory

System

Stock Ordered at

Period k 

Stock at Period k  Stock at Period k + 1

Demand at Period k 

xk 

wk 

xk + 1 = xk   + uk - wk 

uk Cost ofP eriod k 

c uk  + r (xk   + uk  - wk )

•   Discrete-time system

xk+1  = f k(xk, uk, wk) = xk + uk − wk

•   Cost function that is additive over time

gN (xN ) +

N −1k=0

gk(xk, uk, wk)

= E N −1

k=0

cuk + r(xk + uk − wk)

•  Optimization over policies: Rules/functions uk  =µk(xk) that map states to controls

Page 6: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 6/276

ADDITIONAL ASSUMPTIONS

•   The set of values that the control  uk   can takedepend at most on  xk  and not on prior  x  or  u

•   Probability distribution of  wk  does not dependon past values  wk−1, . . . , w0, but may depend onxk   and uk

−   Otherwise past values of   w   or   x   would beuseful for future optimization

•   Sequence of events envisioned in period  k:

−   xk  occurs according to

xk  = f k−1

xk−1, uk−1, wk−1

−   uk   is selected with knowledge of  xk, i.e.,

uk ∈ U k(xk)

−   wk   is random and generated according to adistribution

P wk(xk, uk)

Page 7: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 7/276

DETERMINISTIC FINITE-STATE PROBLEMS

•   Scheduling example: Find optimal sequence of operations A, B, C, D

•   A must precede B, and C must precede D

•   Given startup cost S A  and S C , and setup tran-sition cost  C mn   from operation  m  to operation  n

A

S A

C

S C

AB

CAB

ACCAC

CDA

CAD

ABC

CA

CCD CD

ACD

ACB

CAB

CAD

CBC

CCB

CCD

CAB

CCA

CDA

CCD

CBD

CDB

CBD

CDB

CAB

Initial

S ta te

Page 8: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 8/276

STOCHASTIC FINITE-STATE PROBLEMS

•  Example: Find two-game chess match strategy•   Timid  play draws with prob.  pd  >  0 and loseswith prob. 1− pd.   Bold  play wins with prob. pw  <1/2 and loses with prob. 1 − pw

1 - 0

0.5-0.5

0 - 1

2 - 0

1.5-0.5

1 - 1

0.5-1.5

0 - 2

2nd Game / Timid Play 2nd Game / Bold Play

1st Game / Timid Play

0 - 0

0.5-0.5

0 - 1

p d 

1 - p d 

1st Game / Bold Play

0 - 0

1 - 0

0 - 1

1 - p w 

p w 

1 - 0

0.5-0.5

0 - 1

2 - 0

1.5-0.5

1 - 1

0.5-1.5

0 - 2

p d 

p d 

p d 

1 - p d 

1 - p d 

1 - p d 

1 - p w 

p w 

1 - p w 

p w 

1 - p w 

p w 

Page 9: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 9/276

BASIC PROBLEM

•   System xk+1  = f k(xk, uk, wk), k = 0, . . . , N  −1•   Control contraints uk ∈ U k(xk)

•   Probability distribution P k(· | xk, uk) of  wk

•   Policies   π   = {µ0, . . . , µN −1},  where   µk   mapsstates   x

k  into controls   u

k  =   µ

k(x

k) and is such

that  µk(xk) ∈ U k(xk) for all  xk

•   Expected cost of  π  starting at  x0   is

J π(x0) = E gN (xN ) +N −1

k=0

gk(xk, µk(xk), wk)•   Optimal cost function

J ∗(x0) = minπ

J π(x0)

•   Optimal policy  π∗ satisfies

J π∗(x0) = J ∗(x0)

When produced by DP,  π∗ is independent of  x0.

Page 10: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 10/276

SIGNIFICANCE OF FEEDBACK

•   Open-loop versus closed-loop policies

  System

xk + 1 = f k(xk,u k,wk)

mk

uk = mk(xk) xk

wk

uk  =  µk(xk)

µk

•   In deterministic problems open loop is as goodas closed loop

•   Value of information; chess match example

•  Example of open-loop policy: Play always bold

•   Consider the closed-loop policy:   Play timid if and only if you are ahead

Timid Play

1 - p d 

p d 

Bold Play

0 - 0

1 - 0

0 - 1

1 - p w 

p w 

1.5-0.5

1 - 1

1 - 1

0 - 2

1 - p w 

p w 

Bold Play

Page 11: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 11/276

VARIANTS OF DP PROBLEMS

•   Continuous-time problems•   Imperfect state information problems

•   Infinite horizon problems

•   Suboptimal control

Page 12: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 12/276

LECTURE BREAKDOWN

•   Finite Horizon Problems (Vol. 1, Ch. 1-6)−  Ch. 1: The DP algorithm (2 lectures)

−  Ch. 2: Deterministic finite-state problems (1lecture)

−   Ch. 3: Deterministic continuous-time prob-

lems (1 lecture)−  Ch. 4: Stochastic DP problems (2 lectures)

−  Ch. 5: Imperfect state information problems(2 lectures)

−  Ch. 6: Suboptimal control (2 lectures)

•  Infinite Horizon Problems - Simple (Vol. 1, Ch.7, 3 lectures)

•   Infinite Horizon Problems - Advanced (Vol. 2)

−  Ch. 1: Discounted problems - Computationalmethods (3 lectures)

−  Ch. 2: Stochastic shortest path problems (1lecture)

−  Ch. 6: Approximate DP (6 lectures)

Page 13: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 13/276

A NOTE ON THESE SLIDES

•   These slides are a teaching aid, not a text•   Don’t expect a rigorous mathematical develop-ment or precise mathematical statements

•  Figures are meant to convey and enhance ideas,not to express them precisely

•  Omitted proofs and a much fuller discussion canbe found in the texts, which these slides follow

Page 14: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 14/276

6.231 DYNAMIC PROGRAMMING

LECTURE 2

LECTURE OUTLINE

•   The basic problem

•   Principle of optimality

•   DP example: Deterministic problem

•   DP example: Stochastic problem

•  The general DP algorithm

•   State augmentation

Page 15: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 15/276

BASIC PROBLEM

•   System xk+1  = f k(xk, uk, wk), k = 0, . . . , N  −1•   Control constraints  uk ∈ U k(xk)

•   Probability distribution P k(· | xk, uk) of  wk

•   Policies   π   = {µ0, . . . , µN −1},  where   µk   mapsstates   x

k  into controls   u

k  =   µ

k(x

k) and is such

that  µk(xk) ∈ U k(xk) for all  xk

•   Expected cost of  π  starting at  x0   is

J π(x0) = E gN (xN ) +N −1

k=0

gk(xk, µk(xk), wk)•   Optimal cost function

J ∗(x0) = minπ

J π(x0)

•   Optimal policy  π∗ is one that satisfies

J π∗(x0) = J ∗(x0)

Page 16: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 16/276

PRINCIPLE OF OPTIMALITY

•   Let  π∗ = {µ∗0, µ∗1, . . . , µ∗N −1}  be optimal policy•   Consider the “tail subproblem” whereby we areat xi  at time i and wish to minimize the “cost-to-go” from time  i  to time  N 

gN (xN ) +N −1k=i

gk

xk, µk(xk), wk

and the “tail policy” {µ∗i , µ∗i+1, . . . , µ∗N −1}

0 Ni

xi Tail Subproblem

•   Principle of optimality : The tail policy is opti-mal for the tail subproblem (optimization of the

future does not depend on what we did in the past)•   DP first solves ALL tail subroblems of finalstage

•  At the generic step, it solves ALL tail subprob-lems of a given time length, using the solution of 

the tail subproblems of shorter time length

Page 17: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 17/276

DETERMINISTIC SCHEDULING EXAMPLE

•   Find optimal sequence of operations A, B, C,D (A must precede B and C must precede D)

A

C

AB

AC

CDA

ABC

CA

CD

ACD

ACB

CAB

CAD

InitialState1 0

76

2

86

6

2

2

9

3

33

3

3

3

5

1

5

4

4

3

1

5

4

•  Start from the last tail subproblem and go back-wards

•   At each state-time pair, we record the optimalcost-to-go and the optimal decision

Page 18: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 18/276

STOCHASTIC INVENTORY EXAMPLE

InventorySystem

Stock Ordered atPeriod k 

Stock at Period k  Stock at Period k + 1

Demand at Period k 

x k 

w k 

x k + 1 = x k  + u k -  w k 

u k Cost of Period k 

cu k  + r (x k  + u k  -  w k )

• Tail Subproblems of Length 1:

J N −1(xN −1) = minuN −1≥0

E wN −1

cuN −1

+ r(xN −1 + uN −1 − wN −1)

•  Tail Subproblems of Length  N  − k:

J k(xk) = minuk≥0

E wk

cuk + r(xk + uk − wk)

+ J k+1(xk + uk − wk)

•   J 0(x0) is opt. cost of initial state  x0

Page 19: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 19/276

DP ALGORITHM

•  Start with

J N (xN ) = gN (xN ),

and go backwards using

J k(xk) = minuk∈U k(xk) E wk

gk(xk, uk, wk)

+ J k+1

f k(xk, uk, wk)

, k = 0, 1, . . . , N   − 1.

•   Then J 0(x0), generated at the last step, is equalto the optimal cost  J ∗(x0). Also, the policy

π∗ = {µ∗0, . . . , µ∗N −1}where µ∗k(xk) minimizes in the right side above foreach  xk   and k, is optimal

•  Justification: Proof by induction that J k(xk) is

equal to J ∗k (xk), defined as the optimal cost of thetail subproblem that starts at time  k  at state  xk

•   Note:

−  ALL the tail subproblems are solved (in ad-dition to the original problem)

−   Intensive computational requirements

Page 20: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 20/276

PROOF OF THE INDUCTION STEP

•   Let   πk   =

µk, µk+1, . . . , µN −1

  denote a tailpolicy from time  k  onward

•   Assume that  J k+1(xk+1) = J ∗k+1(xk+1). Then

J ∗k (xk) = min(µk,πk+1)

E wk,...,wN −1

gkxk, µk(xk), wk+ gN (xN ) +

N −1i=k+1

gi

xi, µi(xi), wi

= minµk E wk

gkx

k, µ

k(xk

), wk

+ minπk+1

  E 

wk+1,...,wN −1

gN (xN ) +

N −1i=k+1

gi

xi, µi(xi), wi

= minµk

E wk gkxk, µk(xk), wk+ J ∗k+1f kxk, µk(xk), wk

= minµk

E wk

gk

xk, µk(xk), wk

+ J k+1

f k

xk, µk(xk), wk

= min

uk∈U k(xk)E wk

gk(xk, uk, wk) + J k+1

f k(xk, uk, wk)

= J k(xk)

Page 21: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 21/276

LINEAR-QUADRATIC ANALYTICAL EXAMPLE

Temperature  u 0

Temperature  u 1

FinalTemperature x 2

InitialTemperature x 0

Oven 1 Oven 2x 1

•  System

xk+1  = (1 − a)xk + auk, k = 0, 1,

where a   is given scalar from the interval (0, 1)

•  Cost

r(x2 − T )2 + u20 + u21

where r   is given positive scalar

•   DP Algorithm:

J 2(x2) = r(x2 − T )2

J 1(x1) = minu1

u21 + r

(1 − a)x1 + au1 − T 

2J 0(x0) = min

u0 u20 + J 1

(1 − a)x0 + au0

Page 22: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 22/276

STATE AUGMENTATION

•   When assumptions of the basic problem areviolated (e.g., disturbances are correlated, cost isnonadditive, etc) reformulate/augment the state

•   Example: Time lags

xk+1 = f k(xk, xk−1, uk, wk)

•   Introduce additional state variable  yk   =  xk−1.New system takes the form

xk+1

yk+1

= f k(xk, yk, uk, wk)

xk

View xk  = (xk, yk) as the new state.

•   DP algorithm for the reformulated problem:

J k(xk, xk−1) = minuk∈U k(xk)

E wk

gk(xk, uk, wk)

+ J k+1

f k(xk, xk−1, uk, wk), xk

Page 23: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 23/276

6.231 DYNAMIC PROGRAMMING

LECTURE 3

LECTURE OUTLINE

•  Deterministic finite-state DP problems

•  Backward shortest path algorithm

•   Forward shortest path algorithm

•   Shortest path examples

•  Alternative shortest path algorithms

Page 24: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 24/276

DETERMINISTIC FINITE-STATE PROBLEM

. . .

. . .

. . .

Stage 0 Stage 1 Stage 2 Stage N - 1 Stage N 

Initial State  s 

t Artificial TerminalNode

Terminal Arcswith Cost Equalto Terminal Cost

. . .

•  States  <==>  Nodes

•   Controls  <==>  Arcs

•   Control sequences (open-loop)   <==>   pathsfrom initial state to terminal states

•   akij : Cost of transition from state i ∈ S k  to state

 j ∈ S k+1  at time k  (view it as “length” of the arc)•   aN 

it : Terminal cost of state  i ∈ S N 

•  Cost of control sequence <==> Cost of the cor-responding path (view it as “length” of the path)

Page 25: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 25/276

BACKWARD AND FORWARD DP ALGORITHM

•   DP algorithm:

J N (i) = aN it , i ∈ S N ,

J k(i) = minj∈S k+1

akij+J k+1( j)

, i ∈ S k, k = 0, . . . , N  −1

The optimal cost is   J 0(s) and is equal to thelength of the shortest path from  s  to  t

•   Observation: An optimal path  s → t  is also anoptimal path   t →  s   in a “reverse” shortest pathproblem where the direction of each arc is reversedand its length is left unchanged

•   Forward DP algorithm (= backward DP algo-rithm for the reverse problem):

J N ( j) = a0sj , j ∈ S 1,

J k( j) = mini∈S N −k

aN −kij   +  J k+1(i), j

 ∈S N −k+1

The optimal cost is  J 0(t) = mini∈S N 

aN it  +  J 1(i)

•   View  J k( j) as  optimal cost-to-arrive  to state  jfrom initial state  s

Page 26: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 26/276

A NOTE ON FORWARD DP ALGORITHMS

•   There is no forward DP algorithm for stochasticproblems

•   Mathematically, for stochastic problems, wecannot restrict ourselves to open-loop sequences,so the shortest path viewpoint fails

•   Conceptually, in the presence of uncertainty,the concept of “optimal-cost-to-arrive” at a statexk   does not make sense. For example, it may beimpossible to guarantee (with prob. 1) that anygiven state can be reached

•   By contrast, even in stochastic problems, theconcept of “optimal cost-to-go” from any state  xk

makes clear sense

Page 27: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 27/276

GENERIC SHORTEST PATH PROBLEMS

• {1, 2, . . . , N , t}: nodes of a graph (t: the  desti-nation )

•   aij : cost of moving from node  i  to node  j

•  Find a shortest (minimum cost) path from eachnode  i  to node  t

•   Assumption: All cycles have nonnegative length.Then an optimal path need not take more than N moves

•   We formulate the problem as one where we re-quire exactly N  moves but allow degenerate moves

from a node  i  to itself with cost  aii = 0

J k(i) = opt. cost of getting from i to t in N −k moves

J 0(i): Cost of the optimal path from  i  to  t.

•   DP algorithm:

J k(i) = minj=1,...,N 

aij+J k+1( j)

, k = 0, 1, . . . , N  −2,

with  J N −1(i) = ait,  i = 1, 2, . . . , N  

Page 28: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 28/276

EXAMPLE

2

7 5

25 5

6 1

3

0.53

1

2

4

0 1 2 3 4

1

2

3

4

5

State i 

Stage k 

3 3 3 3

4 4 4 5

4.5 4.5 5.5 7

2 2 2 2

Destination5

(a) (b)

J N −1(i) = ait, i = 1, 2, . . . , N ,

J k(i) = minj=1,...,N 

aij+J k+1( j)

, k = 0, 1, . . . , N  −2.

Page 29: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 29/276

ESTIMATION / HIDDEN MARKOV MODELS

•  Markov chain with transition probabilities  pij

•   State transitions are hidden from view

•   For each transition, we get an (independent)observation

•  r(z; i, j): Prob. the observation takes value   z

when the state transition is from  i  to  j

•   Trajectory estimation problem: Given the ob-servation sequence Z N   = {z1, z2, . . . , zN }, what isthe “most likely” state transition sequence  X N   =

{x0, x1, . . . , xN 

} [one that maximizes  p(X N 

 |Z N )

over all  X N   = {x0, x1, . . . , xN }].

. . .

. . .

. . .

s  x 0 x 1 x 2 x N - 1 x N  t 

Page 30: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 30/276

VITERBI ALGORITHM

•   We have p(X N  | Z N ) =

  p(X N , Z N )

 p(Z N )

where p(X N , Z N ) and p(Z N ) are the unconditionalprobabilities of occurrence of (X N , Z N ) and  Z N 

•   Maximizing p(X N  | Z N ) is equivalent with max-imizing ln( p(X N , Z N ))

•   We have

 p(X N , Z N ) = πx0

k=1

 pxk−1xkr(zk; xk−1, xk)

so the problem is equivalent to

minimize

−ln(πx0)

k=1

ln pxk−1xkr(zk; xk−1, xk)over all possible sequences {x0, x1, . . . , xN }.

•  This is a shortest path problem.

Page 31: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 31/276

GENERAL SHORTEST PATH ALGORITHMS

•   There are many nonDP shortest path algo-rithms. They can all be used to solve deterministicfinite-state problems

•   They may be preferable than DP if they avoidcalculating the optimal cost-to-go of  EVERY state

•  This is essential for problems with HUGE statespaces. Such problems arise for example in com-binatorial optimization

1

1 20

20

5

3

5

4

4

15

15

3

ABC ABD ACB ACD ADB ADC

ABCD

AB AC AD

ABDC ACBD ACDB ADBC ADCB

Artificial Terminal Node t 

Origin Node s A

1

11

20 20

2020

44

4 4

1515 5

5

3 3

5

33

15

Page 32: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 32/276

LABEL CORRECTING METHODS

•  Given: Origin  s, destination  t, lengths  aij

 ≥0.

•   Idea is to progressively discover shorter pathsfrom the origin  s  to every other node  i

•   Notation:

−   di   (label of   i): Length of the shortest path

found (initially  ds  = 0,  di  = ∞  for  i = s)−  UPPER: The label  dt  of the destination

−   OPEN list: Contains nodes that are cur-rently active in the sense that they are candi-dates for further examination (initially OPEN={s

Label Correcting AlgorithmStep 1 (Node Removal):  Remove a node i fromOPEN and for each child  j  of  i, do step 2

Step 2 (Node Insertion Test):   If   di  + aij   <min

{dj , UPPER

}, set   dj   =   di + aij   and set   i   to

be the parent of  j. In addition, if  j = t, place j   inOPEN if it is not already in OPEN, while if  j  = t,set UPPER to the new value  di + ait  of  dt

Step 3 (Termination Test):  If OPEN is empty,terminate; else go to step 1

Page 33: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 33/276

VISUALIZATION/EXPLANATION

•   Given: Origin  s, destination  t, lengths  aij ≥ 0•   di (label of  i): Length of the shortest path foundthus far (initially  ds  = 0,  di  = ∞  for  i = s). Thelabel di is implicitly associated with an s → i path

•  UPPER: The label  dt  of the destination

•   OPEN list: Contains “active” nodes (initiallyOPEN={s})

i  j

REMOVE

Is di + aij < d j ?

(Is the path s --> i --> jbetter than thecurrent path s --> j ?)

Is di + aij < UPPER  ?

(Does the path s --> i --> jhave a chance to be partof a shorter s --> t path ?)

YES

YES

INSERT

O P E N

Set d j = di + aij

Page 34: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 34/276

EXAMPLE

ABC ABD ACB ACD ADB ADC

ABCD

AB AC AD

ABDC ACBD ACDB ADBC ADCB

Artificial Terminal Node t 

Origin Node s A

1

11

20 20

2020

44

4 4

1515 5

5

3 3

5

33

15

 

2

3

4

5

6

7

8

9

 

Iter. No. Node Exiting OPEN OPEN after Iteration UPPER

0 - 1   ∞1 1 2, 7,10   ∞2 2 3, 5, 7, 10   ∞3 3 4, 5, 7, 10   ∞4 4 5, 7, 10 43

5 5 6, 7, 10 43

6 6 7, 10 137 7 8, 10 13

8 8 9, 10 13

9 9 10 13

10 10 Empty 13

•   Note that some nodes never entered OPEN

Page 35: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 35/276

VALIDITY OF LABEL CORRECTING METHOD

Proposition:   If there exists at least one pathfrom the origin to the destination, the label cor-recting algorithm terminates with UPPER equalto the shortest distance from the origin to the des-tination

Proof:   (1) Each time a node j  enters OPEN, its

label is decreased and becomes equal to the lengthof some path from  s  to  j

(2) The number of possible distinct path lengthsis finite, so the number of times a node can enterOPEN is finite, and the algorithm terminates

(3) Let (s, j1, j2, . . . , jk, t) be a shortest path andlet   d∗ be the shortest distance. If UPPER   > d∗

at termination, UPPER will also be larger thanthe length of all the paths (s, j1, . . . , jm),   m   =1, . . . , k, throughout the algorithm. Hence, node

 jk  will never enter the OPEN list with  djk   equalto the shortest distance from   s   to   jk. Similarlynode   jk−1   will never enter the OPEN list withdjk−1  equal to the shortest distance from s to jk−1.Continue to  j1  to get a contradiction

Page 36: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 36/276

6.231 DYNAMIC PROGRAMMING

LECTURE 4

LECTURE OUTLINE

•   Deterministic continuous-time optimal control

•   Examples

•   Connection with the calculus of variations

•   The Hamilton-Jacobi-Bellman equation as acontinuous-time limit of the DP algorithm

•   The Hamilton-Jacobi-Bellman equation as asufficient condition

•   Examples

Page 37: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 37/276

PROBLEM FORMULATION

•   Continuous-time dynamic system:

x(t) = f 

x(t), u(t)

,   0 ≤ t ≤ T, x(0) : given,

where

−  x(t)

∈ ℜn: state vector at time  t

−   u(t) ∈ U  ⊂ ℜm: control vector at time  t−   U : control constraint set

−   T : terminal time

•  Admissible control trajectories

u(t) | t ∈ [0, T ]

:

piecewise continuous functions u(t)|

t∈

[0, T ]with u(t) ∈ U  for all t ∈ [0, T ]; uniquely determine

x(t) | t ∈ [0, T ]

•   Problem: Find an admissible control trajectory

u(t) | t ∈ [0, T ]

 and corresponding state trajec-

tory x(t)

|t

∈[0, T ], that minimizes the cost

h

x(T )

+

   T 0

g

x(t), u(t)

dt

•  f ,h,g  are assumed continuously differentiable

Page 38: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 38/276

EXAMPLE I

•   Motion control:   A unit mass moves on a lineunder the influence of a force  u

•   x(t) =

x1(t), x2(t)

: position and velocity of the mass at time  t

•  Problem: From a given x1(0), x2(0), bring the

mass “near” a given final position-velocity pair(x1, x2) at time  T   in the sense:

minimizex1(T ) − x1

2

+x2(T ) − x2

2

subject to the control constraint

|u(t)| ≤ 1,   for all t ∈ [0, T ]

•  The problem fits the framework with

x1(t) = x2(t),   x2(t) = u(t),

h

x(T )

=x1(T ) − x1

2 +x2(T ) − x2

2,

gx(t), u(t) = 0,   for all t ∈ [0, T ]

Page 39: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 39/276

EXAMPLE II

•  A producer with production rate x(t) at time tmay allocate a portion  u(t) of his/her productionrate to reinvestment and 1 − u(t) to production of a storable good. Thus  x(t) evolves according to

x(t) = γu(t)x(t),

where γ > 0 is a given constant

•   The producer wants to maximize the total amountof product stored

   T 

0

1 − u(t)

x(t)dt

subject to

0≤

u(t)≤

1,   for all t∈

[0, T ]

•   The initial production rate x(0) is a given pos-itive number

Page 40: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 40/276

EXAMPLE III (CALCULUS OF VARIATIONS)

Length = Ú0T

1 + (u(t))2 d t

a  x(t)

T t0

x(t) = u(t).

GivenPoint Given

Line

   T 

0

 1 +

u(t)

2dt

•   Find a curve from a given point to a given linethat has minimum length

•  The problem is

minimize

   T 0

 1 +

x(t)2

dt

subject to   x(0) = α

•   Reformulation as an optimal control problem:

minimize

   T 0

 1 +

u(t)2

dt

subject to x(t) = u(t), x(0) = α

Page 41: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 41/276

HAMILTON-JACOBI-BELLMAN EQUATION I

•   We discretize [0, T ] at times 0, δ, 2δ , . . . , N δ  ,where δ  = T /N , and we let

xk  = x(kδ ), uk  = u(kδ ), k = 0, 1, . . . , N  

•   We also discretize the system and cost:

xk+1 = xk+f (xk, uk)·δ, h(xN )+N −1k=0

g(xk, uk)·δ 

•   We write the DP algorithm for the discretizedproblem

J ∗(Nδ,x) = h(x),

J ∗(kδ,x) = minu∈U 

g(x, u)·δ + J ∗

(k+1)·δ, x+f (x, u)·δ 

.

•   Assume  J ∗ is differentiable and Taylor-expand:

J ∗(kδ,x) = min

u∈U 

g(x, u) · δ  + J 

∗(kδ,x) + ∇tJ ∗(kδ,x) · δ 

+ ∇xJ ∗(kδ,x)′f (x, u) · δ  + o(δ )

•   Cancel  J ∗(kδ,x), divide by  δ , and take limit

Page 42: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 42/276

HAMILTON-JACOBI-BELLMAN EQUATION II

•   Let   J ∗(t, x) be the optimal cost-to-go of thecontinuous problem. Assuming the limit is valid

limk→∞, δ→0, kδ=t

J ∗(kδ,x) = J ∗(t, x),   for all t,x,

we obtain for all  t, x,

0 = minu∈U 

g(x, u) +∇tJ ∗(t, x) +∇xJ ∗(t, x)′f (x, u)

with the boundary condition  J ∗(T, x) = h(x)

•   This is the   Hamilton-Jacobi-Bellman (HJB)

equation  – a partial  differential equation, which issatisfied for all time-state pairs (t, x) by the cost-to-go function   J ∗(t, x) (assuming   J ∗ is differen-tiable and the preceding informal limiting proce-dure is valid)

•   Hard to tell  a priori  if  J ∗

(t, x) is differentiable•  So we use the HJB Eq. as a verification tool; if we can solve it for a differentiable  J ∗(t, x), then:

−   J ∗ is the optimal-cost-to-go function

−   The control   µ∗(t, x) that minimizes in the

RHS for each (t, x) defines an optimal con-trol

Page 43: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 43/276

VERIFICATION/SUFFICIENCY THEOREM

•   Suppose  V (t, x) is a solution to the HJB equa-tion; that is,  V   is continuously differentiable in  tand  x, and is such that for all  t, x,

0 = minu∈U 

g(x, u) + ∇tV (t, x) + ∇xV (t, x)′f (x, u)

,

V (T, x) = h(x),   for all x

•  Suppose also that µ∗(t, x) attains the minimumabove for all  t  and x

•  Let x∗(t)

|t∈

[0, T ] and u∗(t) = µ∗t, x∗(t),t ∈ [0, T ], be the corresponding state and controltrajectories

•   Then

V (t, x) = J ∗(t, x),   for all t,x,

and

u∗(t) | t ∈ [0, T ]

 is optimal

Page 44: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 44/276

PROOF

Let

 {(u(t), x(t))

 |  t

 ∈  [0, T ]

}  be any admissible

control-state trajectory. We have for all  t ∈ [0, T ]

0 ≤ g

x(t), u(t)

+∇tV  

t, x(t)

+∇xV  

t, x(t)′

x(t), u(t)

.

Using the system equation   ˙x(t) =   f 

x(t), u(t)

,

the RHS of the above is equal to

g

x(t), u(t)

+  d

dt

V (t, x(t))

Integrating this expression over  t ∈ [0, T ],

0 ≤    T 

0 g

x(t), u(t)

dt + V 

T, x(T )−V 

0, x(0)

.

Using V (T, x) = h(x) and x(0) = x(0), we have

0, x(0)

≤ h

x(T )

+

   T 0

g

x(t), u(t)

dt.

If we use u∗(t) and x∗(t) in place of u(t) and x(t),the inequalities becomes equalities, and

0, x(0)

= h

x∗(T )

+

   T 0

g

x∗(t), u∗(t)

dt

Page 45: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 45/276

EXAMPLE OF THE HJB EQUATION

Consider the scalar system x(t) = u(t), with|u(t)

| ≤1 and cost (1/2)

x(T )2

.  The HJB equation is

0 = min|u|≤1

∇tV (t, x)+∇xV (t, x)u

,   for all t,x,

with the terminal condition  V (T, x) = (1/2)x

2

•   Evident candidate for optimality:   µ∗(t, x) =−sgn(x). Corresponding cost-to-go

J ∗(t, x) =  1

2max

0, |x| − (T  − t)

2

.

•  We verify that J ∗ solves the HJB Eq., and thatu = −sgn(x) attains the min in the RHS. Indeed,

∇tJ ∗(t, x) = max

0, |x| − (T  − t)

,

∇xJ ∗(t, x) = sgn(x)

·max0,

 |x| −

(T  −

t).

Substituting, the HJB Eq. becomes

0 = min|u|≤1

1 + sgn(x) · u

max

0, |x| − (T  − t)

Page 46: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 46/276

LINEAR QUADRATIC PROBLEM

Consider the  n-dimensional linear system

x(t) = Ax(t) + Bu(t),

and the quadratic cost

x(T )′QT x(T ) +    T 

0x(t)′Qx(t) + u(t)′Ru(t)dt

The HJB equation is

0 = minu∈ℜm

x′Qx+u

′Ru+∇tV  (t, x)+∇xV  (t, x)′(Ax+Bu)

,

with the terminal condition V (T, x) = x′QT x. Wetry a solution of the form

V (t, x) = x′K (t)x, K (t) : n × n  symmetric,

and show that  V (t, x) solves the HJB equation if 

K (t) = −K (t)A−A′K (t)+K (t)BR−1B′K (t)−Q

with the terminal condition  K (T ) = QT 

Page 47: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 47/276

6.231 DYNAMIC PROGRAMMING

LECTURE 5

LECTURE OUTLINE

•   Examples of stochastic DP problems

•   Linear-quadratic problems

•   Inventory control

Page 48: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 48/276

LINEAR-QUADRATIC PROBLEMS

•   System:   xk+1  = Akxk + Bkuk + wk

•   Quadratic cost

E wk

k=0,1,...,N −1

x′N QN xN  +

N −1

k=0(x′kQkxk + u′kRkuk)

where Qk ≥ 0 and Rk  > 0 (in the positive (semi)definitesense).

•   wk  are independent and zero mean

•  DP algorithm:

J N (xN ) = x′N QN xN ,

J k(xk) = minuk

x′kQkxk + u′kRkuk

+ J k+1(Akxk + Bkuk + wk)

•   Key facts:−   J k(xk) is quadratic

−   Optimal policy {µ∗0, . . . , µ∗N −1}  is linear:

µ∗k(xk) = Lkxk

− Similar treatment of a number of variants

Page 49: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 49/276

DERIVATION

•   By induction verify that

µ∗k(xk) = Lkxk, J k(xk) = x′kK kxk +constant,

where Lk  are matrices given by

Lk  = −(B′kK k+1Bk + Rk)−1B′

kK k+1Ak,

and where K k  are symmetric positive semidefinitematrices given by

K N   = QN ,

K k  = A′k

K k+1 − K k+1Bk(B′

kK k+1Bk

+ Rk)−1B′kK k+1

Ak + Qk.

•  This is called the discrete-time Riccati equation .

•   Just like DP, it starts at the terminal time  N and proceeds backwards.

•   Certainty equivalence holds (optimal policy isthe same as when  wk   is replaced by its expectedvalue E {wk} = 0).

Page 50: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 50/276

ASYMPTOTIC BEHAVIOR OF RICCATI EQ.

•   Assume time-independent system and cost perstage, and some technical assumptions: controla-bility of (A, B) and observability of (A, C ) whereQ = C ′C 

•  The Riccati equation converges limk→−∞ K k  =

K , where   K   is pos. definite, and is the unique(within the class of pos. semidefinite matrices) so-lution of the  algebraic Riccati equation 

K  = A′

K − KB(B′KB + R)−1B′K 

A + Q

•  The corresponding steady-state controller µ∗(x) =Lx, where

L = −(B′KB + R)−1B′KA,

is stable in the sense that the matrix (A + BL) of the closed-loop system

xk+1  = (A + BL)xk + wk

satisfies limk→∞(A + BL)k = 0.

Page 51: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 51/276

GRAPHICAL PROOF FOR SCALAR SYSTEMS

A

2

R B 

2 + Q 

P  0

F (P )

450

P P k  P k + 1 P *

-R 

B 2 

•   Riccati equation (with  P k  = K N −k):

P k+1  = A2

P k −   B2P 2k

B2P k + R

+ Q,

or  P k+1  = F (P k),  where

F (P ) =   A2RP B2P  + R

 + Q.

•   Note the two steady-state solutions, satisfyingP   = F (P ), of which only one is positive.

Page 52: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 52/276

RANDOM SYSTEM MATRICES

•   Suppose that {A0, B0}, . . . , {AN −1, BN −1}  arenot known but rather are independent randommatrices that are also independent of the  wk

•   DP algorithm is

J N (xN ) = x′

N QN xN ,

J k(xk) = minuk

E wk,Ak,Bk

x′kQkxk

+ u′kRkuk + J k+1(Akxk + Bkuk + wk)

•   Optimal policy  µ∗k(xk) = Lkxk,  where

Lk  = −Rk + E {B′kK k+1Bk}

−1E {B′

kK k+1Ak},

and where the matrices  K k  are given by

K N   = QN ,

K k  =  E {A′kK k+1Ak} − E {A′

kK k+1Bk}Rk + E {B′

kK k+1Bk}−1

E {B′kK k+1Ak} + Qk

Page 53: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 53/276

PROPERTIES

•   Certainty equivalence may not hold•  Riccati equation may not converge to a steady-state

450

0 P 

F (P )

- R 

E {B 2}

•   We have  P k+1 =  F (P k),  where

F (P ) =  E {A2}RP 

E {B2}P  + R + Q +

  T P 2

E {B2}P  + R,

T   = E {A2}E {B2} −

E {A}

2

E {B}

2

Page 54: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 54/276

INVENTORY CONTROL

•   xk: stock,   uk: inventory purchased,   wk: de-mand

xk+1 = xk + uk − wk, k = 0, 1, . . . , N   − 1

•   Minimize

N −1k=0

cuk + r(xk + uk − wk)

where, for some  p > 0 and  h > 0,

r(x) = p max(0, −x) + h max(0, x)

•   DP algorithm:

J N (xN ) = 0,

J k(xk) = minuk≥0

cuk+H (xk+uk)+E 

J k+1(xk+uk−wk)

,

where H (x + u) = E {r(x + u − w)}.

Page 55: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 55/276

OPTIMAL POLICY

•   DP algorithm can be written as

J N (xN ) = 0,

J k(xk) = minuk≥0

Gk(xk + uk) − cxk,

where

Gk(y) = cy + H (y) + E 

J k+1(y − w)

.

•  If  Gk   is convex and lim|x|→∞ Gk(x)

 → ∞, we

have

µ∗k(xk) =

S k − xk   if  xk  < S k,0 if  xk ≥ S k,

where S k  minimizes  Gk(y).

•  This is shown, assuming that  c < p, by showingthat  J k  is convex for all  k, and

lim|x|→∞

J k(x) → ∞

Page 56: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 56/276

JUSTIFICATION

•   Graphical inductive proof that  J k   is convex.

- cy 

- cy 

H (y )

cy  + H (y )

S N - 1

cS N - 1

J N - 1(x N - 1)

x N - 1S N - 1

Page 57: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 57/276

6.231 DYNAMIC PROGRAMMING

LECTURE 6

LECTURE OUTLINE

•   Stopping problems

•   Scheduling problems

•   Other applications

Page 58: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 58/276

PURE STOPPING PROBLEMS

•   Two possible controls:−   Stop (incur a one-time stopping cost, and

move to cost-free and absorbing stop state)

−   Continue [using   xk+1   =   f k(xk, wk) and in-curring the cost-per-stage]

•  Each policy consists of a partition of the set of states xk   into two regions:

−   Stop region, where we stop

−   Continue region, where we continue

STOPREGION

CONTINUEREGION

Stop State

Page 59: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 59/276

EXAMPLE: ASSET SELLING

•  A person has an asset, and at k  = 0, 1, . . . , N  −1receives a random offer  wk

•   May accept  wk  and invest the money at fixedrate of interest  r, or reject  wk  and wait for  wk+1.Must accept the last offer  wN −1

•  DP algorithm (xk: current offer, T : stop state):

J N (xN ) =

xN    if  xN  = T ,0 if  xN   = T ,

J k(xk) =max(1 + r)N −kx

k, E J 

k+1(w

k)   if  x

k = T ,

0 if  xk  = T .

•   Optimal policy;

accept the offer  xk   if  xk  > αk,

reject the offer  xk   if  xk  < αk,

where

αk  = E 

J k+1(wk)

(1 + r)N −k  .

Page 60: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 60/276

FURTHER ANALYSIS

0 1 2 N  - 1 N k 

ACCEPT

REJECT

a 1

a N - 1

a 2

•   Can show that  αk ≥ αk+1   for all  k

•   Proof: Let   V k(xk) =   J k(xk)/(1 + r)N −k forxk

 = T.  Then the DP algorithm is  V N (xN ) = xN 

and

V k(xk) = max

xk,  (1 + r)−1 E 

w

V k+1(w)

.

We have αk

 = E wV 

k+1(w)/(1 + r), so it is enough

to show that   V k(x) ≥   V k+1(x) for all   x   and   k.Start with  V N −1(x) ≥  V N (x) and use the mono-tonicity property of DP.

•   We can also show that   αk →   a   as   k → −∞.Suggests that for an infinite horizon the optimal

policy is stationary.

Page 61: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 61/276

GENERAL STOPPING PROBLEMS

•   At time k, we may stop at cost  t(xk) or choosea control  uk ∈ U (xk) and continue

J N (xN ) = t(xN ),

J k(xk) = min

t(xk),   min

uk∈U (xk)E 

g(xk, uk, wk)

+ J k+1

f (xk, uk, wk)

•   Optimal to stop at time   k   for states   x   in theset

T k   =

x t(x) ≤   minu∈U (x) E 

g(x,u,w) +  J k+1

f (x,u,w)

•   Since   J N −1(x)  ≤   J N (x), we have   J k(x)  ≤J k+1(x) for all  k, so

T 0

 ⊂ · · · ⊂T k

 ⊂T k+1

 ⊂ · · · ⊂T N −1.

•  Interesting case is when all the  T k  are equal (toT N −1, the set where it is better to stop than to goone step and stop). Can be shown to be true if 

f (x,u,w) ∈ T N −1,   for all x ∈ T N −1, u ∈ U (x), w.

Page 62: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 62/276

SCHEDULING PROBLEMS

•  Set of tasks to perform, the ordering is subjectto optimal choice.

•   Costs depend on the order

•  There may be stochastic uncertainty, and prece-dence and resource availability constraints

•   Some of the hardest combinatorial problemsare of this type (e.g., traveling salesman, vehiclerouting, etc.)

•   Some special problems admit a simple quasi-analytical solution method

−   Optimal policy has an   “index form”, i.e.,each task has an easily calculable “cost in-dex”, and it is optimal to select the taskthat has the minimum value of index (multi-armed bandit problems - to be discussed later)

−   Some problems can be solved by an “inter-change argument”(start with some schedule,interchange two adjacent tasks, and see whathappens)

Page 63: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 63/276

EXAMPLE: THE QUIZ PROBLEM

•   Given a list of  N   questions. If question  i is an-swered correctly (given probability pi), we receivereward  Ri; if not the quiz terminates. Choose or-der of questions to maximize expected reward.

•   Let  i  and  j  be the  kth and (k + 1)st questions

in an optimally ordered list

L = (i0, . . . , ik−1, i , j , ik+2, . . . , iN −1)

E  {reward of  L} = E 

reward of   {i0, . . . , ik−1}

+ pi0

· · · pik−1 ( piRi + pi pjRj)

+ pi0 · · · pik−1 pi pjE 

reward of  {ik+2, . . . , iN −1}

Consider the list with  i and  j   interchanged

L′ = (i0, . . . , ik−1, j , i , ik+2, . . . , iN −1)

Since L is optimal, E {reward of  L} ≥ E {reward of  L′},so it follows that  piRi + pi pjRj ≥  pjRj  + pj piRi

or piRi/(1 − pi) ≥  pjRj/(1 − pj).

Page 64: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 64/276

MINIMAX CONTROL

•  Consider basic problem with the difference thatthe disturbance  wk   instead of being random, it is just known to belong to a given set W k(xk, uk).

•   Find policy  π  that minimizes the cost

J π(x0) = maxwk∈W k(xk,µk(xk))k=0,1,...,N −1

gN (xN )

+N −1k=0

gk

xk, µk(xk), wk

•  The DP algorithm takes the form

J N (xN ) = gN (xN ),

J k(xk) = minuk

∈U (xk

)max

wk

∈W k

(xk

,uk

) gk(xk, uk, wk)

+ J k+1

f k(xk, uk, wk)

(Exercise 1.5 in the text, solution posted on thewww).

Page 65: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 65/276

UNKNOWN-BUT-BOUNDED CONTROL

•   For each k, keep the xk  of the controlled system

xk+1  = f k

xk, µk(xk), wk

inside a given set  X k, the  target set at time  k.

•   This is a minimax control problem, where thecost at stage  k   is

gk(xk) =

0 if  xk ∈ X k,1 if  xk   /∈ X k.

•  We must reach at time  k  the set

X k  =

xk | J k(xk) = 0

in order to be able to maintain the state withinthe subsequent target sets.

•  Start with X N   = X N , and for k = 0, 1, . . . , N  −1,

X k  =

xk ∈ X k |   there exists  uk ∈ U k(xk) such that

f k(xk, uk, wk) ∈ X k+1,   for all  wk ∈ W k(xk, uk)

Page 66: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 66/276

6.231 DYNAMIC PROGRAMMING

LECTURE 7

LECTURE OUTLINE

•   Problems with imperfect state info

•   Reduction to the perfect state info case

•   Linear quadratic problems

•   Separation of estimation and control

Page 67: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 67/276

BASIC PROBL. W/ IMPERFECT STATE INFO

•   Same as basic problem of Chapter 1 with onedifference: the controller, instead of knowing  xk,receives at each time k  an observation of the form

z0 = h0(x0, v0), zk  = hk(xk, uk−1, vk), k ≥ 1

•   The observation  zk  belongs to some space  Z k.

•  The random observation disturbance vk  is char-acterized by a probability distribution

P vk (· | xk, . . . , x0, uk−1, . . . , u0, wk−1, . . . , w0, vk−1, . . . , v0)

•   The initial state  x0   is also random and charac-terized by a probability distribution  P x0 .

•   The probability distribution  P wk(· |  xk, uk) of wk   is given, and it may depend explicitly on   xk

and  uk  but not on  w0, . . . , wk−1, v0, . . . , vk−1.•   The control  uk   is constrained to a given subsetU k   (this subset does not depend on   xk, which isnot assumed known).

Page 68: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 68/276

INFORMATION VECTOR AND POLICIES

•   Denote by  I k   the   information vector , i.e., theinformation available at time  k:

I k  = (z0, z1, . . . , zk, u0, u1, . . . , uk−1), k ≥ 1,

I 0  = z0

•   We consider policies   π   = {µ0, µ1, . . . , µN −1},where each function µk  maps the information vec-tor  I k  into a control  uk   and

µk(I k) ∈ U k,   for all I k, k ≥ 0

•  We want to find a policy  π  that minimizes

J π  =   E x0,wk,vk

k=0,...,N −1

gN (xN ) +

N −1k=0

gk

xk, µk(I k), wk

subject to the equations

xk+1  = f k

xk, µk(I k), wk

, k ≥ 0,

z0 = h0(x0, v0), zk  = hk

xk, µk−1(I k−1), vk

, k ≥ 1

Page 69: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 69/276

REFORMULATION AS PERFECT INFO PROBL.

•   We have

I k+1  = (I k, zk+1, uk), k = 0, 1, . . . , N  −2, I 0  = z0

View this as a dynamic system with state I k, con-trol  uk, and random disturbance  zk+1

•   We have

P (zk+1 |  I k, uk) = P (zk+1 |  I k, uk, z0, z1, . . . , zk),

since z0, z1, . . . , zk  are part of the information vec-

tor  I k. Thus the probability distribution of  zk+1depends explicitly only on the state I k  and controluk  and not on the prior “disturbances”  zk, . . . , z0

•   Write

E gk(xk, uk, wk) = E    E xk,wk

gk(xk, uk, wk)  |  I k, ukso the cost per stage of the new system is

gk(I k, uk) =   E xk,wk

gk(xk, uk, wk) | I k, uk

Page 70: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 70/276

DP ALGORITHM

•   Writing the DP algorithm for the (reformulated)perfect state info problem and doing the algebra:

J k(I k) = minuk∈U k

  E 

xk,wk, zk+1

gk(xk, uk, wk)

+ J k+1(I k, zk+1, uk)  |  I k, ukfor  k = 0, 1, . . . , N   − 2, and for  k = N  − 1,

J N −1(I N −1) = minuN −1∈U N −1

  E xN −1, wN −1

gN 

f N −1(xN −1, uN −1, wN −1)

+ gN −1(xN −1, uN −1, wN −1)   |  I N −1, uN −1

•   The optimal cost  J ∗ is given by

J ∗ = E z0

J 0(z0)

Page 71: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 71/276

LINEAR-QUADRATIC PROBLEMS

•   System:   xk+1  = Akxk + Bkuk + wk

•   Quadratic cost

E wk

k=0,1,...,N −1

x′N QN xN  +

N −1

k=0(x′kQkxk + u′kRkuk)

where Qk ≥ 0 and  Rk  > 0

•   Observations

zk  = C kxk + vk, k = 0, 1, . . . , N  

 −1

•   w0, . . . , wN −1,  v0, . . . , vN −1   indep. zero mean

•   Key fact to show:

−  Optimal policy {µ∗0, . . . , µ∗N −1} is of the form:

µ∗k(I k) = LkE {xk | I k}

Lk: same as for the perfect state info case

−  Estimation problem and control problem canbe solved separately

Page 72: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 72/276

DP ALGORITHM I

•   Last stage  N  − 1 (supressing index  N  − 1):

J N −1(I N −1) = minuN −1

E xN −1,wN −1

x′N −1QxN −1

+ u′N −1RuN −1 + (AxN −1 + BuN −1 + wN −1)′

· Q(AxN −1 + BuN −1 + wN −1)   |  I N −1, uN −1

•   Since   E {wN −1 |   I N −1}  =  E {wN −1}  = 0, theminimization involves

minuN −1

u′N −1(B′QB + R)uN −1

+ 2E {xN −1   |  I N −1}′A′QBuN −1

The minimization yields the optimal  µ∗N −1:

u∗N −1

 = µ∗N −1

(I N −1

) = LN −1

E {

xN −1 |

 I N −1}

where

LN −1 = −(B′QB + R)−1B′QA

Page 73: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 73/276

DP ALGORITHM II

•   Substituting in the DP algorithm

J N −1(I N −1) =   E xN −1

x′N −1K N −1xN −1 |  I N −1

+   E 

xN −1xN −1 − E {xN −1 |  I N −1}

· P N −1

xN −1 − E {xN −1 |  I N −1} | I N −1

+   E wN −1

{w′N −1QN wN −1},

where the matrices  K N −1  and P N −1  are given by

P N −1  = A′N −1QN BN −1(RN −1 + B′

N −1QN BN −1)−1

· B′N −1QN AN −1,

K N −1  = A′N −1QN AN −1 − P N −1 + QN −1

•   Note the structure of   J N −1: in addition tothe quadratic and constant terms, it involves aquadratic in the estimation error

xN −1 − E {xN −1 |  I N −1}

Page 74: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 74/276

DP ALGORITHM III

•   DP equation for period  N  − 2:

J N −2(I N −2) = minuN −2

  E 

xN −2,wN −2,zN −1

{x′N −2QxN −2

+ u′N −2RuN −2 + J N −1(I N −1)   |  I N −2, uN −2}= E 

x′N −2QxN −2   |  I N −2

+ minuN −2

u′N −2RuN −2

+ E 

x′N −1K N −1xN −1   |  I N −2, uN −2

+ E 

xN −1 − E {xN −1   |  I N −1}′

· P N −1

xN −1 − E {xN −1   |  I N −1}

|  I N −2, uN −2

+ E wN −1

{w′N −1QN wN −1}

•   Key point:   We have excluded the next to lastterm from the minimization with respect to uN −2

•  This term turns out to be independent of  uN −2

Page 75: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 75/276

QUALITY OF ESTIMATION LEMMA

•   For every  k, there is a function   M k   such thatwe have

xk−E {xk | I k} = M k(x0, w0, . . . , wk−1, v0, . . . , vk),

independently of the policy being used

•   Using the lemma,

xN −1 − E {xN −1 |  I N −1} = ξ N −1,

where

ξ N −1: function of  x0, w0, . . . , wN −2, v0, . . . , vN −1

•   Since  ξ N −1   is independent of  uN −2, the condi-tional expectation of  ξ ′N −1P N −1ξ N −1  satisfies

E {ξ ′N −1P N −1ξ N −1 |  I N −2, uN −2}= E {ξ ′N −1P N −1ξ N −1 |  I N −2}

and is independent of  uN −2.

•  So minimization in the DP algorithm yields

u∗N −2  = µ∗N −2(I N −2) = LN −2 E {xN −2 |  I N −2}

Page 76: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 76/276

FINAL RESULT

•   Continuing similarly (using also the quality of estimation lemma)

µ∗k(I k) = LkE {xk |  I k},

where Lk  is the same as for perfect state info:

Lk  = −(Rk + B′kK k+1Bk)−1B′

kK k+1Ak,

with  K k  generated using the Riccati equation:

K N   = QN , K k  = A′

kK k+1Ak − P k + Qk,

P k  = A′kK k+1Bk(Rk + B′

kK k+1Bk)−1B′kK k+1Ak

x k + 1 = Ak x k  + B k u k  + w k 

Lk 

u k 

w k 

x k 

z k  = C k x k  + v k 

Delay

EstimatorE {x k  | I k }

u k - 1

z k 

v k 

z k u k 

Page 77: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 77/276

SEPARATION INTERPRETATION

•  The optimal controller can be decomposed into(a) An estimator , which uses the data to gener-

ate the conditional expectation  E {xk |  I k}.

(b) An actuator , which multiplies E {xk | I k} bythe gain matrix   Lk   and applies the control

input  uk  = LkE {xk |  I k}.

•  Generically the estimate x of a random vector xgiven some information (random vector)  I , whichminimizes the mean squared error

E x{x − x2 |  I } = x2 − 2E {x | I }x + x2is E {x | I } (set to zero the derivative with respectto x  of the above quadratic form).

•  The estimator portion of the optimal controller

is optimal for the problem of estimating the statexk  assuming the control is not subject to choice.

•   The actuator portion is optimal for the controlproblem assuming perfect state information.

Page 78: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 78/276

STEADY STATE/IMPLEMENTATION ASPECTS

•   As N  → ∞, the solution of the Riccati equationconverges to a steady state and  Lk → L.

•   If   x0,   wk, and  vk   are Gaussian,   E {xk |  I k}   isa   linear   function of  I k  and is generated by a nicerecursive algorithm, the Kalman filter.

•   The Kalman filter involves also a Riccati equa-tion, so for  N  → ∞, and a stationary system, italso has a steady-state structure.

•   Thus, for Gaussian uncertainty, the solution isnice and possesses a steady state.

•  For nonGaussian uncertainty, computing E {xk | I k}maybe very difficult, so a suboptimal solution istypically used.

•   Most common suboptimal controller: ReplaceE 

{xk

|I k

}by the estimate produced by the Kalman

filter (act as if  x0,  wk, and  vk  are Gaussian).

•   It can be shown that this controller is optimalwithin the class of controllers that are linear  func-tions of  I k.

Page 79: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 79/276

6.231 DYNAMIC PROGRAMMING

LECTURE 8

LECTURE OUTLINE

•   DP for imperfect state info

•   Sufficient statistics

•   Conditional state distribution as a sufficientstatistic

•   Finite-state systems

•   Examples

Page 80: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 80/276

REVIEW: IMPERFECT STATE INFO PROBLEM

•   Instead of knowing  xk, we receive observations

z0 = h0(x0, v0), zk  = hk(xk, uk−1, vk), k ≥ 0

•   I k: information vector available at time  k:

I 0  = z0, I k  = (z0, z1, . . . , zk, u0, u1, . . . , uk−1), k ≥ 1

•  Optimization over policies π = {µ0, µ1, . . . , µN −1},where µk(I k) ∈ U k, for all  I k   and  k.

•   Find a policy  π  that minimizes

J π  =   E x0,wk,vk

k=0,...,N −1

gN (xN ) +

N −1k=0

gk

xk, µk(I k), wk

subject to the equations

xk+1  = f k

xk, µk(I k), wk

, k ≥ 0,

z0 = h0(x0, v0), zk  = hk

xk, µk−1(I k−1), vk

, k ≥ 1

Page 81: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 81/276

DP ALGORITHM

•   DP algorithm:

J k(I k) = minuk∈U k

  E 

xk,wk, zk+1

gk(xk, uk, wk)

+ J k+1(I k, zk+1, uk)  |  I k, ukfor  k = 0, 1, . . . , N   − 2, and for  k = N  − 1,

J N −1(I N −1) = minuN −1∈U N −1

  E xN −1, wN −1

gN 

f N −1(xN −1, uN −1, wN −1)

+ gN −1(xN −1, uN −1, wN −1)   |  I N −1, uN −1

•   The optimal cost  J ∗ is given by

J ∗ = E z0

J 0(z0)

.

Page 82: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 82/276

SUFFICIENT STATISTICS

•   Suppose that we can find a function S k(I k) suchthat the right-hand side of the DP algorithm canbe written in terms of some function  H k   as

minuk∈U k

H k

S k(I k), uk

.

•  Such a function S k is called a sufficient statistic .

•   An optimal policy obtained by the precedingminimization can be written as

µ∗k(I k) = µk

S k(I k)

,

where µk   is an appropriate function.

•   Example of a sufficient statistic:   S k(I k) = I k

• Another important sufficient statistic

S k(I k) = P xk|I k

Page 83: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 83/276

DP ALGORITHM IN TERMS OF  P XK |I K

•  It turns out that P xk|I k  is generated recursivelyby a dynamic system (estimator) of the form

P xk+1|I k+1 = Φk

P xk|I k , uk, zk+1

for a suitable function Φk

•   DP algorithm can be written as

J k(P xk|I k) = minuk∈U k

  E 

xk,wk,zk+1

gk(xk, uk, wk)

+ J k+1Φk(P xk|I k

, uk, zk+1) |  I k, uk

uk xk

Delay

Estimator

uk - 1

uk - 1

vk

zk

zk

wk

φ k - 1

Actuator

xk + 1 = fk(xk ,uk ,wk) zk = hk(xk ,uk  - 1,vk)

System Measurement

Pxk| I

k

µ k

Page 84: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 84/276

EXAMPLE: A SEARCH PROBLEM

•   At each period, decide to search or not searcha site that may contain a treasure.

•   If we search and a treasure is present, we findit with prob.  β  and remove it from the site.

•  Treasure’s worth:   V . Cost of search:   C 

•  States: treasure present & treasure not present

•  Each search can be viewed as an observation of the state

•   Denote

 pk   : prob. of treasure present at the start of time k

with  p0  given.

•   pk  evolves at time  k  according to the equation

 pk+1 =

 pk   if not search,0 if search and find treasure,

 pk(1−β) pk(1−β)+1− pk

if search and no treasure.

Page 85: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 85/276

SEARCH PROBLEM (CONTINUED)

•   DP algorithm

J k( pk) = max

0, −C  + pkβV 

+ (1 − pkβ )J k+1  pk(1 − β )

 pk(1

−β ) + 1

− pk,

with  J N ( pN ) = 0.

•   Can be shown by induction that the functionsJ k  satisfy

J k( pk) = 0,   for all pk ≤   C βV 

•   Furthermore, it is optimal to search at periodk  if and only if 

 pkβV  ≥

(expected reward from the next search ≥ the costof the search)

Page 86: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 86/276

FINITE-STATE SYSTEMS - POMDP

•   Suppose the system is a finite-state Markovchain, with states 1, . . . , n.

•   Then the conditional probability distributionP xk|I k   is a vector

P (xk  = 1 | I k), . . . , P  (xk  = n | I k)

•  The DP algorithm can be executed over the  n-dimensional simplex (state space is not expandingwith increasing  k)

•   When the control and observation spaces arealso finite sets the problem is called a POMDP(Partially Observed Markov Decision Problem).

•   For POMDP it turns out that the cost-to-gofunctions   J k   in the DP algorithm are piecewise

linear and concave (Exercise 5.7).•  This is conceptually important. It is also usefulin practice because it forms the basis for approxi-mations.

Page 87: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 87/276

INSTRUCTION EXAMPLE

•   Teaching a student some item. Possible statesare  L: Item learned, or  L: Item not learned.

•   Possible decisions:   T : Terminate the instruc-tion, or T : Continue the instruction for one periodand then conduct a test that indicates whether the

student has learned the item.•  The test has two possible outcomes:   R: Studentgives a correct answer, or   R: Student gives anincorrect answer.

•   Probabilistic structure

L L R 

r t 

1 1

1 - r 1 - t L R L

•  Cost of instruction is  I  per period

•   Cost of terminating instruction; 0 if student haslearned the item, and  C > 0 if not.

Page 88: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 88/276

INSTRUCTION EXAMPLE II

•   Let pk: prob. student has learned the item giventhe test results so far

 pk  = P (xk|I k) = P (xk  = L | z0, z1, . . . , zk).

•   Using Bayes’ rule we can obtain

 pk+1 = Φ( pk, zk+1)

=

  1−(1−t)(1− pk)1−(1−t)(1−r)(1− pk)

  if  zk+1 = R,

0 if  zk+1 = R.

•   DP algorithm:

J k( pk) = min

(1 − pk)C, I  +   E 

zk+1J k+1

Φ( pk , zk+1)

.

starting with

J N −1( pN −1) = min

(1− pN −1)C, I +(1−t)(1− pN −1)C 

.

Page 89: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 89/276

INSTRUCTION EXAMPLE III

•   Write the DP algorithm as

J k( pk) = min

(1 − pk)C, I  + Ak( pk)

,

where

Ak( pk) =  P (zk+1  = R |  I k)J k+1

Φ( pk, R)

+ P (zk+1  = R |  I k)J k+1

Φ( pk, R)

•  Can show by induction that Ak( p) are piecewiselinear, concave, monotonically decreasing, with

Ak−1( p) ≤ Ak( p) ≤ Ak+1( p),   for all p ∈ [0, 1].

0 p 

I  + AN - 1(p )

I  + AN - 2(p )

I  + AN - 3(p )

1a N  - 1 a N  - 3a N  - 2 1 -I 

Page 90: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 90/276

6.231 DYNAMIC PROGRAMMING

LECTURE 9

LECTURE OUTLINE

•   Suboptimal control

•   Cost approximation methods: Classification

•   Certainty equivalent control: An example

•   Limited lookahead policies

•   Performance bounds

•   Problem approximation approach

•   Parametric cost-to-go approximation

Page 91: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 91/276

PRACTICAL DIFFICULTIES OF DP

•   The curse of modeling•   The curse of dimensionality

−   Exponential growth of the computational andstorage requirements as the number of statevariables and control variables increases

−   Quick explosion of the number of states incombinatorial problems

−  Intractability of imperfect state informationproblems

•  There may be real-time solution constraints

−  A family of problems may be addressed. Thedata of the problem to be solved is given withlittle advance notice

−  The problem data may change as the systemis controlled – need for on-line replanning

Page 92: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 92/276

COST-TO-GO FUNCTION APPROXIMATION

•   Use a policy computed from the DP equationwhere   the optimal cost-to-go function   J k+1   isreplaced by an approximation  J k+1.  (SometimesE 

gk

 is also replaced by an approximation.)

•   Apply  µk(xk), which attains the minimum in

minuk∈U k(xk)

gk(xk, uk, wk)+ J k+1

f k(xk, uk, wk)

•   There are several ways to compute  J k+1:

−  Off-line approximation: The entire func-tion  J k+1 is computed for every k, before thecontrol process begins.

−   On-line approximation: Only the valuesJ k+1(xk+1) at the relevant next states  xk+1are computed and used to compute   uk   just

after the current state  xk   becomes known.−   Simulation-based methods: These are off-

line and on-line methods that share the com-mon characteristic that they are based onMonte-Carlo simulation. Some of these meth-

ods are suitable for problems of very largesize.

Page 93: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 93/276

CERTAINTY EQUIVALENT CONTROL (CEC)

•   Idea:   Replace the stochastic problem with adeterministic problem

•   At each time   k, the uncertain quantities arefixed at some “typical” values

•  On-line implementation   for a perfect state

info problem. At each time  k:

(1) Fix the   wi,   i ≥   k, at some   wi. Solve thedeterministic problem:

minimize   gN (xN ) +

N −1i=k gi

xi, ui, wi

where xk   is known, and

ui ∈ U i, xi+1  = f ixi, ui, wi.(2) Use as control the first element in the opti-

mal control sequence found.

•   So we apply µk(xk) that minimizes

gkxk, uk, wk+  J k+1f k(xk, uk, wk)where  J k+1  is the optimal cost of the correspond-ing deterministic problem.

Page 94: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 94/276

ALTERNATIVE OFF-LINE IMPLEMENTATION

•   Let

µd

0(x0), . . . , µd

N −1(xN −1)

  be an optimalcontroller obtained from the DP algorithm for thedeterministic problem

minimize   gN (xN ) +

N −1

k=0

gkxk, µk(xk), wksubject to   xk+1  = f k

xk, µk(xk), wk

, µk(xk) ∈ U k

•   The CEC applies at time   k   the control inputµdk(xk).

•   In an imperfect info version,  xk   is replaced byan estimate  xk(I k).

x k 

Delay

Estimator

u k - 1

u k - 1

v k 

z k 

z k 

w k 

Actuator

x k + 1 = f k (x k  ,u k  ,w k ) z k  = h k (x k  ,u k   - 1,v k )System Measurement

m k 

u k =m 

d (x k 

)

x k (I k )

Page 95: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 95/276

PARTIALLY STOCHASTIC CEC

•  Instead of fixing all  future disturbances to theirtypical values, fix only some, and treat the rest asstochastic.

•   Important special case:  Treat an imperfectstate information problem as one of perfect state

information, using an estimate  xk(I k) of  xk   as if it were exact.

•  Multiaccess communication example:  Con-sider controlling the slotted Aloha system (Ex-ample 5.1.1 in the text) by optimally choosingthe probability of transmission of waiting pack-ets. This is a hard problem of imperfect stateinfo, whose perfect state info version is easy.

•   Natural partially stochastic CEC:

µk(I k) = min

1,

  1

xk(I k)

,

where  xk(I k) is an estimate of the current packetbacklog based on the entire past channel historyof successes, idles, and collisions (which is  I k).

Page 96: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 96/276

GENERAL COST-TO-GO APPROXIMATION

•   One-step lookahead (1SL) policy: At eachk  and state  xk, use the control  µk(xk) that

minuk∈U k(xk)

gk(xk, uk, wk)+ J k+1

f k(xk, uk, wk)

,

where−  J N   = gN .

−  J k+1: approximation to true cost-to-go J k+1

•   Two-step lookahead policy: At each  k  andxk, use the control µk(xk) attaining the minimum

above, where the function  J k+1   is obtained usinga 1SL approximation (solve a 2-step DP problem).

•   If  J k+1   is readily available and the minimiza-tion above is not too hard, the 1SL policy is im-plementable on-line.

•  Sometimes one also replaces  U k(xk) above witha subset of “most promising controls”  U k(xk).

•   As the length of lookahead increases, the re-quired computation quickly explodes.

Page 97: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 97/276

PERFORMANCE BOUNDS FOR 1SL

•   Let J k(xk) be the cost-to-go from (xk, k) of the1SL policy, based on functions  J k.

•   Assume that for all (xk, k), we have

J k(xk) ≤  J k(xk),   (*)

where  J N   = gN  and for all  k,

J k(xk) = minuk∈U k(xk)

gk(xk, uk, wk)

+  J k+1f k(xk, uk, wk),

[so  J k(xk) is computed along with  µk(xk)]. Then

J k(xk) ≤  J k(xk),   for all (xk, k).

•  Important application:   When  J k  is the cost-to-go of some heuristic policy (then the 1SL policyis called the  rollout  policy).

•   The bound can be extended to the case wherethere is a  δ k   in the RHS of (*). Then

J k(xk) ≤  J k(xk) + δ k + · · · + δ N −1

Page 98: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 98/276

COMPUTATIONAL ASPECTS

•  Sometimes nonlinear programming can be usedto calculate the 1SL or the multistep version [par-ticularly when  U k(xk) is not a discrete set]. Con-nection with stochastic programming methods.

•   The choice of the approximating functions  J k

is critical, and is calculated in a variety of ways.•   Some approaches:

(a)   Problem Approximation : Approximate theoptimal cost-to-go with some cost derivedfrom a related but simpler problem

(b)  Parametric Cost-to-Go Approximation : Ap-proximate the optimal cost-to-go with a func-tion of a suitable parametric form, whose pa-rameters are tuned by some heuristic or sys-tematic scheme (Neuro-Dynamic Program-

ming)(c)   Rollout Approach : Approximate the optimal

cost-to-go with the cost of some suboptimalpolicy, which is calculated either analyticallyor by simulation

Page 99: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 99/276

PROBLEM APPROXIMATION

•   Many (problem-dependent) possibilities−  Replace uncertain quantities by nominal val-

ues, or simplify the calculation of expectedvalues by limited simulation

−  Simplify difficult constraints or dynamics

•   Example of   enforced decomposition: Route  mvehicles that move over a graph. Each node has a“value.” The first vehicle that passes through thenode collects its value. Max the total collectedvalue, subject to initial and final time constraints(plus time windows and other constraints).

•   Usually the 1-vehicle version of the problem ismuch simpler. This motivates an approximationobtained by solving single vehicle problems.

•   1SL scheme: At time  k   and state  xk   (position

of vehicles and “collected value nodes”), considerall possible  kth moves by the vehicles, and at theresulting states we approximate the optimal value-to-go with the value collected by optimizing thevehicle routes one-at-a-time

Page 100: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 100/276

PARAMETRIC COST-TO-GO APPROXIMATION

•  Use a cost-to-go approximation from a paramet-ric class  J (x, r) where  x   is the current state andr   = (r1, . . . , rm) is a vector of “tunable” scalars(weights).

•   By adjusting the weights, one can change the

“shape” of the approximation  ˜J  so that it is rea-sonably close to the true optimal cost-to-go func-

tion.

•   Two key issues:

−   The choice of parametric class  J (x, r) (the

approximation architecture).−   Method for tuning the weights (“training”the architecture).

•  Successful application strongly depends on howthese issues are handled, and on insight about the

problem.•   Sometimes a simulator is used, particularlywhen there is no mathematical model of the sys-tem.

Page 101: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 101/276

APPROXIMATION ARCHITECTURES

•   Divided in linear and nonlinear [i.e., linear ornonlinear dependence of  J (x, r) on  r].

•  Linear architectures are easier to train, but non-linear ones (e.g., neural networks) are richer.

• Architectures based on feature extraction

Feature ExtractionMapping

Cost Approximator w/ Parameter Vector r 

FeatureVector y State x 

Cost Approximation

J (y ,r )

•  Ideally, the features will encode much of the

nonlinearity that is inherent in the cost-to-go ap-proximated, and the approximation may be quiteaccurate without a complicated architecture.

•   Sometimes the state space is partitioned, and“local” features are introduced for each subset of 

the partition (they are 0 outside the subset).

•  With a well-chosen feature vector  y(x), we canuse a linear architecture

J (x, r) =  J 

y(x), r

=

iriyi(x)

Page 102: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 102/276

AN EXAMPLE - COMPUTER CHESS

•  Programs use a feature-based position evaluatorthat assigns a score to each move/position

FeatureExtraction

Weightingof Features

Score

Features:Material balance,

Mobility,Safety, etc

Position Evaluator

•   Many context-dependent special features.

•   Most often the weighting of features is linearbut multistep lookahead is involved.

•   Most often the training is done by trial anderror.

Page 103: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 103/276

6.231 DYNAMIC PROGRAMMING

LECTURE 10

LECTURE OUTLINE

•   Rollout algorithms

•   Cost improvement property

•  Discrete deterministic problems

•  Approximations of rollout algorithms

•  Discretization of continuous time

•   Discretization of continuous space

•   Other suboptimal approaches

Page 104: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 104/276

ROLLOUT ALGORITHMS

•   One-step lookahead policy:  At each  k   andstate xk, use the control  µk(xk) that

minuk∈U k(xk)

gk(xk, uk, wk)+ J k+1

f k(xk, uk, wk)

,

where

−  J N   = gN .

−  J k+1: approximation to true cost-to-go J k+1

•  Rollout algorithm: When  J k  is the cost-to-goof some heuristic policy (called the  base policy )

•  Cost improvement property (to be shown): Therollout algorithm achieves no worse (and usuallymuch better) cost than the base heuristic startingfrom the same state.

•   Main difficulty: Calculating  J k(xk) may be

computationally intensive if the cost-to-go of thebase policy cannot be analytically calculated.

−   May involve Monte Carlo simulation if theproblem is stochastic.

−  Things improve in the deterministic case.

Page 105: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 105/276

EXAMPLE: THE QUIZ PROBLEM

•   A person is given  N  questions; answering cor-rectly question   i   has probability   pi, reward   vi.Quiz terminates at the first incorrect answer.

•   Problem: Choose the ordering of questions soas to maximize the total expected reward.

•   Assuming no other constraints, it is optimal touse the index policy : Answer questions in decreas-ing order of  pivi/(1 − pi).

•   With minor changes in the problem, the indexpolicy need not be optimal. Examples:

−   A limit (< N ) on the maximum number of questions that can be answered.

−   Time windows, sequence-dependent rewards,precedence constraints.

•  Rollout with the index policy as base policy:

Convenient because at a given state (subset of questions already answered), the index policy andits expected reward can be easily calculated.

•   Very effective for solving the quiz problem andimportant generalizations in scheduling (see Bert-

sekas and Castanon, J. of Heuristics, Vol. 5, 1999).

Page 106: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 106/276

COST IMPROVEMENT PROPERTY

•   Let

J k(xk): Cost-to-go of the rollout policy

H k(xk): Cost-to-go of the base policy

•   We claim that  J k(xk) ≤ H k(xk) for all  xk,  k

•  Proof by induction: We have J N (xN ) = H N (xN )for all  xN . Assume that

J k+1(xk+1)

≤H k+1(xk+1),

  ∀ xk+1.

Then, for all  xk

J k(xk) = E 

gk

xk, µk(xk), wk

+ J k+1

f k

xk, µk(xk), wk

≤ E gkxk, µk(xk), wk+ H k+1f kxk, µk(xk), wk≤ E 

gk

xk, µk(xk), wk

+ H k+1

f k

xk, µk(xk), wk

= H k(xk)

−  Induction hypothesis ==>  1st inequality

−  Min selection of  µk(xk) ==> 2nd inequality

−   Definition of  H k, µk   ==>   last equality

Page 107: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 107/276

DISCRETE DETERMINISTIC PROBLEMS

•   Any discrete optimization problem (with finitenumber of choices/feasible solutions) can be repre-sented sequentially by breaking down the decisionprocess into stages.

•  A tree/shortest path representation. The leaves

of the tree correspond to the feasible solutions.•  Decisions can be made in stages.

−  May complete partial solutions, one stage ata time.

−   May apply rollout with any heuristic that

can complete a partial solution.−  No costly stochastic simulation needed.

•   Example: Traveling salesman problem. Find aminimum cost tour that goes exactly once througheach of  N   cities.

ABC ABD ACB ACD ADB ADC

ABCD

AB AC AD

ABDC ACBD ACDB ADBC ADCB

Origin Node s A

Traveling salesman problem with four cities A, B, C, D

Page 108: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 108/276

EXAMPLE: THE BREAKTHROUGH PROBLEM

root

•  Given a binary tree with  N   stages.

•  Each arc is either free or is blocked (crossed outin the figure).

•  Problem: Find a free path from the root to theleaves (such as the one shown with thick lines).

•   Base heuristic (greedy): Follow the right branch

if free; else follow the left branch if free.•   This is a rare rollout instance that admits adetailed analysis.

•   For large   N   and given prob. of free branch:the rollout algorithm requires   O(N ) times more

computation, but has  O(N ) times larger prob. of finding a free path than the greedy algorithm.

Page 109: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 109/276

DET. EXAMPLE: ONE-DIMENSIONAL WALK

•   A person takes either a unit step to the left ora unit step to the right. Minimize the cost  g(i) of the point  i  where he will end up after  N   steps.

g (i )

i N N  - 2-N  0

(N ,0)

(0,0)

(N ,-N ) (N ,N )

i  _ 

i  _ 

•  Base heuristic: Always go to the right. Rollout

finds the rightmost  local minimum .

•   Base heuristic: Compare always go to the rightand always go the left. Choose the best of the two.Rollout finds a  global minimum .

Page 110: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 110/276

ROLLING HORIZON APPROACH

•   This is an   l-step lookahead policy where thecost-to-go approximation is just 0.

•   Alternatively, the cost-to-go approximation isthe terminal cost function  gN .

•  A short rolling horizon saves computation.

•   “Paradox”: It is not true that a longer rollinghorizon always improves performance.

•   Example:   At the initial state, there are twocontrols available (1 and 2). At every other state,there is only one control.

CurrentState

Optimal Trajectory

HighCost

... ...

... ...

1

2

LowCost

HighCost

l Stages

Page 111: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 111/276

ROLLING HORIZON WITH ROLLOUT

•   We can use a rolling horizon approximation incalculating the cost-to-go of the base heuristic.

•   Because the heuristic is suboptimal, the ratio-nale for a long rolling horizon becomes weaker.

•  Example: N -stage stopping problem where the

stopping cost is 0, the continuation cost is either−ǫ   or 1, where 0   < ǫ <<   1, and the first statewith continuation cost equal to 1 is state m. Thenthe optimal policy is to stop at state  m, and theoptimal cost is −mǫ.

0 1 2 m N 

Stopped State

- e - e  

1... ...

•  Consider the heuristic that continues at every

state, and the rollout policy that is based on thisheuristic, with a rolling horizon of  l ≤ m  steps.

•  It will continue up to the first  m − l + 1 stages,thus compiling a cost of −(m− l +1)ǫ. The rolloutperformance improves as  l  becomes shorter!

•   Limited vision may work to our advantage!

Page 112: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 112/276

MODEL PREDICTIVE CONTROL (MPC)

•  Special case of rollout for controlling linear de-terministic systems (extensions to nonlinear/stochasticare similar)

−   System:   xk+1  = Axk + Buk

−   Quadratic cost per stage:   x′kQxk + u′kRuk

−   Constraints:   xk ∈ X ,  uk ∈ U (xk)•   Assumption: For any x0 ∈ X  there is a feasiblestate-control sequence that brings the system to 0in  m  steps, i.e.,  xm  = 0

• MPC at state xk  solves an m-step optimal con-

trol problem with constraint  xk+m  = 0, i.e., findsa sequence uk, . . . , uk+m−1  that minimizes

m−1ℓ=0

x′k+ℓQxk+ℓ + u′k+ℓRuk+ℓ

subject to  xk+m = 0•   Then applies the first control uk   (and repeatsat the next state  xk+1)

•   MPC is rollout with heuristic derived from thecorresponding m

−1-step optimal control problem

•   Key Property of MPC:   Since the heuris-tic is stable, the rollout is also stable (by policyimprovement property).

Page 113: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 113/276

DISCRETIZATION

•   If the state space and/or control space is con-tinuous/infinite, it must be replaced by a finitediscretization.

•   Need for consistency, i.e., as the discretizationbecomes finer, the cost-to-go functions of the dis-

cretized problem converge to those of the contin-uous problem.

•   Pitfall with discretizing continuous time:The control constraint set may change a lot as wepass to the discrete-time approximation.

•   Example:  Letx(t) = u(t),

with control constraint u(t) ∈ {−1, 1}. The reach-able states after time   δ   are   x(t + δ ) =   x(t) + u,with  u

∈[−

δ, δ ].

•   Compare it with the reachable states after wediscretize the system naively:   x(t + δ ) =   x(t) +δu(t),  with  u(t) ∈ {−1, 1}.

•   “Convexification effect” of continuous time: a

discrete control constraint set in continuous-timedifferential systems, is equivalent to a continuouscontrol constraint set when the system is lookedat discrete times.

Page 114: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 114/276

SPACE DISCRETIZATION I

•   Given a discrete-time system with state spaceS , consider a finite subset  S ; for example  S  couldbe a finite grid within a continuous state space  S .

•   Difficulty:   f (x,u,w)  /∈ S   for  x ∈ S .

•  We define an approximation to the original

problem, with state space  S , as follows:

•   Express each  x ∈ S  as a convex combination of states in  S , i.e.,

x = xi∈S 

γ i(x)xi   where  γ i(x)

≥0,

i

γ i(x) = 1

•   Define a “reduced” dynamic system with statespace   S , whereby from each   xi ∈   S   we move tox  =  f (xi, u , w) according to the system equation

of the original problem, and then move to  xj ∈ S with probabilities  γ j(x).

•  Define similarly the corresponding cost per stageof the transitions of the reduced system.

•  Note application to finite-state POMDP (Par-

tially Observed Markov Decision Problems)

Page 115: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 115/276

SPACE DISCRETIZATION II

•   Let J k(xi) be the optimal cost-to-go of the “re-duced” problem from each state  xi ∈  S  and timek  onward.

•  Approximate the optimal cost-to-go of any  x ∈S  for the original problem by

J k(x) =xi∈S 

γ i(x)J k(xi),

and use one-step-lookahead based on  J k.

•   The choice of coefficients   γ i(x) is in principlearbitrary, but should aim at consistency, i.e., as

the number of states in  S   increases,  J k(x) shouldconverge to the optimal cost-to-go of the originalproblem.

•   Interesting observation:  While the originalproblem may be deterministic, the reduced prob-

lem is always stochastic.

•   Generalization:   The set   S   may be any fi-nite set (not a subset of   S ) as long as the coef-ficients   γ i(x) admit a meaningful interpretationthat quantifies the degree of association of  x with

xi.

Page 116: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 116/276

OTHER SUBOPTIMAL APPROACHES

•   Minimize the DP equation error: Approxi-mate the optimal cost-to-go functions J k(xk) withJ k(xk, rk), where rk  is a parameter vector, chosento minimize some form of error in the DP equa-tions

−  Can be done sequentially going backwardsin time (approximate   J k   using an approxi-mation of  J k+1).

•   Direct approximation of control policies:For a subset of states  xi,  i = 1, . . . , m, find

µk(xi) = arg minuk∈U k(x

i)E 

g(xi, uk, wk)

+  J k+1

f k(xi, uk, wk), rk+1

.

Then find µk(xk, sk),  where  sk   is a vector of pa-rameters obtained by solving the problem

mins

mi=1

µk(xi) − µk(xi, s)2.

•   Approximation in policy space:   Do notbother with cost-to-go approximations. Parametrizethe policies as µk(xk, sk),  and minimize the costfunction of the problem over the parameters  sk.

Page 117: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 117/276

6.231 DYNAMIC PROGRAMMING

LECTURE 11

LECTURE OUTLINE

•   Infinite horizon problems

•   Stochastic shortest path problems

•   Bellman’s equation

•  Dynamic programming – value iteration

•   Discounted problems as special case of SSP

Page 118: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 118/276

TYPES OF INFINITE HORIZON PROBLEMS

•  Same as the basic problem, but:−  The number of stages is infinite.

−  The system is stationary.

•   Total cost problems: Minimize

J π(x0) = limN →∞

E wk

k=0,1,...

N −1k=0

αkg

xk, µk(xk), wk

−   Stochastic shortest path problems (α   = 1,finite-state system with a termination state)

−   Discounted problems (α <  1, bounded costper stage)

−  Discounted and undiscounted problems withunbounded cost per stage

•   Average cost problems

limN →∞

1

N   E wk

k=0,1,...

N −1k=0

g

xk, µk(xk), wk

•  Infinite horizon characteristics: Challenging anal-ysis, elegance of solutions and algorithms

Page 119: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 119/276

PREVIEW OF INFINITE HORIZON RESULTS

•   Key issue:   The relation between the infiniteand finite horizon optimal cost-to-go functions.

•   Illustration:  Let  α = 1 and  J N (x) denote theoptimal cost of the   N -stage problem, generatedafter  N  DP iterations, starting from  J 0(x) ≡ 0

J k+1(x) = minu∈U (x)

E w

g(x,u,w) + J k

f (x,u,w)

, ∀ x

•   Typical results for total cost problems:

−   Convergence of DP algorithm (value itera-

tion):

J ∗(x) = limN →∞

J N (x), ∀  x

−  Bellman’s Equation holds for all  x:

J ∗(x) = minu∈U (x)

E w

g(x,u,w) + J ∗

f (x,u,w)

−   If  µ(x) minimizes in Bellman’s Eq., the pol-icy {µ , µ , . . .}  is optimal.

•   Bellman’s Eq. holds for all types of prob-lems. The other results are true for SSP (andbounded/discounted; unusual exceptions for otherproblems).

Page 120: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 120/276

STOCHASTIC SHORTEST PATH PROBLEMS

•  Assume finite-state system: States 1, . . . , n andspecial cost-free termination state  t

−  Transition probabilities  pij(u)

−   Control constraints  u ∈ U (i)

−   Cost of policy  π = {µ0, µ1, . . .}  is

J π(i) = limN →∞

N −1k=0

g

xk, µk(xk)  x0  = i

− Optimal policy if  J π(i) = J ∗(i) for all  i.

−  Special notation: For stationary policies π  ={µ , µ , . . .}, we use  J µ(i) in place of  J π(i).

•  Assumption (Termination inevitable): Thereexists integer   m   such that for every policy andinitial state, there is positive probability that the

termination state will be reached after no morethat  m  stages; for all  π, we have

ρπ  = maxi=1,...,n

P {xm = t | x0 = i, π} < 1

Page 121: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 121/276

FINITENESS OF POLICY COST FUNCTIONS

•   Letρ = max

πρπ.

Note that  ρπ  depends only on the first  m  compo-nents of the policy  π, so that  ρ < 1.

•  For any  π  and any initial state  i

P {x2m  = t | x0  = i, π} = P {x2m  = t | xm  = t, x0  = i, π}

× P {xm  = t | x0  = i, π} ≤ ρ2

and similarly

P {xkm = t | x0  = i, π} ≤ ρk, i = 1, . . . , n

•   So E {Cost between times  km  and (k + 1)m − 1 }

≤ mρk

maxi=1,...,nu∈U (i)

g(i, u)

and

J π(i)

∞k=0

mρk maxi=1,...,nu∈U (i)

g(i, u)

=  m

1 − ρ  maxi=1,...,nu∈U (i)

g(i, u)

Page 122: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 122/276

MAIN RESULT

•   Given any initial conditions   J 0(1), . . . , J  0(n),the sequence  J k(i) generated by the DP iteration

J k+1(i) = minu∈U (i)

g(i, u) +

nj=1

 pij(u)J k( j)

, ∀  i

converges to the optimal cost  J ∗(i) for each  i.

•  Bellman’s equation has J ∗(i) as unique solution:

J ∗(i) = minu∈U (i)

g(i, u) +

n

j=1

 pij(u)J ∗( j)

, ∀  i

J ∗(t) = 0

•   A stationary policy   µ   is optimal if and onlyif for every state   i,   µ(i) attains the minimum in

Bellman’s equation.•  Key proof idea: The “tail” of the cost series,

∞k=mK 

g

xk, µk(xk)

vanishes as  K   increases to ∞.

Page 123: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 123/276

OUTLINE OF PROOF THAT  J N  → J ∗

•   Assume for simplicity that  J 0(i) = 0 for all   i,and for any  K  ≥ 1, write the cost of any policy  πas

J π(x0) =

mK−1

k=0

g

xk, µk(xk)

+

k=mK

g

xk, µk(xk)

mK−1k=0

g

xk, µk(xk)

+

∞k=K

ρkm maxi,u

|g(i, u)|

Take the minimum of both sides over  π  to obtain

J ∗(x0) ≤ J mK (x0) +   ρK 

1 − ρm max

i,u|g(i, u)|.

Similarly, we have

J mK (x0) −  ρK 

1 − ρ m maxi,u |g(i, u)| ≤ J ∗

(x0).

It follows that limK →∞ J mK (x0) = J ∗(x0).

•   It can be seen that   J mK (x0) and   J mK +k(x0)converge to the same limit for   k   = 1, . . . , m −1 (since   k   extra steps far into the future don’tmatter), so

J N (x0) → J ∗(x0)

Page 124: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 124/276

EXAMPLE

•   Minimizing the  E {Time to Termination}: Let

g(i, u) = 1,   ∀  i = 1, . . . , n, u ∈ U (i)

•  Under our assumptions, the costs J ∗(i) uniquely

solve Bellman’s equation, which has the form

J ∗(i) = minu∈U (i)

1 +

nj=1

 pij(u)J ∗( j)

, i = 1, . . . , n

•   In the special case where there is only one con-trol at each state,  J ∗(i) is the mean first passagetime from i to t. These times, denoted mi, are theunique solution of the equations

mi  = 1 +

nj=1

 pijmj , i = 1, . . . , n .

Page 125: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 125/276

DISCOUNTED PROBLEMS

•   Assume a discount factor  α < 1.•   Conversion to an SSP problem.

i   j 

p ij (u )

p ii (u ) p  jj (u )

p  ji (u )

1 - a 

i   j 

p ij (u )

p ii (u ) p  jj (u )

p  ji (u )

1 - a 

•  Value iteration converges to J ∗ for all initial J 0:

J k+1(i) = minu∈U (i)

g(i, u) + α

nj=1

 pij(u)J k( j)

, ∀ i

•   J ∗ is the unique solution of Bellman’s equation:

J ∗(i) = minu∈U (i)

g(i, u) + α

nj=1

 pij(u)J ∗( j)

, ∀  i

•   A stationary policy   µ   is optimal if and onlyif for every state   i,   µ(i) attains the minimum inBellman’s equation.

Page 126: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 126/276

6.231 DYNAMIC PROGRAMMING

LECTURE 12

LECTURE OUTLINE

•   Review of stochastic shortest path problems

•   Computational methods for SSP−  Value iteration

−   Policy iteration

−   Linear programming

•   Computational methods for discounted prob-lems

Page 127: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 127/276

STOCHASTIC SHORTEST PATH PROBLEMS

•  Assume finite-state system: States 1, . . . , n andspecial cost-free termination state  t

−  Transition probabilities  pij(u)

−   Control constraints  u ∈ U (i)

−   Cost of policy  π = {µ0, µ1, . . .}  is

J π(i) = limN →∞

N −1k=0

g

xk, µk(xk)  x0  = i

− Optimal policy if  J π(i) = J ∗(i) for all  i.

−  Special notation: For stationary policies π  ={µ , µ , . . .}, we use  J µ(i) in place of  J π(i).

•  Assumption (Termination inevitable): There ex-ists integer m such that for every policy and initialstate, there is positive probability that the termi-

nation state will be reached after no more that  mstages; for all  π, we have

ρπ  = maxi=1,...,n

P {xm = t | x0  = i, π} < 1

Page 128: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 128/276

MAIN RESULT

•   Given any initial conditions   J 0(1), . . . , J  0(n),the sequence  J k(i) generated by value iteration

J k+1(i) = minu∈U (i)

g(i, u) +

nj=1

 pij(u)J k( j)

, ∀  i

converges to the optimal cost  J ∗(i) for each  i.

•  Bellman’s equation has J ∗(i) as unique solution:

J ∗(i) = minu∈U (i)

g(i, u) +

n

j=1

 pij(u)J ∗( j)

, ∀  i

•   A stationary policy   µ   is optimal if and onlyif for every state   i,   µ(i) attains the minimum inBellman’s equation.

• Key proof idea: The “tail” of the cost series,

∞k=mK 

g

xk, µk(xk)

vanishes as  K   increases to ∞.

Page 129: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 129/276

BELLMAN’S EQUATION FOR A SINGLE POLIC

•   Consider a stationary policy  µ•   J µ(i),   i  = 1, . . . , n, are the unique solution of the linear system of  n  equations

J µ(i) = gi, µ(i)+n

j=1

 pijµ(i)J µ( j),

  ∀i = 1, . . . , n

•   Proof:   This is just Bellman’s equation for amodified/restricted problem where there is onlyone policy, the stationary policy µ, i.e., the control

constraint set at state  i  is  U (i) = {µ(i)}•  The equation provides a way to compute  J µ(i),i   = 1, . . . , n, but the computation is substantialfor large n  [O(n3)]

•  For large  n, value iteration may be preferable.

(Typical case of a large linear system of equations,where an iterative method may be better than adirect solution method.)

•   For VERY large   n, exact methods cannot beapplied, and approximations are needed. (We will

discuss these later.)

Page 130: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 130/276

POLICY ITERATION

•   It generates a sequence  µ1, µ2, . . . of stationarypolicies, starting with any stationary policy  µ0.

•   At the typical iteration, given  µk, we performa policy evaluation step, that computes the J µk(i)as the solution of the (linear) system of equations

J (i) = g

i, µk(i)

+n

j=1

 pij

µk(i)

J ( j), i = 1, . . . , n ,

in the  n  unknowns   J (1), . . . , J  (n). We then per-form a  policy improvement step, which computes

a new policy  µk+1 as

µk+1(i) = arg minu∈U (i)

g(i, u) +

nj=1

 pij(u)J µk( j)

, ∀ i

•  The algorithm stops when J µk(i) = J µk+1 (i) forall  i

•  Note the connection with the rollout algorithm,which is just a single policy iteration

Page 131: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 131/276

JUSTIFICATION OF POLICY ITERATION

•   We can show thatJ µk+1 (i) ≤ J µk(i) for all  i, k•   Fix  k  and consider the sequence generated by

J N +1(i) = g

i, µk+1(i)

+n

j=1

 pij

µk+1(i)

J N ( j)

where J 0(i) = J µk

(i). We have

J 0(i) = g

i, µk(i)

+n

j=1

 pij

µk(i)

J 0( j)

≥ gi, µk+1(i)+n

j=1

 pijµk+1(i)J 0( j) = J 1(i)

Using the monotonicity property of DP,

J 0(i) ≥ J 1(i) ≥ · · · ≥ J N (i) ≥ J N +1(i) ≥ · · · ,   ∀ i

Since   J N (i) →   J µk+1 (i) as   N  → ∞, we obtainJ µk(i) = J 0(i) ≥ J µk+1 (i) for all i. Also if J µk(i) =

J µk+1 (i) for all   i,   J µk   solves Bellman’s equationand is therefore equal to  J ∗

•   A policy cannot be repeated, there are finitelymany stationary policies, so the algorithm termi-nates with an optimal policy

Page 132: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 132/276

LINEAR PROGRAMMING

•  We claim that J ∗ is the “largest” J  that satisfiesthe constraint

J (i) ≤ g(i, u) +n

j=1

 pij(u)J ( j),   (1)

for all  i = 1, . . . , n and  u ∈ U (i).

•   Proof:   If we use value iteration to generate asequence of vectors  J k  =

J k(1), . . . , J  k(n)

 start-

ing with a  J 0  such that

J 0(i) ≤   minu∈U (i)

g(i, u) +

nj=1

 pij(u)J 0( j)

,   ∀  i

Then,   J k(i)  ≤   J k+1(i) for all   k   and   i   (mono-

tonicity property of DP) and   J k →   J ∗

, so thatJ 0(i) ≤ J ∗(i) for all  i.

•   So J ∗ = (J ∗(1), . . . , J  ∗(n)) is the solution of thelinear program of maximizing

 ni=1 J (i) subject

to the constraint (1).

Page 133: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 133/276

Page 134: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 134/276

DISCOUNTED PROBLEMS

•   Assume a discount factor  α < 1.•   Conversion to an SSP problem.

i   j 

p ij (u )

p ii (u ) p  jj (u )

p  ji (u )

1 - a 

i   j 

p ij (u )

p ii (u ) p  jj (u )

p  ji (u )

1 - a 

•  Value iteration converges to J ∗ for all initial J 0:

J k+1(i) = minu∈U (i)

g(i, u) + α

nj=1

 pij(u)J k( j)

, ∀ i

•   J ∗ is the unique solution of Bellman’s equation:

J ∗(i) = minu∈U (i)

g(i, u) + α

nj=1

 pij(u)J ∗( j)

, ∀  i

Page 135: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 135/276

DISCOUNTED PROBLEMS (CONTINUED)

•  Policy iteration converges finitely to an optimalpolicy, and linear programming works.

•   Example:   Asset selling over an infinite hori-zon. If accepted, the offer   xk   of period   k, is in-vested at a rate of interest  r.

•   By depreciating the sale amount to period 0dollars, we view (1 +  r)−kxk   as the reward forselling the asset in period  k  at a price  xk, wherer > 0 is the rate of interest. So the discount factoris  α = 1/(1 + r).

•   J ∗ is the unique solution of Bellman’s equation

J ∗(x) = max

x,

 E 

J ∗(w)

1 + r

.

•   An optimal policy is to sell if and only if the cur-rent offer  xk   is greater than or equal to α, where

α =  E 

J ∗(w)

1 + r  .

Page 136: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 136/276

6.231 DYNAMIC PROGRAMMING

LECTURE 13

LECTURE OUTLINE

•   Average cost per stage problems

•  Connection with stochastic shortest path prob-lems

•   Bellman’s equation

•   Value iteration

•   Policy iteration

Page 137: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 137/276

AVERAGE COST PER STAGE PROBLEM

•   Stationary system with finite number of statesand controls

•   Minimize over policies  π = {µ0, µ1,...}

J π(x0) = limN →∞

1

N    E wkk=0,1,...

N −1

k=0

gxk

, µk

(xk

), wk

•   Important characteristics (not shared by othertypes of infinite horizon problems)

− For any fixed K , the cost incurred up to timeK   does not matter (only the state that weare at time  K  matters)

−  If all states “communicate” the optimal costis independent of the initial state [if we cango from   i   to   j   in finite expected time, we

must have J ∗(i) ≤ J ∗( j)]. So J ∗(i) ≡ λ∗ forall  i.

−  Because “communication” issues are so im-portant, the methodology relies heavily onMarkov chain theory.

Page 138: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 138/276

CONNECTION WITH SSP

•   Assumption: State n is such that for all initialstates and all policies,  n  will be visited infinitelyoften (with probability 1).

•   Divide the sequence of generated states intocycles marked by successive visits to  n.

•   Each of the cycles can be viewed as a statetrajectory of a corresponding stochastic shortestpath problem with  n  as the termination state.

i   j 

p ij (u )

p ii (u ) p  jj (u )p  ji (u )

p in (u ) p  jn (u )

p nn (u )

p n  j (u )p n i (u )

i   j 

p ij (u )

p ii (u ) p  jj (u )p  ji (u )

Artificial Termination State

SpecialState n 

p n i (u )

p in (u )

p nn (u )

p n  j 

(u )

p  jn (u )

•   Let the cost at  i  of the SSP be  g(i, u) − λ∗

•   We will show that

Av. Cost Probl.   ≡   A Min Cost Cycle Probl.   ≡   SSP Probl.

Page 139: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 139/276

CONNECTION WITH SSP (CONTINUED)

•   Consider a  minimum cycle cost problem : Finda stationary policy  µ  that minimizes the  expected cost per transition within a cycle 

C nn(µ)

N nn(µ),

where for a fixed  µ,

C nn(µ) :   E {cost from  n  up to the first return to  n}

N nn(µ) :   E {time from  n  up to the first return to  n}

•   Intuitively,   C nn(µ)/N nn(µ) = average cost of µ, and optimal cycle cost = λ∗, so

C nn(µ) − N nn(µ)λ∗ ≥ 0,

with equality if  µ  is optimal.•   Thus, the optimal  µ  must minimize over  µ  theexpression   C nn(µ) − N nn(µ)λ∗, which is the ex-pected cost of  µ  starting from  n   in the SSP withstage costs  g(i, u) − λ∗.

•  Also: Optimal SSP Cost = 0.

Page 140: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 140/276

BELLMAN’S EQUATION

•   Let h∗(i) the optimal cost of this SSP problemwhen starting at the nontermination states   i   =1, . . . , n. Then,   h∗(n) = 0, and   h∗(1), . . . , h∗(n)solve uniquely the corresponding Bellman’s equa-tion

h∗(i) = minu∈U (i)

g(i, u) − λ∗ +n−1j=1

 pij(u)h∗( j)

, ∀ i

•   If  µ∗ is an optimal stationary policy for the SSP

problem, we have (since Optimal SSP Cost = 0)

h∗(n) = C nn(µ∗) − N nn(µ∗)λ∗ = 0

•   Combining these equations, we have

λ∗+h∗(i) = minu∈U (i)

g(i, u) +

nj=1

 pij(u)h∗( j)

, ∀ i

•  If  µ∗(i) attains the min for each i, µ∗ is optimal.

Page 141: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 141/276

MORE ON THE CONNECTION WITH SSP

•   Interpretation of  h∗(i) as a  relative or differen-tial cost : It is the minimum of 

E {cost to reach  n  from  i  for the first time}− E {cost if the stage cost were  λ∗ and not  g(i, u)}

•   We don’t know  λ∗, so we can’t solve the aver-age cost problem as an SSP problem. But similarvalue and policy iteration algorithms are possible.

•   Example: A manufacturer at each time:

−  Receives an order with prob.  p and no orderwith prob. 1 − p.

−   May process all unfilled orders at cost  K >0, or process no order at all. The cost perunfilled order at each time is  c > 0.

− Maximum number of orders that can remainunfilled is  n.

−  Find a processing policy that minimizes thetotal expected cost per stage.

Page 142: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 142/276

EXAMPLE (CONTINUED)

•   State = number of unfilled orders. State 0 isthe special state for the SSP formulation.

•   Bellman’s equation: For states i = 0, 1, . . . , n−1

λ∗ + h∗(i) = min K  + (1 − p)h∗(0) + ph∗(1),

ci + (1 − p)h∗(i) + ph∗(i + 1)

,

and for state  n

λ∗ + h∗(n) = K  + (1 − p)h∗(0) + ph∗(1)

•   Optimal policy: Process  i  unfilled orders if 

K +(1− p)h∗(0)+ ph∗(1) ≤ ci+(1− p)h∗(i)+ ph∗(i+1).

•  Intuitively,  h∗(i) is monotonically nondecreas-

ing with   i   (interpret  h∗(i) as optimal costs-to-gofor the associate SSP problem). So a threshold pol-icy   is optimal: process the orders if their numberexceeds some threshold integer  m∗.

Page 143: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 143/276

VALUE ITERATION

•   Natural value iteration method: Generate op-timal k-stage costs by DP algorithm starting withany  J 0:

J k+1(i) = minu∈U (i)

g(i, u) +n

j=1

 pij(u)J k( j) ,

 ∀ i

•   Result: limk→∞ J k(i)/k = λ∗ for all  i.

•   Proof outline: Let J ∗k  be so generated from theinitial condition  J ∗

0

  = h∗. Then, by induction,

J ∗k (i) = kλ∗ + h∗(i),   ∀i, ∀  k.

On the other hand,

J k(i) − J 

∗k (i) ≤   maxj=1,...,n

J 0( j) − h

( j),   ∀  i

since   J k(i) and   J ∗k (i) are optimal costs for twok-stage problems that differ only in the terminalcost functions, which are  J 0  and h∗.

Page 144: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 144/276

RELATIVE VALUE ITERATION

•   The value iteration method just described hastwo drawbacks:

−   Since typically some components of   J k   di-verge to ∞ or −∞, calculating limk→∞ J k(i)/kis numerically cumbersome.

−  The method will not compute a correspond-ing differential cost vector  h∗.

•   We can bypass both difficulties by subtractinga constant from all components of the vector  J k,so that the difference, call it hk, remains bounded.

•   Relative value iteration algorithm:   Pick anystate s, and iterate according to

hk+1(i) = minu∈U (i)

g(i, u) +

nj=1

 pij(u)hk( j)

−   min

u∈U (s)

g(s, u) +n

j=1

 psj(u)hk( j)

,   ∀  i

•   Then we can show   hk →   h∗ (under an extraassumption).

Page 145: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 145/276

POLICY ITERATION

•   At the typical iteration, we have a stationaryµk.

•   Policy evaluation: Compute λk and hk(i) of  µk,using the  n + 1 equations  hk(n) = 0 and

λk

+ hk

(i) = g

i, µk

(i)

+

nj=1 pij

µk

(i)

hk

( j), ∀ i

•   Policy improvement: (For the λk-SSP) Find forall  i

µk+1(i) = arg minu∈U (i)

g(i, u) +n

j=1

 pij(u)hk( j)

•   If  λk+1 = λk and hk+1(i) = hk(i) for all i, stop;otherwise, repeat with  µk+1 replacing µk.

•   Result:  For each  k, we either have  λk+1 < λk

or we have policy improvement for the  λk-SSP:

λk+1 = λk, hk+1(i) ≤ hk(i), i = 1, . . . , n .

The algorithm terminates with an optimal policy.

Page 146: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 146/276

6.231 DYNAMIC PROGRAMMING

LECTURE 14

LECTURE OUTLINE

•   Control of continuous-time Markov chains –

Semi-Markov problems•  Problem formulation – Equivalence to discrete-time problems

•   Discounted problems

•  Average cost problems

Page 147: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 147/276

CONTINUOUS-TIME MARKOV CHAINS

•   Stationary system with finite number of statesand controls

•   State transitions occur at discrete times

•   Control applied at these discrete times and staysconstant between transitions

•   Time between transitions is random

•   Cost accumulates in continuous time (may alsobe incurred at the time of transition)

•   Example: Admission control in a system with

restricted capacity (e.g., a communication link)−   Customer arrivals: a Poisson process

−  Customers entering the system, depart afterexponentially distributed time

−  Upon arrival we must decide whether to ad-

mit or to block a customer−  There is a cost for blocking a customer

−   For each customer that is in the system, thereis a customer-dependent reward per unit time

−  Minimize time-discounted or average cost

Page 148: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 148/276

PROBLEM FORMULATION

•   x(t) and  u(t): State and control at time  t•   tk: Time of  kth transition (t0  = 0)

•   xk  = x(tk);   x(t) = xk   for tk ≤ t < tk+1.

•   uk  = u(tk);   u(t) = uk   for  tk ≤ t < tk+1.

•   No transition probabilities; instead   transitiondistributions (quantify the uncertainty about   bothtransition time and next state)

Qij(τ, u) = P {tk+1−tk  ≤ τ, xk+1  = j  | xk  = i, uk  = u}

•   Two important formulas:(1) Transition probabilities are specified by

 pij(u) = P {xk+1  = j  | xk  = i, uk  = u} = limτ →∞

Qij(τ, u)

(2) The Cumulative Distribution Function (CDF)of  τ   given  i, j, u  is (assuming  pij(u) > 0)

P {tk+1−tk ≤ τ  | xk  = i, xk+1  = j, uk  = u} =  Qij(τ, u)

 pij(u)

Thus,  Qij(τ, u) can be viewed as a “scaled CDF”

Page 149: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 149/276

EXPONENTIAL TRANSITION DISTRIBUTIONS

•   Important example of transition distributions:

Qij(τ, u) = pij(u)

1 − e−ν i(u)τ 

,

where pij(u) are transition probabilities, and ν i(u)is called the  transition rate  at state  i.

•   Interpretation:  If the system is in state   i  andcontrol  u  is applied

−  the next state will be j with probability pij(u)

−   the time between the transition to state   iand the transition to the next state  j   is ex-ponentially distributed with parameter ν i(u)(independently of  j):

P {transition time interval   > τ  | i, u} = e−ν i(u)τ 

•  The exponential distribution is   memoryless.

This implies that for a given policy, the systemis a continuous-time Markov chain (the future de-pends on the past through the current state).

•   Without the memoryless property, the Markovproperty holds only at the times of transition.

Page 150: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 150/276

COST STRUCTURES

•   There is cost  g(i, u) per unit time, i.e.

g(i, u)dt = the cost incurred in time  dt

•   There may be an extra “instantaneous” cost

g(i, u) at the time of a transition (let’s ignore thisfor the moment)

•   Total discounted cost of  π = {µ0, µ1, . . .} start-ing from state  i  (with discount factor  β > 0)

limN →∞E N −1

k=0

   tk+1

tk

e−βtg

xk, µk(xk)

dt x0  = i

•   Average cost per unit time

limN →∞

1E {tN }

E N −1

k=0

   tk+1

tk

g

xk, µk(xk)

dt x0  = i

•  We will see that  both problems have equivalentdiscrete-time versions.

Page 151: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 151/276

A NOTE ON NOTATION

•  The scaled CDF Qij(τ, u) can be used to modeldiscrete, continuous, and mixed distributions forthe transition time  τ .

•  Generally, expected values of functions of  τ  canbe written as integrals involving   d Qij(τ, u). For

example, the conditional expected value of  τ  giveni,  j, and  u  is written as

E {τ  | i,j,u} =

   ∞0

τ  d Qij(τ, u)

 pij(u)

•  If  Qij(τ, u) is continuous with respect to  τ , its

derivative

q ij(τ, u) = dQij

dτ   (τ, u)

can be viewed as a “scaled” density function. Ex-pected values of functions of  τ  can then be written

in terms of  q ij(τ, u). For example

E {τ  | i,j,u} =

   ∞0

τ  q ij(τ, u)

 pij(u)  dτ 

•   If  Qij(τ, u) is discontinuous and “staircase-like,”expected values can be written as summations.

Page 152: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 152/276

DISCOUNTED CASE - COST CALCULATION

•   For a policy  π = {µ0, µ1, . . .}, write

J π(i) = E {1st transition cost}+E {e−βτ 

J π1 ( j) | i, µ0(i)}

where  J π1 ( j) is the cost-to-go of the policy  π1  =

{µ1, µ2, . . .

}•   We calculate the two costs in the RHS. TheE {1st transition cost}, if  u is applied at state i, is

G(i, u) = E j

E τ {1st transition cost | j}

=

n

j=1

 pij(u)   ∞0   τ 

0

e−βtg(i, u)dt dQij(τ, u)

 pij(u)

=

nj=1

   ∞0

1 − e−βτ 

β g(i, u)dQij(τ, u)

•  Thus the E 

{1st transition cost

} is

G

i, µ0(i)

= g

i, µ0(i) nj=1

   ∞0

1 − e−βτ 

β   dQij

τ, µ0(i)

(The summation term can be viewed as a “dis-counted length of the transition interval  t1 − t0”.)

Page 153: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 153/276

COST CALCULATION (CONTINUED)

•   Also the expected (discounted) cost from thenext state  j   is

e−βτ J π1 ( j) | i, µ0(i)

= E j

E {e−βτ  | i, µ0(i), j}J π1 ( j) | i, µ0(i)

=

nj=1

 pij(µ0(i))   ∞

0

e−βτ  dQij(τ, µ0(i)) pij(µ0(i))

J π1 ( j)

=n

j=1

mij

µ0(i)

J π1 ( j)

where mij(u) is given by

mij(u) =

   ∞0

e−βτ dQij(τ, u)

<

   ∞0

dQij(τ, u) = pij(u)

and can be viewed as the “effective discount fac-tor” [the analog of  αpij(u) in discrete-time case].

•   So J π(i) can be written as

J π(i) = G

i, µ0(i)

+n

j=1

mij

µ0(i)

J π1 ( j)

i.e., the (continuous-time discounted) cost of 1st

period, plus the (continuous-time discounted) cost-to-go from the next state.

Page 154: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 154/276

EQUIVALENCE TO AN SSP

•   Similar to the discrete-time case, introduce astochastic shortest path problem with an artificialtermination state  t

•   Under control u, from state i the system movesto state  j  with probability  mij(u) and to the ter-

mination state t with probability 1−n

j=1 mij(u)•   Bellman’s equation: For  i = 1, . . . , n,

J ∗(i) = minu∈U (i)

G(i, u) +

n

j=1mij(u)J ∗( j)

•  Analogs of value iteration, policy iteration, andlinear programming.

•  If in addition to the cost per unit time  g, thereis an extra (instantaneous) one-stage cost g(i, u),

Bellman’s equation becomes

J ∗(i) = minu∈U (i)

g(i, u) + G(i, u) +

nj=1

mij(u)J ∗( j)

Page 155: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 155/276

MANUFACTURER’S EXAMPLE REVISITED

•  A manufacturer receives orders with interarrivaltimes uniformly distributed in [0, τ max].

•  He may process all unfilled orders at cost K > 0,or process none. The cost per unit time of anunfilled order is  c. Max number of unfilled orders

is  n.•   The nonzero transition distributions are

Qi1(τ, Fill) = Qi(i+1)(τ, Not Fill) = min

1,

  τ 

τ max

•   The one-stage expected cost  G is

G(i, Fill) = 0, G(i, Not Fill) = γ c i,

where

γ  =n

j=1

   ∞0

1 − e−βτ β 

  dQij(τ, u) =   τ max

0

1 − e−βτ βτ max

dτ 

•   There is an “instantaneous” cost

g(i, Fill) = K,   g(i, Not Fill) = 0

Page 156: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 156/276

MANUFACTURER’S EXAMPLE CONTINUED

•   The “effective discount factors”  mij(u) in Bell-man’s Equation are

mi1(Fill) = mi(i+1)(Not Fill) = α,

where

α =

   ∞0

e−βτ 

dQij(τ, u) =

   τ max

0

e−βτ 

τ maxdτ   =

  1 − e−βτ max

βτ max

•  Bellman’s equation has the form

J ∗(i) = min

K +αJ ∗(1), γci+αJ ∗(i+1)

, i = 1, 2, . . .

•   As in the discrete-time case, we can concludethat there exists an optimal threshold  i∗:

fill the orders   <==>   their number  i  exceeds  i∗

Page 157: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 157/276

AVERAGE COST

•   Minimize

limN →∞

1

E {tN }E 

   tN 0

g

x(t), u(t)

dt

assuming there is a special state that is “recurrent

under all policies”•   Total expected cost of a transition

G(i, u) = g(i, u)τ i(u),

where τ i(u): Expected transition time.

•  We now apply the SSP argument used for the

discrete-time case. Divide trajectory into cyclesmarked by successive visits to n. The cost at (i, u)is   G(i, u) − λ∗τ i(u), where   λ∗ is the optimal ex-pected cost per unit time. Each cycle is viewed asa state trajectory of a corresponding SSP problem

with the termination state being essentially  n.•   So Bellman’s Eq. for the average cost problem:

h∗(i) = minu∈U (i)

G(i, u) − λ∗τ i(u) +

nj=1

 pij(u)h∗( j)

Page 158: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 158/276

MANUFACTURER EXAMPLE/AVERAGE COST

•   The expected transition times are

τ i(Fill) = τ i(Not Fill) =  τ max

2

the expected transition cost is

G(i, Fill) = 0, G(i, Not Fill) =  c i τ max

2

and there is also the “instantaneous” cost

g(i, Fill) = K,   g(i, Not Fill) = 0

•   Bellman’s equation:

h∗(i) = minK − λ∗τ max

2  + h∗(1),

ci τ max2

  − λ∗ τ max2

  + h∗(i + 1)

•   Again it can be shown that a threshold policyis optimal.

Page 159: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 159/276

6.231 DYNAMIC PROGRAMMING

LECTURE 15

LECTURE OUTLINE

•   We start a nine-lecture sequence on advanced

infinite horizon DP and approximate solution meth-ods

•   We allow infinite state space, so the stochasticshortest path framework cannot be used any more

•  Results are rigorous assuming a countable dis-

turbance space−   This includes deterministic problems with

arbitrary state space, and countable stateMarkov chains

−   Otherwise the mathematics of measure the-

ory make analysis difficult, although the fi-nal results are essentially the same as forcountable disturbance space

•   The discounted problem is the proper startingpoint for this analysis

•   The central mathematical structure is that theDP mapping is a contraction mapping (instead of existence of a termination state)

Page 160: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 160/276

DISCOUNTED PROBLEMS/BOUNDED COST

•   Stationary system with arbitrary state space

xk+1 = f (xk, uk, wk), k = 0, 1, . . .

•   Cost of a policy  π = {µ0, µ1, . . .}

J π(x0) = limN →∞

E wk

k=0,1,...

N −1k=0

αkg

xk, µk(xk), wk

with α < 1, and for some M , we have |g(x,u,w)| ≤M   for all (x,u,w)

•   Shorthand notation for DP mappings (operateon functions of state to produce other functions)

(T J )(x) = minu∈U (x)

E w

g(x,u,w) + αJ 

f (x,u,w)

,  ∀ x

T J  is the optimal cost function for the one-stageproblem with stage cost  g  and terminal cost  αJ .

•   For any stationary policy  µ

(T µJ )(x) = E w gx, µ(x), w+ αJ f (x, µ(x), w) ,  ∀ x

Page 161: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 161/276

“SHORTHAND” THEORY – A SUMMARY

•   Cost function expressions [with  J 0(x) ≡ 0]

J π(x) = limk→∞

(T µ0T µ1 · · ·T µkJ 0)(x), J µ(x) = limk→∞

(T kµJ 0)(x)

•   Bellman’s equation: J ∗ = T J ∗,   J µ  = T µJ µ

•  Optimality condition:

µ: optimal   <==> T µJ ∗ = T J ∗

•   Value iteration: For any (bounded)  J   and allx,

J ∗(x) = limk→∞(T kJ )(x)

•   Policy iteration: Given  µk,

−   Policy evaluation: Find J µk   by solving

J µk  = T µkJ µk

−   Policy improvement: Find  µk+1 such that

T µk+1 J µk  = T J µk

Page 162: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 162/276

TWO KEY PROPERTIES

•   Monotonicity property: For any functions J and  J ′ such that  J (x) ≤ J ′(x) for all  x, and anyµ

(T J )(x) ≤ (T J ′)(x),   ∀  x,

(T µJ )(x) ≤ (T µJ ′)(x),   ∀  x.

•   Constant Shift property:   For any   J , anyscalar  r, and any  µ

T (J  + re)

(x) = (T J )(x) + αr,   ∀  x,

T µ(J  + re)

(x) = (T µJ )(x) + αr,   ∀  x,

where e  is the unit function [e(x) ≡ 1].

Page 163: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 163/276

CONVERGENCE OF VALUE ITERATION

•   If  J 0 ≡ 0,

J ∗(x) = limN →∞

(T N J 0)(x),   for all x

Proof:   For any initial state   x0, and policy   π   ={µ0, µ1, . . .},

J π(x0) = E 

 ∞k=0

αkg

xk, µk(xk), wk

= E 

N −1

k=0 αkgxk, µk(xk), wk

+ E 

  ∞k=N 

αkg

xk, µk(xk), wk

The tail portion satisfiesE 

  ∞k=N 

αkg

xk, µk(xk), wk

≤  αN M 

1 − α,

where   M  ≥ |g(x,u,w)|. Take the min over   π   of both sides.   Q.E.D.

Page 164: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 164/276

BELLMAN’S EQUATION

•  The optimal cost function J ∗ satisfies Bellman’sEq., i.e.   J ∗ = T (J ∗).

Proof: For all  x  and  N ,

J ∗(x)

 αN M 

1 − α ≤(T N J 0)(x)

≤J ∗(x) +

 αN M 

1 − α

,

where  J 0(x) ≡  0 and  M  ≥ |g(x,u,w)|. ApplyingT   to this relation, and using Monotonicity andConstant Shift,

(T J ∗)(x) −  αN +1

M 1 − α

  ≤ (T N +1J 0)(x)

≤ (T J ∗)(x) + αN +1M 

1 − α

Taking the limit as  N 

 → ∞ and using the fact

limN →∞

(T N +1J 0)(x) = J ∗(x)

we obtain  J ∗ = T J ∗.   Q.E.D.

Page 165: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 165/276

THE CONTRACTION PROPERTY

•  Contraction property: For any bounded func-tions  J   and J ′, and any  µ,

maxx

(T J )(x) − (T J ′)(x) ≤ α max

x

J (x) − J ′(x),

maxx(T µJ )(x)−(T µJ 

)(x) ≤ α maxx

J (x)−J 

(x).

Proof: Denote  c = maxx∈S J (x) − J ′(x)

.  Then

J (x) − c ≤ J ′(x) ≤ J (x) + c,   ∀  x

Apply  T  to both sides, and use the Monotonicityand Constant Shift properties:

(T J )(x) − αc ≤ (T J ′)(x) ≤ (T J )(x) + αc,   ∀ x

Hence

(T J )(x) − (T J ′)(x) ≤ αc,   ∀  x.

Q.E.D.

Page 166: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 166/276

IMPLICATIONS OF CONTRACTION PROPERT

•   We can strengthen our earlier result:•   Bellman’s equation  J  = T J  has a unique solu-tion, namely  J ∗, and for any bounded  J , we have

limk→∞

(T kJ )(x) = J ∗(x),   ∀  x

Proof: Use

maxx

(T kJ )(x) − J ∗(x) = max

x

(T kJ )(x) − (T kJ ∗)(x)

≤ αk maxx J (x) − J ∗(x)

•   Special Case: For each stationary  µ, J µ  is theunique solution of  J  = T µJ   and

limk→∞

(T kµJ )(x) = J µ(x),   ∀  x,

for any bounded  J .

•   Convergence rate:  For all  k,

maxx (T kJ )(x) − J ∗(x) ≤ αk max

x J (x) − J ∗(x)

Page 167: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 167/276

NEC. AND SUFFICIENT OPT. CONDITION

•   A stationary policy  µ   is optimal if and only if µ(x) attains the minimum in Bellman’s equationfor each  x; i.e.,

T J ∗ = T µJ ∗.

Proof: If  T J ∗ = T µJ ∗, then using Bellman’s equa-tion (J ∗ = T J ∗), we have

J ∗ = T µJ ∗,

so by uniqueness of the fixed point of T µ, we obtainJ ∗ = J µ; i.e.,  µ  is optimal.

•  Conversely, if the stationary policy µ is optimal,we have  J ∗ = J µ, so

J ∗ = T µJ ∗.

Combining this with Bellman’s equation (J ∗ =T J ∗), we obtain  T J ∗ = T µJ ∗.   Q.E.D.

Page 168: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 168/276

COMPUTATIONAL METHODS

•   Value iteration and variants−   Gauss-Seidel version

−   Approximate value iteration

•   Policy iteration and variants

−  Combination with value iteration

−   Modified policy iteration

−  Asynchronous policy iteration

•   Linear programming

maximize

ni=1

J (i)

subject to   J (i) ≤ g(i, u) + αn

j=1

 pij(u)J ( j),   ∀  (i, u)

•   Approximate linear programming: use in placeof  J (i) a low-dim. basis function representation

J (i, r) =mk=1

rkwk(i)

and low-dim. LP (with many constraints)

Page 169: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 169/276

6.231 DYNAMIC PROGRAMMING

LECTURE 16

LECTURE OUTLINE

•   Review of basic theory of discounted problems

•  Monotonicity and contraction properties

•   Contraction mappings in DP

•   Discounted problems: Countable state spacewith unbounded costs

•   Generalized discounted DP

Page 170: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 170/276

DISCOUNTED PROBLEMS/BOUNDED COST

•   Stationary system with arbitrary state space

xk+1 = f (xk, uk, wk), k = 0, 1, . . .

•   Cost of a policy  π = {µ0, µ1, . . .}

J π(x0) = limN →∞

E wk

k=0,1,...

N −1k=0

αkg

xk, µk(xk), wk

with α < 1, and for some M , we have |g(x,u,w)| ≤M   for all (x,u,w)

•   Shorthand notation for DP mappings (operateon functions of state to produce other functions)

(T J )(x) = minu∈U (x)

E w

g(x,u,w) + αJ 

f (x,u,w)

,  ∀ x

T J  is the optimal cost function for the one-stageproblem with stage cost  g  and terminal cost  αJ .

•   For any stationary policy  µ

(T µJ )(x) = E w gx, µ(x), w+ αJ f (x, µ(x), w) ,  ∀ x

Page 171: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 171/276

Page 172: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 172/276

MAJOR PROPERTIES

•   Monotonicity property:   For any functionsJ   and  J ′ on the state space  X   such that  J (x) ≤J ′(x) for all  x ∈ X , and any  µ

(T J )(x) ≤ (T J ′)(x),   ∀  x ∈ X,

(T µJ )(x) ≤ (T µJ ′)(x),   ∀  x ∈ X.

•  Contraction property: For any bounded func-tions  J   and J ′, and any  µ,

maxx(T J )(x) − (T J 

)(x) ≤ α maxx

J (x) − J 

(x),

maxx

(T µJ )(x)−(T µJ ′)(x) ≤ α max

x

J (x)−J ′(x).

•   The contraction property can be written in

shorthand as

T J −T J ′ ≤ αJ −J ′,   T µJ −T µJ ′ ≤ αJ −J ′,

where for any bounded function  J , we denote by

 the sup-norm

J  = maxx∈X

J (x).

Page 173: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 173/276

CONTRACTION MAPPINGS

•   Given a real vector space  Y   with a norm  · (i.e., y ≥ 0 for all  y ∈ Y , y = 0 if and only if y  = 0, ay  = |a|y   for all scalars  a  and  y ∈  Y ,and y + z ≤ y + z  for all  y, z ∈ Y )

•  A function F   : Y   → Y  is said to be a contraction 

mapping  if for some  ρ ∈ (0, 1), we have

F y − F z ≤ ρy − z,   for all y, z ∈ Y.

ρ  is called the  modulus of contraction  of  F .

•  For   m >  1, we say that   F   is an  m-stage con-

traction  if  F m is a contraction.

•   Important example: Let X  be a set (e.g., statespace in DP),   v   :   X   → ℜ   be a positive-valuedfunction. Let   B(X ) be the set of all functionsJ   : X 

 → ℜsuch that J (s)/v(s) is bounded over s.

•  We define a norm on  B(X ), called the weighted sup-norm , by

J  = maxs∈X

|J (s)|v(s)

  .

•   Important special case:  The discounted prob-lem mappings  T   and T µ   [for  v(s) ≡ 1,  ρ = α].

Page 174: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 174/276

A DP-LIKE CONTRACTION MAPPING

•   Let  X  = {1, 2, . . .}, and let  F   : B(X ) → B(X )be a linear mapping of the form

(F J )(i) = b(i) +j∈X

a(i, j) J ( j),   ∀  i

where b(i) and a(i, j) are some scalars. Then F   isa contraction with modulus  ρ  if 

j∈X |a(i, j)| v( j)

v(i)  ≤ ρ,   ∀  i

•   Let   F   :   B(X ) →   B(X ) be a mapping of theform

(F J )(i) = minµ∈M 

(F µJ )(i),   ∀  i

where  M   is parameter set, and for each   µ ∈  M ,F µ   is a contraction mapping from  B(X ) to B(X )with modulus ρ. Then F  is a contraction mappingwith modulus  ρ.

Page 175: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 175/276

CONTRACTION MAPPING FIXED-POINT TH.

•   Contraction Mapping Fixed-Point Theo-rem:  If  F   :  B(X ) →  B(X ) is a contraction withmodulus   ρ  ∈   (0, 1), then there exists a uniqueJ ∗ ∈ B(X ) such that

J ∗ = F J ∗.

Furthermore, if   J   is any function in   B(X ), then{F kJ }  converges to  J ∗ and we have

F kJ  − J ∗ ≤ ρkJ  − J ∗, k = 1, 2, . . . .

•   Similar result if   F   is an   m-stage contractionmapping.

•   This is a special case of a general result forcontraction mappings   F   :   Y   →   Y   over normed

vector spaces Y   that are complete : every sequence{yk}  that is Cauchy (satisfies ym − yn →  0 asm, n → ∞) converges.

•   The space B(X ) is complete (see the text for aproof).

Page 176: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 176/276

GENERAL FORMS OF DISCOUNTED DP

•  We consider an abstract form of discounted DPbased on contraction mappings

•   Denote  R(X ): set of real-valued functions  J   :X   → ℜ, and let   H   :   X  × U  × R(X ) → ℜ   be agiven mapping. We consider the mapping

(T J )(x) = minu∈U (x)

H (x,u,J ),   ∀  x ∈ X.

•   We assume that (T J )(x) > −∞   for all  x ∈ X ,so  T   maps  R(X ) into  R(X ).

•  For each “policy” µ ∈ M, we consider the map-ping T µ : R(X ) → R(X ) defined by

(T µJ )(x) = H 

x, µ(x), J 

,   ∀  x ∈ X.

•   We want to find a function   J ∗ ∈   R(X ) suchthat

J ∗(x) = minu∈U (x)

H (x,u,J ∗),   ∀  x ∈ X 

and a policy  µ∗ such that  T µ∗J ∗ = T J ∗.

Page 177: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 177/276

EXAMPLES

•   Discounted problems

H (x,u,J ) = E 

g(x,u,w) + αJ 

f (x,u,w)

•   Discounted Semi-Markov Problems

H (x,u,J ) = G(x, u) +n

y=1

mxy(u)J (y)

where  mxy  are “discounted” transition probabili-ties, defined by the transition distributions

•   Shortest Path Problems

H (x,u,J ) =

axu + J (u) if  u = d,axd   if  u = d

where d  is the destination•   Minimax Problems

H (x,u,J ) = maxw∈W (x,u)

g(x,u,w)+αJ 

f (x,u,w)

Page 178: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 178/276

GENERAL FORMS OF DISCOUNTED DP

•   Monotonicity assumption: If  J, J ′ ∈ R(X ) andJ  ≤ J ′, then

H (x,u,J ) ≤ H (x,u,J ′),   ∀  x ∈ X, u ∈ U (x)

•   Contraction assumption:−   For every  J  ∈ B(X ), the functions  T µJ   andT J  belong to  B(X ).

−   For some  α ∈ (0, 1) and all  J, J ′ ∈ B(X ),  H satisfies

H (x,u,J )−H (x,u,J ′) ≤ α max

y∈X

J (y)−J ′(y)

for all  x ∈ X   and  u ∈ U (x).

•   We can show all the standard analytical and

computational results of discounted DP based onthese two assumptions

•   With just the monotonicity assumption (as inshortest path problem) we can still show variousforms of the basic results under appropriate as-sumptions (like in the SSP problem)

Page 179: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 179/276

RESULTS USING CONTRACTION

•   The mappings  T µ  and T  are sup-norm contrac-tion mappings with modulus   α   over   B(X ), andhave unique fixed points in B(X ), denoted J µ  andJ ∗, respectively (cf.   Bellman’s equation). Proof:From contraction property of  H .

•   For any  J  ∈ B(X ) and  µ ∈ M,

limk→∞

T kµJ  = J µ,   limk→∞

T kJ  = J ∗.

(cf.   convergence of value iteration). Proof: Fromcontraction property of  T µ  and T .

•   We have   T µJ ∗ =   T J ∗ if and only if   J µ   =   J ∗

(cf.   optimality condition). Proof:   T µJ ∗ =  T J ∗,then  T µJ ∗ =  J ∗, implying  J ∗ =  J µ. Conversely,if  J µ = J ∗, then  T µJ ∗ = T µJ µ = J µ  = J ∗ = T J ∗.

•   For any  J  ∈ B(X ) and  µ ∈ M,

J µ − J  ≤ T µJ  − J 1 − α

(this is a   useful bound for  J µ)

Page 180: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 180/276

RESULTS USING MON. AND CONTRACTION

•   Optimality of fixed point:

J ∗(x) = minµ∈M

J µ(x),   ∀  x ∈ X 

•  Furthermore, for every  ǫ > 0, there exists  µǫ

 ∈M  such that

J ∗(x) ≤ J µǫ(x) ≤ J ∗(x) + ǫ,   ∀  x ∈ X 

•   Nonstationary policies: Consider the set Π of 

all sequences   π   = {µ0, µ1, . . .}   with   µk ∈ M   forall  k, and define

J π(x) = lim inf k→∞

(T µ0 T µ1 · · · T µkJ )(x),   ∀ x ∈ X,

with   J   being any function (the choice of   J   doesnot matter)

•   We have

J ∗(x) = minπ∈Π

J π(x),   ∀  x ∈ X 

Page 181: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 181/276

6.231 DYNAMIC PROGRAMMING

LECTURE 17

LECTURE OUTLINE

•   Review of computational theory of discounted

problems•   Value iteration (VI), policy iteration (PI)

•   Q-Learning

•   Optimistic PI

•   Asynchronous VI•   Asynchronous PI

•   Computational methods for generalized dis-counted DP

Page 182: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 182/276

DISCOUNTED PROBLEMS/BOUNDED COST

•   Stationary system with arbitrary state space

xk+1 = f (xk, uk, wk), k = 0, 1, . . .

•   Cost of a policy  π = {µ0, µ1, . . .}

J π(x0) = limN →∞

E wk

k=0,1,...

N −1k=0

αkg

xk, µk(xk), wk

with α < 1, and for some M , we have |g(x,u,w)| ≤M   for all (x,u,w)

•   Shorthand notation for DP mappings (operateon functions of state to produce other functions)

(T J )(x) = minu∈U (x)

E w

g(x,u,w) + αJ 

f (x,u,w)

,  ∀ x

T J  is the optimal cost function for the one-stageproblem with stage cost  g  and terminal cost  αJ .

•   For any stationary policy  µ

(T µJ )(x) = E w gx, µ(x), w+ αJ f (x, µ(x), w) ,  ∀ x

Page 183: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 183/276

“SHORTHAND” THEORY – A SUMMARY

•   Cost function expressions [with  J 0(x) ≡ 0]

J π(x) = limk→∞

(T µ0T µ1 · · ·T µkJ 0)(x), J µ(x) = limk→∞

(T kµJ 0)(x)

•   Bellman’s equation: J ∗ = T J ∗,   J µ  = T µJ µ

•  Optimality condition:

µ: optimal   <==> T µJ ∗ = T J ∗

•   Value iteration: For any (bounded)  J   and allx,

J ∗(x) = limk→∞(T kJ )(x)

•   Policy iteration: Given  µk,

−   Policy evaluation: Find J µk   by solving

J µk  = T µkJ µk

−   Policy improvement: Find  µk+1 such that

T µk+1 J µk  = T J µk

Page 184: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 184/276

INTERPRETATION OF VI AND PI

J ∗ =  TJ ∗

J ∗ =  TJ ∗

TJ 

TJ 

45 Degree Line

J ∗ =  TJ ∗

J µ1   = T µ1J µ1

Policy Improvement

Policy Improvement

T µ1J 

Policy Evaluation

J 0

J 0

J 0

J 0

TJ 0

TJ 0

TJ 0

T 2J 0

T 2J 0

Value Iterations

Page 185: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 185/276

Q-LEARNING I

•   We can write Bellman’s equation as

J ∗(i) = minu∈U (i)

Q∗(i, u)   i = 1, . . . , n ,

where Q∗ is the unique solution of 

Q∗(i, u) =n

j=1

 pij(u)

g(i,u,j) + α   min

v∈U (j)Q∗( j, v)

•   Q∗(i, u) is called the   optimal Q-factor of (i, u)

•   Q-factors have optimal cost interpretation inan “augmented” problem whose states are  i   and(i, u),  u ∈ U (i), where  i ∈ {1, . . . , n}•   We can equivalently write the VI method as

J k+1(i) = minu∈U (i) Qk+1(i, u), i = 1, . . . , n ,

where Qk+1  is generated for all  i and u ∈ U (i) by

Qk+1(i, u) =n

j=1

 pij(u)g(i,u,j) + α   minv∈U (j)

Qk( j, v)

Page 186: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 186/276

Q-LEARNING II

•   Q-factors are no different than costs•  They satisfy a Bellman equation Q  = F Q where

(F Q)(i, u) =n

j=1 pij(u)

g(i,u,j) + α   min

v∈U (j)Q( j, v)

•   VI and PI for Q-factors are mathematicallyequivalent to VI and PI for costs

•   They require equal amount of computation ...they just need more storage

•   Having optimal Q-factors is convenient whenimplementing an optimal policy on-line by

µ∗(i) = minu∈U (i)

Q∗(i, u)

•   Once   Q∗(i, u) are known, the model [g   and pij(u)] is not needed.   Model-free operation

•   Later we will see how stochastic/sampling meth-ods can be used to calculate (approximations of)

Q∗

(i, u) using a simulator of the system (no modelneeded)

Page 187: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 187/276

APPROXIMATE PI

•   Suppose that the policy evaluation is approxi-mate, according to,

maxx

|J k(x) − J µk(x)| ≤ δ, k = 0, 1, . . .

and policy improvement is approximate, accordingto,

maxx

|(T µk+1 J k)(x)−(T J k)(x)| ≤ ǫ, k = 0, 1, . . .

where δ  and ǫ  are some positive scalars.

•   Error Bound:   The sequence {µk}  generatedby approximate policy iteration satisfies

lim supk→∞

maxx∈S 

J µk(x) − J ∗(x)

≤   ǫ + 2αδ 

(1 − α)2

•   Typical practical behavior: The method makessteady progress up to a point and then the iteratesJ µk  oscillate within a neighborhood of   J ∗.

Page 188: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 188/276

OPTIMISTIC PI

•   This is PI, where policy evaluation is carriedout by a finite number of VI

•   Shorthand definition: For some integers  mk

T µkJ k  = T J k, J k+1  = T mk

µk  J k, k = 0, 1, . . .

−   If  mk ≡ 1 it becomes VI

−   If  mk  = ∞  it becomes PI

−  For intermediate values of  mk, it is generallymore efficient than either VI or PI

TJ   = minµ T µJ 

J ∗ =  TJ ∗

Policy Improvement

Policy Improvement

J 0

J 0

TJ 0

J µ0   = T µ0J µ0

TJ 0   = T µ0J 0   J 1   = T 2µ0

J 0

T µ0J 

Approx. Policy Evaluation

Page 189: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 189/276

GENERAL FORMS OF DISCOUNTED DP

•  We consider an abstract form of discounted DP•   Denote  R(X ): set of real-valued functions  J   :X   → ℜ, and let   H   :   X  × U  × R(X ) → ℜ   be agiven mapping. We consider the mapping

(T J )(x) = minu∈U (x) H (x,u,J ),   ∀  x ∈ X.

•   For each  µ, we consider

(T µJ )(x) = H 

x, µ(x), J 

,   ∀  x ∈ X.

•  We want to find a function J ∗ such that

J ∗(x) = minu∈U (x)

H (x,u,J ∗),   ∀  x ∈ X 

and a  µ∗ such that  T µ∗J ∗ = T J ∗.

•  Discounted, Discounted Semi-Markov, MinimaxH (x,u,J ) = E 

g(x,u,w) + αJ 

f (x,u,w)

H (x,u,J ) = G(x, u) +

ny=1

mxy(u)J (y)

H (x,u,J ) = maxw∈W (x,u)

g(x,u,w)+αJ 

f (x,u,w)

Page 190: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 190/276

ASSUMPTIONS AND RESULTS

•   Monotonicity assumption: If  J, J ′ ∈ R(X ) andJ  ≤ J ′, then

H (x,u,J ) ≤ H (x,u,J ′),   ∀  x ∈ X, u ∈ U (x)

•   Contraction assumption:

−   For every  J  ∈ B(X ), the functions  T µJ   andT J  belong to  B(X ).

−   For some  α ∈ (0, 1) and all  J, J ′ ∈ B(X ),  H satisfies

H (x,u,J )−

H (x,u,J ′) ≤ α maxy∈XJ (y)

−J ′(y)

for all  x ∈ X   and  u ∈ U (x).

•   Standard algorithmic results extend:

−   Generalized VI converges to  J ∗, the unique

fixed point of  T −   Generalized PI and optimistic PI generate{µk}  such that

limk→∞

J k − J ∗ = 0

Page 191: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 191/276

6.231 DYNAMIC PROGRAMMING

LECTURE 18

LECTURE OUTLINE

•   Analysis of asynchronous VI and PI for gener-

alized discounted DP•   Undiscounted problems

•  Stochastic shortest path problems (SSP)

•   Proper and improper policies

•   Analysis and computational methods for SSP•   Pathologies of SSP

Page 192: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 192/276

ASYNCHRONOUS ALGORITHMS

•   Motivation for asynchronous algorithms−   Faster convergence

−  Parallel and distributed computation

−  Simulation-based implementations

•  General framework:   Partition  X   into disjoint

nonempty subsets   X 1, . . . , X  m, and use separateprocessor  ℓ  updating  J (x) for  x ∈ X ℓ

•   J  be partitioned as

J  = (J 1, . . . , J  m),

where J ℓ  is the restriction of  J  on the set  X ℓ.

•   Synchronous algorithm:

J t+1ℓ   (x) = T (J t1, . . . , J  tm)(x), x ∈ X ℓ, ℓ = 1, . . . , m

•   Asynchronous algorithm: For some subsets of times Rℓ,

J t+1ℓ   (x) =

T (J 

τ ℓ1(t)1   , . . . , J  

τ ℓm(t)m   )(x) if  t ∈ Rℓ,

J tℓ(x) if  t /∈ Rℓ

where t − τ ℓj(t) are communication “delays”

Page 193: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 193/276

ONE-STATE-AT-A-TIME ITERATIONS

•   Important special case:   Assume  n   “states”, aseparate processor for each state, and no delays

•  Generate a sequence of states {x0, x1, . . .}, gen-erated in some way, possibly by simulation (eachstate is generated infinitely often)

•   Asynchronous VI:

J t+1ℓ   =

T (J t1, . . . , J  tn)(ℓ) if  ℓ = xt,J tℓ   if  ℓ = xt,

where   T (J t1

, . . . , J  tn)(ℓ) denotes the   ℓ-th compo-nent of the vector

T (J t1, . . . , J  tn) = T J t,

and for simplicity we write  J tℓ   instead of  J tℓ(ℓ)

•   The special case where

{x0, x1, . . .} = {1, . . . , n , 1, . . . , n , 1, . . .}

is the   Gauss-Seidel method

•   We can show that  J t → J ∗ under the contrac-tion assumption

Page 194: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 194/276

ASYNCHRONOUS CONV. THEOREM I

•  Assume that for all ℓ, j  = 1, . . . , m, Rℓ is infiniteand limt→∞ τ ℓj(t) = ∞•   Proposition: Let T  have a unique fixed point J ∗,and assume that there is a sequence of nonemptysubsets S (k) ⊂ R(X ) with  S (k + 1) ⊂ S (k) for

all  k, and with the following properties:(1)   Synchronous Convergence Condition:   Ev-

ery sequence {J k}  with  J k ∈  S (k) for eachk, converges pointwise to  J ∗. Moreover, wehave

T J  ∈ S (k+1),   ∀ J  ∈ S (k), k = 0, 1, . . . .

(2)   Box Condition: For all k, S (k) is a Cartesianproduct of the form

S (k) = S 1(k) × · · · × S m(k),

where S ℓ(k) is a set of real-valued functionson  X ℓ, ℓ = 1, . . . , m.

Then for every  J 

 ∈ S (0), the sequence

 {J t

} gen-

erated by the asynchronous algorithm convergespointwise to  J ∗.

Page 195: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 195/276

ASYNCHRONOUS CONV. THEOREM II

•   Interpretation of assumptions:

S (0)S (k)

S (k + 1)   J ∗

J  = (J 1, J 2)

S 1(0)

S 2(0)TJ 

A synchronous iteration from any J   in S (k) movesinto  S (k + 1) (component-by-component)

•   Convergence mechanism:

S (0)S (k)

S (k + 1)   J ∗

J  = (J 1, J 2)

J 1   Iterations

J 2   Iteration

Key:   “Independent” component-wise improve-ment. An asynchronous component iteration fromany  J   in  S (k) moves into the corresponding com-

ponent portion of  S (k + 1)

Page 196: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 196/276

UNDISCOUNTED PROBLEMS

•   System:   xk+1  = f (xk, uk, wk)•   Cost of a policy  π = {µ0, µ1, . . .}

J π(x0) = limN →∞

E wk

k=0,1,...

N −1

k=0g

xk, µk(xk), wk

•   Shorthand notation for DP mappings

(T J )(x) = minu∈U (x)

E w

g(x,u,w) + J 

f (x,u,w)

,   ∀  x

•  For any stationary policy  µ

(T µJ )(x) = E w

g

x, µ(x), w

+ J 

f (x, µ(x), w)

,   ∀ x

•   Neither   T   nor   T µ   are contractions in general,but their monotonicity is helpful.

•   SSP problems provide a “soft boundary” be-tween the easy finite-state discounted problemsand the hard undiscounted problems.

−  They share features of both.

− Some of the nice theory is recovered because

of the termination state.

Page 197: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 197/276

SSP THEORY SUMMARY I

•   As earlier, we have a cost-free term. state   t, afinite number of states 1, . . . , n, and finite numberof controls, but we will make weaker assumptions.

•   Mappings   T   and   T µ   (modified to account fortermination state  t):

(T J )(i) = minu∈U (i)

g(i, u) +

nj=1

 pij(u)J ( j)

, i = 1, . . . , n ,

(T µJ )(i) = gi, µ(i)+n

j=1

 pijµ(i)J ( j), i = 1, . . . , n .

•   Definition:   A stationary policy   µ   is calledproper,   if under   µ, from every state   i, there isa positive probability path that leads to  t.

•  Important fact: If  µ is proper, T µ  is contrac-

tion with respect to some weighted max norm

maxi

1

vi|(T µJ )(i)−(T µJ ′)(i)| ≤ ρµ max

i

1

vi|J (i)−J ′(i)|

•   T   is similarly a contraction if all  µ  are proper

(the case discussed in the text, Ch. 7, Vol. I).

Page 198: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 198/276

SSP THEORY SUMMARY II

•   The theory can be pushed one step further.Assume that:

(a) There exists at least one proper policy

(b) For each improper µ, J µ(i) = ∞  for some  i

•  Then  T   is not necessarily a contraction, but:

−   J ∗ is the unique solution of Bellman’s Equ.

−   µ∗ is optimal if and only if  T µ∗J ∗ = T J ∗

−   limk→∞(T kJ )(i) = J ∗(i) for all  i

−  Policy iteration terminates with an optimal

policy, if started with a proper policy•   Example: Deterministic shortest path problemwith a single destination  t.

−   States   <=>   nodes; Controls   <=>   arcs

−  Termination state   <=>   the destination

−   Assumption (a)   <=>   every node is con-nected to the destination

−   Assumption (b)   <=>   all cycle costs  > 0

Page 199: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 199/276

SSP ANALYSIS I

•   For a proper policy   µ,   J µ   is the unique fixedpoint of  T µ, and T kµJ  → J µ  for all J  (holds by thetheory of Vol. I, Section 7.2)

•   A stationary  µ  satisfying  J  ≥  T µJ   for some  J must be proper - true because

J  ≥ T kµJ  = P kµJ  +k−1m=0

P mµ   gµ

and some component of the term on the rightblows up if  µ  is improper (by our assumptions).

•   Consequence:   T   can have at most one fixedpoint.

Proof:   If  J   and  J ′ are two fixed points, select  µand  µ′ such that  J   = T J   = T µJ   and  J ′ = T J ′ =

T µ′J ′

. By preceding assertion,  µ  and  µ′

must beproper, and  J  = J µ  and  J ′ = J µ′ . Also

J  = T kJ  ≤ T kµ′

J  → J µ′  = J ′

Similarly,  J ′

≤J , so  J  = J ′.

Page 200: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 200/276

SSP ANALYSIS II

•   We now show that T  has a fixed point, and alsothat policy iteration converges.

•  Generate a sequence of proper policies {µk} bypolicy iteration starting from a proper policy  µ0.

•  µ1   is proper and  J µ0

 ≥J µ1   since

J µ0  = T µ0 J µ0 ≥ T J µ0  = T µ1 J µ0 ≥ T kµ1 J µ0 ≥ J µ1

•   Thus {J µk}   is nonincreasing, some policy  µ   isrepeated, with  J µ   =  T J µ. So  J µ   is a fixed point

of  T .•   Next   show   T kJ  →   J µ   for all   J ,   i.e., value it-eration converges to the same limit as policy iter-ation. (Sketch: True if  J   =   J µ, argue using theproperness of   µ   to show that the terminal cost

difference J  − J µ  does not matter.)•   To show  J µ = J ∗, for any  π = {µ0, µ1, . . .}

T µ0 · · · T µk−1 J 0 ≥ T kJ 0,

where  J 0 ≡

 0. Take lim sup as k → ∞

, to obtainJ π ≥ J µ,  so µ  is optimal and  J µ = J ∗.

Page 201: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 201/276

SSP ANALYSIS III

•   Contraction Property: If all policies are proper(the assumption of Section 7.1, Vol. I),   T µ   andT  are contractions with respect to a weighted supnorm.

Proof:   Consider a new SSP problem where thetransition probabilities are the same as in the orig-inal, but the transition costs are all equal to −1.Let  J   be the corresponding optimal cost vector.For all  µ,

J (i) = −1+ minu∈U (i)

n

j=1

 pij(u)J ( j) ≤ −1+

n

j=1

 pijµ(i)J ( j)

For  vi = −J (i),  we have  vi ≥ 1, and for all  µ,

n

j=1

 pijµ(i)vj ≤ vi−

1≤

ρ vi, i = 1, . . . , n ,

where

ρ = maxi=1,...,n

vi − 1

vi< 1.

This implies contraction of  T µ and T  by the resultsof the preceding lecture.

Page 202: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 202/276

Page 203: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 203/276

PATHOLOGIES II: DETERM. SHORTEST PATHS

•   If there is a cycle with cost   <   0,   Bellman’sequation has no solution [among functions  J   with−∞ < J (i) < ∞  for all  i]. Example:

0

-1

11 2 t

•   We have  J ∗(1) = J ∗(2) = −∞.

•   Bellman’s equation is

J (1) = J (2), J (2) = min−1 + J (1), 1].

•   There is no solution [among functions   J   with−∞ < J (i) < ∞  for all  i].

•   Bellman’s equation has as solution   J ∗(1) =J ∗(2) = −∞   [within the larger class of functionsJ (·) that can take the value −∞   for some (orall) states]. This situation can be generalized (seeChapter 3 of Vol. 2 of the text).

Page 204: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 204/276

PATHOLOGIES III: THE BLACKMAILER

•  Two states, state 1 and the termination state t.•   At state 1, choose a control   u  ∈   (0, 1] (theblackmail amount demanded) at a cost −u, andmove to   t  with probability   u2, or stay in 1 withprobability 1 − u2.

•  Every stationary policy is proper, but the con-trol set in not finite.

•   For any stationary  µ  with  µ(1) = u, we have

J µ(1) = −u + (1 − u2)J µ(1)

from which  J µ(1) = − 1u

•   Thus   J ∗(1) = −∞, and there is no optimalstationary policy.

•  It turns out that   a nonstationary policy is

optimal:   demand   µk(1) =   γ/(k  + 1) at time   k,with  γ  ∈ (0, 1/2).

−  Blackmailer requests diminishing amounts overtime, which add to ∞.

−  The probability of the victim’s refusal dimin-

ishes at a much faster rate, so the probabil-ity that the victim stays forever compliant isstrictly positive.

Page 205: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 205/276

SSP ALGORITHMS

•   All the basic algorithms have counterparts•   “Easy” case: All policies proper, in which casethe mappings  T   and T µ  are contractions

•   Even in the general case all basic algorithmshave satisfactory counterparts

•  Value iteration, policy iteration

•   Q-learning

•   Optimistic policy iteration

•  Asynchronous value iteration

•   Asynchronous policy iteration

Page 206: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 206/276

6.231 DYNAMIC PROGRAMMING

LECTURE 19

LECTURE OUTLINE

•   We begin a lecture series on approximate DP

for large/intractable problems.•  Check the updated/expanded version of Chap-ter 6, Vol. 2, athttp://web.mit.edu/dimitrib/www/dpchapter.html

• Today we classify/overview the main approaches:

−  Rollout/Simulation-based single policy iter-ation (will not discuss this further)

−  Approximation in value space (approximatepolicy iteration, Q-Learning, Bellman errorapproach, approximate LP)

−  Approximation in value space using problemapproximation   (simplification - aggregation- limited lookahead) - will not discuss much

−  Approximation in policy space (policy para-metrization, gradient methods)

Page 207: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 207/276

GENERAL ORIENTATION

•   We will mainly adopt an   n-state discountedmodel (the easiest case - but think of HUGE  n).

•  Extensions to SSP and average cost are possible(but more quirky). We will set aside for later.

•  Other than manual/trial-and-error approaches

(e.g., as in computer chess), the only other ap-proaches are simulation-based. They are collec-tively known as “neuro-dynamic programming” or“reinforcement learning”.

•   Simulation is essential for large state spaces

because of its (potential) computational complex-ity advantage in computing sums/expectations in-volving a very large number of terms.

•   Simulation also comes in handy when an ana-lytical model of the system is unavailable, but a

simulation/computer model is possible.•   Simulation-based methods are of three types:

−  Rollout (we will not discuss further)

−   Approximation in value space

− Approximation in policy space

Page 208: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 208/276

APPROXIMATION IN POLICY SPACE

•   A brief discussion; we will return to it at theend.

•   We parameterize the set of policies by a vectorr = (r1, . . . , rs) and we optimize the cost over  r.

•  Discounted problem example:

−   Each value of   r   defines a stationary policy,with cost starting at state i denoted by J i(r).

−   Use a gradient (or other) method to mini-mize over  r

J (r) =

ni=1

q (i)J i(r),

where

q (1), . . . , q  (n)

is some probability dis-tribution over the states.

•   In a special case of this approach, the param-eterization of the policies is indirect, through anapproximate cost function.

−   A cost approximation architecture parame-terized by r, defines a policy dependent on  rvia the minimization in Bellman’s equation.

Page 209: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 209/276

APPROX. IN VALUE SPACE - APPROACHES

•   Approximate PI (Policy evaluation/Policy im-provement)

−   Uses simulation algorithms to approximatethe cost  J µ  of the current policy  µ

−  Projected equation and aggregation approaches

•  Approximation of the optimal cost function J ∗

−   Q-Learning:   Use a simulation algorithm toapproximate the optimal costs   J ∗(i) or theQ-factors

Q∗(i, u) = g(i, u) + α

nj=1

 pij(u)J ∗( j)

−  Bellman error approach: Find  r  to

minr E i

J (i, r) − (T  J )(i, r)2

where   E i{·}   is taken with respect to somedistribution

−   Approximate LP (discussed earlier - supple-

mented with clever schemes to overcome thelarge number of constraints issue)

Page 210: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 210/276

POLICY EVALUATE/POLICY IMPROVE

•   General structure

System Simulator

Decision GeneratorDecision µ(i)

Cost-to-Go Approximator

Supplies Values  J ( j, r)

Cost Approximation

Algorithm

J ( j, r)

State   i

Samples

•  J ( j, r) is the cost approximation for the pre-

ceding policy, used by the decision generator tocompute the current policy   µ   [whose cost is ap-proximated by  J ( j, r) using simulation]

•   There are several cost approximation/policyevaluation algorithms

•   There are several important issues relating tothe design of each block (to be discussed in thefuture).

Page 211: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 211/276

POLICY EVALUATION APPROACHES I

•   Approximate the cost of the current policy byusing a simulation method.

−   Direct policy evaluation - Cost samples gen-erated by simulation, and optimization byleast squares

−  Indirect policy evaluation−   An example of indirect approach: Solvingthe projected equation Φr = ΠT µ(Φr) whereΠ is projection w/ respect to a suitable weightedEuclidean norm

S: Subspace spanned by basis functions

0

ΠJµ

Projectionon S

S: Subspace spanned by basis functions

Tµ(Φr)

0

Φr = ΠTµ(Φr)

Projectionon S

Direct Mehod: Projection of cost vector Jµ Indirect method: Solving a projectedform of Bellman’s equation

•  Batch and incremental methods

•  Regular and optimistic policy iteration

Page 212: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 212/276

POLICY EVALUATION APPROACHES II

•   Projected equation methods−   TD(λ): Stochastic iterative algorithm for solv-

ing Φr = ΠT µ(Φr)

−   LSPE(λ): A simulation-based form of   pro- jected value iteration 

Φrk+1 = ΠT µ(Φrk) + simulation noise

S: Subspace spanned by basis functions

T(Φrk) = g + αPΦrk

0

Value Iterate

Projectionon S

Φrk+1

Simulation error

S: Subspace spanned by basis functions

Φrk

T(Φrk) = g + αPΦrk

0

Φrk+1

Value Iterate

Projectionon S

Projected Value Iteration (PVI) Least Squares Policy Evaluation (LSPE)

Φrk

−  LSTD(λ): Solves a simulation-based approx-

imation using a linear system solver (e.g.,Gaussian elimination/Matlab)

•   Aggregation methods: Solve

Φr = DT µ(Φr)

where the rows of  D and Φ are prob. distributions(e.g.,   D  and Φ “aggregate” rows and columns of the linear system  J  = T µJ ).

Page 213: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 213/276

THEORETICAL BASIS OF APPROXIMATE PI

•  If policies are approximately evaluated using anapproximation architecture:

maxi

|J (i, rk) − J µk(i)| ≤ δ, k = 0, 1, . . .

•   If policy improvement is also approximate,

maxi

|(T µk+1 J )(i, rk)−(T  J )(i, rk)| ≤ ǫ, k = 0, 1, . . .

•  Error Bound:   The sequence

 {µk

} generated

by approximate policy iteration satisfies

lim supk→∞

maxi

J µk(i) − J ∗(i)

≤   ǫ + 2αδ 

(1 − α)2

•  Typical practical behavior: The method makessteady progress up to a point and then the iteratesJ µk  oscillate within a neighborhood of   J ∗.

Page 214: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 214/276

THE USE OF SIMULATION

•  Simulation advantage: Compute (approximately)sums with a very large number of terms.

•   Example: Projection by Monte Carlo Simula-tion:   Compute projection ΠJ   of  J  ∈ ℜn on sub-space   S   =   {Φr   |   r  ∈ ℜs}, with respect to a

weighted Euclidean norm  · v.•   Find Φr∗, where

r∗ = arg minr∈ℜs

Φr−J 2v  = arg minr∈ℜs

ni=1

vi

φ(i)′r−J (i)2

•  Setting to 0 the gradient at  r∗,

r∗ =

  ni=1

viφ(i)φ(i)′

−1 ni=1

viφ(i)J (i)

•  Approximate by simulation the two “expected

values”

rk  =

  kt=1

φ(it)φ(it)′

−1k

t=1

φ(it)J (it)

•  Equivalent least squares alternative:

rk  = arg minr∈ℜs

kt=1

φ(it)′r − J (it)

2

Page 215: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 215/276

THE ISSUE OF EXPLORATION

•  To evaluate a policy µ, we need to generate costsamples using that policy - this biases the simula-tion by underrepresenting states that are unlikelyto occur under  µ.

•   As a result, the cost-to-go estimates of these

underrepresented states may be highly inaccurate.•   This seriously impacts the improved policy µ.

•   This is known as   inadequate exploration   - aparticularly acute difficulty when the randomnessembodied in the transition probabilities is “rela-

tively small” (e.g., a deterministic system).•   One possibility to guarantee adequate explo-ration: Frequently restart the simulation and en-sure that the initial states employed form a richand representative subset.

•   Another possibility: Occasionally generatingtransitions that use a randomly selected controlrather than the one dictated by the policy  µ.

•   Other methods, to be discussed later, use twoMarkov chains (one is the chain of the policy and

is used to generate the transition sequence, theother is used to generate the state sequence).

Page 216: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 216/276

APPROXIMATING Q-FACTORS

•   The approach described so far for policy eval-uation requires calculating expected values for allcontrols  u ∈ U (i) (and knowledge of  pij(u)).

•   Model-free alternative: Approximate  Q-factors

Q(i,u,r) ≈n

j=1

 pij(u)

g(i,u,j) + αJ µ( j)

and use for policy improvement the minimization

µ(i) = arg minu∈U (i)

Q(i,u,r)

•   r is an adjustable parameter vector and  Q(i,u,r)is a parametric architecture, such as

Q(i,u,r) =m

k=1

rkφk(i, u)

•   Can use any method for constructing cost ap-proximations, e.g., TD(λ).

•   Use the Markov chain with states (i, u) - pij(µ(i))is the transition prob. to ( j, µ(i)), 0 to other ( j, u′).

•  Major concern: Acutely diminished exploration.

Page 217: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 217/276

6.231 DYNAMIC PROGRAMMING

LECTURE 20

LECTURE OUTLINE

•   Discounted problems - Approximate policy eval-

uation/policy improvement•   Direct approach - Least squares

•   Indirect approach - The projected equation

•   Contraction properties - Error bounds

•  Matrix form of the projected equation•   Simulation-based implementation

•   LSTD and LSPE methods

Page 218: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 218/276

APPROXIMATE PI

Approximate Policy

Evaluation

Policy Improvement

Guess Initial Policy

Evaluate Approximate Cost

J µ(r) = Φr  Using Simulation

Generate “Improved” Policy  µ

•  Linear cost function approximation

J (r) = Φr

where Φ is full rank   n × s   matrix with columnsthe basis functions, and  ith row denoted  φ(i)′.

•  Policy “improvement”

µ(i) = arg minu∈U (i)

nj=1

 pij(u)

g(i,u,j) + αφ( j)′r

Page 219: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 219/276

POLICY EVALUATION

•   Focus on policy evaluation: approximate thecost of the current policy by using a simulationmethod.

−   Direct policy evaluation - Cost samples gen-erated by simulation, and optimization byleast squares

−   Indirect policy evaluation - solving the pro- jected equation Φr   = ΠT µ(Φr) where Π isprojection w/ respect to a suitable weightedEuclidean norm

S: Subspace spanned by basis functions0

ΠJµ

Projectionon S

S: Subspace spanned by basis functions

Tµ(Φr)

0

Φr = ΠTµ(Φr)

Projectionon S

Direct Mehod: Projection of cost vector Jµ Indirect method: Solving a projectedform of Bellman’s equation

•   Recall that projection can be implemented bysimulation and least squares

Page 220: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 220/276

DIRECT POLICY EVALUATION

•   Suppose we can implement in a simulator thecurrent policy µ, and want to calculate J µ  by sim-ulation.

•   Generate by simulation sample costs. Then:

J µ(i) ≈   1M i

M im=1

c(i, m)

c(i, m) :   mth (noisy) sample cost starting from state  i

•   Approximating well each   J µ(i) is impracticalfor a large state space. Instead, a “compact rep-resentation”  J µ(i, r) is used, where  r   is a tunableparameter vector.

•   Direct approach: Calculate an optimal value r∗

of  r  by a least squares fit

r∗ = arg minr

ni=1

M im=1

c(i, m) −  J µ(i, r)2

•   Note that this is much easier when the archi-tecture is linear - but this is not a requirement.

Page 221: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 221/276

IMPLEMENTATION OF DIRECT APPROACH

•   Focus on a batch: an   N -transition portion(i0, . . . , iN ) of a simulated trajectory

•  We view the numbers

N −1

t=k

αt−kgit, µ(it), it+1, k = 0, . . . , N  

 −1,

as cost samples, one per initial state  i0, . . . , iN −1

•   Solve the least squares problem

minr

12

N −1k=0

J (ik, r) − N −1

t=k

αt−kg

it, µ(it), it+12

•   If  J (ik, r) is linear in   r, this problem can besolved by matrix inversion.

•  Another possibility: Use a gradient-type method[applies also to cases where  J (ik, r) is nonlinear]

•   A variety of gradient methods: Batch and in-cremental

Page 222: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 222/276

INDIRECT POLICY EVALUATION

•   For the current policy   µ, consider the corre-sponding mapping  T :

(T J )(i) =ni=1

 pij

g(i, j)+αJ ( j)

, i = 1, . . . , n ,

•   The solution  J µ  of Bellman’s equation  J  = T J is approximated by the solution of 

Φr = ΠT (Φr)

S: Subspace spanned by basis functions

T(Φr)

Φr = ΠT(Φr)

Projectionon S

Indirect method: Solving a projectedform of Bellmanʼs equation

Page 223: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 223/276

WEIGHTED EUCLIDEAN PROJECTIONS

•   Consider a weighted Euclidean norm

J v  =

  ni=1

vi

J (i)2

,

where v   is a vector of positive weights  v1, . . . , vn.

•   Let Π denote the projection operation onto

S  = {Φr | r ∈ ℜs}

with respect to this norm, i.e., for any  J  ∈ ℜn,

ΠJ  = Φr∗

wherer∗ = arg min

r∈ℜs J  −

Φr2v

Page 224: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 224/276

KEY QUESTIONS AND RESULTS

•   Does the projected equation have a solution?•   Under what conditions is the mapping ΠT   acontraction, so ΠT  has unique fixed point?

•   Assuming ΠT  has unique fixed point Φr∗, howclose is Φr∗ to  J µ?

•   Assumption:   P  has a single recurrent class andno transient states, i.e., it has steady-state prob-abilities that are positive

ξ j  = limN →∞

1

N k=1P (ik  = j | i0  = i) > 0

•   Proposition:   ΠT   is contraction of modulus   αwith respect to the weighted Euclidean norm ·ξ,where  ξ   = (ξ 1, . . . , ξ  n) is the steady-state proba-

bility vector. The unique fixed point Φr∗ of ΠT satisfies

J µ − Φr∗ξ ≤   1√ 1 − α2

 J µ − ΠJ µξ

Page 225: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 225/276

PRELIMINARIES: PROJECTION PROPERTIES

•   Important property of the projection Π on   S with weighted Euclidean norm  · v. For all  J  ∈ℜn, J  ∈ S , the Pythagorean Theorem holds:

J  − J 2v  = J  − ΠJ 2v + ΠJ  − J 2v

Proof:   Geometrically, (J  − ΠJ ) and (ΠJ  − J )are orthogonal in the scaled geometry of the norm · v, where two vectors  x, y ∈ ℜn are orthogonalif 

 ni=1 vixiyi   = 0. Expand the quadratic in the

RHS below:

J  − J 2v  = (J  − ΠJ ) + (ΠJ  − J )2v

•  The Pythagorean Theorem implies that the pro- jection is nonexpansive, i.e.,

ΠJ  − Π J v ≤ J  −  J v,   for all J,  J  ∈ ℜn.

To see this, note that

Π(J 

 −J )2v ≤ Π(J 

 −J )2v + (I 

−Π)(J 

 −J )2v

= J  − J 2v

Page 226: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 226/276

PROOF OF CONTRACTION PROPERTY

•   Lemma:   We have

P zξ ≤ zξ, z ∈ ℜn

Proof:   Let  pij  be the components of  P . For allz ∈ ℜn, we have

P z2ξ  =ni=1

ξ i

n

j=1

 pijzj

2

≤ni=1

ξ i

nj=1

 pijz2j

=n

j=1

n

i=1

ξ i pijz2j   =n

j=1

ξ jz2j   =

z

2ξ ,

where the inequality follows from the convexity of the quadratic function, and the next to last equal-ity follows from the defining property

ni=1 ξ i pij  =

ξ j  of the steady-state probabilities.

•   Using the lemma, the nonexpansiveness of Π,and the definition  T J  = g + αP J , we have

ΠT J −ΠT J ξ  ≤ T J −T J ξ  = αP (J −J )ξ  ≤ αJ −J ξ

for all   J,  J  ∈ ℜn. Hence ΠT   is a contraction of modulus  α.

Page 227: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 227/276

PROOF OF ERROR BOUND

•   Let Φr∗ be the fixed point of ΠT . We have

J µ − Φr∗ξ ≤   1√ 1 − α2

J µ − ΠJ µξ.

Proof:   We have

J µ − Φr∗2ξ  = J µ − ΠJ µ2ξ +ΠJ µ − Φr∗

= J µ − ΠJ µ2ξ +ΠT J µ − ΠT (Φr∗)

≤ J µ − ΠJ µ2ξ + α2J µ − Φr∗2ξ ,

where

−  The first equality uses the Pythagorean The-orem

−  The second equality holds because  J µ   is the

fixed point of  T   and Φr∗

is the fixed pointof ΠT 

−  The inequality uses the contraction propertyof ΠT .

Q.E.D.

Page 228: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 228/276

MATRIX FORM OF PROJECTED EQUATION

•   Its solution is the vector   J   = Φr∗, where   r∗solves the problem

minr∈ℜs

Φr − (g + αP Φr∗)2ξ

.

•   Setting to 0 the gradient with respect to  r   of this quadratic, we obtain

Φ′Ξ

Φr∗ − (g + αP Φr∗)

= 0,

where Ξ is the diagonal matrix with the steady-state probabilities  ξ 1, . . . , ξ  n  along the diagonal.

•   This is just the   orthogonality condition: Theerror Φr∗ − (g + αP Φr∗) is “orthogonal” to thesubspace spanned by the columns of Φ.

•   Equivalently, Cr∗ = d,

where

C  = Φ′Ξ(I − αP )Φ, d = Φ′Ξg.

Page 229: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 229/276

PROJECTED EQUATION: SOLUTION METHOD

•   Matrix inversion:  r∗ = C −1d•   Projected Value Iteration (PVI) method:

Φrk+1  = ΠT (Φrk) = Π(g + αP Φrk)

Converges to  r∗ because ΠT   is a contraction.

S: Subspace spanned by basis functions

Φrk

T(Φrk) = g + αPΦrk

0

Φrk+1

Value Iterate

Projectionon S

•  PVI can be written as:

rk+1 = arg minr∈ℜs

Φr − (g + αP Φrk)2ξ

By setting to 0 the gradient with respect to  r,

Φ′Ξ

Φrk+1 − (g + αP Φrk)

= 0,

which yields

rk+1  = rk − (Φ′ΞΦ)−1(Crk − d)

Page 230: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 230/276

SIMULATION-BASED IMPLEMENTATIONS

•   Key idea: Calculate simulation-based approxi-mations based on  k  samples

C k ≈ C, dk ≈ d

•   Matrix inversion   r∗

=   C −1

d   is approximatedbyrk  = C −1k   dk

This is the   LSTD (Least Squares Temporal Dif-ferences) Method.

•   PVI method rk+1  = rk − (Φ′ΞΦ)−1(Crk − d) isapproximated by

rk+1  = rk − Gk(C krk − dk)

whereGk ≈ (Φ′ΞΦ)−1

This is the   LSPE (Least Squares Policy Evalua-tion) Method.

•   Key fact:   C k,   dk, and   Gk   can be computed

with low-dimensional linear algebra (of order   s;the number of basis functions).

Page 231: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 231/276

SIMULATION MECHANICS

•   We generate an infinitely long trajectory (i0, i1, . . .)of the Markov chain, so states   i   and transitions(i, j) appear with long-term frequencies ξ i and pij .

•   After generating the transition (it, it+1), wecompute the row   φ(it)′ of Φ and the cost com-

ponent  g(it, it+1).•   We form

C k  =  1

k + 1

k

t=0φ(it)

φ(it)−αφ(it+1)

′ ≈ Φ′Ξ(I −αP )Φ

dk  =  1

k + 1

kt=0

φ(it)g(it, it+1) ≈ Φ′Ξg

Also in the case of LSPE

Gk  =  1

k + 1

kt=0

φ(it)φ(it)′ ≈ Φ′ΞΦ

•   Convergence proof is simple: Use the law of large numbers.

•   Note that  C k,  dk, and Gk  can be formed incre-mentally.

Page 232: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 232/276

6.231 DYNAMIC PROGRAMMING

LECTURE 21

LECTURE OUTLINE

•   Review of projected equation methods

•   Optimistic versions

•   Multistep projected equation methods

•   Bias-variance tradeoff 

•   Exploration-enhanced implementations

•   Oscillations

Page 233: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 233/276

REVIEW: PROJECTED BELLMAN EQUATION

•   For a fixed policy  µ   to be evaluated, considerthe corresponding mapping  T :

(T J )(i) =ni=1

 pij

g(i, j)+αJ ( j)

, i = 1, . . . , n ,

or more compactly,

T J  = g + αP J 

•   Approximate Bellman’s equation   J   =   T J   byΦr   = ΠT (Φr) or the matrix form/orthogonalitycondition Cr∗ = d,  where

C  = Φ′Ξ(I − αP )Φ, d = Φ′Ξg.

S: Subspace spanned by basis functions

T(Φr)

0

Φr = ΠT(Φr)

Projectionon S

Indirect method: Solving a projectedform of Bellmanʼs equation

Page 234: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 234/276

PROJECTED EQUATION METHODS

•   Matrix inversion:  r∗ = C −1d•   Iterative Projected Value Iteration (PVI) method:

Φrk+1  = ΠT (Φrk) = Π(g + αP Φrk)

Converges to   r∗ if ΠT   is a contraction. True if 

Π is projection with respect to  · ξ, where   ξ   issteady-state distribution.

•   Simulation-Based Implementations: Calculateapproximations based on  k  simulation samples

C k ≈

C, dk ≈

d

•   LSTD:  r∗ = C −1d  is approximated by

rk  = C −1k   dk

•  LSPE:

rk+1  = rk − Gk(C krk − dk)

where  Gk ≈  (Φ′ΞΦ)−1. Converges to  r∗ if ΠT   iscontraction.

•  Key fact:   C k,   dk, and   Gk   can be computed

with low-dimensional linear algebra (of order   s;the number of basis functions).

Page 235: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 235/276

OPTIMISTIC VERSIONS

•  Use coarse approximations  C k ≈ C  and dk ≈ d,based on few simulation samples

•  Evaluate (coarsely) current policy  µ, then do apolicy improvement

• Very complex behavior (see the subsequent dis-

cussion on oscillations)

•   Often approaches the limit more quickly (asoptimistic methods often do)

•  The matrix inversion/LSTD method has seriousproblems due to large simulation noise (because of 

limited sampling)

•  LSPE tends to cope better because of its itera-tive nature

•   A stepsize  γ  ∈ (0, 1] in LSPE may be useful todamp the effect of simulation noise

rk+1  = rk − γGk(C krk − dk)

Page 236: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 236/276

MULTISTEP METHODS

•   Introduce a multistep version of Bellman’s equa-tion  J  = T (λ)J , where for  λ ∈ [0, 1),

T (λ) = (1 − λ)∞t=0

λtT t+1

Geometrically weighted sum of powers of  T .

•   Note that   T t is a contraction with modulusαt, with respect to the weighted Euclidean norm·ξ, where ξ  is the steady-state probability vectorof the Markov chain.

•   Hence  T (λ) is a contraction with modulus

αλ = (1 − λ)∞t=0

αt+1λt =  α(1 − λ)

1 − αλ

Note that  αλ

 →0 as  λ

→1.

•   T t and T (λ) have the same fixed point  J µ  and

J µ − Φr∗λξ ≤  1

 1 − α2

λ

J µ − ΠJ µξ

where Φr∗λ  is the fixed point of ΠT (λ).

•   The fixed point Φr∗λ  depends on  λ.

Page 237: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 237/276

BIAS-VARIANCE TRADEOFF

•  Error bound J µ−Φr∗λξ ≤  1 1−α2

λJ µ−ΠJ µξ

•   Bias-variance tradeoff:

−   As   λ ↑   1, we have   αλ ↓   0, so error bound(and the quality of approximation) improvesas  λ

↑1. In fact

limλ↑1

Φr∗λ  = ΠJ µ

−  But simulation noise in approximating

T (λ) = (1 − λ)

∞t=0

λtT t+1

increases.

Subspace  S  = {Φr |  r  ∈ ℜs}

J µ

Simulation errorΠJ µ

Bias

λ = 0

λ   = 1

0 < λ < 1

Solution of projected equation

Simulation error

Φr = ΠT (λ)µ   (Φr)

•   Choice of  λ  is based on trial and error

Page 238: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 238/276

MULTISTEP PROJECTED EQ. METHODS

•  The projected Bellman equation is

Φr = ΠT (λ)(Φr)

•   In matrix form:   C (λ)r = d(λ), where

C (λ) = Φ′ΞI 

−αP (λ)Φ, d(λ) = Φ′Ξg(λ),

with

P (λ) = (1 − λ)∞ℓ=0

αℓλℓP ℓ+1, g(λ) =∞ℓ=0

αℓλℓP ℓg

•   The LSTD(λ) method isC (λ)k

−1d(λ)k   ,

where C (λ)k   and d

(λ)k   are simulation-based approx-

imations of  C (λ) and d(λ).

•   The LSPE(λ) method is

rk+1 = rk − γGk

C (λ)k   rk − d

(λ)k

where Gk is a simulation-based approx. to (Φ′ΞΦ)−1

•   TD(λ) is a simplified version of LSPE(λ) whereGk   =  I , and simpler approximations to  C   and  dare used. It is more properly viewed as a stochas-tic iteration for solving the projected equation.

Page 239: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 239/276

MORE ON MULTISTEP METHODS

•  The simulation process to obtain  C (λ)k   and d

(λ)k

is similar to the case λ = 0 (single simulation tra- jectory  i0, i1, . . . more complex formulas)

C (λ)k   =

  1

k + 1

k

t=0

φ(it)k

m=t

αm−tλm−tφ(im)

−αφ(im+1)′,

d(λ)k   =

  1

k + 1

kt=0

φ(it)k

m=t

αm−tλm−tgim

•   In the context of approximate policy iteration,we can use optimistic versions (few samples be-tween policy updates).

•   Many different versions ... see on-line chapter.

•  Note the  λ-tradeoffs:

−   As  λ ↑  1,  C (λ)k   and  d(λ)k   contain more “sim-ulation noise”, so more samples are neededfor a close approximation of  rλ  (the solutionof the projected equation)

−  The error bound J µ−Φrλξ becomes smaller

−   As   λ ↑   1, ΠT (λ) becomes a contraction forarbitrary projection norm

Page 240: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 240/276

POLICY ITERATION ISSUES

•   1st major issue:   exploration. Common rem-edy is the off-policy approach:  Replace  P   of cur-rent policy with

P   = (I − B)P  + BQ,

where  B   is a diagonal matrix with diagonal com-

ponents   β i  ∈   [0, 1] and   Q   is another transitionprobability matrix.

•   Then LSTD and LSPE formulas must be mod-ified ... otherwise the policy associated with   P (not  P ) is evaluated (see the on-line chapter).

•   2nd major issue:  oscillation of policies

•   Analysis using the  greedy partition:  Rµ   is theset of parameter vectors   r   for which   µ   is greedywith respect to  J (·, r) = Φr

Rµ =

r | T µ(Φr) = T (Φr)

rµk

rµk+1

rµk+2

rµk+3

Rµk

Rµk+1

Rµk+2

Rµk+3

Page 241: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 241/276

MORE ON OSCILLATIONS/CHATTERING

•   In the case of optimistic policy iteration a dif-ferent picture holds

r

µ

1

rµ2

rµ3

Rµ1

Rµ2

Rµ3

•   Oscillations are less violent, but the “limit”point is meaningless!

•   Fundamentally, oscillations are due to the lackof monotonicity of the projection operator, i.e.,J  ≤ J ′ does not imply ΠJ  ≤ ΠJ ′.

•   If approximate PI uses policy evaluation

Φr = (W T µ)(Φr)

with W  a monotone operator, the generated poli-cies converge (to a possibly nonoptimal limit).

•   The operator   W   used in the aggregation ap-proach has this monotonicity property.

Page 242: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 242/276

6.231 DYNAMIC PROGRAMMING

LECTURE 22

LECTURE OUTLINE

•  Aggregation as an approximation methodology

•   Aggregate problem

•   Examples of aggregation

•   Simulation-based aggregation

•   Q-Learning

Page 243: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 243/276

PROBLEM APPROXIMATION - AGGREGATIO

•   Another major idea in ADP is to approximatethe cost-to-go function of the problem with thecost-to-go function of a simpler problem. The sim-plification is often ad-hoc/problem dependent.

•  Aggregation is a systematic approach for prob-

lem approximation. Main elements:−   Introduce a few “aggregate” states, viewedas the states of an “aggregate” system

−   Define transition probabilities and costs of the aggregate system,   by relating originalsystem states with aggregate states

−   Solve (exactly or approximately) the “ag-gregate” problem by any kind of value or pol-icy iteration method (including simulation-based methods)

−  Use the optimal cost of the aggregate prob-

lem to approximate the optimal cost of theoriginal problem

•   Hard aggregation example: Aggregate statesare subsets of original system states, treated as if they all have the same cost.

Page 244: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 244/276

AGGREGATION/DISAGGREGATION PROBS

•   The aggregate system transition probabilitiesare defined via two (somewhat arbitrary) choices:

•   For each original system state  j  and aggregatestate y, the aggregation probability  φjy

−   Roughly, the “degree of membership of  j   in

the aggregate state  y.”−   In hard aggregation,   φjy   = 1 if state   j   be-

longs to aggregate state/subset  y.

•   For each aggregate state  x  and original systemstate i, the disaggregation probability  dxi

−  Roughly, the “degree to which  i  is represen-tative of  x.”

−   In hard aggregation, one possibility is allstates i that belongs to aggregate state/subsetx  have equal  dxi.

 pij(u)

dxi   φjy

 ji

x   y

OriginalSystem States

Aggregate States

ˆ pxy(u) =n

i=1

dxi

n

j=1

 pij(u)φjy

Disaggregation

Probabilities

Aggregation

Probabilities

Page 245: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 245/276

AGGREGATE TRANSITION PROBABILITIES

•  The transition probability from aggregate statex  to aggregate state  y  under control  u

ˆ pxy(u) =ni=1

dxi

nj=1

 pij(u)φjy,   or  P (u) = DP (u)Φ

where the rows of   D  and Φ are the disaggr. and

aggr. probs.

•  The expected transition cost is

g(x, u) =n

i=1dxi

n

j=1 pij(u)g(i,u,j),   or g = DP g

•  The optimal cost function of the aggregate prob-lem, denoted  R, is

R(x) = minu∈U 

g(x, u) + α

yˆ pxy(u) R(y)

,   ∀ x

or  R   = minu[g + αP  R] - Bellman’s equation forthe aggregate problem.

•   The optimal cost function   J ∗ of the originalproblem is approximated by  J  given by

J ( j) =y

φjy R(y),   ∀  j

Page 246: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 246/276

EXAMPLE I: HARD AGGREGATION

•   Group the original system states into subsets,and view each subset as an aggregate state

•   Aggregation probs.:   φjy   = 1 if    j   belongs toaggregate state  y.

1   2   3

4 5 6

7   8   9

x1   x2

x3   x4

Φ =

1 0 0 0

1 0 0 0

0 1 0 0

1 0 0 0

1 0 0 0

0 1 0 0

0 0 1 0

0 0 1 0

0 0 0 1

•   Disaggregation probs.: There are many possi-bilities, e.g., all states   i   within aggregate state  xhave equal prob.  dxi.

•   If optimal cost vector  J ∗ is piecewise constantover the aggregate states/subsets, hard aggrega-tion is exact. Suggests grouping states with “roughlyequal” cost into aggregates.

•   Soft aggregation  (provides “soft boundaries”between aggregate states).

Page 247: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 247/276

EXAMPLE II: FEATURE-BASED AGGREGATIO

•   If we know good features, it makes sense togroup together states that have “similar features”

•   Essentially discretize the features and assign aweight to each discretization point

States   Aggregate StatesFeatures

FeatureExtraction

•   A general approach for passing from a feature-based state representation to an aggregation-basedarchitecture

•   Aggregation-based architecture is more power-ful (nonlinear in the features)

•  ... but may require many more aggregate statesto reach the same level of performance as the cor-responding linear feature-based architecture

Page 248: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 248/276

EXAMPLE III: REP. STATES/COARSE GRID

•  Choose a collection of “representative” originalsystem states, and associate each one of them withan aggregate state

x

 j2

 j3

 j1

y1   y2

y3

Original State Space

Representative/Aggregate States

•   Disaggregation probabilities are  dxi  = 1 if   i   isequal to representative state  x.

•  Aggregation probabilities associate original sys-tem states with convex combinations of represen-

tative states j ∼y∈A

φjyy

•   Well-suited for Euclidean space discretization

•   Extends nicely to continuous state space, in-cluding belief space of POMDP

Page 249: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 249/276

APPROXIMATE PI BY AGGREGATION

•   Consider approximate policy iteration for theoriginal problem, with policy evaluation done byaggregation.

•   Evaluation of policy   µ:  J   = ΦR, where   R   =DT µ(ΦR) (R   is the vector of costs of aggregate

states corresponding to  µ). Can be done by sim-ulation.

•   Similar form to the projected equation ΦR  =ΠT µ(ΦR) (ΦD  in place of Π).

•   Advantages: It has no problem with exploration

or with oscillations.•   Disadvantage:   The rows of   D   and Φ must beprobability distributions.

 pij(u)

dxi   φjy

 ji

x   y

OriginalSystem States

Aggregate States

ˆ pxy(u) =

n

i=1

dxi

n

j=1

 pij(u)φjy

Disaggregation

Probabilities

Aggregation

Probabilities

Page 250: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 250/276

Q-LEARNING I

•   Q-learning has two motivations:−   Dealing with multiple policies simultaneously

−   Using a model-free approach [no need to know pij(u), only be able to simulate them]

•  The Q-factors are defined by

Q∗(i, u) =n

j=1

 pij(u)

g(i,u,j) + αJ ∗( j)

,   ∀ (i, u)

•  Since J ∗ = T J ∗, we have J ∗(i) = minu∈U (i) Q∗(i, u)

so the  Q  factors solve the equation

Q∗(i, u) =n

j=1

 pij(u)

g(i,u,j) + α   min

u′∈U (j)Q∗( j, u′)

•  Q∗(i, u) can be shown to be the unique solu-

tion of this equation.   Reason:  This is Bellman’sequation for a system whose states are the originalstates 1, . . . , n ,  together with all the pairs (i, u).

•   Value iteration: For all (i, u)

Q(i, u) :=n

j=1

 pij(u)

g(i,u,j) + α   minu′∈U (j)

Q( j, u′)

Page 251: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 251/276

Q-LEARNING II

•   Use any probabilistic mechanism to select se-quence of pairs (ik, uk) [all pairs (i, u) are choseninfinitely often], and for each k, select jk accordingto  pikj(uk).

•  At each k, Q-learning algorithm updates Q(ik, uk)

according to

Q(ik, uk) :=

1 − γ k(ik, uk)

Q(ik, uk)

+ γ k(ik, uk)

g(ik, uk, jk) + α   min

u′∈U (jk)Q( jk, u′)

•   Stepsize γ k(ik, uk) must converge to 0 at properrate (e.g., like 1/k).

•   Important mathematical point:   In the  Q-factor version of Bellman’s equation the order of expectation and minimization is reversed relative

to the ordinary cost version of Bellman’s equation:

J ∗(i) = minu∈U (i)

nj=1

 pij(u)

g(i,u,j) + αJ ∗( j)

•  Q-learning can be shown to converge to true/exact

Q-factors (a sophisticated proof).

•   Major drawback:  The large number of pairs(i, u) - no function approximation is used.

Page 252: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 252/276

Q-FACTOR APROXIMATIONS

•   Introduce basis function approximation for  Q-factors:

Q(i,u,r) = φ(i, u)′r

•   We can use approximate policy iteration andLSPE/LSTD/TD for policy evaluation (exploration

issue is acute)•   Optimistic policy iteration methods are fre-quently used on a heuristic basis.

•   Example:  Generate infinite trajectory {(ik, uk) |k = 0, 1, . . .

}•  At iteration k, given rk and state/control (ik, uk):

(1) Simulate next transition (ik, ik+1) using thetransition probabilities  pikj(uk).

(2) Generate control  uk+1   from

uk+1 = arg minu∈U (ik+1)

Q(ik+1, u , rk)

(3) Update the parameter vector via

rk+1  = rk − (LSPE or TD-like correction)

•   Unclear validity. There is a solid basis for animportant but very special case: optimal stopping(see the on-line chapter)

Page 253: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 253/276

6.231 DYNAMIC PROGRAMMING

LECTURE 23

LECTURE OUTLINE

•   LSTD-like methods - Use to enhance explo-

ration•   Additional topics in ADP

•   Stochastic shortest path problems

•   Average cost problems

•   Nonlinear versions of the projected equation•   Extension of  Q-learning for optimal stopping

•   Basis function adaptation

•  Gradient-based approximation in policy space

Page 254: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 254/276

REVIEW: PROJECTED BELLMAN EQUATION

•   For fixed policy  µ to be evaluated, the solutionof Bellman’s equation J  = T J  is approximated bythe solution of 

Φr = ΠT (Φr)

whose solution is in turn obtained using a simulation-based method such as LSPE(λ), LSTD(λ), or TD(λ).

S: Subspace spanned by basis functions

T(Φr)

Φr = ΠT(Φr)

Projection

on S

Indirect method: Solving a projectedform of Bellmanʼs equation

•   These ideas apply to other (linear) Bellmanequations, e.g., for SSP and average cost.

•  Key Issue: Construct framework where ΠT   [or

at least ΠT (λ)] is a contraction.

Page 255: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 255/276

STOCHASTIC SHORTEST PATHS

•   Introduce approximation subspace

S  = {Φr | r ∈ ℜs}and for a given proper policy, Bellman’s equationand its projected version

J  = T J  = g + P J,   Φr = ΠT (Φr)

Also its  λ-version

Φr = ΠT (λ)(Φr), T (λ) = (1 − λ)∞

t=0λtT t+1

•   Question:   What should be the norm of pro- jection?

•   Speculation based on discounted case:   Itshould be a weighted Euclidean norm with weightvector   ξ   = (ξ 1, . . . , ξ  n), where   ξ i   should be some

type of long-term occupancy probability of state i(which can be generated by simulation).

•  But what does “long-term occupancy probabil-ity of a state” mean in the SSP context?

• How do we generate infinite length trajectories

given that termination occurs with prob. 1?

Page 256: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 256/276

SIMULATION TRAJECTORIES FOR SSP

•   We envision simulation of trajectories up totermination, followed by restart at state   i   withsome fixed probabilities  q 0(i) > 0.

•   Then the “long-term occupancy probability of a state” of  i  is proportional to

q (i) =∞t=0

q t(i), i = 1, . . . , n ,

where

q t(i) = P (it  = i), i = 1, . . . , n, t  = 0, 1, . . .

•  We use the projection norm

J q  =

 

ni=1

q (i)

J (i)

2

[Note that 0   < q (i)   < ∞, but   q   is not a prob.distribution. ]

•   We can show that ΠT (λ) is a contraction withrespect to  · ξ  (see the next slide).

•   LSTD(λ), LSPE(λ), and TD(λ) are possible.

Page 257: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 257/276

CONTRACTION PROPERTY FOR SSP

•   We have  q  =∞

t=0 q t  so

q ′P   =∞t=0

q ′tP   =∞t=1

q ′t  = q ′ − q ′0

orni=1

q (i) pij  = q ( j) − q 0( j),   ∀  j

•   To verify that ΠT   is a contraction, we showthat there exists  β < 1 such that P z2q ≤  β z2qfor all  z ∈ ℜn.

•   For all  z ∈ ℜn

, we have

P z2q  =ni=1

q (i)

n

j=1

 pijzj

2

≤ni=1

q (i)n

j=1

 pijz2j

=

nj=1

z2j

ni=1

q (i) pij  =

nj=1

q ( j) − q 0( j)

z2j

= z2q − z2q0 ≤ β z2q

where

β  = 1 − minj

q 0( j)

q ( j)

Page 258: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 258/276

AVERAGE COST PROBLEMS

•   Consider a single policy to be evaluated, withsingle recurrent class, no transient states, and steady-state probability vector  ξ  = (ξ 1, . . . , ξ  n).

•  The average cost, denoted by  η, is independentof the initial state

η = limN →∞

1

N  E 

N −1k=0

g

xk, xk+1

 x0  = i

,   ∀  i

•   Bellman’s equation is  J  = F J   with

F J  = g − ηe + P J 

where e  is the unit vector  e = (1, . . . , 1).

•  The projected equation and its  λ-version are

Φr = ΠF (Φr),   Φr = ΠF (λ)

(Φr)•   A problem here is that  F   is not a contractionwith respect to any norm (since  e = P e).

•   However, ΠF (λ) turns out to be a contractionwith respect to · ξ  assuming that e does not be-

long to S  and λ > 0 [the case λ = 0 is exceptional,but can be handled - see the text].

•   LSTD(λ), LSPE(λ), and TD(λ) are possible.

Page 259: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 259/276

GENERALIZATION/UNIFICATION

•   Consider approximate solution of   x   =   T (x),where

T (x) = Ax + b, A  is  n × n, b ∈ ℜn

by solving the projected equation   y   = ΠT (y),

where Π is projection on a subspace of basis func-tions (with respect to some Euclidean norm).

•   We will generalize from DP to the case whereA  is arbitrary, subject only to

I − ΠA : invertible

•   Benefits of generalization:

−  Unification/higher perspective for TD meth-ods in approximate DP

−  An extension to a broad new area of applica-tions, where a DP perspective may be help-ful

•   Challenge: Dealing with less structure

− Lack of contraction

−  Absence of a Markov chain

Page 260: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 260/276

GENERALIZED PROJECTED EQUATION

•  Let Π be projection with respect to

xξ  =

  ni=1

ξ ix2i

where   ξ  ∈ ℜn is a probability distribution withpositive components.

•   If  r∗ is the solution of the projected equation,we have Φr∗ = Π(AΦr∗ + b) or

r∗ = arg minr∈ℜs

ni=1

ξ iφ(i)′r −

nj=1

aijφ( j)′r∗ − bi2

where φ(i)′ denotes the  ith row of the matrix Φ.

•  Optimality condition/equivalent form:

ni=1

ξ iφ(i)

φ(i) −

nj=1

aijφ( j)

r∗ =ni=1

ξ iφ(i)bi

•   The two expected values can be approximatedby simulation.

Page 261: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 261/276

SIMULATION MECHANISM

Row Sampling According to  ξ 

i0   i1

 j0   j1

ik   ik+1

 jk   jk+1

. . . . . .

Column SamplingAccording to   P 

•   Row sampling:   Generate sequence {i0, i1, . . .}according to ξ , i.e., relative frequency of each rowi  is  ξ i

•  Column sampling: Generate

(i0, j0), (i1, j1), . . .

according to some transition probability matrix P with

 pij  > 0 if    aij = 0,

i.e., for each i, the relative frequency of (i, j) is pij

•   Row sampling   may   be done using a Markovchain with transition matrix  Q  (unrelated to  P )

•   Row sampling   may also be done without   aMarkov chain - just sample rows according to someknown distribution  ξ  (e.g., a uniform)

Page 262: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 262/276

ROW AND COLUMN SAMPLING

Row Sampling According to  ξ (May Use Markov Chain  Q)

| |

Column Sampling

ccording to

Markov Chain

P   ∼ |A|

i0   i1

 j0  j1

ik   ik+1

 jk   jk+1

. . .   . . .

•   Row sampling ∼ State Sequence Generation inDP. Affects:

− The projection norm.

−   Whether ΠA  is a contraction.

•   Column sampling ∼  Transition Sequence Gen-eration in DP.

−   Can be totally unrelated to row sampling.

Affects the sampling/simulation error.−   “Matching”  P   with |A|   is beneficial (has aneffect like in importance sampling).

•   Independent row and column sampling allowsexploration at will! Resolves the exploration prob-

lem that is critical in approximate policy iteration.

Page 263: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 263/276

LSTD-LIKE METHOD

•   Optimality condition/equivalent form of pro- jected equation

n

i=1ξ iφ(i)

φ(i) −

n

j=1aijφ( j)

r∗ =n

i=1ξ iφ(i)bi

•   The two expected values are approximated byrow and column sampling (batch 0 → t).

•  We solve the linear equation

tk=0

φ(ik)

φ(ik) −  aikjk

 pikjkφ( jk)

rt  =t

k=0

φ(ik)bik

•   We have rt → r∗, regardless of ΠA being a con-traction (by law of large numbers; see next slide).

•   An LSPE-like method is also possible, but re-quires that ΠA is a contraction.

•   Under the assumption n

j=1 |aij | ≤ 1 for all   i,there are conditions that guarantee contraction of ΠA; see the paper by Bertsekas and Yu,“Projected

Equation Methods for Approximate Solution of Large Linear Systems,” 2009.

Page 264: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 264/276

JUSTIFICATION W/ LAW OF LARGE NUMBER

•   We will match terms in the exact optimalitycondition and the simulation-based version.

•   Let   ξ ti   be the relative frequency of   i   in rowsampling up to time  t.

•  We have

1

t + 1

tk=0

φ(ik)φ(ik)′ =ni=1

ξ tiφ(i)φ(i)′ ≈ni=1

ξ iφ(i)φ(i)′

1

t + 1

t

k=0

φ(ik)bik  =n

i=1

ξ tiφ(i)bi

 ≈

n

i=1

ξ iφ(i)bi

•   Let ˆ ptij   be the relative frequency of (i, j) incolumn sampling up to time  t.

1

t + 1

t

k=0

aikjk

 pikjkφ(ik)φ( jk)′

=ni=1

ξ ti

nj=1

ˆ ptijaij pij

φ(i)φ( j)′

n

i=1

ξ i

n

j=1

aijφ(i)φ( j)′

Page 265: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 265/276

LSPE FOR OPTIMAL STOPPING EXTENDED

•  Consider a system of the form

x = T (x) = Af (x) + b,

where f   : ℜn → ℜn is a mapping with scalar com-ponents of the form  f (x) =

f 1(x1), . . . , f  n(xn)

.

•   Assume that each  f i  : ℜ → ℜ  is nonexpansive:f i(xi) − f i(xi) ≤ |xi − xi|,   ∀  i, xi,  xi ∈ ℜ

This guarantees that   T   is a contraction with re-spect to any weighted Euclidean norm ·ξ  when-

ever A is a contraction with respect to that norm.•   Algorithms similar to LSPE [approximatingΦrk+1 = ΠT (Φrk)] are then possible.

•   Special case:  In the optimal stopping problemof Section 6.4,  x  is the  Q-factor corresponding to

the continuation action,   α ∈   (0, 1) is a discountfactor,   f i(xi) = min{ci, xi}, and  A  =  αP , whereP   is the transition matrix for continuing.

•   If  n

j=1 pij   < 1 for some state  i, and 0 ≤ P  ≤Q, where   Q   is an irreducible transition matrix,

then Π((1−γ )I +γT ) is a contraction with respectto  · ξ   for all  γ  ∈ (0, 1), even with  α = 1.

Page 266: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 266/276

BASIS FUNCTION ADAPTATION I

•   An important issue in ADP is how to selectbasis functions.

•   A possible approach is to introduce basis func-tions that are parametrized by a vector   θ, andoptimize over  θ, i.e., solve the problem

minθ∈Θ

J (θ)

where  J (θ) is the solution of the projected equa-tion.

•   One example is

J (θ)

=J (θ) − T 

J (θ)2

•   Another example is

J (θ)

=i∈I 

|J (i) −  J (θ)(i)|2,

where  I   is a subset of states, and  J (i),   i ∈ I,  arethe costs of the policy at these states calculated

directly by simulation.

Page 267: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 267/276

BASIS FUNCTION ADAPTATION II

•  Some algorithm may be used to minimize F 

J (θ)

over θ.

•   A challenge here is that the algorithm shoulduse low-dimensional calculations.

• One possibility is to use a form of random search

(the cross-entropy method); see the paper by Men-ache, Mannor, and Shimkin (Annals of Oper. Res.,Vol. 134, 2005)

•  Another possibility is to use a gradient method.For this it is necessary to estimate the partial

derivatives of  J (θ) with respect to the componentsof  θ.

•   It turns out that by differentiating the pro- jected equation, these partial derivatives can becalculated using low-dimensional operations. See

the paper by Menache, Mannor, and Shimkin, andthe paper by Yu and Bertsekas (2009).

Page 268: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 268/276

APPROXIMATION IN POLICY SPACE I

•   Consider an average cost problem, where theproblem data are parametrized by a vector  r, i.e.,a cost vector   g(r), transition probability matrixP (r). Let   η(r) be the (scalar) average cost perstage, satisfying Bellman’s equation

η(r)e + h(r) = g(r) + P (r)h(r)

where   h(r) is the corresponding differential costvector.•   Consider minimizing η(r) over r  (here the data

dependence on control is encoded in the parametriza-tion). We can try to solve the problem by nonlin-ear programming/gradient descent methods.

•   Important fact: If ∆η  is the change in η  dueto a small change ∆r  from a given  r, we have

∆η = ξ ′

(∆g + ∆P h),where   ξ   is the steady-state probability distribu-tion/vector corresponding to P (r), and all the quan-tities above are evaluated at  r:

∆η = η(r + ∆r) − η(r),

∆g = g(r+∆r)−g(r),   ∆P   = P (r+∆r)−P (r)

Page 269: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 269/276

APPROXIMATION IN POLICY SPACE II

•   Proof of the gradient formula:   We have,by “differentiating” Bellman’s equation,

∆η(r)·e+∆h(r) = ∆g(r)+∆P (r)h(r)+P (r)∆h(r)

By left-multiplying with  ξ ′,

ξ′∆η(r)·e+ξ′∆h(r) = ξ′

∆g(r)+∆P (r)h(r)

+ξ′P (r)∆h(r)

Since  ξ ′∆η(r) · e  = ∆η(r) and  ξ ′ =  ξ ′P (r), thisequation simplifies to

∆η = ξ ′(∆g + ∆P h)

•  Since we don’t know  ξ , we cannot implement agradient-like method for minimizing  η(r). An al-ternative is to use “sampled gradients”, i.e., gener-ate a simulation trajectory (i0, i1, . . .), and change

r  once in a while, in the direction of a simulation-based estimate of  ξ ′(∆g + ∆P h).

•   Important Fact:   ∆η   can be viewed as anexpected value!

• There is much research on this subject, see e.g.,

the work of Marbach and Tsitsiklis, and Kondaand Tsitsiklis, and the refs given there.

Page 270: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 270/276

6.231 DYNAMIC PROGRAMMING

OVERVIEW-EPILOGUE

LECTURE OUTLINE

•   Finite horizon problems

−  Deterministic vs Stochastic−  Perfect vs Imperfect State Info

•   Infinite horizon problems

−  Stochastic shortest path problems

−  Discounted problems

−  Average cost problems

Page 271: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 271/276

FINITE HORIZON PROBLEMS - ANALYSIS

•   Perfect state info−   A general formulation - Basic problem, DP

algorithm

−   A few nice problems admit analytical solu-tion

•   Imperfect state info−   Sufficient statistics - Reduction to perfect

state info

−   Very few nice problems that admit analyticalsolution

−   Finite-state problems admit reformulation asperfect state info problems whose states areprob. distributions (the belief vectors)

Page 272: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 272/276

FINITE HORIZON PROBS - EXACT COMP. SOL

•   Deterministic finite-state problems−   Equivalent to shortest path

−  A wealth of fast algorithms

−   Hard combinatorial problems are a specialcase (but # of states grows exponentially)

•   Stochastic perfect state info problems

−  The DP algorithm is the only choice

−  Curse of dimensionality is big bottleneck

•   Imperfect state info problems

−  Forget it!−  Only trivial examples admit an exact com-putational solution

Page 273: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 273/276

FINITE HORIZON PROBS - APPROX. SOL.

•  Many techniques (and combinations thereof) tochoose from

•   Simplification approaches

−   Certainty equivalence

−  Problem simplification

−  Rolling horizon

−  Aggregation - Coarse grid discretization

•  Limited lookahead combined with:

−   Rollout

−  MPC (an important special case)−  Feature-based cost function approximation

•   Approximation in policy space

−   Gradient methods

−  Random search

Page 274: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 274/276

INFINITE HORIZON PROBLEMS - ANALYSIS

•   A more extensive theory•   Bellman’s equation

•   Optimality conditions

•   Contraction mappings

•   A few nice problems admit analytical solution•  Idiosynchracies of average cost problems

•  Idiosynchracies of problems with no underlyingcontraction

•   Elegant analysis

Page 275: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 275/276

INF. HORIZON PROBS - EXACT COMP. SOL.

•   Value iteration−   Variations (asynchronous, modified, etc)

•   Policy iteration

−  Variations (asynchronous, based on value it-eration, etc)

•   Linear programming

•   Elegant algorithmic analysis

•   Curse of dimensionality is major bottleneck

Page 276: DP_Slides_2011

8/14/2019 DP_Slides_2011

http://slidepdf.com/reader/full/dpslides2011 276/276

INFINITE HORIZON PROBS - ADP