dynamic programming with applicationspeople.stern.nyu.edu/rcaldent/courses/dp-classnotes-hec.pdf ·...

Dynamic Programming with Applications

Class Notes

Rene CaldenteyStern School of Business, New York University

Spring 2011

Prof. R. Caldentey

Preface

These lecture notes are based on the material that my colleague Gustavo Vulcano uses in theDynamic Programming Ph.D. course that he regularly teaches at the New York University LeonardN. Stern School of Business.

Part of this material is based on the widely used Dynamic Programming and Optimal Controltextbook by Dimitri Bertsekas, including a set of lecture notes publicly available in the textbookswebsite: http://www.athenasc.com/dpbook.html

However, I have added some additional material on Optimal Control for deterministic systems(Chapter 1) and for point processes (Chapter 6). I have also tried to add more applications relatedto Operations Management.

The booklet is organized in six chapters. We will cover each chapter in a 3-hour lecture except forChapter 2 where we will spend two 3-hour lectures. The details of each session is presented in thesyllabus. I hope that you will find the material useful!

2

Contents

1 Deterministic Optimal Control 7

1.1 Introduction to Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.1 Abstract Vector Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.2 Classical Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2 Continuous-Time Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.3 Pontryagin Minimum Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.3.1 Weak & Strong Extremals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.3.2 Necessary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.4 Deterministic Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.4.1 Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.4.2 DP’s Partial Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . 23

1.4.3 Feedback Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.4.4 The Linear-Quadratic Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.5.1 The Method of Characteristics for First-Order PDEs . . . . . . . . . . . . . . 26

1.5.2 Optimal Control and Myopic Solution . . . . . . . . . . . . . . . . . . . . . . 29

1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2 Discrete Dynamic Programming 41

2.1 Discrete-Time Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.1.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.2 Deterministic DP and the Shortest Path Problem . . . . . . . . . . . . . . . . . . . . 46

2.2.1 Deterministic finite-state problem . . . . . . . . . . . . . . . . . . . . . . . . 47

2.2.2 Backward and forward DP algorithms . . . . . . . . . . . . . . . . . . . . . . 47

3

Prof. R. Caldentey CONTENTS

2.2.3 Generic shortest path problems . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.2.4 Some shortest path applications . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.2.5 Shortest path algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2.2.6 Alternative shortest path algorithms: Label correcting methods . . . . . . . . 54

2.2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

2.3 Stochastic Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.4 The Dynamic Programming Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

2.5 Linear-Quadratic Regulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

2.5.1 Preliminaries: Review of linear algebra and quadratic forms . . . . . . . . . . 71

2.5.2 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

2.5.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

2.5.4 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

2.5.5 Asymptotic behavior of the Riccati equation . . . . . . . . . . . . . . . . . . 75

2.5.6 Random system matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

2.5.7 On certainty equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

2.5.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

2.6 Modular functions and monotone policies . . . . . . . . . . . . . . . . . . . . . . . . 82

2.6.1 Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

2.6.2 Supermodularity and increasing differences . . . . . . . . . . . . . . . . . . . 83

2.6.3 Parametric monotonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

2.6.4 Applications to DP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

2.7 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

2.7.1 The Value of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

2.7.2 State Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

2.7.3 Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

2.7.4 Multiplicative Cost Functional . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3 Applications 101

3.1 Inventory Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3.1.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3.1.2 Structure of the cost function . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

3.1.3 Positive fixed cost and (s, S) policies . . . . . . . . . . . . . . . . . . . . . . . 106

4

CONTENTS Prof. R. Caldentey

3.1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

3.2 Single-Leg Revenue Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

3.2.1 System with observable disturbances . . . . . . . . . . . . . . . . . . . . . . . 115

3.2.2 Structure of the value function . . . . . . . . . . . . . . . . . . . . . . . . . . 116

3.2.3 Structure of the optimal policy . . . . . . . . . . . . . . . . . . . . . . . . . . 120

3.2.4 Computational complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

3.2.5 Airlines: Practical implementation . . . . . . . . . . . . . . . . . . . . . . . . 121

3.2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

3.3 Optimal Stopping and Scheduling Problems . . . . . . . . . . . . . . . . . . . . . . . 123

3.3.1 Optimal stopping problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

3.3.2 General stopping problems and the one-step look ahead policy . . . . . . . . 129

3.3.3 Scheduling problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

3.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

4 DP with Imperfect State Information. 135

4.1 Reduction to the perfect information case . . . . . . . . . . . . . . . . . . . . . . . . 135

4.2 Linear-Quadratic Systems and Sufficient Statistics . . . . . . . . . . . . . . . . . . . 144

4.2.1 Linear-Quadratic systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

4.2.2 Implementation aspects – Steady-state controller . . . . . . . . . . . . . . . . 149

4.2.3 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

4.2.4 The conditional state distribution recursion . . . . . . . . . . . . . . . . . . . 153

4.3 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

4.3.1 Conditional state distribution: Review of basics . . . . . . . . . . . . . . . . . 155

4.3.2 Finite-state systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

5 Infinite Horizon Problems 167

5.1 Types of infinite horizon problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

5.1.1 Preview of infinite horizon results . . . . . . . . . . . . . . . . . . . . . . . . . 168

5.1.2 Total cost problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 168

5.2 Stochastic shortest path problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

5.2.1 Computational approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

5.3 Discounted problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

5.4 Average cost-per-stage problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

5

Prof. R. Caldentey CONTENTS

5.4.1 General setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

5.4.2 Associated stochastic shortest path (SSP) problem . . . . . . . . . . . . . . . 183

5.4.3 Heuristic argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

5.4.4 Bellman’s equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

5.4.5 Computational approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

5.5 Semi-Markov Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

5.5.1 General setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

5.5.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

5.5.3 Discounted cost problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

5.5.4 Average cost problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

5.6 Application: Multi-Armed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

6 Point Process Control 201

6.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

6.2 Counting Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

6.3 Optimal Intensity Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

6.3.1 Dynamic Programming for Intensity Control . . . . . . . . . . . . . . . . . . 205

6.4 Applications to Revenue Management . . . . . . . . . . . . . . . . . . . . . . . . . . 206

6.4.1 Model Description and HJB Equation . . . . . . . . . . . . . . . . . . . . . . 206

6.4.2 Bounds and Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

7 Papers and Additional Readings 209

6

Chapter 1

Deterministic Optimal Control

In this chapter, we discuss the basic Dynamic Programming framework in the context of determin-istic, continuous-time, continuous-state-space control.

1.1 Introduction to Calculus of Variations

Given a function f : X → R, we are interested in characterizing a solution to

minx∈X

f(x), [∗]

where X is a finite-dimensional space, e.g., in classical calculus X ⊆ Rn.

If n = 1 and X = [a, b], then under some smoothness conditions we can characterize solutions to [∗]through a set of necessary conditions.

Necessary conditions for a minimum at x∗:

- Interior point: f ′(x∗) = 0, f ′′(x∗) ≥ 0, and a < x∗ < b.

- Left Boundary: f ′(x∗) ≥ 0 and x∗ = a.

- Right Boundary: f ′(x∗) ≤ 0 and x∗ = b.

Existence: If f is continuous on [a, b] then it has a minimum on [a, b].

Uniqueness: If f is strictly convex on [a, b] then it has a unique minimum on [a, b].

1.1.1 Abstract Vector Space

Consider a general optimization problem:

minx∈D

J(x) [∗∗]

where D is a subset of a vector space V.

We consider functions ζ = ζ(ε) : [a, b]→ D such that the composite J ζ is differentiable. Supposethat x∗ ∈ D and J(x∗) ≤ J(x) for all x ∈ D. In addition, let ζ such that ζ(ε∗) = x∗ then (necessaryconditions):

7

Prof. R. Caldentey CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL

- Interior point: ddεJ(ζ(ε))

∣∣∣ε=ε∗

= 0, d2

dε2J(ζ(ε))∣∣∣ε=ε∗

≥ 0, and a < ε∗ < b.

- Left Boundary: ddεJ(ζ(ε))

∣∣∣ε=ε∗

≥ 0 and ε∗ = a.

- Right Boundary: ddεJ(ζ(ε))

∣∣∣ε=ε∗

≤ 0 and ε∗ = b.

How do we use these necessary conditions to identify “good candidates” for x∗?

Extremals and Gateau Variations

Definition 1.1.1

Let (V, ‖ · ‖) be a normed linear space and let D ⊆ V.

– We say that a point x∗ ∈ D is an extremal point for a real-valued function J on D if

J(x∗) ≤ J(x) for all x ∈ D ∨ J(x∗) ≥ J(x) for all x ∈ D.

– A point x0 ∈ D is called a local extremal point for J if for some ε > 0, x0 is an extremal point onDε(x0) := x ∈ D : ‖x− x0‖ < ε.

– A point x ∈ D is an internal (radial) point of D in the direction v ∈ V if

∃ε(v) > 0 such that x+ εv ∈ D for all |ε| < ε(v) (0 ≤ ε < ε(v)).

– The directional derivative of order n of J at x in the direction v is denoted by

δnJ(x; v) =dn

dεnJ(x+ εv)

∣∣∣ε=0

.

– J is Gateau-differentiable at x if x is an internal point in the direction v and δJ(x; v) exists forall v ∈ V.

Theorem 1.1.1 (Necessary Conditions) Let (V; ‖ · ‖) be a normed linear space. If J has a(local) extremal at a point x∗ on D then δJ(x∗, v) = 0 for all v ∈ V such that (i) x∗ is an internalpoint in the direction v and (ii) δJ(x∗, v) exists.

This result is useful if there is “enough” directions v so that the condition δJ(x∗, v) = 0 candetermine x∗.

Problem 1.1.1

1. Find the extremal points for

J(y) =∫ b

ay2(x) dx

on the domain D = y ∈ C[a, b] : y(a) = α and y(b) = β.

2. Find the extremal for

J(P ) =∫ b

aP (t)D(P (t)) dt

on the domain D = P ∈ C[a, b] : P (t) ≤ ξ.

8

CHAPTER 1. DETERMINISTIC OPTIMAL CONTROL Prof. R. Caldentey

Extremal with Constraints

Suppose that in a normed linear space (V, ‖ · ‖) we want to characterize extremal points for areal-valued function J on a domain D ⊆ V. Suppose that the domain is given by the level setD := x ∈ V : G(x) = ψ, where G is a real-valued function on V and ψ ∈ R.

Let x∗ be a (local) extremal point. We will assume that both J and G are defined in a neighborhoodof x∗. We pick an arbitrary pair of directions v, w and a define the mapping

Fv,w(r, s) :=

(ρ(r, s)σ(r, s)

)=

(J(x∗ + rv + sw)G(x∗ + rv + sw)

)which is well defined in a neighborhood of the origin.

Suppose F maps a neighborhood of 0 in the (r, s) plane into an neighborhood of (ρ∗, σ∗) :=(J(x∗), G(x∗)) in the (ρ, σ) plane. Then x∗ cannot be an extremal point of J .

!

"

"# =

!# =

Figure 1.1.1:

This condition is assured if F has an inverse which is continuous at (ρ∗, σ∗).

Theorem 1.1.2 For x ∈ Rn and a neighborhood N (x), if a vector valued function F : N (x)→ Rn

has continuous first partial derivatives in each component with nonvanishing Jacobian determinantat x, then F provides a continuously invertible mapping between a neighborhood of x and a regioncontaining a full neighborhood of F (x).

In our case, x = 0 and the Jacobian matrix of F is given by

∇F (0, 0) =

(δJ(x∗; v) δJ(x∗;w)δG(x∗; v) δG(x∗;w)

)

Then if |∇F (0, 0)| 6= 0 then x∗ cannot be an extremal point for J when constraint to the level setdefined by G(x∗).

9


Definition 1.1.2 In a normed linear space (V, ‖ · ‖), the Gateau variations δJ(x, v) of a real valued

function J are said to be weakly continuous at x∗ ∈ V if for each v ∈ V δJ(x; v) → δJ(x∗; v) as

x→ x∗.

Theorem 1.1.3 (Lagrange) In a normed linear space (V, ‖ · ‖), if a real valued functions J andG are defined in a neighborhood of x∗, a (local) extremal point for J constrained by the level setG(x∗), and have there weakly continuous Gateau variations, then either

a) δG(x∗;w) = 0, for all w ∈ V, or

b) there exists a constant λ ∈ R such that δJ(x∗, v) = λδG(x∗; v), for all v ∈ V.

Problem 1.1.2 Find the extremal for

J(P ) =∫ T

0P (t)D(P (t)) dt

on the domain D = P ∈ C[0, T ] :∫ T

0 D(P (t)) dt = I.

1.1.2 Classical Calculus of Variations

Historical Background

The theory of Calculus of Variations has been the “classic” approach to solve dynamic optimiza-tion problems, dating back to the late 17th century. It started with the Brachistochrone problemproposed by Johann Bernoulli in 1696: Find the planar curve which would provide the faster timeof transit to a particle sliding down it under the action of gravity (see Figure 1.1.2). Five solutionswere proposed by Jakob Bernoulli (Johann’s brother), Newton, Euler, Leibniz, and L’Hospital. An-other classical example of the method of calculus of variations is the Geodesic problems: Find theshortest path in a given domain connecting two points of it (e.g., the shortest path in a sphere).

Figure 1.1.2: The Brachistochrone problem: Find the curve which would provide the faster time of

transit to a particle sliding down it from Point A to Point B under the action of gravity.

10


More generally, calculus of variations problems involve finding (possibly multidimensional) curves x(t)with certain optimality properties. In general, the calculus of variations approach requires the dif-ferentiability of the functions that enter the problem in order to get “interior solutions”.

The Simplest Problem in Calculus of Variations

J(x) =∫ b

aL(t, x(t), x(t)) dt,

where x(t) = ddtx(t). The variational integrand is assumed to be smooth enough (e.g., at least C2).

Example 1.1.1

– Geodesic: L =√

1 + x2 – Brachistochrone: L =√

1+x2

x−α

– Minimal Surface of Revolution: L = x√

1 + x2.

Admissible Solutions: A function x(t) is called piecewise Cn on [a, b], if x(t) is Cn−1 on [a, b]and x(n)(t) is piecewise continuous on [a, b],i.e, continuous except on a finite number of points. Wedenote by H[a, b] the vector space of all real-valued piecewise C1 function on [a, b] and by He[a, b]the subspace of H[a, b] such that x(a) = xa and x(b) = xb for all x ∈ He[a, b].

Problem: minx∈He[a,b]

J(x).

Admissible Variations or Test Functions: Let Y[a, b] ⊆ H[a, b] be the subspace of piecewiseC1 functions y such that

y(a) = y(b) = 0.

We note that for x ∈ He[a, b], y ∈ Y[a, b], and ε ∈ R, the function x+ εy ∈ He[a, b].

Theorem 1.1.4 Let J have a minimum on He[a, b] at x∗. Then

Lx −∫ t

aLx dτ = constant for all t ∈ [a, b]. (1.1.1)

A function x∗(t) satisfying (1.1.1) is called extremal.

Corollary 1.1.1 (Euler’s Equation) Every extremal x∗ satisfies the differential equation

Lx =ddtLx.

Problem 1.1.3 (Production-Inventory Control)

Consider a firm that operates according to a make-to-stock policy during a planning horizon [0, T ]. The

company faces an exogenous and deterministic demand with intensity λ(t). Production is costly; if the

firm chooses a production rate µ at time t then the instantaneous production cost rate is c(t, µ). In

11


addition, there are holding and backordering costs. We denote by h(t, I) the holding/backordering cost

rate if the inventory position at time t is I. We suppose that the company starts with an initial inventory

I0 and tries to minimize total operating costs during the planning horizon of length T > 0 subject to

the requirement that the final inventory position at time T is IT .

a) Formulate the optimization problem as a calculus of variations problem.

b) What is Euler’s equation?

Sufficient Conditions: Weierstrass Method

Suppose that x∗ is an extremal for

J(x) =∫ b

af(t, x(t), x(t)) dt :=

∫ b

af [x(t)] dt

on D = x ∈ C1[a, b] : x(a) = x∗(a); x(b) = x∗(b). Let x(t) ∈ D be an arbitrary feasible solution.

For each τ ∈ (a, b] we define the function Ψ(t; τ) on (a, τ) such that Ψ(t; τ) is an extremal functionfor f on (a, τ) whose graph joins (a, x∗(a)) to (τ, x(τ)) and such that Ψ(t; b) = x∗(t).

a b!

x1

x2

(a,x*(a)) (b,x*(b))

(t, x(t))(!, x(!))

(t,x*(t))

t

"( ;!)

We define

σ(τ) := −∫ τ

af [Ψ(t; τ)] dt−

∫ b

τf [x(t)] dt,

which has the following properties:

σ(a) = −∫ b

af [x(t)] dt = −J(x) and σ(b) = −

∫ b

af [Ψ(t, b)] dt = −J(x∗).

Therefore, we have that

J(x)− J(x∗) = σ(b)− σ(a) =∫ b

aσ(τ) dτ,

so that a sufficient condition for the optimality of x∗ is σ(τ) ≥ 0. That is,

Weierstrass’ formula

σ(τ) := E(τ, x(τ), Ψ(τ ; τ), ˙x(τ))

= f [x(τ)]− f(τ, x(τ), Ψ(τ ; τ))− fx(τ, x(τ), Ψ(τ ; τ)) · ( ˙x(τ)− Ψ(τ ; τ)) ≥ 0

12


1.1.3 Exercises

Exercise 1.1.1 (Convexity and Euler’s Equation) Let V be a linear vector space and D a subset of V.

A real-valued function f defined on D is said to be [strictly] convex on D if

f(y + v)− f(y) ≥ δf(y; v) for all y, y + v ∈ D,

[with equality if and only if v = 0]. Where δf(y; v) is the first Gateau variation of f at y on the direction

v.

a) Prove the following: If f is [strictly] convex on D then each x∗ ∈ D for which δf(x∗; y) = 0 for

all x∗ + y ∈ D minimizes f on D [uniquely].

Let f = f(x, y, z) be a real value function on [a, b]×R2. Assume that f and the partial derivatives

fy and fz are defined and continuous on S. For all y ∈ C1[a, b] we define the integral function

F (y) =∫ b

af(x, y(x), y′(x)) dx :=

∫ b

af [y(x)] dx,

where f [y(x)] = f(x, y(x), y′(x)).

b) Prove that the first Gateau variation of F is given by

δF (y; v) =∫ b

a

(fy[y(x)] v(x) + fz[y(x)] v′(x)

)dx.

c) Let D be a domain in R2. For two arbitrary real numbers α and β define

Dα,β[a, b] =y ∈ C1[a, b] : y(a) = α, y(b) = β, and (y(x), y′(x)) ∈ D ∀x ∈ [a, b]

.

Prove that if f(x, y, z) is convex on [a, b]×D then

1. F (y) defined above is convex on D and

2. each y ∈ D for whichd

dxfz[y(x)] = fy[y(x)] [∗]

on (a, b) satisfies δF (y, v) = 0 for all y + v ∈ D.

Conclude that such a y ∈ D that satisfies [∗] minimizes F on D. That is, extremal solutions are

minimizers.

Exercise 1.1.2 (du Bois-Reymond’s Lemma)The proof of Euler’s equation uses du Bois-Reymond’s

Lemma:

If h ∈ C[a, b] and∫ ba h(x)v′(x) dx = 0

for all v ∈ D0 = v ∈ C1[a, b] : v(a) = v(b) = 0

then h =constant on [a, b]. Using this lemma prove the more general results.

13


a) If g, h ∈ C[a, b] and∫ ba [g(x)v(x) + h(x)v′(x)] dx = 0

for all v ∈ D0 = v ∈ C1[a, b] : v(a) = v(b) = 0

then h ∈ C1[a, b] and h′ = g.

b) If h ∈ C[a, b] and for some m = 1, 2 . . . we have∫ ba h(x)v(m)(x) dx = 0

for all v ∈ D(m)0 = v ∈ Cm[a, b] : v(k)(a) = v(k)(b) = 0, k = 0, 1, 2, . . . ,m− 1

then on [a, b], h is a polynomial of degree ≤ m− 1.

Exercise 1.1.3 Suppose you have inherited a large sum S and plan to spend it so as to maximize your

discounted cumulative utility for the next T units of time. Let u(t) be the amount that you expend

on period t and let√u(t) the the instantaneous utility rate that you receive at time t. Let β be the

discount factor that you use to discount future utility, i.e, the discounted value of expending u at time

t is equal to exp(−βt)√u. Let α be the risk-free interest rate available on the market, i.e., one dollar

today is equivalent to exp(αt) dollars t units of time in the future.

a) Formulate the control problem that maximizes the discounted cumulative utility given all necessary

constraints.

b) Find the optimal expenditure rate u(t) for all t ∈ [0, T ].

Exercise 1.1.4 (Production-Inventory Problem)Consider a make-to-stock manufacturing facility pro-

ducing a single type of product. Initial inventory at time t = 0 is I0. Demand rate for the next selling

season [0, T ] is know and equal to λ(t) t ∈ [0, T ]. We denote by µ(t) the production rate and by I(t)the inventory position. Suppose that due to poor inventory management there is a fixed proportion α

of inventory that is lost per unit time. Thus, at time t the inventory I(t) increases at a rate µ(t) and

decreases at a rate λ(t) + αI(t).

Suppose the company has set target values for the inventory and production rate during [0, T ]. Let I

and P be these target values, respectively. Deviation from these values are costly, and the company uses

the following cost function C(I, P ) to evaluate a production-inventory strategy (P, I):

C(I, P ) =∫ T

0[β2(I − I(t))2 + (P − P (t))2] dt.

The objective of the company is to find and optimal production-inventory strategy that minimizes the

cost function subject to the additional condition that I(T ) = I.

a) Rewrite the cost function C(I, P ) as a function of the inventory position and its first derivative

only.

b) Find the optimal production-inventory strategy.

14


1.2 Continuous-Time Optimal Control

The Optimal Control problem that we study in this section, and in particular the optimality con-ditions that we derive (HJB equation and Pontryagin Minimum principle) will provide us with analternative and powerful method to solve the variational problems discussed in the previous section.This new method is not only useful as a solution technique but also as a insightful methodology tounderstand how dynamic programming works.

Compared to the method of Calculus of Variation, Optimal Control theory is a more modernand flexible approach that requires less stringent differentiability conditions and can handle cornersolutions. In fact, calculus of variations problems can be reformulated as optimal control problems,as we show lated in this section.

The first, and most fundamental, step in the derivation of these new solution techniques is thenotion of a System Equation:

• System Equation (also called equation of motion or system dynamics):

x(t) = f(t, x(t), u(t)), 0 ≤ t ≤ T, x(0) : given,

i. e.dxi(t)

dt= fi(t, x(t), u(t)), i = 1, . . . , n.

where:

– x(t) ∈ Rn is the state vector at time t,

– x(t) is the gradient of x(t) with respect to t,

– u(t) ∈ U ⊆ Rm is the control vector at time t,

– T is the terminal time.

• Assumptions:

– An admissible control trajectory is a piecewise continuous function u(t) ∈ U, ∀t ∈ [0, t],that does not involve an infinite value of u(t) (i.e., all jumps are of finite size).

U could be a bounded control set. For instance, U could be a compact set such asU = [0, 1], so that corner solutions (boundary solutions) could be admitted. Whenthis feature is combined with jump discontinuities on the control path, an interestingphenomenon called a bang-bang solution may result, where the control alternates betweencorners.

– An admissible state trajectory x(t) is continuous, but it could have a finite number ofcorners; i.e., it must be piecewise differentiable. A sharp point on the state trajectoryoccurs at a time when the control trajectory makes a jump.

Like admissible control paths, admissible state paths must have a finite x(t) value forevery t ∈ [0, T ]. See Figure 1.2.1 for an illustration of a control path and the associatedstate path.

– The control trajectory u(t) | t ∈ [0, T ] uniquely determines xu(t) | t ∈ [0, T ]. We willdrop the superscript u from now on, but this dependence should be clear. In a more rigor-ous treatment, the issue of existence and uniqueness of the solution should be addressedmore carefully.

15


t1 t2

t1 t2

u

x

State path

Control path

Figure 1.2.1: Control and state paths for a continuous-time optimal control problem under the regular assumptions.

• Objective: Find an admissible policy (control trajectory) u(t) | t ∈ [0, T ] and correspond-ing state trajectory that optimizes a given functional J of the state x = (xt : 0 ≤ t ≤ T ).The following are some common formulations for the functional J and the associated optimalcontrol problem.

Lagrange Problem: minu∈U

J(x) =∫ T

0g(t, x(t), u(t)) dt

subject to x(t) = f(t, x(t), u(t)), x(0) = x0 (system dynamics)

φ(x(T )) = 0 (boundary conditions).

Mayer Problem: minu∈U

h(x(T ))

subject to x(t) = f(t, x(t), u(t)), x(t0) = x0 (system dynamics)

φ(x(T )) = 0 k = 2, . . . , k (boundary conditions).

Bolza Problem: minu∈U

h(x(T )) +∫ T

0g(t, x(t), u(t))dt.

subject to x(t) = f(t, x(t), u(t)), x(t0) = x0 (system dynamics)

φ(x(T )) = 0 k = 2, . . . , k (boundary conditions).

The functions f, h, g and φ are normally assumed to be continuously differentiable with respectto x; and f, g are continuous with respect to t and u.

16


Problem 1.2.1 Show that all three versions of the optimal control problem are equivalent.

Example 1.2.1 (Motion Control) A unit mass moves on a line under the influence of a force u.

Here, u =force=acceleration. (Recall from physics that force = mass × acceleration, with mass=1 in

this case).

• State: x(t) = (x1(t), x2(t)), where x1 represents position and x2 represents velocity.

• Problem: From a given initial (x1(0), x2(0)), bring the mass near a given final position-velocity

pair (x1, x2) at time T ; in the sense that it minimizes

|x1(T )− x1|2 + |x2(T )− x2|2 ,

such that |u(t)| ≤ 1, ∀ t ∈ [0, T ].

• System Equation:

x1(t) = x2(t)

x2(t) = u(t)

• Costs:

h(x(T )) = (x1(T )− x1)2 + (x2(T )− x2)2

g(x(t), u(t)) = 0, ∀t ∈ [0, T ].

Example 1.2.2 (Resource Allocation) A producer with production rate x(t) at time t may allocate

a portion u(t) ∈ [0, 1] of her production rate to reinvestment (i.e., to increase the production rate) and

[1− u(t)] to produce a storable good. Assume a terminal cost h(x(T )) = 0.

• System Equation:

x1(t) = γu(t)x(t), where γ > 0 is the reinvestment benefit, u(t) ∈ [0, 1].

• Problem: The producer wants to maximize the total amount of product stored

maxu(t)∈[0,1]

∫ T

0(1− u(t))x(t)dt

Assume x(0) is given.

Example 1.2.3 (An application of Calculus of Variations) Find a curve from a given point to

a given vertical line that has minimum length. (Intuitively, this should be a straight line) Figure 1.2.2

illustrates the formulation as an infinite sum of infinitely small hypotenuses α.

• The problem in terms of calculus of variations is:

min∫ T

0

√1 + (x(t))2dt

s.t. x(0) = α.

17


t time

(t, x(t)) (t+dt, x(t))

(t+dt, x(t)+x(t)dt)·

Zoom-in:

dttx

dttxdt2

22

))((1

))(()(

(t+dt, x(t+dt))

Figure 1.2.2: Problem of finding a curve of minimum length from a given point to a given line, and its formulation

as an optimal control problem.

• The corresponding optimal control problem is:

minu(t)

∫ T

0

√1 + (u(t))2dt

s.t. x(t) = u(t)

x(0) = α

1.3 Pontryagin Minimum Principle

1.3.1 Weak & Strong Extremals

Let H[a, b] be a subset of piecewise right-continuous function with left-limit (cadlag). We define onH[a, b] two norms

for x ∈ H[a, b] ‖x‖ = supt∈[a,b]

|x(t)| and ‖x‖1 = ‖x‖+ ‖x‖.

A set x ∈ H[a, b] : ‖x − x∗‖1 < ε is called a weak neighborhood of x∗. A solution x∗ is called aweak solution if J(x∗) ≤ J(x) for all x in a weak neighborhood containing x∗.

A set x ∈ H[a, b] : ‖x − x∗‖ < ε is called a strong neighborhood of x∗. A solution x∗ is called astrong solution if J(x∗) ≤ J(x) for all x in a strong neighborhood containing x∗.

Example 1.3.1

minxJ(x) =

∫ 1

−1(x(t)− sign(t))2 dt+

∑t∈[−1,1]

(x(t)− x(t−))2,

where x(t−) = limτ↑t x(τ).

18


1.3.2 Necessary Conditions

Given a control u ∈ U with corresponding trajectory x(t), we consider the following family ofvariations:

For a fixed direction v ∈ U , τ ∈ [0, T ], and η > 0 small, we defined the “strong” variation ξ of u(t)in the direction v by the function

ξ : 0 ≤ ε ≤ η → Uε → ξ(ε) = uε,

where

uε(t) =

v if t ∈ (τ − ε, τ ]u(t) if t ∈ [0, T ] ∩ (τ − ε, τ ]c.

0 T!!"#

$$

u(t)

%

t

Strong VariationWeak Variation

Figure 1.3.1: An example of strong and weak variations

Lemma 1.3.1 For a real variable ε, let xε(t) be the solution of xε(t) = f(t, xε(t), u(t)) on [0, T ]with initial condition

xε(0) = x(0) + εy + o(ε).

Then,xε(t) = x(t) + εδ(t) + o(t, ε),

where δ(t) is the solution of

δ(t) = fx(t, x(t), u(t)) δ(t), t ∈ [0, T ] and δ(0) = y.

Lemma 1.3.2 If xε are solutions to xε(t) = f(t, xε(t), uε(t)) with the same initial condition xε(0) =x0 then

xε(t) = x(t) + εδ(t) + o(t, ε),

where δ(t) solves

δ(t) =

0 if 0 ≤ t < τ

f(τ, x(τ), v)− f(τ, x(τ), u(τ)) +∫ tτ fx(s, x(s), u(s)) δ(s) ds if τ ≤ t ≤ T.

19


Theorem 1.3.1 (Pontryagin Principle For Free Terminal Conditions)

- Mayer’s formulation: Let P (t) be the solution of

P (t) = −P (t) fx(t, x(t), u(t)), P (t1) = −φx(x(T )).

A necessary condition for optimality of a control u is that

P (t) [f(t, x(t), v)− f(t, x(t), u(t))] ≤ 0

for each v ∈ U and t ∈ (0, T ].

- Lagrange’s formulation: We define the Hamiltonian H as

H(t, x, u) := P (t)f(t, x, u)− L(t, x, u).

Where P (t) solves

P (t) = − ∂

∂xH(t, x, u)

with boundary condition P (T ) = 0. A necessary condition for a control u to be optimal is

H(t, x(t), v)−H(t, x(t), u(t)) ≤ 0 for all v ∈ U, t ∈ [0, T ].

Theorem 1.3.2 (Pontryagin Principle with Terminal Conditions)

- Mayer’s formulation: Let P (t) be the solution of

P ′(t) = −P ′(t) fx(t, x(t), u(t)), P (t1) = −λ′φx(T, x(T )).

A necessary condition for optimality of a control u ∈ U is that there exists λ, a nonzerok-dimensional vector with λ1 ≤ 0, such that

P (t)′ [f(t, x(t), v)− f(t, x(t), u(t))] ≤ 0

P (T )′f(T, x(T ), u(T )) = −λ′φt(T, x(T )).

Problem 1.3.1 Solve

minu

∫ T

0(u(t)− 1)x(t) dt,

subject to x(t) = γ u(t)x(t) x0 > 0,

0 ≤ u(t) ≤ 1, for all t ∈ [0, T ].

20


1.4 Deterministic Dynamic Programming

1.4.1 Value Function

Consider the following optimal control problem in Mayer’s form:

V (t0, x0) = infu∈U

J(t1, x(t1)) (1.4.1)

subject to x(t) = f(t, x(t), u(t)), x(t0) = x0 (state dynamics) (1.4.2)

(t1, x(t1)) ∈M (boundary conditions). (1.4.3)

The terminal set M is a closed subset of Rn+1. The admissible control set U is assumed to be theset of piecewise continuous function on [t0, t1]. The performance function J is assumed to be C1.The function V (·, ·) is called the value function and we shall use the convention V (t0, x0) =∞ if thecontrol problem above admits no feasible solution. We will denote by U(x0, t0), the set of feasiblecontrols with initial condition (x0, t0), that is, the set of control u such that the correspondingtrajectory x satisfies x(t1) ∈M .

REMARK 1.4.1 For notational convenience, in this section the time horizon is denoted by the interval

[t0, t1] instead of [0, T ].

Proposition 1.4.1 Let u(t) ∈ U(x0, t0) be a feasible control and x(t) the corresponding trajectory.Then, for any t0 ≤ τ1 ≤ τ2 ≤ t1, V (τ1, x(τ1)) ≤ V (τ2, x(τ2)). That is, the value function is anondecreasing function along any feasible trajectory.

Proof:

! !

Corollary 1.4.1 The value function evaluated along any optimal trajectory is constant.

Proof: Let u∗ be an optimal control with corresponding trajectory x∗. Then V (t0, x0) = J(t1, x∗(t1)).In addition, for any t ∈ [t0, t1] u∗ is a feasible control starting at (t, x∗(t)) and so V (t, x∗(t)) ≤J(t1, x∗(t1)). Finally by Proposition 1.4.1 V (t0, x0) ≤ V (t, x∗(t)) so we conclude V (t, x∗(t)) =V (t0, x0) for all t ∈ [t0, t1].

According to the previous results a necessary condition for optimality is that the value function isconstant along the optimal trajectory. The following result provides a sufficient condition.

21


Theorem 1.4.1 Let W (s, y) be an extended real valued function defined on Rn+1 such that W (s, y) =J(s, y) for all (s, y) ∈ M . Given an initial condition (t0, x0), suppose that for any feasible tra-jectory x(t), the function W (t, x(t)) is finite and nondecreasing on [t0, t1]. If u∗ is a feasiblecontrol with corresponding trajectory x∗ such that W (t, x∗(t)) is constant then u∗ is optimal andV (t0, x0) = W (t0, x0).

Proof: For any feasible trajectory x, W (t0, x0) ≤W (t1, x(t1)) = J(t1, x(t1). On the other hand, forx∗,W (t0, x0) = W (t1, x∗(t1)) = J(t1, x∗(t1).

Corollary 1.4.2 Let u∗ be an optimal control with corresponding feasible trajectory x∗. Then therestriction of u∗ to [t, t1] is an optimal for the control problem with initial condition (t, x∗(t)).

In many applications, the control problem is given in its Lagrange form

V (t0, x0) = infu∈U(x0,t0)

∫ t1

t0

L(t, x(t), u(t)) dt (1.4.4)

subject to x(t) = f(t, x(t), u(t)), x(t0) = x0. (1.4.5)

In this case, the following result is the analogue to Proposition 1.4.1.

Theorem 1.4.2 (Bellman’s Principle of Optimality). Consider an optimal control problem in La-grange form. For any u ∈ U(s, y) and its corresponding trajectory x

V (s, y) ≤∫ τ

sL(t, x(t), u(t)) dt+ V (τ, x(τ)).

Proof: Given u ∈ U(s, y), let u ∈ U(τ, x(τ)) be arbitrary. Define

u(t) =

u(t) s ≤ t ≤ τu(t) τ ≤ t ≤ t1.

Thus, u ∈ U(s, y) so that

V (s, y) ≤∫ t1

sL(t, x(t), u(t)) dt =

∫ τ

sL(t, x(t), u(t)) dt+

∫ t1

τL(t, x(t), u(t)) dt. (1.4.6)

Since the inequality holds for any u ∈ U(τ, x(τ)) we conclude

V (s, y) ≤∫ τ

sL(t, x(t), u(t)) dt+ V (τ, x(τ).

Although the conditions given by Theorem 1.4.1 are sufficient, they do not provide a concrete wayto construct an optimal solution. In the next section, we will provide a direct method to computethe value function.

22


1.4.2 DP’s Partial Differential Equations

Define Q0 the reachable set as

Q0 = (s, y) ∈ Rn+1 : U(s, y) 6= ∅.

This set define the collection of initial conditions for which the optimal control problem is feasible.

Theorem 1.4.3 Let (s, y) be any interior point of Q0 at which V (s, y) is differentiable. ThenV (s, y) satisfies

Vs(s, y) + Vy(s, y) f(s, y, v) ≥ 0 for all v ∈ U.

If there is an optimal u∗ ∈ U(s, y), then the PDE

minv∈UVs(s, y) + Vy(s, y) f(s, y, v) = 0

is satisfied and the minimum is achieved by the right limit u∗(s)+ of the optimal control at s.

Proof: Pick any v ∈ U and let xv(t) be the corresponding trajectory for s ≤ t ≤ s+ ε, ε > 0 small.Given the initial condition (s, y), we define the feasible control uε as follows

uε(t) =

v s ≤ t ≤ s+ ε

u(t) s+ ε ≤ t ≤ t1.

Where u ∈ U(s + ε, xv(s + ε)) is arbitrary. Note that for ε small (s + ε, xv(s + ε)) ∈ Q0 and souε ∈ U(s, y). We denote by xε(t) the corresponding trajectory. By proposition (1.5.1), V (t, xε(t))is nondecreasing, hence,

D+V (t, xε(t)) := limh↓0

V (t+ h, xε(t+ h))− V (t, xε(t))h

≥ 0

for any t at which the limit exists, in particular t = s. Thus, from the chain rule we get

D+V (s, xε(s)) = Vs(s, y) + Vy(s, y)D+xε(s) = Vs(s, y) + Vy(s, y) f(s, y, v).

The equalities use the indentity xε(s) = y and the system dynamic equationD+xε(t) = f(t, xε, uε(t)+).

If u∗ ∈ U(s, y) is an optimal control with trajectory x∗ then corollary 1.4.1 implies V (t, x∗(t)) =J(t1, x∗(t1)) for all t ∈ [s, t1], so differentiating (from the right) this equality at t = 2 we conclude

Vs(s, y) + Vy(s, y) f(s, y, u∗(s)+) = 0.

Corollary 1.4.3 (Hamilton-Jacobi-Bellman equation (HJB)) For a control problem given inLagrange form (1.4.4)-(1.4.5), the value function at a point (s, y) ∈ int(Q0) satisfies

Vs(y, s) + Vy(s, y) f(s, y, v) + L(s, y, v) ≥ 0 for all v ∈ U.

If there exists an optimal control u∗ then the PDE

minv∈UVs(y, s) + Vy(s, y) f(s, y, v) + L(s, y, v) = 0

is satisfied and the minimum is achieved by the right limit u∗(s)+ of the optimal control at s.

23


In many applications, instead of solving the HJB equation a candidate for the value function isidentified, say by inspection. It is important to be able to decide whether or not the proposedsolution is in fact optimal.

Theorem 1.4.4 (Verification Theorem) Let W (s, y) be a C1 solution to the partial differentialequation

minv∈UVs(s, y) + Vy(s, y) f(s, y, v) = 0

with boundary condition W (s, y) = J(s, y) for all (s, y) ∈M . Let (t0, x0) ∈ Q0, u ∈ U(t0, x0) and xthe corresponding trajectory. Then, W (t, x(t)) is nondecreasing on t. If u∗ is a control in U(t0, x0)defined on [t0, t∗1] with corresponding trajectory x∗ such that for any t ∈ [t0, t∗1]

Ws(t, x∗(t)) +Wy(t, x∗(t)) f(t, x∗(t), u∗(t)) = 0

then u∗ is an optimal control in calU(t0, x0) and V (s, y) = W (s, y).

Example 1.4.1

min‖u‖≤1

J(t0, x0, u) =12

(x(τ))2

subject to x(t) = u(t), x(t0) = x0

where ‖u‖ = max0≤t≤τ|u(t)|. The HJB equation is min|u|≤1 Vt(t, x) + Vx(t, x)u = 0 with bound-

ary condition V (τ, x) = 12x

2. We can solve this problem by inspection. Since the only cost is associated

to the terminal state x(τ), and optimal control will try to make x(τ) as close to zero as possible, i.e.,

u∗(t, x) = −sgn(x) =

1 x < 00 x = 0−1 x > 0.

(Bang-Bang policy)

We should now verify that u∗ is in fact an optimal control. Let J∗(t, x) = J(t, x, u∗). Then, it is not

hard to show that

J∗(t, x) =12

(max0 ; |x| − (τ − t))2

which satisfies the boundary condition J∗(τ, x) = 12x

2. In addition,

J∗t (t, x) = (|x| − (τ − t))+ and J∗x(t, x) = sgn(x) (|x| − (τ − t))+.

Therefore, for any u such that |u| ≤ 1 it follows that

J∗t (t, x) + J∗x(t, x)u = (1 + sgn(x)u) (|x| − (τ − t))+ ≥ 0

with the equality holding for u = u∗(t, x). Thus, J∗(t, x) is the value function and u∗ is optimal.

1.4.3 Feedback Control

In the previous example, the notion of a feedback control policy was introduced. Specifically, afeedback control u is a mapping from Rn+1 to U such that u = u(t, x) and the system dynamics

24


x = f(t, x,u(t, x)) has a unique solution for each initial condition (s, y) ∈ Q0. Given a feedbackcontrol u and an initial condition (s, y), we can define the trajectory x(t; s, y) as the solution to

x = f(t, x,u(t, x)) x(s) = y.

The corresponding control policy is u(t) = u(t, x(t; s, y)).

A feedback control u∗ is an optimal feedback control if for any (s, y) ∈ Q0 the control u(t) =u∗(t, x(t; s, y)) solve the optimization problem (1.4.1)-(1.4.3) with initial condition (s, y).

Theorem 1.4.5 If there is an optimal feedback control u∗ and t1(s, y) and x(t1; s, y) are the ter-minal time and terminal state for the trajectory

x = f(t, x,u(t, x)) x(s) = y

then the value function V (s, y) is differentiable at each point at which t1(s, y) and x(t1; s, y) aredifferentiable with respect to (s, y).

Proof: From the optimality of u∗ we have that

V (s, y) = J(t1(s, y), x(t1(s, y); s, y)).

The result follows from this identity and the fact that J is C1.

1.4.4 The Linear-Quadratic Problem

Consider the following optimal control problem.

minx(T )′QT x(T ) +∫ T

0

[x(t)′Qx(t) + u(t)′Ru(t)

]dt (1.4.7)

subject to x(t) = Ax(t) +B u(t) (1.4.8)

where the n× n matrices QT and Q are symmetric positive semidefinite and the m×m matrix Ris symmetric positive definite. The HJB equation for this problem is given by

minu∈Rm

Vt(t, x) + Vx(t, x)′ (Ax+Bu) + x′Qx+ u′Ru

= 0

with boundary condition V (T, x) = x′QT x.

We guess a quadratic solution for the HJB equation. That is, we suppose that V (t, x) = x′K(t)xfor a n× n symmetric matrix K(t). If this is the case then

Vt(t, x) = 2K(t)x and Vx(t, x) = x′ K(t)x.

Plugging back these derivatives on the HJB equation we get

minu∈Rm

x′ K(t)x+ 2x′K(t)Ax+ 2x′K(t)B u+ x′Qx+ u′Ru

= 0. (1.4.9)

Thus, the optimal control satisfies

2B′K(t)x+ 2Ru = 0 =⇒ u∗ = −R−1B′K(t)x.

25


Substituting the value of u∗ in equation (1.4.9) we obtain the condition

x′(K(t) +K(t)A+A′K(t)−K(t)BR−1B′K(t) +Q

)x = 0 for all (t, x).

Therefore, for this to hold matrix K(t) must satisfy the continuous-time Ricatti equation in matrixform

K(t) = −K(t)A−A′K(t) = K(t)BR−1B′K(t)−Q, with boundary condition K(T ) = QT .

(1.4.10)Reversing the argument it can be shown that if K(t) solves (1.4.10) then W (t, x) = x′K(t)x is asolution of the HJB equation and so nt the verification theorem we conclude that it is equal to thevalue function. In addition, the optimal feedback control is u∗(t, x) = −R−1B′K(t)x.

1.5 Extensions

1.5.1 The Method of Characteristics for First-Order PDEs

First-Order Homogeneous Case

Consider the following first-order homogeneous PDE

ut(t, x) + a(t, x)ux(t, x) = 0, x ∈ R, t > 0,

with boundary conditions u(x, 0) = φ(x) for all x ∈ R. We assume that a and φ are “smoothenough” functions. A PDE problem in this form is referred to as a Cauchy problem.

We will investigate the solution to this problem using the method of characteristics. The charac-teristics of this PDE are curves in the x− t plane defined by

x(t) = a(x(t), t), x(0) = x0. (1.5.1)

Let x = x(t) be a solution with x(0) = x0. Let u be a solution to the PDE, we want to study theevolution of u along x(t).

u(t, x(t)) = ut(t, x(t)) + ux(t, x(t)) ˙x(t) = ut(t, x(t)) + ux(t, x(t)) a(x(t), t) = 0.

So, u(t, x) is constant along the characteristic curve x(t), that is,

u(t, x(t)) = u(0, x(0)) = φ(x0), ∀t > 0. (1.5.2)

Thus, if we are able to solve the ODE (1.5.3) then we would be able to find the solution to theoriginal PDE.

Example 1.5.1 Consider the Cauchy problem

ut + x ux = 0, x ∈ R, t > 0

u(x, o) = φ(x), x ∈ R.

26


The characteristic curves are defined by

x(t) = x(t), x(0) = x0,

so x(t) = x0 exp(t). So for a given (t, x) the characteristic passing through this point has initial

condition x0 = x exp(−t). Since u(t, x(t)) = φ(x0) we conclude that u(t, x) = φ(x exp(−t)).

First-Order Non-Homogeneous Case

Consider the following nonhomogeneous problem.

ut(t, x) + a(t, x) ux(t, x) = b(t, x), x ∈ R, t > 0

u(x, 0) = φ(x), x ∈ R.

Again, the characteristic curves are given by

x(t) = a(x(t), t), x(0) = x0. (1.5.3)

Thus, for a solution u(t, x) of the PDE along a characteristic curve x(t) we have that

u(t, x(t)) = ut(t, x(t)) + ux(t, x(t)) ˙x(t) = ut(t, x(t)) + ux(t, x(t)) a(x(t), t) = b(t, x(t)).

Hence, the solution to the PDE is given by

u(t, x(t)) = φ(x0) +∫ t

0b(τ, x(τ)) dτ

along the characteristic (t, x(t)).

Example 1.5.2 Consider the Cauchy problem

ut + ux = x, x ∈ R, t > 0

u(x, o) = φ(x), x ∈ R.

The characteristic curves are defined by

x(t) = 1, x(0) = x0,

so x(t) = x0 + t. So for a given (t, x) the characteristic passing through this point has initial condition

x0 = x− t. In addition, along a characteristic x(t) = x0 + t starting at x0, we have

u(t, x(t)) = φ(x0) +∫ t

0x(τ) dτ = φ(x0) + x0 t+

12t2.

Thus, the solution to the PDE is given by

u(t, x) = φ(x− t) +(x− t

2

)t.

27


Applications to Optimal Control

Given that the partial differential equation of dynamic programming is a first-order PDE, we cantry to apply the method of characteristic to find the value function. In general, the HJB is not astandard first-order PDE because of the maximization that takes place. So in general, we can notjust solve a simple first-order PDE to get the value function of dynamic programming. Nevertheless,in some situations it is possible to obtain good results as the following example shows.

Example 1.5.3 (Method of Characteristics) Consider the optimal control problem

min‖u‖≤1

J(t0, x0, u) =12

(x(τ))2


where ‖u‖ = max0≤t≤τ|u(t)|.

A candidate for value function W (t, x) should satisfy the HJB equation

min|u|≤1

Wt(t, x) +Wx(t, x)u = 0,

with boundary condition W (τ, x) = 12x

2.

For a given u ∈ U , let solve the PDE

Wt(t, x;u) +Wx(t, x;u)u = 0, W (τ, x;u) =12x2. (1.5.4)

A characteristic curve x(t) is found solving

x(t) = u, x(0) = x0,

so x(t) = x0 + ut. Since the solution to the PDE is constant along the characteristic curve we have

W (t, x(t);u) = W (τ, x(τ);u) =12

(x(τ))2 =12

(x0 + uτ)2.

The characteristic passing through the point (t, x) has initial condition x0 = x − ut, so the general

solution to the PDE (1.5.4) is

W (t, x;u) =12

(x+ (τ − t)u)2.

Since our objective is to minimize the terminal cost, we can identify a policy by minimizing W (t, x;u)over u above. It is straightforward to see that the optimal control (in feedback form) satisfies

u∗(x, t) =

−1 if x > τ − t−xτ−t if |x| ≤ τ − t1 if x < t− τ.

The corresponding “candidate” for value function W ∗(t, x) = W (t, x;u∗(t, x)) satisfies

W (t, x) =12

(max0 ; |x| − (τ − t)

)2

which we already know satisfies the HJB equation.

28


1.5.2 Optimal Control and Myopic Solution

Consider the following deterministic control problem in Bolza form:

minu∈U

J(x(T )) +∫ T

0L(xt, ut) dt

subject to x(t) = f(xt, ut), x(0) = x0.

The functions f , J , and L are assumed to be “sufficiently” smooth.

The solution to this problem can be found solving the associated Hamilton-Jacobi-Bellman equation

Vt(t, x) + minu∈Uf(x, u)V (t, x) + L(x, u) = 0

with boundary condition V (T, x) = J(x). The value function V (t, x) represents the optimal cost-to-go starting at time t in state x.

Suppose, we fix the control u ∈ U and solve the first-order PDE

Wt(t, x;u) + f(x, u)Wx(t, x;u) + L(x, u) = 0, W (T, x;u) = J(x) (1.5.5)

using the methods of characteristics. That is, we solve the characteristic ODE x(t) = f(x, u) andlet x(t) = H(t; s, y, u) the solution passing through the point (s, y), i.e., x(s) = H(s; s, y, u) = y.

By construction, along a characteristic curve (t, x(t)) the functionW (t, x(t);u) satisfies W (t, x(t);u)+L(x(t), u) = 0. Therefore, after integration we have that

W (s, x(s);u) = W (T, x(T );u) +∫ T

sL(x(t), u) dt = J(x(T )) +

∫ T

sL(x(t), u),

where the second equality follows from the boundary condition for W . We can rewrite this lastidentity for the particular characteristic curve passing through (t, x) as follows

W (t, x;u) = J(H(T ; t, x, u)) +∫ T

tL(H(s; t, x;u), u) ds.

Since the control u has been fixed so far, we call W (t, x;u) the static value function associated tocontrol u. Now, if we view W (t, x;u) as a function of u, we can minimize this static value function.We define

u∗(t, x) = arg minu∈U

W (t, x;u) and V(t, x) = W (t, x;u∗(t, x)).

Proposition 1.5.1 Suppose that u∗(t, x) is an interior solution and that W (t, x;u) is sufficientlysmooth so that u∗(t, x) satisfies

dW (t, x;u)du

∣∣∣u=u∗(t,x)

= 0. (1.5.6)

Then the function V(t, x) satisfies the PDE

Vt(t, x) + f(x, u∗(t, x))Vx(t, x) + L(x, u∗(t, x)) = 0 (1.5.7)

with boundary condition V(t, x) = J(x).

29


Proof: Let us rewrite the PDE in terms of W (t, x, u∗) to get

∂W (t, x, u∗)∂t

+ f(x, u∗)∂W (t, x, u∗)

∂x+ L(x, u∗)︸︷︷︸

(a)

+[∂u∗

∂t+ f(x, u∗)

∂u∗

∂x

]∂W (t, x, u∗)

∂u∗︸︷︷︸(b)

.

We note that by construction of the function W on equation (1.5.5) the expression denoted by (a)is equal to zero. In addition, the optimality condition (1.5.6) implies that (b) is also equal to zero.Therefore, V(t, x) satisfies the PDE (1.5.7). The border condition follows again from the definitionof the value function W .

Given this result, the question that naturally arises is whether V(t, x) is in fact the value function(that is V(t, x) = V (t, x)) and u∗(t, x) is the corresponding optimal feedback control.

Unfortunately, this is not generally true. In fact, to prove that V (t, x) = V(t, x) we would need toshow that

u∗(t, x) = arg minu∈Uf(x, u)Vx(t, x) + L(x, u) .

Since we have assumed that u∗(t, x) is an interior solution then the first order optimality conditionfor the minimization problem above is given by

fu(x, u∗(t, x))Vx(t, x) + Lu(x, u∗(t, x)) = 0.

Using the optimality condition (1.5.6) we have that

Vx(t, x) = Wx(t, x;u∗) = J ′(H)Hx +∫ T

tLx(H,u∗)Hx dt,

where H = H(T ; t;x, u∗) and Hx = Hx(T ; t;x, u∗) the partial derivative of H(T ; t;x, u∗) withrespect to x keeping u∗ = u∗(t, x) fixed. Thus, the first order optimality condition that needs to beverified is

fu(x, u∗(t, x))(J ′(H)Hx +

∫ T

tLx(H,u∗)Hx dt

)+ Lu(x, u∗(t, x)) = 0. (1.5.8)

On the other hand, the optimality condition (1.5.6) that u∗(t, x) satisfies is

J ′(H)Hu +∫ T

t[Lx(H,u∗)Hu + Lu(H,u∗)] dt = 0. (1.5.9)

It should be clear that condition (1.5.9) does not necessarily imply condition (1.5.8) and so V(t, x)and u∗(t, x) are not guaranteed to be the value function and the optimal feedback control, respec-tively. The following example shows the suboptimality of u∗(t, x).

Example 1.5.4 Consider the traditional linear-quadratic control problem

minu

x2(T ) +

∫ T

0(x2(t) + u2(t)) dt

subject to x(t) = αx(t) + βu(t), x(0) = x0.

• Exact solution to the HJB equation: This problem is traditionally tackled solving an associated

Riccati differential equation. We suppose that the optimal control satisfies

u(t, x) = −β k(t)x,

30


where the function k(t) satisfies the Riccati ODE

k(t) + 2αk(t) = β2 k2(t)− 1, k(T ) = 1.

We can get a particular solution assuming k(t) = k = constant. In this case,

β2 k2 − 2α k − 1 = 0 =⇒ k± =α±

√α2 + β2

β2.

Now, let us define k(t) = z(t) + k+ then the Riccati becomes

z(t) + 2(α− β2 k+) z(t) = β2 z2(t) =⇒ z(t)z2(t)

+2(α− β2 k+)

z(t)= β2.

If we set w(t) = z−1(t) then the last ODE is equivalent to

w(t) + 2(α− β2 k+)w(t) = β2.

This a simple linear differential equation that can be solved using the integrating factor exp(2(α −β2 k+) t, that is,

d

dt

(exp

(2(α− β2k+) t

)w(t)

)= exp

(2(α− β2 k+) t

)β2.

The solution to this ODE is

w(t) = k exp(− 2(α− β2 k+) t

)+

β2

2(α− β2 k+),

where k is a constant of integration. Using the fact that α−β2 k+ = −√α2 + β2 and k(t) = k++1/w(t)

we get

k(t) =α+

√α2 + β2

β2+

2√α2 + β2

2 k√α2 + β2 exp

(− 2(α− β2 k+) t

)− β2

.

The value of k is obtained from the border condition k(T ) = 1.

•Myopic Solution: If we solve the problem using the myopic approach described at the beginning of

this notes, we get that the characteristic curve is given by

x(t) = αx(t) + β u =⇒ ln(αx(t) + β u) = α(t+A),

with A a constant of integration. The characteristic passing through the point (t, x) satisfies A =ln(αx+ β u)/α− t and is given by

x(τ) =(αx+ β u) exp(α(τ − t))− β u

α.

The value of the static value function W (t, x;u) is given by

W (t, x;u) =(

(αx+ β u) exp(α(T − t))− β uα

)2

+∫ T

t

[((αx+ β u) exp(α(τ − t))− β u

α

)2

+ u2

]dτ.

If we compute the derivative of W (t, x;u) with respect to u and make it equal to zero we get, after

some manipulations, that the optimal myopic solution is

u∗(t, x) =(

αβ (3 exp(2(T − t))− 4 exp(T − t) + 1)β2(3 exp(2(T − t))− 8 exp(T − t) + 5) + 2(T − t)(α2 + β2)

)x.

Interestingly, this myopic feedback control is also linear on x as in the optimal solution, however, the

solution is clearly different and suboptimal.

31


The previous example shows that in general the use of a myopic policy produces suboptimal solu-tions. However, a questions remains still open which is under what conditions is the myopic solutionoptimal? A general solution to this problem can be obtained by looking under what restrictions onthe problem’s data the optimality condition (1.5.8) is implied by condition (1.5.9).

In what follows we present one specific case for which the optimal solution is given by the myopicsolution. Consider the control problem

minu∈U

J(x(T )) +∫ T

0L(u(t)) dt

subject to x(t) = f(x(t), u(t)) := g(x(t))h(u(t)), x(0) = x0.

In this case, it can be shown that the characteristic equation passing through the point (t, x) isgiven by

x(τ) = G−1(h(u)(τ − t) +G(x)), where G(x) :=∫

dxg(x)

.

In this case, the static value function is

W (t, x;u) = J(G−1(h(u)(T − t) +G(x))) + L(u)(T − t)

and the myopic solution satisfies dduW (t, x;u) = 0 or equivalently

0 = J ′(G−1(h(u)(T − t) +G(x)))h′(u) (T − t)G−1′(h(u)(T − t) +G(x)) + (T − t)L′(u) ⇐⇒0 = J ′(G−1(h(u)(T − t) +G(x))) fu(x, u)G′(x)G−1′(h(u)(T − t) +G(x)) + L′(u) ⇐⇒0 = fu(x, u)Wx(t, x;u) + L′(u).

The second equality uses the identities G′(x) = 1/f(x) and fu(x, u) = f(x)h′(u). Therefore, theoptimal myopic policy u∗(t, x) satisfies

0 = fu(x, u∗)Vx(t, x) + L′(u∗)

i.e., the first order optimality condition (1.5.8).

Example 1.5.5 Consider control problem

minu

x2(T ) +

∫ T

0u2(t) dt

subject to x(t) = x(t)u(t), x(0) = x0.

In this case, the characteristic passing through (t, x) is given by

x(τ) = x exp(u(τ − t)).

The static value function is

W (t, x;u) = x2 exp(2u(T − t)) + u2 (T − t).

Minimizing W over u implies

x2 exp(2u∗(T − t)) + u∗ = 0

and the corresponding value function

V (t, x) = V(t, x) = u∗(t, x)(u∗(t, x) (T − t)− 1

).

32


Connecting the HJB Equation with Pontryagin Principle

We consider the optimal control problem in Lagrange form. In this case, the HJB equation is givenby

minu∈UVt(t, x) + Vx(t, x) f(t, x, u) + L(t, x, u) = 0,

with boundary condition V (t1, x(t1)) = 0.

Let us define the so-called Hamiltonian

H(t, x, u, λ) := λf(x, t, u)− L(t, x, u).

Thus, the HJB equation implies that the value function satisfies

maxu∈U

H(t, x, u,−Vx) = 0,

and so the optimal control can be found maximizing the Hamiltonian. Specifically, let x∗(t) bethe optimal trajectory and let P (t) = −Vx(t, x∗(t)), then the optimal control satisfies the so-calledMaximum Principle

H(t, x∗(t), u∗(t), P (t)) ≤ H(t, x∗(t), u, P (t)), for all u ∈ U.

In order to complete the connection with Pontryagin principle we need to derive the adjoint equa-tions. Let x∗(t) be the optimal trajectory and consider a small perturbation x(t) such that

x(t) = x∗(t) + δ(t), where |δ(t)| < ε.

First, we note that the HJB equation together with the optimality of x∗ and its correspondingcontrol u∗ implies that

H(t, x∗(t), u∗(t),−Vx(t, x∗(t)))− Vt(t, x∗(t)) ≥ H(t, x(t), u∗(t),−Vx(t, x(t)))− Vt(t, x(t)).

Therefore, the derivative of H(t, x(t), u∗(t),−Vx(t, x(t))) + Vt(t, x(t)) with respect to x so be equalto zero at x∗(t). Using the definition of H this condition implies that

−Vxx(t, x∗(t)) f(t, x∗(t), u∗(t))−Vx(t, x∗(t)) fx(t, x∗(t), u∗(t))−Lx(t, x∗(t), u∗(t))−Vxt(t, x∗(t)) = 0.

In addition, using the dynamics of the system we get that

Vx(t, x∗(t)) = Vtx(t, x∗(t)) + Vxx(t, x∗(t)) f(t, x∗(t), u∗(t)),

therefore

Vx(t, x∗(t)) = Vx(t, x∗(t)) f(t, x∗(t), u∗(t)) + L(t, x∗(t), u∗(t)).

Finally, using the definition of P (t) and H we conclude that P (t) satisfies the adjoint condition

P (t) =∂

∂xH(t, x∗(t), u∗(t), P (t)).

The boundary condition for P (t) are obtained from the boundary conditions of the HJB, that is,

P (t1) = −Vx(t1, x(t1)) = 0. (transversality condition)

33


Economic Interpretation of the Maximum Principle

Let us again consider the control problem in Lagrange form. In this case the performance measureis

V (t, x) = min∫ T

0L(t, x(t), u(t)) dt.

The function L corresponds to the instantaneous “cost” rate. According to our definition of P (t) =−Vx(t, x(t)), we can interpret this quantity as the marginal profit associated to a small change onthe state variable x. The economic interpretation of the Hamiltonian is as follows:

H dt = P (t) f(t, x, u) dt− L(t, x, u) dt

= P (t) x(t) dt− L(t, x, u) dt

= P (t) dx(t)− L(t, x, u) dt.

The term −L(t, x, u) dt corresponds to the instantaneous profit made at time t at state x if control uis selected. We can look at this profit as a direct contribution. The second term P (t) dx(t) representsthe instantaneous profit that it is generated by changing the state from x(t) to x(t) + dx(t). Wecan look at this profit as an indirect contribution. Therefore H dt can be interpreted as the totalcontribution made from time t to t+ dt given the state x(t) and the control u.

With this interpretation, the Maximum Principle simply state that an optimal control should try tomaximize the total contribution for every time t. In other words, the Maximum Principle decouplesthe dynamic optimization problem in to a series of static optimization problem, one for every timet.

Note also that if we integrate the adjoint equation we get

P (t) =∫ t1

tHx dt.

So P(t) is the cumulative gain obtained over [t, t1] by marginal change of the state space. In thisrespect, the adjoint variables behave in much the same way as dual variables in LP.

1.6 Exercises

Exercise 1.6.1 In class, we solved the following deterministic optimal control problem

min‖u‖≤1

J(t0, x0, u) =12

(x(τ))2


where ‖u‖ = max0≤t≤τ|u(t)| using the method of characteristics. In particular, we solved theopen-loop HJB PDE equation

Wt(t, x;u) +Wx(t, x;u)u = 0, W (τ, x;u) =12x2.

for a fixed u and then find the optimal close-loop control solving

u∗(t, x) = arg min‖u‖≤1

W (t, x;u)

34


and computing the value function as V (t, x) = W (t, x;u∗(t, x)).

a) Explain why this methodology does not work in general. Provide a counter example.

b) What specific control problems can be solve using this open-loop approach.

c) Propose an algorithm that uses the open-loop solution to approximately solve a general deter-ministic optimal control problem.

Exercise 1.6.2 (Dynamic Pricing in Discrete Time)

Assume that we have x0 items of a ceratin type that we want to sell over a period of N days. Ateach day, we may sell at most one item. At the kth day, knowing the current number xk of remainingunsold items, we can set the selling price uk of a unit item to a nonnegative number of our choice;then, the probability qk(uk) of selling an item on the kth day depends on uk as follows:

qk(uk) = α exp(−uk)

where 0 < α < 1 is a given scalar. The objective is to find the optimal price setting policy so as tomaximize the total expected revenue over N days. Let Vk(xk) be the optimal expected cost fromday k to the end if we have xk unsold units.

a) Assuming that for all k, the value function Vk(xk) is monotonically nondecreasing as a functionof xk, prove that for xk > 0, the optimal prices have the form

µ∗k(xk) = 1 + Jk+1(xk)− Vk+1(xk − 1)

and thatVk(xk) = α exp(−µ∗k(xk)) + Vk+1(xk).

b) Prove simultaneously by induction that, for all k, the value function Vk(xk) is indeed monotoni-cally nondecreasing a sa function of xk, that the optimal price µ∗k(xk) is monotonically nonincreasingas a function of xk, and that Vk(xk) is given in closed form by

Vk(xk) =

(N − k)α exp(−1) if xk ≥ N − k,∑N−xk

i=k α exp(−µ∗i (xk)) + xk α exp(−1) if 0 < xk < N − k,0 if xk = 0.

Exercise 1.6.3 Consider a deterministic optimal control problem in which u is a scalar controland x is also scalar. The dynamics are given by

f(t, x, u) = a(x) + b(x)u

where a(x) and b(x) are C2 vector functions. If

P (t) b(x(t)) = 0 on a time interval α ≤ t ≤ β,

the Hamiltonian does not depend on u and the problem is singular. Show that under these conditions

P (t) q(x) = 0, α ≤ t ≤ β,

35


where q(x) = bx(x)a(x)− ax(x)b(x). Show further that if

P (t)[qx(x(t)) b(x(t))− bx(x(t)) q(x(t))] 6= 0

thenu(t) = −P (t)[qx(x(t)) a(x(t))− ax(x(t)) q(x(t))]

P (t)[qx(x(t)) b(x(t))− bx(x(t)) q(x(t))].

Exercise 1.6.4 The objective of this note is to characterize a particular family of Learning Func-tion. These learning functions are useful modelling devices for situations where there is an agentthat tries to increase his or her level of “knowledge” about a certain phenomenon (such as cus-tomers’ preferences or product quality) by applying a certain control or “effort”. To fix ideas, inwhat follows knowledge will be represented by the variable x while effort will be represented by thevariable e. For simplicity we will assume that knowledge takes values in the [0, 1] interval while effortis a nonnegative real variable. The family of learning function that we are interested in this noteare those than can be derived from a specific subfamily that we called Additive Learning Functions.The formal definition of an Additive Learning Function1 is as follows.

Definition 1.6.1 Consider a function L : R+ × [0, 1] → [0, 1]. The function L would be calledAdditive Learning Function if it satisfies the following properties:

Additivity: L(e2 + e1, x) = L(e2, L(e1, x)) for all e1, e2 ∈ R+ and x ∈ [0, 1].

Boundary Condition: L(0, x) = x for all x ∈ [0, 1].

Monotonicity: Le(t, x) = ∂L∂eL(e, x) > 0 for all (e, x) ∈ R× [0, 1].

Satiation: lime→∞ L(e, x) = 1 for all x ∈ [0, 1].

a) Prove the following. Suppose that L(e, x) is a C1 additive learning function. Then L(e, x) satisfies

Le(e, x)− Le(0, x)Lx(e, x) = 0

where Le and Lx are the partial derivatives of L(e, x) with respect to e and x respectively.

b) Using the method of characteristics solve the PDE of part a) as a function of

H(x) =∫− 1Le(0, x)

dx

and prove that the solution is of the form

L(e, x) := H−1 (H(x)− e) .

Consider the following optimal control problem.

V (0, x) = maxpt

∫ T

0[pt λ(pt)xt] dt (1.6.1)

subject to xt = Le(0, xt)λ(pt) x0 = x ∈ [0, 1] given. (1.6.2)1This name is probably not standard since I do not know the relevant literature well enough.

36


Where Le(0, x) is the partial derivative of the learning function L(e, x) with respect to e evaluatedat (0, x). This problem corresponds to the case of a seller that tries to maximize cumulative revenueduring the period [0, T ]. Potential demand rate at time t is given by λ(pt) where pt is the priceset by the seller at time t. However, only a fraction xt ∈ [0, 1] of the potential customers buy theproduct at time t. The dynamics of xt are given by (1.6.2).

c) Show that equation (1.6.2) can be rewritten as

xt = L(yt, x) where yt :=∫ t

0λs ds.

and use this fact to reformulate your control problem as follows

maxyt

∫ T

0p(yt) yt L(yt, x) dt subject to y0 = 0. (1.6.3)

d) Deduce that the optimality conditions in this case are given by

y2t p′(yt)L(yt, x) = constant. (1.6.4)

e) Solve the optimality condition for the case

λ(p) = λ0 exp(−αp) and L(e, x) = 1 + (x− 1) exp(−βe), α, β > 0.

1.7 Exercises

Exercise 1.7.1 Solve the problem:

min (x(T ))2 +∫ T

0(u(t))2dt

subject to x(t) = u(t), |u(t)| ≤ 1, ∀t ∈ [0, T ]

Calculate the cost-to-go function J∗(t, x) and verify that it satisfies the HJB equation.

A young investor has earned in the stock market a large amount of money S and plans to spendit so as to maximize his enjoyment through the rest of his life without working. He estimates thathe will live exactly T more years and that his capital x(t) should be reduced to zero at time T , i.e.x(T ) = 0. Also, he models the evolution of his capital by the differential equation

dx(t)dt

= αx(t)− u(t),

where x(0) = S is his initial capital, α > 0 is a given interest rate, and u(t) ≥ 0 is his rate ofexpenditure. The total enjoyment he will obtain is given by∫ T

0e−βt

√u(t)dt

Here β is some positive scalar, which serves to discount future enjoyment. Find the optimal u(t)|t ∈[0, T ].

37


Exercise 1.7.2 Analyze the problem of finding a curve x(t)|t ∈ [0, T ] that maximizes the areaunder x, ∫ T

0x(t) dt,

subject to the constraints

x(0) = a, x(T ) = b,

∫ T

0

√1 + (x(t))2dt = L,

where a, b and L are given positive scalars. The last constraint is known as the “isoperimetricconstraint”: it requires that the length of the curve be L.

Hint: Introduce the system equations x1 = u, x2 =√

1 + u2, and view the problem as a fixedterminal state problem. Show that the optimal curve x(t) satisfies the condition sinφ(t) = (C1 −t)/C2 for given constants C1, C2. Under some assumptions on a, b, and L, the optimal curve is acircular arc.

Exercise 1.7.3 Let a, b and T be positive scalars, and let A = (0, a) and B = (T, b) be twopoints in a medium within which the velocity of propagation of light is proportional to the verticalcoordinate. Thus the time it takes for light to propagate from A to B along curve x(t)|t ∈ [0, T ]is ∫ T

0

√1 + (x(t))2

Cx(t)dt,

where C is a given positive constant. Find the curve of minimum travel time of light from A to B,and show that it is an arc of a circle of the form

(x(t))2 + (t− d)2 = D,

where d and D are some constants.

Hint: Introduce the system equation x = u, and consider a fixed initial/terminal state problemx(0) = a and x(T ) = b.

Exercise 1.7.4 Use the discrete time Minimum Principle to solve the following problem:

A farmer annually producing xk units of a certain crop stores (1 − uk)xk units of his production,where 0 ≤ uk ≤ 1, and invests the remaining ukxk units, thus increasing the next year’s productionto a level xk+1 given by

xk+1 = xk + wukxk, k = 0, 1, . . . , N − 1

The scalar w is fixed at a known deterministic value. The problem is to find the optimal investmentpolicy that maximizes the total expected product stored over N years,

xN +N−1∑k=0

(1− uk)xk

Show the optimality of the following policy that consists of constant functions:

1. If w > 1, µ∗0(x0) = · · · = µ∗N−1(xN−1) = 1.

38


2. If 0 < w < 1/N , µ∗0(x0) = · · · = µ∗N−1(xN−1) = 0.

3. If 1/N ≤ w ≤ 1,µ∗0(x0) = · · · = µ∗N−k−1(xN−k−1) = 1,

µ∗N−k(xN−k) = · · · = µ∗N−1(xN−1) = 0,

where k is such that1

k + 1< w ≤ 1

k.

39


40

Chapter 2

Discrete Dynamic Programming

Dynamic programming (DP) is a technique pioneered by Richard Bellman1 in the 1950’s to modeland solve problems where decisions are made in stages2 in order to optimize a particular functional(e.g., minimize a certain cost) that depends (possibly) on the entire evolution (trajectory) of thesystem over time as well as on the decisions that were made along the way. The distinctive featureof DP (and one that is useful to keep in mind) with respect to the method of Calculus of Variationsdiscussed in the previous chapter is that instead of thinking of an optimal trajectory as a point inan appropriate space, DP constructs this optimal trajectory sequentially over time, in essence DPis an algorithm.

A fundamental idea that emerges from DP is that in general decisions cannot be made myopically(that is, optimizing current performance) since a low cost now might mean a high cost in the future.

2.1 Discrete-Time Formulation

Let us introduce the basic DP model using one of the most classical examples in Operations Man-agement, namely, the Inventory Control problem.

Example 2.1.1 (Inventory control) Consider the problem faced by a firm that must replenish

periodically (i.e., every month) the level of inventory of a certain good. The inventory of this good is

used to satisfied a (possibly stochastic) demand. The dynamics of this inventory system are depicted in

Figure 3.1.1. There are two costs incurred per period: a per-unit purchasing cost c, and an inventory

cost incurred at the end of a period that accounts for either holding (even there is a positive amount

of inventory that is carried over to the next period) or backlog costs (associated to unsatisfied demand

that must be met in the future) given by a function r(·). The manager of this firm must decide at the

beginning of every period k the amount of inventory to order (uk) based on the initial level of inventory in

period k (xk) and the available forecast of future demands, (wk, wk+1, . . . , wN ), this forecast is captured

by the underlying joint probabilities distribution of these future demands.

1For a brief historical account of the early developments of DP, including the origin of its name, see S. Dreyfus

(2002). ”Richard Bellman on the Birth of Dynamic Programming”, Operations Research vol. 50, No. 1, JanFeb,

4851.2For the most part we will consider applications in which these different stages correspond to different moments

in time.

41

Prof. R. Caldentey CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING

Stock ordered at beginning of Period k

Stock at beginningof Period k

Demand duringPeriod k

Figure 2.1.1: System dynamics for the Inventory Control problem

Assumptions:

1. Leadtime = 0 (i.e., instantaneous replenishment)

2. Independent demands w0, w1, . . . , wN−1

3. Fully backlogged demand

4. Zero terminal cost (i.e., free disposal gN (xN ) = 0)

The objective is to minimize the total cost over N periods, i.e.

minu0,...,uN−1≥0

Ew0,w1,...,wN−1

[N−1∑k=0

(cuk + r(xk + uk − wk))

]

We will prove that for convex cost functions r(·), the optimal policy is of the “order up to” form.

The inventory problem highlights the following main features of our Basic Model:

1. An underlying discrete-time dynamic system

2. A finite horizon

3. A cost function that is additive over time

System dynamics are described by a sequence of states driven by a system equation

xk+1 = fk(xk, uk, wk), k = 0, 1, . . . , N − 1,

where:

• k is a discrete time index

• fk is the state transition function

• xk is the current state of the system. It could summarize past information relevant for futureoptimization when the system is not Markovian.

42

CHAPTER 2. DISCRETE DYNAMIC PROGRAMMING Prof. R. Caldentey

• uk is the control; decision variable to be selected at time k

• wk is a random parameter (“disturbance” or “noise”) described by a probability distribu-tion Pk(·|xk, uk)

• N is the length of the horizon; number of periods when control is applied

The per-period cost function is given by gk(xk, uk, wk). The total cost function is additive, with atotal expected cost given by

Ew0,w1,...,wN−1

[gN (xN ) +

N−1∑k=0

gk(xk, uk, wk)

],

where the expectation is taken over the joint distribution of the random variables w0, w1, . . . , wN−1

involved.

The sequence of events in a period k is the following:

1. The system manager observes the current state xk

2. Decision uk is made

3. Random noise wk is realized. It could potentially depend on xk and uk (for example, think ofa case where uk is price and wk is demand).

4. Cost gk(xk, uk, wk) is incurred

5. Transition xk+1 = fk(xk, uk, wk) occurs

If we think about tackling a possible solution to a discrete DP such as the Inventory example 2.1.1,two somehow extreme strategies can be considered:

1. Open Loop: Select all orders u0, u1, . . . , uN−1 at time k = 0.

2. Closed Loop: Sequential decision making, place an order uk at time k. Here, we gaininformation about the realization of demand on the fly.

Intuitively, in a deterministic DP settings in which the values of (w0, w1, . . . , wN−1) are known attime 0, open and closed loop strategies are equivalent because no uncertainty is revealed over timeand hence there is no gain from waiting. However, in a stochastic environment postponing decisioncan have a significant impact on the overall performance of a particular strategy. So closed-loopoptimization are generally needed to solve a stochastic DP problem to optimality. In closed-loopoptimization, we want to find an optimal rule (i.e., a policy) for selecting action uk in period k, as afunction of the state xk. So, we want to find a sequence of functions µk(xk) = uk, k = 0, 1, . . . , N−1.The sequence π = µ0, µ1, . . . , µN−1 is a policy or control law. For each policy π, we can associatea trajectory xπ = (xπ0 , x

π1 , . . . , x

πN ) that describes the evolution of the state of the system (e.g., units

in inventory at the beginning of every period in Example 2.1.1) over time when the policy π hasbeen chosen. Note that in genera xπ is a stochastic process. The corresponding performance ofpolicy π = µ0, µ1, . . . , µN−1 is given by

Jπ = Ew0,w1,...,wN−1

[gN (xπN ) +

N−1∑k=0

gk(xπk , µk(xπk), wk)

].

43


If the initial state is fixed, i.e., xπ0 = x0 for all feasible policy π then we denote the performance ofpolicy π by Jπ(x0).

The objective of dynamic programming is to optimize Jπ over all policies π that satisfy the con-straints of the problem.

2.1.1 Markov Decision Processes

There are situations where the state xk is naturally discrete, and its evolution can be modeled by aMarkov chain. In these cases, the state transition function is described by the transition probabilitiesmatrix between the states:

pij(u, k) = Pxk+1 = j|xk = i, uk = u

Claim: Transition probabilities ⇐⇒ System equation

Proof: ⇒) Given a transition probability representation,

pij(u, k) = Pxk+1 = j|xk = i, uk = u,

we can cast it in terms of the basic DP framework as

xk+1 = wk, where Pwk = j|xk = i, uk = u = pij(u, k).

⇐) Given a discrete-state system equation xk+1 = fk(xk, uk, wk), and a probability distributionfor wk, Pk(wk|xk, uk), we can get the following transition probability representation:

pij(u, k) = PkWk(i, u, j)|xk = i, uk = u),

where the event Wk is defined as

Wk(i, u, j) = w|j = fk(i, u, w).

Example 2.1.2 (Scheduling)

• Objective: Find the optimal sequence of operations A,B,C,D to produce a certain product.

• Precedence constraints: A −→ B, C −→ D.

• State definition: Set of operations already performed.

• Costs: Startup costs SA and SB incurred at time k = 0, and setup transition costs Cnm from

operation m to n.

This example is represented in Figure 2.1.2. The optimal solution is described by a path of minimum

cost that starts at the initial state and ends at some state at the terminal time. The cost of a path is

the sum of the labels in the arcs plus the terminal cost (label in the leaf).

44


Figure 2.1.2: System dynamics for the Scheduling problem.

Example 2.1.3 (Chess Game)

• Objective: Find the optimal two-game chess match strategy that maximizes the winning chances.

• Description of the match:

Each game can have two outcomes: Win (1 point for winner, 0 for loser), and Draw (1/2 point

for each player)

If the score is tied after two games, the match continues until one of them wins a game (“sudden

death”).

• State: vector with the two scores attained so far. It could also be the net score (difference between

the scores).

• Each player has two playing styles, and can choose one of the two at will in each game:

– Timid play: draws with probability pd > 0, and loses w.p. 1− pd.

– Bold play: wins w.p. pw > 0, and loses w.p. 1− pw.

• Observations: If there is a tie after the 2nd game, the player must play Bold. So, from an analytical

perspective, the problem is a two-period one. Also note that this is not a “game theory” setting,

since there is no best response here. The other player’s strategy is somehow captured by the

corresponding probabilities.

Using the equivalence between system equation and transition probability function mentioned above, in

Figure 2.1.3 we show the transition probabilities for period k = 0. In Figure 2.1.4 we show the transition

45


Figure 2.1.3: Transition probability graph for period k = 0 for the chess match

probabilities for the second stage of the match (i.e., k = 1), and the cost of the terminal states.

Note that these numbers are negative because maximizing the probability of winning p is equivalent to

minimizing −p (recall that we are working with min problems so far). One interesting feature of this

-1

-1

-pw

0

0

-1

-1

-pw

0

0

Figure 2.1.4: Transition probability graph for period k = 1 for the chess match

problem (to be verified later) is that even if pw < 1/2, the player could still have more than 50% chance

of winning the match.

2.2 Deterministic DP and the Shortest Path Problem

In this section, we focus on deterministic problems, i.e., problems where the value of each distur-bance wk is known in advance at time 0. In deterministic problems, using feedback results does nothelp in terms of cost reduction and hence open-loop and closed-loop policies are equivalent.

Claim: In deterministic problems, minimizing cost over admissible policies µ0, µ1, . . . , µN−1 (i.e.,sequence of functions) leads to the same optimal cost as minimizing over sequences of control vectorsu0, u1, . . . , uN−1.

46


Proof: Given a policy µ0, µ1, . . . , µN−1, and an initial state x0, the future states are perfectlypredictable through the equation

xk+1 = fk(xk, µ(xk)), k = 0, 1, . . . , N − 1,

and the corresponding controls are perfectly predictable through the equation

uk = µk(xk), k = 0, 1, . . . , N − 1.

Thus, the cost achieved by an admissible policy µ0, µ1, . . . , µN−1 for a deterministic problem isalso achieved by the control sequence u0, u1, . . . , uN−1 defined above.

Hence, we may restrict attention to sequences of controls without loss of optimality.

2.2.1 Deterministic finite-state problem

This type of problems can be represented by a graph (see Figure 2.2.1), where:

• States ⇐⇒ Nodes

• Each control applied over a state xk ⇐⇒ Arc from node xk.

So, every outgoing arc from a node xk represents one possible control uk. Among them, wehave to choose the best one u∗k (i.e., the one that minimizes the cost from node xk onwards).

• Control sequences (open-loop) ⇐⇒ paths from initial state s to terminal states

• Final stage ⇐⇒ Artificial terminal node t

• Each state xN at stageN ⇐⇒ Connected to the terminal node t with an arc having cost gN (xN ).

• One-step costs gk(i, uk) ⇐⇒ Cost of an arc akij (cost of transition from state i ∈ Sk to statej ∈ Sk+1 at time k, viewed as the “length of the arc”) if uk forces the transition i→ j.

Define aNit as the terminal cost of state i ∈ SN .

Assume akit =∞ if there is no control that drives from i to j.

• Cost of control sequence⇐⇒ Cost of the corresponding path (view it as “length of the path”)

• The deterministic finite-state problem is equivalent to finding a shortest path from s to t.

2.2.2 Backward and forward DP algorithms

The usual backward DP algorithm takes the form:

JN (i) = aNit , i ∈ SN ,

Jk(i) = minj∈Sk+1

[akij + Jk+1(j)

], i ∈ Sk, k = 0, 1, . . . , N − 1.

The optimal cost is J0(s) and is equal to the length of the shortest path from s to t.

47


. . .

. . .

. . .

Stage 0 Stage 1 Stage 2 Stage N - 1 Stage N

Initial State s

tArtificial TerminalNode

Terminal Arcswith Cost Equalto Terminal Cost

. . .

Cost

g 0(s,u

0)

Cost: g1(x1,u1)

Control: u1

x1 x2

x2 = f1(x1,u1)

gN(xN)

Figure 2.2.1: Construction of a transition graph for a deterministic finite-state system.

• Observation: An optimal path s −→ t is also an optimal path t −→ s in a “reverse” shortestpath problem where the direction of each arc is reversed and its length is left unchanged.

• The previous observation leads to the forward DP algorithm:

JN (j) = a0sj , j ∈ S1,

Jk(j) = mini∈SN−k

[aN−kij + Jk+1(i)

], j ∈ SN−k+1, k = 0, 1, . . . , N − 1.

The optimal cost isJ0(t) = min

i∈SN

[aNit + J1(i)

].

Note that both algorithms yield the same result: J0(s) = J0(t). Take Jk(j) as the optimalcost-to-arrive to state j from initial state s.

The following observations apply to the forward DP algorithm:

• There is no forward DP algorithm for stochastic problems.

• Mathematically, for stochastic problems, we cannot restrict ourselves to open-loop sequences,so the shortest path viewpoint fails.

• Conceptually, in the presence of uncertainty, the concept of “optimal cost-to-arrive” at astate xk does not make sense. The reason is that it may be impossible to guarantee (w.p. 1)that any given state can be reached.

• By contrast, even in stochastic problems, the concept of “optimal cost-to-go” from any state xk(in expectation) makes clear sense.

Conclusion: A deterministic finite-state problem is equivalent to a special type of shortest pathproblem and can be solved by either the ordinary (backward) DP algorithm or by an alternativeforward DP algorithm.

48


2.2.3 Generic shortest path problems

Here, we are converting a shortest path problem to a deterministic finite-state problem. Moreformally, given a graph, we want to compute the shortest path from each node i to the final node t.How to cast this into the DP framework?

• Let 1, 2, . . . , N, t be the set of nodes of a graph, where t is the destination node.

• Let aij be the cost of moving from node i to node j.

• Objective: Find a shortest (minimum cost) path from each node i to node t.

• Assumption: All cycles have nonnegative length. Then, an optimal path need not take morethan N moves (depth of a tree).

• We formulate the problem as one where we require exactly N moves but allow degeneratemoves from a node i to itself with cost aii = 0.

• In terms of the DP framework, we propose a formulation with N stages labeled 0, 1, . . . , N−1.Denote:

Jk(i) = Optimal cost of getting from i to t in N − k moves

J0(i) = Cost of the optimal path from i to t in N moves.

• DP algorithm:Jk(i) = min

j=1,2,...,Naij + Jk+1(j), k = 0, 1, . . . , N − 2,

with JN−1(i) = ait, i = 1, 2, . . . , N .

• The optimal policy when at node i after k moves is to move to a node j∗ such that

j∗ = argmin1≤j≤Naij + Jk+1(j)

• If the optimal path from the algorithm contains degenerate moves from a node to itself, itmeans that the path in reality involves less than N moves.

Demonstration of the algorithm

Consider the problem exhibited in Figure 2.2.2 where the costs aij with i 6= j are shown along theconnecting line segments. The graph is represented as a non-directed one, meaning that the arccosts are the same in both directions, i.e., aij = aji.

Running the algorithm:

In this case, we have N = 4, so it is a 3-stage problem with 4 states:

1. Starting from stage N − 1 = 3, we compute JN−1(i) = ait, for i = 1, 2, 3, 4 and t = 5:

J3(1) = cost of getting from node 1 to node 5 = 2




The numbers above represent the cost of getting from i to t in N − (N − 1) = 1 move.49


27 5

25 5

6 1

3

0.5 3

1

2

4

Destination 5

Figure 2.2.2: Shortest path problem data. There are N = 4 states, and a destination node t = 5.

2. Proceeding backwards to stage N − 2 = 2, we have:

J2(1) = minj=1,2,3,4

a1j + J3(j) = min a11︸︷︷︸0

+ J3(1)︸︷︷︸2

, a12︸︷︷︸6

+ J3(2)︸︷︷︸7

, a13︸︷︷︸5

+ J3(3)︸︷︷︸5

, a14︸︷︷︸2

+ J3(4)︸︷︷︸3

= 2

J2(2) = minj=1,2,3,4

a2j + J3(j) = min a21︸︷︷︸6

+ J3(1)︸︷︷︸2

, a22︸︷︷︸0

+ J3(2)︸︷︷︸7

, a23︸︷︷︸0.5

+ J3(3)︸︷︷︸5

, a24︸︷︷︸5

+ J3(4)︸︷︷︸3

= 5.5

J2(3) = minj=1,2,3,4

a3j + J3(j) = min a31︸︷︷︸5

+ J3(1)︸︷︷︸2

, a32︸︷︷︸0.5

+ J3(2)︸︷︷︸7

, a33︸︷︷︸0

+ J3(3)︸︷︷︸5

, a34︸︷︷︸1

+ J3(4)︸︷︷︸3

= 4

J2(4) = minj=1,2,3,4

a4j + J3(j) = min a41︸︷︷︸2

+ J3(1)︸︷︷︸2

, a42︸︷︷︸5

+ J3(2)︸︷︷︸7

, a43︸︷︷︸1

+ J3(3)︸︷︷︸5

, a44︸︷︷︸0

+ J3(4)︸︷︷︸3

= 3

3. Proceeding backwards to stage N − 3 = 1, we have:

J1(1) = minj=1,2,3,4

a1j + J2(j) = min a11︸︷︷︸0

+ J2(1)︸︷︷︸2

, a12︸︷︷︸6

+ J2(2)︸︷︷︸5.5

, a13︸︷︷︸5

+ J2(3)︸︷︷︸4

, a14︸︷︷︸2

+ J2(4)︸︷︷︸3

= 2

J1(2) = minj=1,2,3,4

a2j + J2(j) = min a21︸︷︷︸6

+ J2(1)︸︷︷︸2

, a22︸︷︷︸0

+ J2(2)︸︷︷︸5.5

, a23︸︷︷︸0.5

+ J2(3)︸︷︷︸4

, a24︸︷︷︸5

+ J2(4)︸︷︷︸3

= 4.5

J1(3) = minj=1,2,3,4

a3j + J2(j) = min a31︸︷︷︸5

+ J2(1)︸︷︷︸2

, a32︸︷︷︸0.5

+ J2(2)︸︷︷︸5.5

, a33︸︷︷︸0

+ J2(3)︸︷︷︸4

, a34︸︷︷︸1

+ J2(4)︸︷︷︸3

= 4

J1(4) = minj=1,2,3,4

a4j + J2(j) = min a41︸︷︷︸2

+ J2(1)︸︷︷︸2

, a42︸︷︷︸5

+ J2(2)︸︷︷︸5.5

, a43︸︷︷︸1

+ J2(3)︸︷︷︸4

, a44︸︷︷︸0

+ J2(4)︸︷︷︸3

= 3

4. Finally, proceeding backwards to stage 0, we have:

J0(1) = minj=1,2,3,4

a1j + J1(j) = min a11︸︷︷︸0

+ J1(1)︸︷︷︸2

, a12︸︷︷︸6

+ J1(2)︸︷︷︸4.5

, a13︸︷︷︸5

+ J1(3)︸︷︷︸4

, a14︸︷︷︸2

+ J1(4)︸︷︷︸3

= 2

J0(2) = minj=1,2,3,4

a2j + J1(j) = min a21︸︷︷︸6

+ J1(1)︸︷︷︸2

, a22︸︷︷︸0

+ J1(2)︸︷︷︸4.5

, a23︸︷︷︸0.5

+ J1(3)︸︷︷︸4

, a24︸︷︷︸5

+ J1(4)︸︷︷︸3

= 4.5

J0(3) = minj=1,2,3,4

a3j + J1(j) = min a31︸︷︷︸5

+ J1(1)︸︷︷︸2

, a32︸︷︷︸0.5

+ J1(2)︸︷︷︸4.5

, a33︸︷︷︸0

+ J1(3)︸︷︷︸4

, a34︸︷︷︸1

+ J1(4)︸︷︷︸3

= 4

J0(4) = minj=1,2,3,4

a4j + J1(j) = min a41︸︷︷︸2

+ J1(1)︸︷︷︸2

, a42︸︷︷︸5

+ J1(2)︸︷︷︸4.5

, a43︸︷︷︸1

+ J1(3)︸︷︷︸4

, a44︸︷︷︸0

+ J1(4)︸︷︷︸3

= 3

50


Figure 2.2.3 shows the outcome of the shortest path (DP-type) algorithm applied over the graph inFigure 2.2.2.

0 1 2 3 Stage k

1

2

3

4

State i

3 3 3 3

4 4 4 5

4.5 4.5 5.5 7

2 2 2 2

Figure 2.2.3: Outcome of the shortest path algorithm. The arcs represent the optimal control to follow from a

given state i (node i in the graph) at a particular stage k (where stage k means 4− k transitions left to reach t = 5).

When there is more than one arc going out of a node, it represents the availability of more than one optimal control.

The label next to each node shows the cost-to-go starting at the corresponding (stage, state) position.

2.2.4 Some shortest path applications

Hidden Markov models and the Viterbi algorithm

Consider a Markov chain for which we do not observe the outcome of the transitions but rather weobserve a signal or proxy that relates to that transition. The setting of the problem is the following:

• Markov chain (discrete time, finite number of states) with transition probabilities pij .

• State transition are hidden from view.

• For each transition, we get an independent observation.

• Denote πi: Probability that the initial state is i.

• Denote r(z; i, j): Probability that the observation takes the value z when the state transitionis from i to j.3

• Trajectory estimation problem: Given the observation sequence ZN = z1, z2, . . . , zN, whatis the most likely (unobservable) transition sequence XN = x0, x1, . . . , xN? More formally:We are looking for the transition sequence XN = x0, x1, . . . , xN that maximizes p(XN |ZN )over all XN = x0, x1, . . . , xN. We are using the notation XN to emphasize the fact that thisis an estimated sequence. We do not observe the true sequence, but just a proxy for it givenby ZN .

3The probabilities pij and r(z; i, j) are assumed to be independent of time for notational convenience, but the

methodology could be extended to time-dependent probabilities.

51


Viterbi algorithm

We know from conditional probability that

P(XN |ZN ) =P(XN , ZN )

P(ZN ),

for unconditional probabilities P(XN , ZN ) and P(ZN ). Since P(ZN ) is a positive constant once ZNis known, we can just maximize P(XN , ZN ), where

P(XN , ZN ) = P(x0, x1, . . . , xN , z1, z2, . . . , zN )

= πx0P(x1, . . . , xN , z1, z2, . . . , zN |x0)

= πx0P(x1, z1|x0)P(x2, . . . , xN , z2, . . . , zN |x0, x1, z1)

= πx0px0x1r(z1;x0, x1) P(x2, . . . , xN , z2, . . . , zN |x0, x1, z1)︸︷︷︸P(x2,z2|x0,x1,z1)P(x3,...,xN ,z3,...,zN |x0,x1,z1,x2,z2)

=px1x2r(z2;x1,x2)P(x3,...,xN ,z3,...,zN |x0,x1,z1,x2,z2)

.

Continuing in the same manner we obtain:

P(XN , ZN ) = πx0ΠNk=1pxk−1xkr(zk;xk−1, xk)

Instead of working with this function, we will maximize log P(XN , ZN ), or equivalently:

minx0,x1,...,xN

− log(πx0)−

N∑k=1

log(pxk−1xkr(zk;xk−1, xk))

The outcome of this minimization problem will be the sequence XN = x1, x2, . . . , xN.

Transformation into a shortest path problem in a trellis diagram

We build the trellis diagram shown in Figure 2.2.4 as follows:

• Arc (s, x0) → Cost = − log πx0 .

• Arc (xN , t) → Cost = 0.

• Arc (xk−1, xk) → Cost = − log(pxk−1xkr(zk;xk−1, xk)).

The shortest path defines the estimated state sequence x0, x1, . . . , xN.

In practice, the shortest path is most conveniently constructed sequentially by forward DP: Supposethat we have already computed the shortest distances Dk(xk) from s to all states xk, on the basisof the observation sequence z1, . . . , zk, and suppose that we observe zk+1. Then:

Dk+1(xk+1) = minxk:pxkxk+1

>0

Dk(xk)− log(pxkxk+1

r(zk+1;xk, xk+1)), k = 1, . . . , N − 1,

starting from D0(x0) = − log πx0 .

Observations:

• Final estimated sequence XN corresponds to the shortest path from s to the final state xNthat minimizes DN (xN ) over the final set of possible states xN .

52


. . .

. . .

. . .

s x0 x1 x2 xN - 1 xN t

Cost = log x0 Cost = 0Cost = log (px1 x2 r(z2;x1,x2))

Copy 0 of the state space

Copy 1 of the state space

Copy N of the state space

Figure 2.2.4: State estimation of a hidden Markov model viewed as a problem of finding a shortest path from s

to t. There are N + 1 copies of the space state (recall that the number of states is finite). So, xk stands for any state

in the copy k of the state space. An arc connects xk−1 with xk if pxk−1xk > 0.

• Advantage: It can be computed in real time, as new observations arrive.

• Applications of the Viterbi algorithm:

– Speech recognition, where the goal is to transcribe a spoken word sequence in terms ofelementary speech units called “phonemes”.

Setting:

∗ States of the hidden Markov model: phonemes.

∗ Given a sequence of recorded phonemes ZN = z1, . . . , zN (i.e., a noisy representa-tion of words) try to find a phonemic sequence XN = x1, . . . , xN that maximizesover all possible XN = x1, . . . , xN the conditional probability P(XN |ZN ).

∗ The probabilities pxk−1xk and r(zk;xk−1, xk) can be experimentally obtained.

– Computerized recognition of handwriting.

2.2.5 Shortest path algorithms

Computational implications of the equivalence shortest path problems ⇐⇒ deterministic finite-stateDP :

• We can use DP to solve general shortest path problems.

Although there are other methods with superior worst-case performance, DP could be pre-ferred because it is highly parallelizable.

• There are many non-DP shortest path algorithms that can be used to solve deterministicfinite-state problems.

53


– They may be preferable than DP if they avoid calculating the optimal cost-to-go at everystate.

– This is essential for problems with huge state spaces (e.g., combinatorial optimizationproblems).

Example 2.2.1 (An Example with very large number of nodes: TSP)

The Traveling Salesman Problem (TSP) is about finding a tour (cycle) that passes exactly once for each

city (node) of a graph, and that minimizes the total cost. Consider for instance the problem described

in Figure 2.2.5.

A B

C D

5

41

3

2015

Figure 2.2.5: Basic graph for the TSP example with four cities.

To convert a TSP problem over a map (graph) with N nodes to a shortest path problem, build a new

execution graph as follows:

• Pick a city and set it as the initial node s.

• Associate a node with every sequence of n distinct cities, n ≤ N .

• Add an artificial terminal node t.

• A node representing a sequence of cities c1, c2, . . . , cn is connected with a node representing a

sequence c1, c2, . . . , cn, cn+1 with an arc with weight acncn+1 (length of the arc in the original

graph).

• Each sequence of N cities is connected to the terminal node through an arc with same cost as the

cost of the arc connecting the last city of the sequence and city s in the original graph.

Figure 2.2.6 shows the construction of the execution graph for the example described in Figure 2.2.5.

2.2.6 Alternative shortest path algorithms: Label correcting methods

Working on the shortest path execution graph as the one in Figure 2.2.6, the idea of these methodsis to progressively discover shorter paths from the origin s to every other node i.

• Given: Origin s, destination t, lengths aij ≥ 0.

54


ABC ABD ACB ACD ADB ADC

ABCD

AB AC AD

ABDC ACBD ACDB ADBC ADCB

Artificial Terminal Node t

Origin Node sA

1

11

20 20

2020

44

4 4

1515 5

5

3 3

5

33

15

Figure 2.2.6: Structure of the shortest path execution graph for the TSP example.

• Notation:

– Label di: Length of the shortest path found (initially ds = 0, di =∞ for i 6= s).

– Variable UPPER: Label dt of the destination.

– Set OPEN: Contains nodes that are currently active in the sense that they are candidatesfor further examination (initially, OPEN:=s). It is sometimes called candidate list.

– Function ParentOf(j): Saves the predecessor of j in the shortest path found so far from s

to j. At the end of the algorithm, proceeding backward from node t, it allows to rebuildthe shortest path from s.

Label Correcting Algorithm (LCA)

Step 1 Node removal: Remove a node i from OPEN and for each child j of i, do Step 2.

Step 2 Node insertion test: If di + aij < mindj ,UPPER, set dj := di + aij and set i :=ParentOf(j).

In addition, if j 6= t, set OPEN:=OPEN∪j; while if j = t, set UPPER:= dt.

Step 3 Termination test: If OPEN is empty, terminate; else go to Step 1.

As a clarification for Step 2, note that since OPEN is a set, if j, j 6= t, is already in OPEN, thenOPEN remains the same. Also, when j = t, note that UPPER takes the new value dt = di + aitthat has just been updated. Figure 2.2.7 sketches the Label Correcting Algorithm.

The execution of the algorithm over the TSP example above is represented in Figure 2.2.8 andTable 2.1. Interestingly, note that several nodes of the execution graph never enter the OPEN set.Indeed, this computational reduction with respect to DP is what makes this method appealing.

The following proposition establishes the validity of the Label Correcting Algorithm.

55


Is di+aij<dj?(Is the path s i jbetter than the current path s j ?)

i j

YES

Is di+aij<UPPER?(Does the path s i jhave a chance to be part of a shorter s t path?)

YESSet i:=ParentOf( j)Set dj:=di + aijIf j t INSERT jIf j=t UPPER:= dt

OPEN

REMOVE

Figure 2.2.7: Sketch of the Label Correcting Algorithm.

Iter No. Node exiting Label update / Observations Status after iterationOPEN OPEN UPPER

0 - - 1 ∞1 1 d2 := 5, d7 := 1, d10 := 15. ParentOf(2,7,10):=1. 2, 7, 10 ∞2 2 d3 := 25, d5 := 9. ParentOf(3,5):=2. 3, 5, 7, 10 ∞3 3 d4 := 28. ParentOf(4):=3. 4, 5, 7, 10 ∞4 4 Reached terminal node t. Set dt := 43. ParentOf(t):=4. 5, 7, 10 435 5 d6 := d5 + 3 = 12. ParentOf(6):=5. 6, 7, 10 436 6 Reached terminal node t. Set dt := 13. ParentOf(t):=6. 7, 10 137 7 d8 := d7 + 3 = 4. Node ABC would have a label 8, 10 13

d7 + 20 = 21 >UPPER. So, it does not enter OPEN.ParentOf(8):=7.

8 8 d9 := 8. ParentOf(9):=8. 9, 10 139 9 d9 + a9t = d9 + 5 = 13 ≥UPPER. 10 1310 10 If picked ADB: d10 + 4 = 19 >UPPER. Empty 13

If picked ADC: d10 + 3 = 18 >UPPER.

Table 2.1: The optimal solution ABCD is found after examining nodes 1 through 10 in Figure 2.2.8, in that order.

The table also shows the successive contents of the OPEN list, the value of UPPER at the end of an iteration, and

the actions taken during each iteration.

Proposition 2.2.1 If there exists at least one path from the origin to the destination in the exe-cution graph, the label correcting algorithm terminates with UPPER equal to the shortest distancefrom the origin to the destination. Otherwise, the algorithm terminates with UPPER=∞.

Proof: We proceed in three steps:

1. The algorithm terminates

Each time a node j enters OPEN, its label dj is decreased and becomes equal to some pathfrom s to j. The number of distinct lengths of paths from s to j that are smaller than anygiven number is finite. Hence, there can only be a finite number of label reductions.

2. Suppose that there is no path s→ t.

56


ABC ABD ACB ACD ADB ADC

ABCD

AB AC AD

ABDC ACBD ACDB ADBC ADCB

Artificial Terminal Node t

Origin Node sA

1

11

20 20

2020

44

4 4

1515 5

5

3 3

5

33

15

1

2

3

4

5

6

7

8

9

1 0

Figure 2.2.8: Labeling of the nodes in the execution graph when executing the LCA corresponds to the iteration

of the LCA.

Then a node i such that (i, t) is an arc cannot enter the OPEN list, because if that happened,since the paths are built starting from s, it would mean that there is a path s → i, whichjointly with the arc (i, t) would determine a path s → t, which is a contradiction. Since thisholds for all i adjacent to t in the basic graph, UPPER can never be reduced from its initialvalue ∞.

3. Suppose that there is a path s→ t. Then, there is a shortest path s→ t. Let (s, j1, j2, . . . , jk, t)be a shortest path, and let d∗ be the corresponding shortest distance. We will see thatUPPER=d∗ upon termination.

Each subpath (s, j1, j2, . . . , jm), m = 1, . . . , k, must be a shortest path s→ jm. If UPPER> d∗

at termination, then same occurs throughout the algorithm (because UPPER is decreasingduring the execution). So, UPPER is bigger than the length of all paths s→ jm (due to thenonnegative arc length assumption).

In particular, node jk will never enter the OPEN list with djk equal to the shortest distance s→jk. To see this, suppose jk enters OPEN. When at some point the algorithm picks jk fromOPEN, it will set dt = djk + ajkt︸︷︷︸

d∗

, and UPPER= d∗.

Similarly, node jk−1 will never enter OPEN with djk−1equal to the shortest distance s→ jk−1.

Proceeding backward, j1 never enters OPEN with dj1 equal to the shortest distance s→ j1; i.e.asj1 . However, this happens at the first iteration of the algorithm, leading to a contradiction.

Therefore, UPPER will be equal to the shortest distance s→ t.

57


Specific Label Correcting Methods

Making the method efficient:

• Reduce the value of UPPER as quickly as possible (i.e., try to discover “good” s → t pathsearly in the course of the algorithm).

• Keep the number of reentries into OPEN low.

– Try to remove from OPEN nodes with small label first.

– Heuristic rationale: if di is small, then dj when set to di + aij will be accordingly small,so reentrance of j in the OPEN list is less likely.

• Reduce the overhead for selecting the node to be removed from OPEN.

• These objectives are often in conflict. They give rise to a large variety of distinct implemen-tations.

Node selection methods:

• Breadth-first search: Also known as the Bellman-Ford method. The set OPEN is treated asan ordered list. FIFO policy.

• Depth-first search: The set OPEN is treated as an ordered list. LIFO policy. It often requiresrelatively little memory, specially for sparse (i.e., tree-like) graphs. It reduces UPPER quickly.

• Best-first search: Also known as the Djikstra method. Remove from OPEN a node j withminimum value of label dj . In this way, each node will be inserted in OPEN at most once.

Advanced initialization:

In order to get a small starting value of UPPER, instead of starting from di =∞ for all i 6= s, wecan initialize the value of the labels di and the set OPEN as follows:

Start with

di := length of some path from s to i.

If there is no such path, set di =∞. Then, set OPEN:=i 6= t|di <∞.

• No node with shortest distance greater or equal than the initial value of UPPER will enterOPEN.

• Good practical idea:

– Run a heuristic to get a “good” starting path P from s to t.

– Use as UPPER the length of P , and as di the path distances of all nodes i along P .

58


2.2.7 Exercises

Exercise 2.2.1 A decision maker must continually choose between two activities over a time inter-val [0, T ]. Choosing activity i at time t, where i = 1, 2, earns reward at a rate gi(t), and every switchbetween the two activities costs c > 0. Thus, for example, the reward for starting with activity 1,switching to 2 at time t1, and switching back to 1 at time t2 > t1, earns total reward∫ t1

0g1(t)dt+

∫ t2

t1

g2(t)dt+∫ T

t2

g1(t)dt− 2c

We want to find a set of switching times that maximize the total reward. Assume that the functiong1(t)− g2(t) changes sign a finite number of times in the interval [0, T ]. Formulate the problem asa finite horizon problem and write the corresponding DP algorithm.

Exercise 2.2.2 Assume that we have a vessel whose maximum weight capacity is z and whosecargo is to consist of different quantities of N different items. Let vi denote the value of the ith typeof the item, wi the weight of ith type of item, and xi the number of items of type i that are loadedin the vessel. The problem is to find the most valuable cargo subject to the capacity constraint.Formulate this problem in terms of DP.

Exercise 2.2.3 Find a shortest path from each node to node 6 for the graph below by using theDP algorithm:

4

1

25

8

91

5

0 5

1

2

3

4

5

6

Exercise 2.2.4 Air transportation is available between n cities, in some cases directly and in othersthrough intermediate stops and change of carrier. The airfare between cities i and j is denoted byaij . We assume that aij = aji, and for notational convenience, we write aij =∞ if there is no directflight between i and j. The problem is to find the cheapest airfare for going between two citiesperhaps through intermediate stops. Let n = 6 and

a12 = 30, a13 = 60, a14 = 25, a15 = a16 =∞,a23 = a24 = a25 =∞, a26 = 50,

a34 = 35, a35 = a36 =∞,a45 = 15, a46 =∞,

a56 = 15.

Find the cheapest airfare from every city to every other city by using the DP algorithm.

59


Exercise 2.2.5 Label correcting with negative arc lengths. Consider the problem of findinga shortest path from node s to node t, and assume that all cycle lengths are nonnegative (insteadof all arc lengths being nonnegative). Suppose that a scalar uj is known for each node j, whichis an underestimate of the shortest distance from j to t (uj can be taken −∞ if no underestimateis known). Consider a modified version of the typical iteration of the label correcting algorithmdiscussed above, where Step 2 is replaced by the following:

Modified Step 2: If di + aij < mindj ,UPPER − uj, set dj = di + aij and set i:=ParentOf(j). Inaddition, if j 6= t, place j in OPEN if it is not already in OPEN, while if j = t, set UPPER to thenew value di + ait of dt.

1. Show that the algorithm terminates with a shortest path, assuming there is at least one pathfrom s to t.

2. Why is the Label Correcting Algorithm given in class a special case of the one here?

Exercise 2.2.6 We have a set of N objects, denoted 1, 2, . . . , N , which we want to group in clustersthat consist of consecutive objects. For each cluster i, i+1, . . . , j, there is an associated cost aij . Wewant to find a grouping of the objects in clusters such that the total cost is minimum. Formulatethe problem as a shortest path problem, and write a DP algorithm for its solution. (Note: Anexample of this problem arises in typesetting programs, such as TEX/LATEX, that break down aparagraph into lines in a way that optimizes the paragraph’s appearance).

2.3 Stochastic Dynamic Programming

We present here a general problem of decision making under stochastic uncertainty over a finitenumber of stages. The components of the formulation are listed below:

• The discrete time dynamic system evolves according to the system equation

xk+1 = fk(xk, uk, wk), k = 0, 1, . . . , N − 1,

where

– the state xk is an element of a space Sk,

– the control uk verifies uk ∈ Uk(xk) ⊂ Ck, for a space Ck, and

– the random disturbance wk is an element of a space Dk.

• The random disturbance wk is characterized by a probability distribution Pk(·|xk, uk) that maydepend explicitly on xk and uk. For now, we assume independent disturbances w0, w1, . . . , wN−1.In this case, since the system evolution from state to state is independent of the past, we havea Markov decision model.

• Admissible policies π = µ0, µ1, . . . , µN−1, where µk maps states xk into controls uk = µk(xk),and is such that µk(xk) ∈ Uk(xk) for all xk ∈ Sk.

60


• Given an initial state x0 and an admissible policy π = µ0, µ1, . . . , µN−1, the states xk anddisturbances wk are random variables with distributions defined through the system equation

xk+1 = fk(xk, µk(xk), wk), k = 0, 1, . . . , N − 1,

• The stage cost function is given by gk(xk, µk(xk), wk).

• The expected cost of a policy π starting at x0 is

Jπ(x0) = E

[gN (xN ) +

N−1∑k=0

gk(xk, µk(xk), wk)

],

where the expectation is taken over the r.v. wk and xk.

An optimal policy π∗ is one that minimizes this cost; that is,

Jπ∗(x0) = minπ∈Π

Jπ(x0), (2.3.1)

where Π is the set of all admissible policies.

Observations:

• It is useful to see J∗ as a function that assigns to each initial state x0 the optimal cost J∗(x0),and call it the optimal cost function or optimal value function.

• When produced by DP, π∗ is typically independent of x0, because

π∗ = µ∗0(x0), . . . , µ∗N−1(xN−1),

and µ∗0(x0) must be defined for all x0.

• Even though we will be using the “min” operator in (2.3.1), it should be understood that thecorrect formal formulation should be Jπ∗(x0) = infπ∈Π Jπ(x0).

Example 2.3.1 (Control of a single server queue)

Consider a single server queueing system with the following features:

• Waiting room for n − 1 customers; an arrival finding n people in the system (n − 1 waiting and

one in the server) leaves.

• Discrete random service time belonging to the set 1, 2, . . . , N.

• Probability pm of having m arrivals at the beginning of a period, with m = 0, 1, 2, . . ..

• System offers two types of service, that can be chosen at the beginning of each period:

– Fast, with cost per period cf , and that finishes at the end of the current period w.p. qf ,

– Slow, with cost per period cs, and that finishes at the end of the current period w.p. qs.

61


Assume qf > qs and cf > cs.

• A recent arrival cannot be immediately served.

• The system incurs two costs in every period: The service cost (either cf or cs), and a waiting time

cost r(i) if there are i customers waiting at the beginning of a period.

• There is a terminal cost R(i) if there are i customers waiting in the final period (i.e., in period N).

• Problem: Choose the type of service at the beginning of each period (fast or slow) in order to

minimize the total expected cost over N periods.

Intuitive optimal strategy must be of the threshold type: “When there are more than i customers in the

system use fast; otherwise, use slow”.

In terms of DP terminology, we have:

• State xk: Number of customers in the system at the start of period k.

• Control uk: Type of service provided; either uk = uf (fast) or uk = us (slow)

• Cost per period k: For 0 ≤ k ≤ N − 1, gk(i, uk, wk) = r(i) + cf11uk = uf + cs11uk = uf.4

For k = N , gN (i) = R(i).

According to the claim above, since states are discrete, transition probabilities are enough to describe

the system dynamics:

• If the system is empty, then:

p0j(uf ) = p0j(us) = pj , j = 0, 1, . . . , n− 1; and p0n(uf ) = p0n(us) =∞∑m=n

pm

In words, since customers cannot be served immediately, they accumulate and system jumps to

state j < n (if there were less than n arrivals), or to state n (if there are n or more).

• When the system is not empty (i.e., xk = i > 0), then

pij(uf ) = 0, if j < i− 1 (we cannot have more than one service completion per period),

pij(uf ) = qfp0, if j = i− 1 (current customer finishes in this period and nobody arrives),

pij(uf ) = Pj − i+ 1 arrivals, service completed+ Pj − i arrivals, service not completed= qfpj−i+1 + (1− qf )pj−i, if i− 1 < j < n− 1,

pi(n−1)(uf ) = Pat least n− i arrivals, service completed+Pn− 1− i arrivals, service uncompleted

= qf

∞∑m=n−i

pm + (1− qf )pn−1−i,

pin(uf ) = Pat least n− i arrivals, service not completed

= (1− qf )∞∑

m=n−ipm

For control us the formulas are analogous, with us and qs replacing uf and qf , respectively.

4Here, 11A is the indicator function, taking the value one if event A occurs, and zero otherwise.

62


2.4 The Dynamic Programming Algorithm

The DP technique rests on a very intuitive idea, the Principle of Optimality. The name is due toBellman (New York 1920 - Los Angeles 1984).

Principle of optimality. Let π∗ = µ∗0, µ∗1, . . . , µ∗N−1 be an optimal policy for the basic problem,and assume that when using π∗ a given state xi occurs at time i with some positive probability.Consider the “tail subproblem” whereby we are at xi at time i and wish to minimize the cost-to-gofrom time i to time N ,

E

[gN (xN ) +

N−1∑k=i

gk(xk, µk(xk), wk)

].

Then, the truncated (tail) policy µ∗i , µ∗i+1, . . . , µ∗N−1 is optimal for this subproblem.

Figure 2.4.1 illustrates the intuition of the Principle of Optimality. The DP algorithm is based on

k=0 k=Nk=i

State x0 State xi State xN

Then, policy i*, i+1*,…, N-1*must be optimal

Policy 1*, 2*,…, N-1*is optimal

Figure 2.4.1: Principle of Optimality: The tail policy is optimal for the tail subproblem.

this idea: It first solves all tail subproblems of final stage, and then proceeds backwards solving alltail subproblems of a given time length using the solution of the tail subproblems of shorter timelength. Next, we introduce the DP algorithm with two examples.

Solution to Scheduling Example 2.1.2:

Consider the graph of costs for previous Example 2 given in Figure 2.4.2. Applying the DP algorithmfrom the terminal nodes (stage N = 3), and proceeding backwards, we get the representation inFigure 2.4.3. At each state-time pair, we record the optimal cost-to-go and the optimal decision.For example, node AC has a cost of 5, and the optimal decision is to proceed to ACB (because ithas the lowest stage k = 3 cost, starting from k = 2 and state AC). In terms of our formal notationfor the cost, g3(ACB) = 1, for a terminal state x3 = ACB.

Solution to Inventory Example 2.1.1:

Consider again the stochastic inventory problem described in Example 1. The application of the DPalgorithm is very similar to the deterministic case, except for the fact that now costs are computedas expected values.

63


Empty schedule

A B

C D

5

3

34

2

6

1

3

4

34

3

Figure 2.4.2: Graph of one-step switching costs for Example 2.

• Tail subproblems of length 1: The optimal cost for the last period is

JN−1(xN−1) = minuN−1≥0

EwN−1 [cuN−1 + r(xN−1 + uN−1 − wN−1)]

Note:

– uN−1 = µN−1(xN−1) depends on xN−1.

– JN−1(xN−1) may be computed numerically and stored as a column of a table.

• Tail subproblems of length N − k: The optimal cost for period k is

Jk(xk) = minuk≥0

Ewk [cuk + r(xk + uk − wk) + Jk+1(xk + uk − wk)] ,

where xk+1 = xk + uk − wk is the initial inventory of the next period.

• The value J0(x0) is the optimal expected cost when the initial stock at time 0 is x0.

If the number of attainable states xk is discrete with finite support [0, S], the output of the DPalgorithm could be stored in two tables (one for the optimal cost Jk, and one for the optimalcontrol uk), each table consisting of N columns labeled from k = 0 to k = N − 1, and S + 1 rowslabeled from 0 to S. The tables are filled by the DP algorithm from right to left.

DP algorithm: Start withJN (xN ) = gN (xN ),

and go backwards using the recursion

Jk(xk) = minuk∈Uk(xk)

Ewk [gk(xk, uk, wk) + Jk+1(fk(xk, uk, wk))] , k = 0, 1, . . . , N − 1, (2.4.1)

where the expectation is taken with respect to the probability distribution of wk, which may dependon xk and uk.

Proposition 2.4.1 For every initial state x0, the optimal cost J∗(x0) of the basic problem is equalto J0(x0), given by the last step of the DP algorithm. Furthermore, if u∗k = µ∗k(xk) minimizes theRHS of (2.4.1) for each xk and k, the policy π∗ = µ∗0, . . . , µ∗N−1 is optimal.

64


A

C

AB

AC

CDA

ABC

CA

CD

ACD

ACB

CAB

CAD

InitialState1 0

7 6

2

86

6

2

2

9

3

33

3

3

3

5

1

5

44

3

1

5

4

Figure 2.4.3: Transition graph for Example 2. Next to each node/state we show the cost of optimallycompleting the scheduling starting from that state. This is the optimal cost of the correspondingtail subproblem. The optimal cost for the original problem is equal to 10. The optimal schedule isCABD.

Observations:

• Justification: Proof by induction that Jk(xk) is equal to J∗k (xk), defined as the optimal costof the tail subproblem that starts at time k at state xk.

• All the tail subproblems are solved in addition to the original problem. Observe the intensivecomputational requirements. The worst-case computational complexity is

∑N−1k=0 |Sk||Uk|,

where |Sk| is the size of the state space in period k, and |Uk| is the size of the control space inperiod k. In particular, note that potentially we could need to search over the whole controlspace, although we just store the optimal one in each period-state pair.

Proof of Proposition 2.4.1

For this version of the proof, we need the following additional assumptions:

• The disturbance wk takes a finite or countable number of values

• The expected values of all stage costs are finite for every admissible policy π

• The functions Jk(xk) generated by the DP algorithm are finite for all states xk and periods k.

Informal argument

Let πk = µk, µk+1, . . . , µN−1 denote a tail policy from time k onward.

• Border case: For k = N , define J∗N (xN ) = JN (xN ) = gN (xN ).

65


• For Jk+1(xk+1) generated by the DP algorithm and the optimal J∗k+1(xk+1), assume thatJk+1(xk+1) = J∗k+1(xk+1). Then

J∗k (xk) = min(µk,πk+1)

Ewk,wk+1,...,wN−1

[gk(xk, µk(xk), wk) + gN (xN ) +

N−1∑i=k+1

gi(xi, µi(xi), wi)

]

= minµk

Ewk

[gk(xk, µk(xk), wk) + min

πk+1

Ewk+1,...,wN−1

[gN (xN ) +

N−1∑i=k+1

gi(xi, µi(xi), wi)

]](this is the “informal step”, since we are moving the min inside the E[·])

= minµk

Ewk[gk(xk, µk(xk), wk) + J∗k+1(fk(xk, µk(xk), wk))

](by def. of J∗k+1)

= minµk

Ewk [gk(xk, µk(xk), wk) + Jk+1(fk(xk, µk(xk), wk))] (by induction hypothesis (IH))

= minuk∈Uk(xk)

Ewk [gk(xk, µk(xk), wk) + Jk+1(fk(xk, µk(xk), wk))]

= Jk(xk),

where the second to last equality follows from converting a minimization problem over func-tions µk to a minimization problem over scalars uk. In symbols, for any function F of x and u,we have

minµ∈M

F (x, µ(x)) = minu∈U(x)

F (x, u),

where M is the set of all functions µ(x) such that µ(x) ∈ U(x) for all x.

A more formal argument

For any admissible policy π = µ0, . . . , µN−1 and each k = 0, 1, . . . , N − 1, denote

πk = µk, µk+1, . . . , µN−1.

For k = 0, 1, . . . , N − 1, let J∗k (xk) be the optimal cost for the (N − k)-stage problem that starts atstate xk and time k, and ends at time N ; that is

J∗k (xk) = minπk

E

[gN (xN ) +

N−1∑i=k

gi(xi, µi(xi), wi)

].

For k = N , we define J∗N (xN ) = gN (xN ). We will show by backward induction that the functions J∗kare equal to the functions Jk generated by the DP algorithm, so that for k = 0 we get the desiredresult.

Start by defining for any ε > 0, and for all k and xk, an admissible control µεk(xk) ∈ Uk(xk) for theDP recursion (2.4.1) such that

Ewk [gk(xk, µεk(xk), wk) + Jk+1(fk(xk, µεk(xk), wk))] ≤ Jk(xk) + ε (2.4.2)

Because of our former assumption, Jk+1(xk) generated by the DP algorithm is well defined andfinite for all k and xk ∈ Sk. Let J εk(xk) be the expected cost when using the policy µεk, . . . , µεN−1.We will show by induction that for all xk and k, it must hold that

Jk(xk) ≤ J εk(xk) ≤ Jk(xk) + (N − k)ε, (2.4.3)

J∗k (xk) ≤ J εk(xk) ≤ J∗k (xk) + (N − k)ε, (2.4.4)

Jk(xk) = J∗k (xk) (2.4.5)

66


• For k = N − 1, we have

JN−1(xN−1) = minuN−1∈UN−1(xN−1)

EwN−1 [gN−1(xN−1, uN−1, wN−1) + gN (xN ))]

J∗N−1(xN−1) = minπN−1

EwN−1 [gN−1(xN−1, µN−1(xN−1), wN−1) + gN (xN ))]

Both minimizations guarantee the LHS inequalities in (2.4.3) and (2.4.4) when comparingversus

J εN−1(xN−1) = EwN−1

[gN−1(xN−1, µ

εN−1(xN−1), wN−1) + gN (xN ))

],

with µεN−1(xN−1) ∈ UN−1(xN−1). The RHS inequalities there hold just by the constructionin (2.4.2). By taking ε→ 0 in equations (2.4.3) and (2.4.4), it is also seen that JN−1(xN−1) =J∗N−1(xN−1).

• Suppose that equations (2.4.3)-(2.4.5) hold for period k + 1. For period k, we have:

J εk(xk) = Ewk[gk(xk, µεk(xk), wk) + J εk+1(fk(xk, µεk(xk), wk))

](by definition of J εk(xk))

≤ Ewk [gk(xk, µεk(xk), wk) + Jk+1(fk(xk, µεk(xk), wk))] + (N − k − 1)ε (by IH)

≤ Jk(xk) + ε+ (N − k − 1)ε (by equation (2.4.2))

= Jk(xk) + (N − k)ε

We also have



≥ Ewk [gk(xk, µεk(xk), wk) + Jk+1(fk(xk, µεk(xk), wk))] (by IH)

≥ minuk∈Uk(xk)

Ewk [gk(xk, uk, wk) + Jk+1(fk(xk, uk, wk))] (by min over all admissible controls)

= Jk(xk).

Combining the preceding two relations, we see that equation (2.4.3) holds.

In addition, for every policy π = µ0, µ1, . . . , µN−1, we have



≤ Ewk [gk(xk, µεk(xk), wk) + Jk+1(fk(xk, µεk(xk), wk))] + (N − k − 1)ε (by IH)

≤ minuk∈Uk(xk)

Ewk [gk(xk, uk, wk) + Jk+1(fk(xk, uk, wk))] + (N − k)ε (by (2.4.2))

≤ Ewk [gk(xk, µk(xk), wk) + Jk+1(fk(xk, µk(xk), wk))] + (N − k)ε

(since µk(xk) is an admissible control for period k)

(Note that by IH, Jk+1 is the optimal cost starting from period k + 1)

≤ Ewk [gk(xk, µk(xk), wk) + Jπk+1(fk(xk, µk(xk), wk))] + (N − k)ε

(where πk+1 is an admissible policy starting from period k + 1)

= Jπk(xk) + (N − k)ε, (for πk = (µk, πk+1))

Since πk is any admissible policy, taking the minimum over πk in the preceding relation, weobtain for all xk,

J εk(xk) ≤ J∗k (xk) + (N − k)ε.

We also have by the definition of the optimal cost J∗k , for all xk,

J∗k (xk) ≤ J εk(xk).

67


Combining the preceding two relations, we see that equation (2.4.4) holds for period k. Finally,equation (2.4.5) follows from equations (2.4.3) and (2.4.4) by taking ε→ 0, and the inductionis complete.

Example 2.4.1 (Linear-quadratic example)

A certain material is passed through a sequence of two ovens (see Figure 2.4.4). Let

• x0: Initial temperature of the material,

• xk, k = 1, 2: Temperature of the material at the exit of Oven k,

• u0: Prevailing temperature in Oven 1,

• u1: Prevailing temperature in Oven 2,

Oven 1 Temperature u0

Initialtemperature

x0 x1

Oven 2 Temperature u1

Final temp.

x2

Figure 2.4.4: System dynamics of Example 5.

Consider a system equation

xk+1 = (1− a)xk + auk, k = 0, 1,

where a ∈ (0, 1) is a given scalar. Note that the system equation is linear in the control and the state.

The objective is to get a final temperature x2 close to a given target T , while expending relatively little

energy. This is expressed by a total cost function of the form

r(x2 − T )2 + u20 + u2

1,

where r is a scalar. In this way, we are penalizing quadratically a deviation from the target T . Note that

the cost is quadratic in both controls and states.

We can cast this problem into a DP framework by setting N = 2, and a terminal cost g2(x2) = r(x2−T )2,

so that the border condition for the algorithm is

J2(x2) = r(x2 − T )2.

Proceeding backwards, we have

J1(x1) = minu1≥0u2

1 + J2(x2)

= minu1≥0u2

1 + J2((1− a)x1 + au1)

= minu1≥0u2

1 + r((1− a)x1 + au1 − T )2 (2.4.6)

68


This is a quadratic function in u1 that we can solve by setting to zero the derivative with respect to u1.

We will get an expression u1 = µ∗1(x1) depending linearly on x1. By substituting back this expression

for u1 into (2.4.6), we obtain a closed form expression for J1(x1), which is quadratic in x1.

Proceeding backwards further,

J0(x0) = minu0≥0u2

0 + J1((1− a)x0 + au0)

Since J1(·) is quadratic in its argument, then it is quadratic in u0. We minimize with respect to u0 by

setting the correspondent derivative to zero, which will depend on x0. The optimal temperature of the

first oven will be a function µ∗0(x0). The optimal cost is obtained by substituting this expression in the

formula for J0.

2.4.1 Exercises

Exercise 2.4.1 Consider the system

xk+1 = xk + uk + wk, k = 0, 1, 2, 3,

with initial state x0 = 5, and cost function

3∑k=0

(x2k + u2

k)

Apply the DP algorithm for the following three cases:

(a) The control constraint set is Uk(xk) = u|0 ≤ xk + u ≤ 5, u integer, for all xk and k, and thedisturbance verifies wk = 0 for all k.

(b) The control constraint and the disturbance wk are as in part (a), but there is in addition aconstraint x4 = 5 on the final state.

Hint: For this problem you need to define a state space for x4 that consists of just the valuex4 = 5, and to redefine U3(x3). Alternatively, you may use a terminal cost g4(x4) equal to avery large number for x4 6= 5.

(c) The control constraint is as in part (a) and the disturbance wk takes the values −1 and 1 withprobability 1/2 each, for all xk and wk, except if xk+wk is equal to 0 or 5, in which case wk = 0with probability 1.

Note: In this exercise (and in the exercises below), when the output of the DP algorithm is requested,submit the tables describing state xk, optimal cost Jk(xk), and optimal control µk(xk), for periodsk = 0, 1, . . . , N − 1 (e.g, in this case, N = 4).

Exercise 2.4.2 Suppose that we have a machine that is either running or is broken down. If it runsthroughout one week, it makes a gross profit of $100. If it fails during the week, gross profit is zero.If it is running at the start of the week and we perform preventive maintenance, the probabilitythat it will fail during the week is 0.4. If we do not perform such maintenance, the probability offailure is 0.7. However, maintenance will cost $20.

69


When the machine is broken down at the start of the week, it may either be repaired at a cost of$40, in which case it will fail during the week with a probability of 0.4, or it maybe replaced at acost of $150 by a new machine that is guaranteed to run its first week of operation.

Find the optimal repair, replacement, and maintenance policy that maximizes total profit over fourweeks, assuming a new machine at the start of the first week.

Exercise 2.4.3 In the framework of the basic problem, consider the case where the cost is of theform

Ewkk = 0, 1, . . . , N − 1

αNgN (xN ) +

N−1∑k=0

αkgk(xk, uk, wk)

,

where α is a discount factor with 0 < α < 1. Show that an alternate form of the DP algorithm isgiven by

VN (xN ) = gN (xN ),

Vk(xk) = minuk∈Uk(xk)

Ewkgk(xk, uk, wk) + αVk+1(fk(xk, uk, wk))

Exercise 2.4.4 In the framework of the basic problem, consider the case where the system evolutionterminates at time i when a given value w of the disturbance at time i occurs, or when a terminationdecision ui is made by the controller. If termination occurs at time i, the resulting cost is

T +i∑

k=0

gk(xk, uk, wk),

where T is a termination cost. If the process has not terminated up to the final time N , the resultingcost is gN (xN ) +

∑N−1k=0 gk(xk, uk, wk). Reformulate the problem into the framework of the basic

problem.

Hint: Augment the state space with a special termination state.

Exercise 2.4.5 For the Stock Option problem discussed in class, where the time index runs back-wards (i.e., period 0 is the terminal stage), prove the following statements:

1. The optimal cost function Jn(s) is increasing and continuous in s.

2. The optimal cost function Jn(s) is increasing in n.

3. If µF ≥ 0 and we do not exercise the option if expected profit is zero, then the option is neverexercised before maturity.

Exercise 2.4.6 Consider a device consisting of N stages connected in series, where each stageconsists of a particular component. The components are subject to failure, and to increase thereliability of the device duplicate components are provided. For j = 1, 2, . . . , N , let (1 + mj) bethe number of components for the jth stage (one mandatory component, and mj backup ones),let pj(mj) be the probability of successful operation when (1 + mj) components are used, and letcj denote the cost of a single backup component at the jth stage. Formulate in terms of DP theproblem of finding the number of components at each stage that maximizes the reliability of thedevice expressed by the product

p1(m1) · p2(m2) · · · pN (mN ),

subject to the cost constraint∑N

j=1 cjmj ≤ A, where A > 0 is given.

70


Exercise 2.4.7 (Monotonicity Property of DP) An evident, yet very important property ofthe DP algorithm is that if the terminal cost gN is changed to a uniformly larger cost gN (i.e.,gN (xN ) ≤ gN (xN ), ∀xN ), then clearly the last stage cost-to-go JN−1(xN−1) will be uniformlyincreased (i.e., JN−1(xN−1) ≤ JN−1(xN−1)).

More generally, given two functions Jk+1 and Jk+1, with Jk+1(xk+1) ≤ Jk+1(xk+1) for all xk+1, wehave, for all xk and uk ∈ Uk(xk),

Ewk [gk(xk, uk, wk) + Jk+1(fk(xk, uk, wk))] ≤ Ewk[gk(xk, uk, wk) + Jk+1(fk(xk, uk, wk))

].

Suppose now that in the basic problem the system and cost are time invariant; that is, SkM=

S,CkM= C,Dk

M= D, fkM= f, Uk

M= U,PkM= P, and gk

M= g. Show that if in the DP algorithm wehave JN−1(x) ≤ JN (x) for all x ∈ S, then

Jk(x) ≤ Jk+1(x), for all x ∈ S and k.

Similarly, if we have JN−1(x) ≥ JN (x) for all x ∈ S, then

Jk(x) ≥ Jk+1(x), for all x ∈ S and k.

2.5 Linear-Quadratic Regulator

2.5.1 Preliminaries: Review of linear algebra and quadratic forms

We will be using some results of linear algebra. Here is a summary of them:

1. Given a matrix A, we let A′ be its transpose. It holds that (AB)′ = B′A′, and (An)′ = (A′)n.

2. The rank of a matrix A ∈ Rm×n is equal to the maximum number of linearly independentrow (column) vectors. The matrix is said to be full rank if rank(A) = minm,n. A squarematrix is of full rank if and only if it is nonsingular.

3. rank(A) = rank(A′).

4. Given a matrix A ∈ Rn×n, the determinant of the matrix γI−A, where I is the n×n identitymatrix and γ is a scalar, is an nth degree polynomial. The n roots of this polynomial arecalled the eigenvalues of A. Thus, γ is an eigenvalue of A if and only if the matrix γI −A issingular (i.e., it does not have an inverse), or equivalently, if there exists a vector v 6= 0 suchthat Av = γv. Such vector v is called an eigenvector corresponding to γ.

The eigenvalues and eigenvectors of A can be complex even if A is real.

A matrix A is singular if and only if it has an eigenvalue that is equal to zero.

If A is nonsingular, then the eigenvalues of A−1 are the reciprocals of the eigenvalues of A.

The eigenvalues of A and A′ coincide.

5. A square symmetric n × n matrix A is said to be positive semidefinite if x′Ax ≥ 0,∀x ∈Rn, x 6= 0. It is said to be positive definite if x′Ax > 0,∀x ∈ Rn, x 6= 0. We will denote A ≥ 0and A > 0 to denote positive semidefiniteness and definiteness, respectively.

71


6. If A is an n×n positive semidef. symmetric matrix and C is an m×n matrix, then the matrixCAC ′ is positive semidef. symmetric. If A is positive def. symmetric, and C has rank m

(equivalently, m ≤ n and C has full rank), then CA′C is positive def. symmetric.

7. An n×n positive def. matrix A can be written as CC ′ where C is a square invertible matrix.If A is positive semidef. symmetric and its rank is m, then it can be written as CC ′, where Cis an n×m matrix of full rank.

8. The expected value of a random vector x = x1, . . . , xn) is the vector:

E[x] = (E[x1], . . . ,E[xn]).

The covariance matrix of a random vector x with expected value E[x] is defined to be then× n positive semidefinite matrix E[(x1 − x1)2] · · · E[(x1 − x1)(xn − xn)]

......

...E[(xn − xn)(x1 − x1)] · · · E[(xn − xn)2]

9. Let f : Rn → R be a quadratic form

f(x) =12x′Qx+ b′x,

where Q is a symmetric n× n matrix and b ∈ Rn. Its gradient is given by

∇f(x) = Qx+ b.

The function f is convex if and only if Q is positive semidefinite. If Q is positive definite,then f is convex and Q is invertible, so a vector x∗ minimizes f if and only if

∇f(x∗) = Qx∗ + b = 0,

or equivalently, x∗ = −Q−1b.

2.5.2 Problem setup

System equation: xk+1 = Akxk +Bkuk + wk [Linear in both state and control.]

Quadratic cost:

Ew0,...,wN−1

x′NQNxN +

N−1∑k=0

x′kQkxk + u′kRkuk

,

where:

• Qk ≥ 0 are square, symmetric, positive semidefinite matrices with appropriate dimension,

• Rk > 0 are square, symmetric, positive definite matrices with appropriate dimension,

• Disturbances wk are independent with E[wk] = 0, and do not depend on xk nor on uk(the case E[wk] 6= 0 will be discussed later, in Section 2.5.7),

• Controls uk are unconstrained.

72


DP Algorithm:JN (xN ) = x′NQNxN (2.5.1)

Jk(xk) = minuk

Ewkx′kQkxk + u′kRkuk + Jk+1(Akxk +Bkuk + wk)

(2.5.2)

Intuition: The purpose of this DP is to bring the state closer to xk = 0 as soon as possible. Anydeviation from zero is penalized quadratically.

2.5.3 Properties

• Jk(xk) is quadratic in xk

• Optimal policy µ∗0, µ∗1, . . . , µ∗N−1 is linear, i.e. µ∗k(xk) = Lkxk

• Similar treatment to several variants of the problems as follows

Variant 1: Nonzero mean wk.

Variant 2: Shifted problem, i.e., set the target in a vector xN rather than in zero:

E[(xN − xN )′QN (xN − xN ) +N−1∑k=0

((xN − xk)′Qk(xk − xk)] + u′kRkuk)].

2.5.4 Derivation

By induction, we want to verify that : µ∗k(xk) = Lkxk and Jk(xk) = x′kKkxk + constant, where Lkare gain matrices5 given by

Lk = −(B′kKk+1Bk +Rk)−1B′kKk+1Ak,

and where Kk are symmetric positive semidefinite matrices given by

KN = QN

Kk = A′k(Kk+1 −Kk+1Bk(B′kKk+1Bk +Rk)−1B′kKk+1)Ak +Qk. (2.5.3)

The above equation is called the discrete time Riccati equation. Just like DP, it starts at the terminaltime N and proceeds backwards.

We will show that the optimal policy (but not the optimal cost) is the same as when wk is replacedby E[wk] = 0 (property known as certainty equivalence).

Induction argument proceeds as follows. Write equation (4.2.5) for N − 1 :

JN−1(xN−1) = minuN−1

EwN−1x′N−1QN−1xN−1 + u′N−1RN−1uN−1

+(AN−1xN−1 +BN−1uN−1 + wN−1)′QN (AN−1xN−1 +BN−1uN−1 + wN−1)= x′N−1QN−1xN−1 + min

uN−1u′N−1RN−1uN−1 + u′N−1B

′N−1QNBN−1uN−1

+2x′N−1A′N−1QNBN−1uN−1 + x′N−1A

′N−1QNAN−1xN−1 + E[w′N−1QNwN−1],(2.5.4)

5The idea is that Lk represent how much we gain in our path towards the target zero.

73


where in the expansion of the last term in the second line we are using the fact that sinceE[w] = 0, then E[wN−1QN (AN−1xN−1 +BN−1uN−1)] = 0.

By differentiating the equation w.r.t uN−1, and setting the derivative equal to 0; we get

( RN−1︸︷︷︸Posit. def.

+B′N−1QNBN−1︸︷︷︸Posit. semidef.︸︷︷︸

Posit. definite ⇒Invertible

)uN−1 = −B′N−1QNAN−1xN−1,

leading tou∗N−1 = −(RN−1 +B′N−1QNBN−1)−1B′N−1QNAN−1︸︷︷︸

LN−1

xN−1.

By substitution into (4.2.6), we get

JN−1(xN−1) = x′N−1KN−1xN−1 + E[w′N−1QNwN−1]︸︷︷︸≥0

, (2.5.5)

where KN−1 = A′N−1(QN −QNBN−1(B′N−1QNBN−1 +RN−1)−1B′N−1QN )AN−1 +QN−1.

Facts:

• The matrix KN−1 is symmetric, since KN−1 = K ′N−1.

• Claim: KN−1 ≥ 0 (we need this result to prove that the next matrix LN−2 is invertible).Proof: : From (4.2.7) we have

x′N−1KN−1xN−1 = JN−1(xN−1)− E[w′N−1QNwN−1]. (2.5.6)

So,

x′KN−1x = x′QN−1︸︷︷︸≥0

x+ minuu′RN−1︸︷︷︸

>0

u+ (AN−1x+BN−1u)′ QN︸︷︷︸≥0

(AN−1x+BN−1u).

Note that the E[w′N−1QNwN−1] in JN−1(xN−1) cancels out with the one in (2.5.6). Thus,the expression in brackets is nonnegative for every u. The minimization over u preservesnonnegativity, showing that KN−1 is also positive semidefinite.

In conclusion,JN−1(xN−1) = x′N−1KN−1xN−1 + constant

is a positive semidefinite quadratic function (plus an inconsequential constant term), we may proceedbackward and obtain from DP equation (4.2.5) the optimal law for stage N −2. As earlier, we showthat JN−2 is positive semidef. and by proceeding sequentially we obtain

u∗K(xk) = Lkxk,

whereLk = −(B′kKk+1Bk +Rk)−1B′kKk+1Ak,

and where the symmetric ≥ 0 matrices Kk are given recursively by

KN = QN ,

74


Kk = (A′k(Kk+1 −Kk+1Bk(B′kKk+1Bk +Rk)−1B′kKk+1)Ak +Qk).

Just like DP, this algorithm starts at the terminal time N and proceeds backwards. The optimalcost is given by

J0(x0) = x′0K0x0 +N−1∑k=0

E[w′kKk+1wk

].

The control u∗k and the system equation lead to the linear feedback structure represented in Fig-ure 2.5.1.

xk+1 = Ak xk + Bk uk + wk

Apply control(xk)=Lk xk

Initial state xk

Noise wkis realized

Figure 2.5.1: Linear feedback structure of the optimal controller for the linear-quadratic problem.

2.5.5 Asymptotic behavior of the Riccati equation

The objective of this section is to prove the following result: If matrices Ak, Bk, Qk and Rk areconstant and equal to A,B,Q,R respectively, then Kk −→ K as k → −∞ (i.e., when we have manyperiods ahead), where K satisfies the algebraic Riccati equation:

K = A′(K −KB(B′KB +R)−1B′K)A+Q,

where K ≥ 0 and is unique (within the class of positive semidefinite matrices) solution. Thisproperty indicates that for the system

xk+1 = Axk +Buk + wk, k = 0, 1, 2, . . . , N − 1,

and a large N , one can approximate the control µ∗k(xk) = Lkxk by the steady state control:

µ∗k(x) = Lx,

whereL = −(B′KB +R)−1B′KA.

Before proving the above result, we need to introduce three notions: controllability, observability,and stability.

Definition 2.5.1 A pair of matrices (A,B), where A ∈ Rn×n and B ∈ Rn×m, is said to be con-trollable if the n× (n,m) matrix: [B,AB,A2B, . . . , An−1B] has full rank.

Definition 2.5.2 A pair (A,C), A ∈ Rn×n, C ∈ Rm×n is said to be observable if the pair (A′, C ′)is controllable.

75


The next two claims provide intuition for the previous definitions:

Claim: If the pair (A,B) is controllable, then for any initial state x0 there exists a sequence ofcontrol vectors u0, u1, . . . , uN−1, that forces state xn of the system: xk+1 = Axk +Buk to be equalto zero at time n.

Proof: By successively applying the equation xk+1 = Axk + Buk, for k = n − 1, n − 2, . . . , 0, weobtain

xn = Anx0 +Bun−1 +ABun−2 + · · ·+An−1Bu0,

or equivalentlyxn −Anx0 = (B,AB, . . . , An−1B)(un−1, un−2, . . . , u1, u0)′ (2.5.7)

Since (A,B) is controllable, (B,AB, . . . , An−1B) has full rank and spans the whole space Rn. Hence,we can find (un−1, un−2, . . . , u1, u0) ∈ Rn such that

(B,AB, . . . , An−1B)(un−1, un−2, . . . , u1, u0)′ = v,

for any vector v ∈ Rn. In particular, by setting v = −Anx0, we obtain xn = 0 in equation (3.3.4).

In words: The system equation xk+1 = Axk+Buk under controllable matrices (A,B) in the space Rn

warrants convergence to the zero vector in exactly n steps.

Claim: Suppose that (A,C) is observable (i.e., (A′, C ′) is controllable). In the context of estimationproblems, given measurements z0, z1, . . . , zn−1 of the form

zk︸︷︷︸∈Rm×1

= C︸︷︷︸∈Rm×n

xk︸︷︷︸∈Rn×1

,

it is possible to uniquely infer the initial state x0 of the system xk+1 = Axk.

Proof: In view of the relation

z0 = Cx0

x1 = Ax0

z1 = Cx1 = CAx0

x2 = Ax1 = A2x0

z2 = Cx1 = CA2x0

......

zn−1 = Cxn−1 = CAn−1x0,

or in matrix form, in view of

(z0, z1, . . . , zn−1)′ = (C,CA, . . . , CAn−1)′x0, (2.5.8)

where (C,CA, . . . , CAn−1) has full rank n, there is a unique x0 that satisfies (2.5.8).

To get the previous result we are using the following: If (A,C) is observable, then (A′, C ′) iscontrollable. So, if we denote

αM= (C ′, A′C ′, (A′)2C ′, . . . , (A′)n−1C ′) = (C ′, (CA)′, (CA2)′, . . . , (CAn−1)′),

76


then α is full rank, and therefore α′ has full rank, where

α′ =

C

CA

CA2

...CAn−1

,

which completes the argument.

In words: The system equation xk+1 = Axk under observable matrices (A,C) allow to infer theinitial state of a sequence of observations z0, z1, . . . , zn−1 given by zk = Cxk.

Stability: The concept of stability refers to the fact that in the absence of random disturbance,the dynamics of the system driven by the control µ(x) = Lx, bring the state

xk+1 = Axk +Buk = (A+BL)xk, k = 0, 1, . . . ,

towards zero as k →∞. For any x0, since xk = (A+BL)kx0, it follows that the closed-loop systemis stable if and only if (A+ BL)k → 0, or equivalently, if and only if the eigenvalues of the matrix(A+BL) are strictly within the unit circle of the complex plane.

Assume time-independent system and cost per stage, and some technical assumptions: controlla-bility of (A,B) and observability of (A,C) where Q = C ′C. The Riccati equation (2.5.3) convergeslimk→−∞Kk = K, where K is positive definite, and is the unique (within the class of positivesemidef. matrices) solution of the algebraic Riccati equation

K = A′(K −KB(B′KB +R)−1B′K)A+Q.

The following proposition formalizes this result. To simplify notation, we reverse the time indexingof the Riccati equation. Thus, Pk corresponds to KN−k in (2.5.3).

Proposition 2.5.1 Let A be an n × n matrix, B be an n × m matrix, Q be an n × n positivesemidef. symmetric matrix, and R be an m ×m positive definite symmetric matrix. Consider thediscrete-time Riccati equation

Pk+1 = A′(Pk − PkB(B′PkB +R)−1B′Pk)A+Q, k = 0, 1, . . . , (2.5.9)

where the initial matrix P0 is an arbitrary positive semidef. symmetric matrix. Assume that thepair (A,B) is controllable. Assume also that Q may be written as Q = C ′C, where the pair (A,C)is observable. Then,

(a) There exists a positive def. symmetric matrix P such that for every positive semidef. symmetricinitial matrix P0 we have limk→∞ Pk = P . Furthermore, P is the unique solution of the algebraicmatrix equation

P = A′(P − PB(B′PB +R)−1B′P )A+Q

within the class of positive semidef. symmetric matrices.

77


(b) The corresponding closed-loop system is stable; that is, the eigenvalues of the matrix

D = A+BL,

whereL = −(B′PB +R)−1B′PA,

are strictly within the unit circle of the complex plane.

Observations:

• The implication of the observability assumption in the proposition is that in the absence ofcontrol, if the state cost per stage x′kQxk → 0 as k →∞, or equivalently Cxk → 0, then alsoxk → 0.

• We could replace the statement in Propostion 2.5.1, part (b), by

(A+BL)k → 0 as k →∞.

Since xk+1 = (A+BL)kx0, then xk → 0 as k →∞.

Graphical proof of Proposition 2.5.1 for the scalar case

We provide here a proof for a limited version of the statement in the proposition, where we assumea one-dimensional state and control. For A 6= 0, B 6= 0, Q > 0, and R > 0, the Riccati equationin (2.5.9) is given by

Pk+1 = A2

(Pk −

B2P 2k

B2Pk +R

)+Q,

which can be equivalently written as

Pk+1 = F (Pk), where F (P ) =A2RP

B2P +R+Q. (2.5.10)

Figure 2.5.2 illustrates this recursion.

Facts about Figure 2.5.2:

• F is concave and monotonic increasing in the range (−R/B2,∞).

• The equation P = F (P ) has one solution P ∗ > 0 and one solution P < 0.

• The Riccati iteration Pk+1 = F (Pk) converges to P ∗ > 0 starting anywhere in (P ,∞).

Technical note: Going back to the matrix case: If controllability of (A,B) and observability of(A,C) are replaced by two weaker assumptions:

• Stabilizability, i.e., there exists a feedback gain matrix G ∈ Rm×n such that the closed-loopsystem xk+1 = (A+BG)xk is stable.

• Detectability, i.e., A is such that if uk → 0 and Cxk → 0, then it follows that xk → 0, andthat xk+1 = (A+BL)xk is stable.

Then, the conclusions of the proposition hold with the exception of positive def. of the limitmatrix P , which can now only be guaranteed to be positive semidef.

78


Pk+1=F(Pk )

Pk+2=F(Pk+1 )

Figure 2.5.2: Graphical illustration of the recursion in equation (2.5.10). Note that F (0) = Q, limP→∞ F (P ) =A2RB2 +Q (horizontal asymptote) , and limP→−R/B2 F (P ) = −∞ (vertical asymptote).

2.5.6 Random system matrices

Setting:

• Suppose that A0, B0, . . . , AN−1, BN−1 are not known but rather are independent randommatrices that are also independent of w0, . . . , wN−1.

• Assume that their probability distribution are given, and have finite variance.

• To cast this problem into the basic DP framework, define disturbances (Ak, Bk, wk).

The DP algorithm is:JN (xN ) = x′NQNxN

Jk(xk) = minuk

EAk,Bk,wk[x′kQkxk + u′kRkuk + Jk+1(Akxk +Bkuk + wk)

]In this case, similar calculations to those for the deterministic matrices give:

µ∗k(xk) = Lkxk,

where the gain matrices are given by

Lk = −(Rk + E[B′kKk+1Bk])−1E[B′kKk+1Ak],

and where the matrices Kk are given by the generalized Riccati equation

KN = QN ,

Kk = E[A′kKk+1Ak]− E[A′kKk+1Bk](Rk + E[B′kKk+1Bk])−1E[B′kKk+1Ak] +Qk. (2.5.11)

In the case of a stationary system and constant matrices Qk and Rk, it is not necessarily true thatthe above equation converges to a steady-state solution. This is illustrated in Figure 2.5.3. In the

79


Q

450

0 P

F (P)

- REB 2

Figure 2.5.3: Graphical illustration of the asymptotic behavior of the generalized Riccati equation (2.5.11) in the

case of a scalar stationary system (one-dimensional state and control).

case of a scalar stationary system (one-dimensional state and control), using Pk in place of KN−k,this equation is written as

Pk+1 = F (Pk),

where the function F is given by

F (P ) =E[A2]RP

E[B2]P +R+Q+

TP 2

E[B2]P +R,

and whereT = E[A2]E[B2]− (E[A])2(E[B])2.

If T = 0, as in the case where A and B are not random, the Riccati equation becomes identicalwith the one of Figure 2.5.2 and converges to a steady-state. Convergence also occurs when T hasa small positive value. However, as illustrated in the figure, for T large enough, the graph of thefunction F and the 45-degree line that passes through the origin do not intersect at a positive valueof P , and the Riccati equation diverges to infinity.

Interpretation: T is a measure of the uncertainty in the system. If there is a lot of uncertainty,optimization over a long horizon is meaningless. This phenomenon has been called the uncertaintythreshold principle.

2.5.7 On certainty equivalence

Consider the optimization problem:

minu

Ew[(ax+ bu+ w)2],

where a, b are scalars, x is known, and w is random. We have

Ew[(ax+ bu+ w)2] = E[(ax+ bu)2 + w2 + 2(ax+ bu)w]

= (ax+ bu)2 + 2(ax+ bu)E[w] + E[w2]

80


Taking derivative with respect to u gives

2(ax+ bu)b+ 2bE[w] = 0,

and hence the minimizer isu∗ = −a

bx− 1

bE[w].

Observe that u∗ depends on w only through the mean E[w]. In particular, the result of the opti-mization problem is the same as for the corresponding deterministic problem where w is replacedby E[w]. This property is called the certainty equivalence principle.

In particular,

• For example, when Ak and Bk are known, the certainty equivalence holds (the optimal controlis still linear in the state xk).

• When Ak and Bk are random, certainty equivalence does not hold.

2.5.8 Exercises

Exercise 2.5.1 Consider a linear-quadratic problem where Ak, Bk are known, for the case where atthe beginning of period k we have a forecast yk ∈ 1, 2, . . . , n consisting of an accurate predictionthat wk will be selected in accordance with a particular probability distribution Pk|yk . The vectorswk need not have zero mean under the distribution Pk|yk . Show that the optimal control law is ofthe form

uk(xk, yk) = −(B′kKk+1Bk +Rk)−1B′kKk+1(Akxk + E[wk|yk]) + αk,

where the matrices Kk are given by the discrete time Riccati equation, and αk are appropriatevectors.

Hint:System equation: xk+1 = Akxk +Bkuk + wk, k = 0, 1, . . . , N − 1,

Cost = Ew0,...,wN−1

[x′NQNxN +

N−1∑k=0

(x′kQkxk + u′kRkuk)

]

Let

yk = Forecast available at the beginning of period k

Pk|yk = p.d.f. of wk given yk

pkyk = a priori p.d.f. of yk at stage k

We have the following DP algorithm:

JN (xN , yN ) = x′NQNxN

Jk(xk, yk) = minuk

Ewk

[x′kQkxk + u′kRkuk +

n∑i=1

pk+1i Jk+1(xk+1, i)

∣∣∣yk],

where the noise wk is explained by Pk|yk .

81


Prove the following result by induction. The control u∗k(xk, yk) should be derived on the way.

Proposition: Under the conditions of the problem:

Jk(xk, yk) = x′kKkxk + x′kbk(yk) + ck(yk), k = 0, 1, . . . , N,

where bk(yk) is an n-dimensional vector, ck(yk) is a scalar, and Kk is generated by the discrete timeRiccati equation.

Exercise 2.5.2 Consider a scalar linear system

xk+1 = akxk + bkuk + wk, k = 0, 1, . . . , N − 1,

where ak, bk ∈ R, and each wk is a Gaussian random variable with zero mean and variance σ2. Weassume no control constraints and independent disturbances.

1. Show that the control law µ∗0, µ∗1, . . . , µ∗N−1 that minimizes the cost function

E

[exp

x2N +

N−1∑k=0

(x2k + r u2

k)

], r > 0,

is linear in the state variable, assuming that the optimal cost is finite for every x0.

2. Show by example that the Gaussian assumption is essential for the result to hold.

Hint 1: Note from integral tables that∫ +∞

−∞e−(ax2+bx+c)dx =

√π

ae(b2−4ac)/(4a), for a > 0

Let w be a normal random variable with zero mean and variance σ2 < 1/(2β). Using this definiteintegral, prove that

E[eβ(a+w)2

]=

1√1− 2βσ2

exp

βa2

1− 2βσ2

Then, prove that if the DP algorithm has a finite minimizing value at each step, then

JN (xN ) = ex2N ,

Jk(xk) = αkeβkx

2k , for constants αk, βk > 0, k = 0, 1, . . . , N − 1.

Hint 2: In particular for wN−1, consider the discrete distribution

P(wN−1 = ξ) =

1/4, if |ξ| = 11/2, if ξ = 0

Find a functional form for JN−1(xN−1), and check that u∗N−1 6= γN−1xN−1, for a constant γN−1.

2.6 Modular functions and monotone policies

Now we go back to the basic DP setting on problems with perfect state information. We willidentify conditions on a parameter θ (e.g., θ could be related to the state of a system) under whichthe optimal action D∗(θ) varies monotonically with it. We start with some technical definitions andrelevant properties.

82


2.6.1 Lattices

Definition 2.6.1 Given two points x = (x1, . . . , xn) and y = (y1, . . . , yn) in Rn, we define

• Meet of x and y: x ∧ y = (minx1, y1, . . . ,minxn, yn),

• Join of x and y: x ∨ y = (maxx1, y1, . . . ,maxxn, yn).

Thenx ∧ y ≤ x ≤ x ∨ y

Definition 2.6.2 A set X ⊂ Rn is said to be a sublattice of Rn if ∀x, y ∈ X,x ∧ y ∈ X andx ∨ y ∈ X.

Examples:

• I =

(x, y) ∈ R2|0 ≤ x ≤ 1, 0 ≤ y ≤ 1

is a sublattice.

• H =

(x, y) ∈ R2|x+ y = 1

is not a sublattice, because for example (1, 0) and (0, 1) are in H,but not (0, 0) nor (1, 1) are in H.

Definition 2.6.3 A point x∗ ∈ X is said to be a greatest element of a sublattice X if x∗ ≥ x, ∀x ∈X. A point x ∈ X is said to be a least element of a sublattice X if x ≤ x, ∀x ∈ X.

Theorem 2.6.1 Suppose X 6= ∅, X a compact (i.e., closed and bounded) sublattice of Rn. Then,X has a least and a greatest element.

2.6.2 Supermodularity and increasing differences

Let S ⊂ Rn, Θ ⊂ Rl. Suppose that both S and Θ are sublattices.

Definition 2.6.4 A function f : S×Θ→ R is said to be supermodular in (x, θ) if for all z = (x, θ)and z′ = (x′, θ′) in S ×Θ:

f(z) + f(z′) ≤ f(z ∨ z′) + f(z ∧ z′).

Similarly, f is submodular if

f(z) + f(z′) ≥ f(z ∨ z′) + f(z ∧ z′).

Example: Let S = Θ = R+, and let f : S ×Θ→ R be given by f(x, θ) = xθ. We will show that fis supermodular in (x, θ).

Pick any (x, θ) and (x′, θ′) in S ×Θ, and assume w.l.o.g. x ≥ x′. There are two cases to consider:

1. θ ≥ θ′ ⇒ (x, θ) ∨ (x′, θ′) = (x, θ), and (x, θ) ∧ (x′, θ′) = (x′, θ′). Then,

f(x, θ)︸︷︷︸xθ

+ f(x′, θ′)︸︷︷︸x′θ′

≤ f((x, θ) ∨ (x′, θ′)︸︷︷︸(x,θ)

)

︸︷︷︸xθ

+ f((x, θ) ∧ (x′, θ′)︸︷︷︸(x′,θ′)

)

︸︷︷︸x′θ′

83


2. θ < θ′ ⇒ (x, θ) ∨ (x′, θ′) = (x, θ′), and (x, θ) ∧ (x′, θ′) = (x′, θ). Then,

f((x, θ) ∨ (x′, θ′))︸︷︷︸xθ′

+ f((x, θ) ∧ (x′, θ′))︸︷︷︸x′θ

= xθ′ + x′θ

and we would have

f(x, θ)︸︷︷︸xθ

+ f(x′, θ′)︸︷︷︸x′θ′

≤ f((x, θ) ∨ (x′, θ′))︸︷︷︸xθ′

+ f((x, θ) ∧ (x′, θ′))︸︷︷︸x′θ

,

if and only if

xθ + x′θ′ ≤ xθ′ + x′θ

⇐⇒ x(θ′ − θ)− x′(θ′ − θ) ≥ 0

⇐⇒ (x− x′)︸︷︷︸≥0

(θ′ − θ)︸︷︷︸>0

≥ 0,

which is indeed the case.

Therefore, f(x, θ) = xθ is supermodular in S ×Θ.

Definition 2.6.5 For S,Θ ⊂ R, a function f : S ×Θ→ R is said to satisfy increasing differencesin (x, θ) if for all pairs (x, θ) and (x′, θ′) in S ×Θ, if x ≥ x′ and θ ≥ θ′, then

f(x, θ)− f(x′, θ) ≥ f(x, θ′)− f(x′, θ′).

If the inequality becomes strict whenever x ≥ x′ and θ ≥ θ′, then f is said to satisfy strictlyincreasing differences.

In other words, f has increasing differences in (x, θ) if the difference

f(x, θ)− f(x′, θ), for x ≥ x′,

is increasing in θ.

Theorem 2.6.2 Let S,Θ ⊂ R, and suppose f : S ×Θ→ R is supermodular in (x, θ). Then

1. f is supermodular in x, for each fixed θ ∈ Θ (i.e., for any fixed θ ∈ Θ, and for any x, x′ ∈ S,we have f(x, θ) + f(x′, θ) ≤ f(x ∨ x′, θ) + f(x ∧ x′, θ)).

2. f satisfies increasing differences in (x, θ).

Proof: For part (1), fix θ. Let z = (x, θ), z′ = (x′, θ). Since f is supermodular in (x, θ):

f(x, θ) + f(x′, θ) ≤ f(x ∨ x′, θ) + f(x ∧ x′, θ),

or equivalentlyf(z) + f(z′) ≤ f(z ∨ z′) + f(z ∧ z′),

and the result holds.

84


For part (2), pick any z = (x, θ) and z′ = (x′, θ′) that satisfy x ≥ x′ and θ ≥ θ′. Let w = (x, θ′)and w′ = (x′, θ). Then, w ∨ w′ = z and w ∧ w′ = z′. Since f is supermodular on S ×Θ,

f( w︸︷︷︸(x,θ′)

) + f( w′︸︷︷︸(x′,θ)

) ≤ f(w ∨ w′︸︷︷︸z=(x,θ)

) + f( w ∧ w′︸︷︷︸z′=(x′,θ′)

).

Rearranging terms,f(x, θ)− f(x′, θ) ≥ f(x, θ′)− f(x′, θ′),

and so f also satisfies increasing differences, as claimed.

Remark: We will prove later on that the reverse of part (2) in the theorem also holds.

Recall: A function f : S ⊂ Rn → Rn is said to be of class Ck if the derivatives f (1), f (2), ..., f (k) existand are continuous (the continuity is automatic for all the derivatives except the last one, f (k)).Moreover, if f is Ck, then the cross-partial derivatives satisfy

∂2

∂zi∂zjf(z) =

∂2

∂zj∂zif(z).

Theorem 2.6.3 Let Z be an open sublattice of Rn. A C2 function h : Z → R is supermodularon Z if and only if for all z ∈ Z, we have

∂2

∂zi∂zjh(z) ≥ 0, i, j = 1, . . . , n, i 6= j.

Similarly, h is submodular if and only if for all z ∈ Z, we have

∂2

∂zi∂zjh(z) ≤ 0, i, j = 1, . . . , n, i 6= j.

Proof: We prove here the result for supermodularity for the case n = 2.

⇐) If∂2

∂zi∂zjh(z) ≥ 0, i, j = 1, . . . , n, i 6= j,

then for x1 > x2 and y1 > y2, ∫ y1

y2

∫ x1

x2

∂2

∂x∂yh(x, y)dxdy ≥ 0

So, ∫ y1

y2

∂

∂y(h(x1, y)− h(x2, y)) dy ≥ 0,

and thus,h(x1, y1)− h(x2, y1)− (h(x1, y2)− h(x2, y2)) ≥ 0,

or equivalently,h(x1, y1)− h(x2, y1) ≥ h(x1, y2)− h(x2, y2),

which shows that h satisfies increasing differences and hence is supermodular.

85


⇒) Suppose h is supermodular. Then, it satisfies increasing differences and so, for x1 > x2, y1 > y,

h(x1, y1)− h(x1, y)y1 − y

≥ h(x2, y1)− h(x2, y)y1 − y

.

Letting y1 → y, we have

∂

∂yh(x1, y) ≥ ∂

∂yh(x2, y), when x1 ≥ x2,

implying that∂2

∂x∂yh(x, y) ≥ 0.

Note that the limit above defines a left derivative, but since f is differentiable, it is also the rightderivative.

2.6.3 Parametric monotonicity

Suppose S,Θ ⊂ R, f : S ×Θ→ R, and consider the optimization problem

maxx∈S

f(x, θ).

Here, by parametric monotonicity we mean that the higher the value of θ, the higher the maxi-mizer x∗(θ).6

Let’s give some intuition for strictly increasing differences implying parametric monotonicity. Weargue by contradiction. Suppose that in this maximization problem a solution exists for all θ ∈ Θ(e.g., suppose that f(·, θ) is continuous on S for each fixed θ, and that S is compact). Pick any twovalues θ1, θ2 with θ1 > θ2. Let x1, x2 be values that are optimal at θ1 and θ2, respectively. Thus,

f(x1, θ1)− f(x2, θ1) ≥ 0 ≥ f(x1, θ2)− f(x2, θ2). (2.6.1)

Suppose f satisfies strictly increasing differences, and that θ1 > θ2, but parametric monotonicityfails. Furthermore, assume x1 < x2. So, the vectors (x2, θ1) and (x1, θ2) satisfy x2 > x1 and θ1 > θ2.By strictly increasing differences:

f(x2, θ1)− f(x1, θ1) > f(x2, θ2)− f(x1, θ2),

contradicting (4.2.3). So, we must have x1 ≥ x2, where x1, x2 were arbitrary selections from thesets of optimal actions at θ1 and θ2, respectively.

In summary, if S,Θ ⊂ R, strictly increasing differences imply monotonicity of optimal actions inthe parameter θ ∈ Θ.

6Note that this concept is different from what is stated in the Envelope Theorem, which studies the marginal

change in the value of the maximized function, and not of the optimizer of that function:

Envelope Theorem: Consider a maximization problem: M(θ) = maxx f(x, θ). Let x∗(θ) be the argmax value

of x that solves the problem in terms of θ, i.e., M(θ) = f(x∗(θ), θ). Assume that f is continuously differentiable

in (x, θ), and that x∗ is continuously differentiable in θ. Then,

∂

∂θM(θ) =

∂

∂θf(y, θ)

˛y=x∗(θ)

.

86


Note: This result also holds for S ⊂ Rn, n ≥ 2, but the proof is different and requires additionalassumptions. The problem of the extension of the previous argument to higher dimensional settingsis that we cannot say anymore that x1 6≥ x2 implies x1 < x2.

The following theorem relaxes the “strict” condition of the increasing differences to guarantee para-metric monotonicity.

Theorem 2.6.4 Let S be a compact sublattice of Rn, Θ be a sublattice of Rl, and f : S×Θ→ R bea continuous function on S for each fixed θ. Suppose that f satisfies increasing differences in (x, θ),and is supermodular in x for each fixed θ. Let the correspondence D∗ : Θ→ S be defined by

D∗(θ) = arg maxf(x, θ)|x ∈ S.

Then,

1. For each θ ∈ Θ, D∗(θ) is a nonempty compact sublattice of Rn, and admits a greatest element,denoted x∗(θ).

2. x∗(θ1) ≥ x∗(θ2) whenever θ1 > θ2.

3. If f satisfies strictly increasing differences in (x, θ), then x1 ≥ x2 for any x1 ∈ D(θ1) andx2 ∈ D(θ2), whenever θ1 > θ2.

Proof: For part (1): Since f is continuous on S for each fixed θ, and since S is compact, D∗(θ) 6= ∅for each θ. Fix θ and take a sequence xp in D∗(θ) converging to x ∈ S. Then, for any y ∈ S,since xp is optimal, we have

f(xp, θ) ≥ f(y, θ).

Taking limit as p→∞, and using the continuity of f(·, θ), we obtain

f(x, θ) ≥ f(y, θ),

so x ∈ D∗(θ). Therefore, D∗(θ) is closed, and as a closed subset of a compact set S, it is also compact.Now, we argue by contradiction: Let x and x′ be distinct elements of D∗(θ). If x ∧ x′ 6∈ D∗(θ), wemust have

f(x ∧ x′, θ) < f(x, θ) = f(x′, θ).

However, supermodularity in x means

f(x, θ) + f(x′, θ)︸︷︷︸2f(x,θ)

≤ f(x′ ∨ x, θ) + f(x′ ∧ x, θ) < f(x′ ∨ x, θ) + f(x, θ),

which impliesf(x′ ∨ x, θ) > f(x, θ) = f(x′, θ),

which in turn contradicts the presumed optimality of x and x′ at θ. A similar argument alsoestablishes that x ∧ x′ ∈ D∗(θ). Thus, D∗(θ) is a sublattice of Rn, and as a nonempty compactsublattice of Rn, admits a greatest element x∗(θ).

87


For part (2): Let θ1 and θ2 be given with θ1 > θ2. Let x1 ∈ D∗(θ1), and x2 ∈ D∗(θ2). Then, wehave

0 ≤ f(x1, θ1)− f(x1 ∨ x2, θ1) (by optimality of x1 at θ1)

≤ f(x1 ∨ x2, θ1)− f(x2, θ1) (by supermodularity in x)

≤ f(x1 ∨ x2, θ2)− f(x2, θ2) (by increasing differences in (x, θ))

≤ 0 (by optimality of x2 at θ2),

so equality holds at every point in this string. Now, suppose x1 = x∗(θ1) and x2 = x∗(θ2). Sinceequality holds at all points in the string, using the first equality we have

f(x1 ∨ x2, θ1) = f(x1, θ1),

and so x1 ∨ x2 is also an optimal action at θ1. If x1 6≥ x2, then we would have x1 ∨ x2 > x1, andthis contradicts the definition of x1 as the greatest element of D∗(θ1). Thus, we must have x1 ≥ x2.

For part (3): Suppose that x1 ∈ D∗(θ1), x2 ∈ D∗(θ2). Suppose that x1 6≥ x2. Then, x2 > x1 ∧ x2.If f satisfies strictly increasing differences, then since θ1 > θ2, we have

f(x2, θ1)− f(x1 ∧ x2, θ1) > f(x2, θ2)− f(x1 ∧ x2, θ2),

so the third inequality in the string above becomes strict, contradicting the equality.

Remark: For the case where S,Θ ⊂ R, from Theorem 2.6.2 it can be seen that if f is supermodular,it automatically verifies the hypotheses of Theorem 2.6.4, and therefore in principle supermodular-ity in R2 constitutes a sufficient condition for parametric monotonicity. For a more general casein Rn, n > 2, a related result follows.

For this general case, the definition of increasing differences is: For all z ∈ Z, for all distincti, j ∈ 1, . . . , n, and for all z′i, z

′j such that

z′i ≥ zi, and z′j ≥ zj ;

it is the case that

f(z−ij , z′i, z′j)− f(z−ij , zi, z′j) ≥ f(z−ij , z′i, zj)− f(z−ij , zi, zj).

In words, f has increasing differences on Z if it has increasing differences in each pair (zi, zj) whenall other coordinates are held fixed at some value.

Theorem 2.6.5 A function f : Z ⊂ Rn → R is supermodular on Z if and only if f has increasingdifferences on Z.

Proof: The implication “⇒” can be proved by a slight modification of part (2) in Theorem 2.6.2.

To prove “⇐”, pick any z and z′ in Z. We are required to show that

f(z) + f(z′) ≤ f(z ∨ z′) + f(z ∧ z′).

88


If z ≥ z′ or z ≤ z′, the inequality trivially holds. So, suppose z and z′ are not comparable under ≥.For notational convenience, arrange the coordinates of z and z′ so that

z ∨ z′ = (z′1, . . . , z′k, zk+1, . . . , zn),

andz ∧ z′ = (z1, . . . , zk, z

′k+1, . . . , z

′n).

Note that since z and z′ are not comparable under ≥, we must have 0 < k < n.

Now, for 0 ≤ i ≤ j ≤ n, define

zi,j = (z′1, . . . , z′i, zi+1, . . . , zj , z

′j+1, . . . , z

′n).

Then, we havez0,k = z ∧ z′, zk,n = z ∨ z′, z0,n = z, zk,k = z′. (2.6.2)

Since f has increasing differences on Z, it is the case that for all 0 ≤ i < k ≤ j < n,

f(zi+1,j+1)− f(zi,j+1) ≥ f(zi+1,j)− f(zi,j).

Therefore, we have for k ≤ j < n,

f(zk,j+1)− f(z0,j+1) =k−1∑i=0

[f(zi+1,j+1)− f(zi,j+1)]

≥k−1∑i=0

[f(zi+1,j)− f(zi,j)]

= f(zk,j)− f(z0,j).

Since this inequality holds for all j satisfying k ≤ j < n, it follows that the LHS is at its highestvalue at j = n− 1, while the RHS is at its lowest value when j = k. Therefore,

f(zk,n)− f(z0,n) ≥ f(zk,k)− f(z0,k).

From (2.6.2), this is precisely the statement that

f(z ∨ z′)− f(z) ≥ f(z′)− f(z ∧ z′).

Since z and z′ were chosen arbitrarily, f is shown to be supermodular on Z.

Remark: From Theorem 2.6.5, it is sufficient to prove supermodularity (or increasing differences)to prove parametric monotonicity.

2.6.4 Applications to DP

We include here a couple of examples that show how useful the concept of parametric monotonicitycould be to characterize monotonicity properties of the optimal policy.

89


Example 2.6.1 (A gambling model with changing win probabilities)

• Consider a gambler who is allowed to bet any amount up to his present fortune at each play.

• He will win or lose that amount according to a given probability p.

• Before each gamble, the value of p changes (p ∼ F ).

• Control: On each play, the gambler must decide, after the win probability is announced, how much

to bet.

• Consider a sequence of N gambles.

• Objective: Maximize the expected value of a given utility function G of his final fortune x,

where G(x) is continuously differentiable and nondecreasing in x.

• State: (x, p), where x is his current fortune, and p is the current win probability.

• Assume indices run backward in time.

DP formulation:

Define the value function Vk(x, p) as the maximal expected final utility for state (x, p) when there are k

games left.

The DP algorithm is:

V0(x, p) = G(x),

and for k = N,N − 1, . . . , 1,

Vk(x, p) = max0≤u≤x

p

∫ 1

0Vk−1(x+ u, α)dF (α) + (1− p)

∫ 1

0Vk−1(x− u, α)dF (α)

.

Let uk(x, p) be the largest u that maximizes this equation. Let gk(u, p) be the expression to maximize

above, i.e.,

gk(u, p) = p

∫ 1

0Vk−1(x+ u, α)dF (α) + (1− p)

∫ 1

0Vk−1(x− u, α)dF (α).

Intuitively, for given k and x, the optimal amount uk(x, p) to bet should be increasing in p. So, we

would like to prove parametric monotonicity of uk(x, p) in p. To this end, it would be enough to prove

increasing differences of gk(u, p) in (u, p), or equivalently, it would be enough to prove supermodularity

of gk(u, p) in (u, p). Or it would be enough to prove

∂2

∂u∂pgk(u, p) ≥ 0.

The derivation proceeds as follows:

∂

∂pgk(u, p) =

∫ 1

0Vk−1(x+ u, α)dF (α)−

∫ 1

0Vk−1(x− u, α)dF (α).

90


Then, by the Leibniz rule7

∂2

∂u∂pgk(u, p) =

∫ 1

0

∂

∂uVk−1(x+ u, α)dF (α)−

∫ 1

0

∂

∂uVk−1(x− u, α)dF (α).

Then,∂2

∂u∂pgk(u, p) ≥ 0

if for all α,∂

∂u[Vk−1(x+ u, α)− Vk−1(x− u, α)] ≥ 0,

or equivalently, if for all α,

Vk−1(x+ u, α)− Vk−1(x− u, α)

increases in u, which follows if Vk−1(z, α) is increasing in z, which immediately holds because for z′ > z,

Vk−1(z′, α) ≥ Vk−1(z, α),

since in the former we are maximizing over a bigger domain.

For V0(·, α), it holds because G(z′) ≥ G(z).

Example 2.6.2 (An optimal allocation problem subject to penalty costs)

• There are N stages to construct I components sequentially.

• At each stage, we allocate u dollars for the construction of one component.

• If we allocate $u, then the component constructed will be a success w.p. P (u) (continuous,

nondecreasing, with P (0) = 0).

• After each component is constructed, we are informed as to whether or not it is successful.

• If at the end of N stages we are j components short, we incur a penalty cost C(j) (increasing,

with C(j) = 0 for j ≤ 0).

• Control: How much money to allocate in each stage to minimize the total expected cost (con-

struction + penalty).

• State: Number of successful components still needed.

• Indices run backward in time.

DP formulation:

Define the value function Vk(i) as the minimal expected remaining cost when state is i and k stages

remain.

The DP algorithm is:

V0(i) = C(i),

7We would need to prove that Vk(x, p) and ∂∂xVk(x, p) are continuous in x. A sufficient condition or that is

that µ∗k(x, p) is continuously differentiable in x.

91


and for k = N,N − 1, . . . , 1, and i > 0,

Vk(i) = minu≥0u+ P (u)Vk−1(i− 1) + (1− P (u))Vk−1(i) . (2.6.3)

We set Vk(i) = 0, ∀i ≤ 0, and for all k.

It follows immediately from the definition of Vk(i) and the monotonicity of C(i) that Vk(i) increases

in i and decreases in k.

Let uk(i) be the minimizer of (2.6.3). Two intuitive results should follow:

1. “The more we need, the more we should invest” (i.e., uk(i) is increasing in i).

2. “The more time we have, the less we need to invest at each stage” (i.e., uk(i) is decreasing in k).

Let’s determine conditions on C(·) that make the previous two intuitions valid.

Define

gk(i, u) = u+ P (u)Vk−1(i− 1) + (1− P (u))Vk−1(i).

Minimizing gk(i, u) is equivalent to maximizing (−gk(i, u)). Then, in order to prove uk(i) increasing

in i, it is enough to prove (−gk(i, u)) supermodular in (i, u), or gk(i, u) submodular in (i, u). Note that

here we are treating i as a continuous quantity.

So, uk(i) increases in i if

∂2

∂i∂ugk(i, u) ≤ 0.

We compute this cross-partial derivative. First, we calculate

∂

∂ugk(i, u) = 1 + P ′(u)[Vk−1(i− 1)− Vk−1(i)],

and then∂2

∂i∂ugk(i, u) = P ′(u)︸︷︷︸

≥0

∂

∂i[Vk−1(i− 1)− Vk−1(i)] ≤ 0,

so that uk(i) increases in i if [Vk−1(i − 1) − Vk−1(i)] decreases in i. Similarly, uk(i) decreases in k if

[Vk−1(i− 1)− Vk−1(i)] increases in k. Therefore, submodularity gives a sufficient condition on gk(i, u),

which ensures the desired monotonicity of the optimal policy. For this example, we show below that

if C(i) is convex in i, then [Vk−1(i−1)−Vk−1(i)] decreases in i and increases in k, ensuring the desired

structure of the optimal policy.

Two results are easy to verify:

• Vk(i) is increasing in i, for a given k.

• Vk(i) is decreasing in k, for a given i.

Proposition 2.6.1 If C(i+ 2)−C(i+ 1) ≥ C(i+ 1)−C(i), ∀i (i.e., C(·) convex), then uk(i) increases

in i and decreases in k.

92


Proof: Define

Ai,k : Vk+1(i+ 1)− Vk+1(i) ≤ Vk(i+ 1)− Vk(i), k ≥ 0

Bi,k : Vk+1(i)− Vk(i) ≤ Vk+2(i)− Vk+1(i), k ≥ 0

Ci,k : Vk(i+ 1)− Vk(i) ≤ Vk(i+ 2)− Vk(i+ 1), k ≥ 0

We proceed by induction on n = k + i. For n = 0 (i.e., k = i = 0):

A0,0 : V1(1)︸︷︷︸=0 from (2.6.3)

−V1(0)︸︷︷︸0

≤ V0(1)︸︷︷︸0

−V0(0)︸︷︷︸0

,

B0,0 : V1(0)︸︷︷︸0

−V0(0)︸︷︷︸0

≤ V2(0)︸︷︷︸0

−V1(0)︸︷︷︸0

,

C0,0 : V0(1)︸︷︷︸C(1)

− V0(0)︸︷︷︸C(0)=0

≤ V0(2)︸︷︷︸C(2)

−V0(1)︸︷︷︸C(1)

,

where the last inequality holds because C(·) is convex.

IH: The 3 inequalities above are true for k + i < n.

Suppose now that k + i = n. We proceed by proving one inequality at a time.

1. For Ai,k:

If i = 0 ⇒ A0,k : Vk+1(1) − Vk+1(0)︸︷︷︸0

≤ Vk(1) − Vk(0)︸︷︷︸0

, which holds because Vk(i) is decreasing

in k.

If i > 0, then there is u such that

Vk+1(i) = u+ P (u)Vk(i− 1) + (1− P (u))Vk(i).

Thus,

Vk+1(i)− Vk(i) = u+ P (u)[Vk(i− 1)− Vk(i)] (2.6.4)

Also, since u is the minimizer just for Vk+1(i),

Vk+1(i+ 1) ≤ u+ P (u)Vk(i) + (1− P (u))Vk(i+ 1).

Then,

Vk+1(i+ 1)− Vk(i+ 1) ≤ u+ P (u)[Vk(i)− Vk(i+ 1)] (2.6.5)

Note that from Ci−1,k (which holds by IH because i− 1 + k = n− 1), we get

Vk(i)− Vk(i+ 1) ≤ Vk(i− 1)− Vk(i)

Then, using the RHS of (4.2.5) and (5.3.2), we have

Vk+1(i+ 1)− Vk(i+ 1) ≤ Vk+1(i)− Vk(i),

or equivalently,

Vk+1(i+ 1)− Vk+1(i) ≤ Vk(i+ 1)− Vk(i),

which is exactly Ai,k.

93


2. For Bi,k:

Note that for some u,

Vk+2(i) = u+ P (u)Vk+1(i− 1) + (1− P (u))Vk+1(i),

or equivalently,

Vk+2(i)− Vk+1(i) = u+ P (u)[Vk+1(i− 1)− Vk+1(i)]. (2.6.6)

Also, since u is the minimizer for Vk+2(i),

Vk+1(i) ≤ u+ P (u)Vk(i− 1) + (1− P (u))Vk(i),

so that

Vk+1(i)− Vk(i) ≤ u+ P (u)[Vk(i− 1)− Vk(i)]. (2.6.7)

By IH, Ai−1,k, for i− 1 + k = n− 1, holds. So,

Vk+1(i)− Vk+1(i− 1) ≤ Vk(i)− Vk(i− 1),

or equivalently,

Vk(i− 1)− Vk(i) ≤ Vk+1(i− 1)− Vk+1(i).

Plugging it in (3.3.4), and using the RHS of (4.2.7), we obtain

Vk+1(i)− Vk(i) ≤ Vk+2(i)− Vk+1(i),

which is exactly Bi,k.

3. For Ci,k, we first note that Bi+1,k−1 (already proved since i+ 1 + k − 1 = n) states that

Vk(i+ 1)− Vk−1(i+ 1) ≤ Vk+1(i+ 1)− Vk(i+ 1),

or equivalently,

2Vk(i+ 1) ≤ Vk+1(i+ 1) + Vk−1(i+ 1). (2.6.8)

Hence, if we can show that,

Vk−1(i+ 1) + Vk+1(i+ 1) ≤ Vk(i) + Vk(i+ 2), (2.6.9)

then from (2.6.8) and (2.6.9) we would have

2Vk(i+ 1) ≤ Vk(i) + Vk(i+ 2),

or equivalently,

Vk(i+ 1)− Vk(i) ≤ Vk(i+ 2)− Vk(i+ 1),

which is exactly Ci,k.

Now, for some u,

Vk(i+ 2) = u+ P (u)Vk−1(i+ 1) + (1− P (u))Vk−1(i+ 2),

which implies

Vk(i+ 2)− Vk−1(i+ 1) = u+ P (u)Vk−1(i+ 1) + (1− P (u))Vk−1(i+ 2)− Vk−1(i+ 1)

= u+ (1− P (u))[Vk−1(i+ 2)− Vk−1(i+ 1)]. (2.6.10)

94


Moreover, since u is the minimizer of Vk(i+ 2):

Vk+1(i+ 1) ≤ u+ P (u)Vk(i) + (1− P (u))Vk(i+ 1).

Subtracting Vk(i) from both sides:

Vk+1(i+ 1)− Vk(i) ≤ u+ (1− P (u))[Vk(i+ 1)− Vk(i)].

Then, equation (2.6.9) will follow if we can prove that

Vk(i+ 1)− Vk(i) ≤ Vk−1(i+ 2)− Vk−1(i+ 1), (2.6.11)

because then

Vk+1(i+ 1)− Vk(i) ≤ u+ (1− P (u))[Vk−1(i+ 2)− Vk−1(i+ 1)]

= Vk(i+ 2)− Vk−1(i+ 1).

Now, from Ai,k−1 (which holds by IH), it follows that

Vk(i+ 1)− Vk(i) ≤ Vk−1(i+ 1)− Vk−1(i). (2.6.12)

Also, from Ci,k−1 (which holds by IH), it follows that

Vk−1(i+ 1)− Vk−1(i) ≤ Vk−1(i+ 2)− Vk−1(i+ 1). (2.6.13)

Finally, (2.6.12) and (2.6.13) ⇒ (2.6.11) ⇒ (2.6.9), and we close this case.

In the end, the three inequalities hold, and the proof is completed.

2.7 Extensions

2.7.1 The Value of Information

The value of information is the reduction in cost between optimal closed-loop and open-loop policies.To illustrate its computation, we revisit the two-game chess match example.

Example 2.7.1 (Two-game chess match)

• Closed-Loop: Recall that the optimal policy when pd > pw is to play timid if and only if one is

ahead in the score. Figure 2.7.1 illustrates this. The optimal payoff under the closed-loop policy

is the sum of the payoffs in the leaves of Figure 2.7.1. This is because the four payoffs correspond

to four mutually exclusive outcomes. The total payoff is

P(win) = pwpd + p2w((1− pd) + (1− pw))

= p2w(2− pw) + pw(1− pw)pd. (2.7.1)

For example, if pw = 0.45 and pd = 0.9, we know that P(win) = 0.53.

95


0-0

1-0

0-1

1.5-0.5

1-1

1-1

0-2

bold

timid

bold

pw

1-pw

pd

1-pd

1-pw

pw

pw pd

pw2 (1-pd)

pw2 (1-pw)

0 (already lost)

Figure 2.7.1: Optimal closed-loop policy for the two-game chess match. The payoffs included next to the leaves

represent the total cumulative payoff for following that particular branch from the root.

• Open-Loop: There are four possible policies:

Note that the latter two policies lead to the same payoff, and that this payoff dominates the first

policy (i.e., playing (timid-timid)) because

pwpd︸︷︷︸≥p2dpw

+ p2w(1− pd)︸︷︷︸≥0

≥ p2dpw.

Therefore, the maximum open-loop probability of winning the match is:

maxp2w(3− 2pw)︸︷︷︸

Play (bold,bold)

, pwpd + p2w(1− pd)︸︷︷︸

Play (bold, timid) or (timid, bold)

= p2w + pw(1− pw) max2pw, pd (2.7.2)

So,

– if pd > 2pw, then the optimal policy is to play either (timid,bold) or (bold, timid);

– if pd ≤ 2pw, then the optimal policy is to play (bold,bold).

Again, if pw = 0.45 and pd = 0.9, then P(win) = 0.425

• For the aforementioned probabilities pw = 0.45 and pd = 0.9, the value of information is the

difference between both optimal payoffs: 0.53− 0.425 = 0.105.

More generally, by subtracting (2.7.1)-(2.7.2):

Value of Information = p2w(2− pw) + pw(1− pw)pd − p2

w − pw(1− pw) max2pw, pd= pw(1− pw) minpw, pd − pw.

2.7.2 State Augmentation

In the basic DP formulation, the random noise is independent across all periods, and the controldepends just on the current state. In this regard, the system is of the Markovian type. In order todeal with a more general situation, we enlarge the state definition so that the current state capturesinformation of the past. In some applications, this past information could be helpful for the future.

96


0-0

0.5-0.5

0-1

1-1

0-2

timid

timidpd

1-pd

pd

1-pd

1-pd

pd

pd2 pw

0

0

0

0.5-1.5

0.5-1.5

timid

0-0

0-0

0-1

2-0

0-2

bold

boldpw

1-pw

pw

1-pw

1-pw

pw

pw2

Pw2 (1-pw)

0

1-1

1-1

bold Pw2 (1-pw)

Play (timid-timid) Play (bold-bold)

P(win) = pd2 pw P(win) = pw

2+2 pw2 (1-pw) = pw

2 (3 2 pw)

0-0

1-0

0-1

0-2

bold

timidpw

1-pw

pd

1-pd

1-pd

pd

pw pd

pw2(1-pd)

0

1-1

timid

0

0-0

0-1

0-2

timid

boldpd

1-pd

pw

1-pw

1-pw

pw

0

0

1-1

bold

pw2(1-pd)

1.5-0.5

0.5-1.5

0.5-0.5

1.5-0.5

0.5-1.5

pd pw

Play (bold-timid)Play (timid-bold)

P(win) = pw pd + pw2 (1 pd) P(win) = pw pd + pw

2 (1 pd)

Figure 2.7.2: Open-loop policies for the two-game chess match. The payoffs included next to the leaves represent

the total cumulative payoff for following that particular branch from the root.

Time Lags

Suppose that the next state xk+1 depends on the last two states xk and xk−1, and on the last twocontrols uk and uk−1. For instance,

xk+1 = fk(xk, xk−1, uk, uk−1, wk), k = 1, . . . , N − 1,

x1 = f0(x0, u0, w0).

We redefine the system equation as follows: xk+1

yk+1

sk+1

︸︷︷︸exk+1

=

fk(xk, yk, uk, sk, wk)xkuk

︸︷︷︸efk(exk,uk,wk)

,

where xk = (xk, yk, sk) = (xk, xk−1, uk−1).

DP algorithm

97


When the DP algorithm for the reformulated problem is translated in terms of the variables of theoriginal problem, it takes the form:

JN (xN ) = gN (xN ),

JN−1(xN−1, xN−2, uN−2) = minuN−1∈UN−1(xN−1)

EwN−1 gN−1(xN−1, uN−1, wN−1)

+ JN (fN−1(xN−1, xN−2, uN−1, uN−2, wN−1)︸︷︷︸xN

) ,

Jk(xk, xk−1 . . . , uk−1) = minuk∈Uk(xk)

Ewk gk(xk, uk, wk) + Jk+1(fk(xk, xk−1, uk, uk−1, wk), xk, uk) ,

k = 1, . . . , N − 2,

J0(x0) = minu0∈U0(x0)

Ew0 g0(x0, u0, w0) + J1(f0(x0, u0, w0), x0, u0) .

Correlated Disturbances

Assume that w0, w1, . . . , wN−1 can be represented as the output of a linear system driven by inde-pendent r.v. For example, suppose that disturbances can be modeled as:

wk = Ckyk+1, where yk+1 = Akyk + ξk, k = 0, 1, . . . , N − 1,

where Ck, Ak are matrices of appropriate dimension, and ξk are independent random vectors. Byviewing yk as an additional state variable; we obtain the new system equation:(

xk+1

yk+1

)=

(fk(xk, uk, Ck(Akyk + ξk))

Akyk + ξk

),

for some initial y0. In period k, this correlated disturbance can be represented as the output of alinear system driven by independent random vectors:

ξk - yk+1 = Akyk + ξkyk+1- Ck

wk -

Observation: In order to have perfect state information, the controller must be able to observe yk.This occurs for instance when Ck is the identity matrix, and therefore wk = yk+1. Since wk isrealized at the end of period k, its known value is carried over the next period k + 1 through thestate component yk+1.

DP algorithm

JN (xN , yN ) = gN (xN )

Jk(xk, yk) = minuk∈Uk(xk)

Eξk

gk(xk, uk, Ck(Akyk + ξk)) + Jk+1(fk(xk, uk, Ck(Akyk + ξk)︸︷︷︸xk+1

, Akyk + ξk︸︷︷︸yk+1

)

98


2.7.3 Forecasts

Suppose that at time k, the controller has access to a forecast yk that results in a reassessment ofthe probability distribution of wk.

In particular, suppose that at the beginning of period k, the controller receives an accurate predictionthat the next disturbance wk will be selected according to a particular prob. distribution from acollection Q1, Q2, . . . , Qm. For example if a forecast is i, then wk is selected according to aprobability distribution Qi. The a priory probability that the forecast will be i is denoted by piand is given.

System equation: (xk+1

yk+1

)=

(fk(xk, uk, wk)

ξk

),

where ξk is the r.v. taking value i w.p. pi. So, when ξk takes the value i, then wk+1 will occuraccording to distribution Qi. Note that there are two sources of randomness now: wk = (wk, ξk),where wk stands for the outcome of the previous forecast in the current period, and ξk passes thenew forecast to the next period.

DP algorithm

JN (xN , yN ) = gN (xN )

Jk(xk, yk) = minuk∈Uk(xk)

Ewk

gk(xk, uk, wk) +

m∑i=1

piJk+1(fk(xk, uk, wk), i)/yk

So, in current period k, the forecast yk is known (given), and it determines the distribution for thecurrent noise wk. For the future, the forecast is i w.p. pi.

2.7.4 Multiplicative Cost Functional

The basic formulation of the DP problem assumes that the cost functional is additive over time.That is, every period (depending on states, actions and uncertainty) the system generates a cost andit is the sum of these single-period costs that we are interested to minimize. It should be relativelyclear by now why this additivity assumption is crucial for the DP method to work. However, whatwe really need is an appropriate form of separability of the cost functional into its single-periodcomponents and additivity is one convenient (and most natural form most practical applications)form to ensure this separability, but is is not the only one. The following exercises clarify this point.

Exercise 2.7.1 In the framework of the basic problem, consider the case where the cost is of theform

Ew[exp

(gN (xN ) + sumN−1

k=1 gk(xk, uk, wk))].

a) Show that the optimal cost and optimal policy can be obtained from the DP-like algorithm

JN (xN ) = exp(gN (xN )), Jk(xk) = minuk∈Uk

E [Jk+1(fk(xk, uk, wk)) exp(gk(xk, uk, wk))] .

99


b) Define the functions Vk(xk) = ln(Jk(xk)). Assume also that gk(x, u, w) = gk(x, u), that is, thegk are independent of wk. Show that the above algorithm can be rewritten as follows:

VN (xn) = gN (xN ),

Vk(xk) = minuk∈Uk

gk(xk, uk) + ln (E [exp (Vk+1(fk(xk, uk, wk)))]) .

Exercise 2.7.2 Consider the case where the cost has the following multiplicative form

Ew

[gN (xN )

N1∏k=1

gk(xk, uk, wk)

].

Develop a DP-like algorithm for this problem assuming gk(xk, uk, wk) ≥ 0 for all xk, uk and wk.

100

Chapter 3

Applications

3.1 Inventory Control

In this section, we study the inventory control problem discussed in Example 2.1.1.

3.1.1 Problem setup

We assume the following:

• Excess demand in each period is backlogged and is filled when additional inventory becomesavailable, i.e.,

xk+1 = xk + uk − wk, k = 0, 1, . . . , N − 1.

• Demands wk take values within a bounded interval and are independent.

• Cost of state x:r(x) = pmax0,−x+ hmax0, x,

where p ≥ 0 is the per-unit backlog cost, and h ≥ 0 is the per-unit holding cost.

• Per-unit purchasing cost c.

• Total expected cost to be minimized:

E

[N−1∑k=0

(cuk + pmax0, wk − xk − uk+ hmax0, xk + uk − wk

],

where the costs are incurred based on the inventory (potentially, negative) available at theend of each period k.

• Suppose that p > c (otherwise, if c ≥ p, it would never be optimal to buy stock in the lastperiod N − 1 and possibly in the earlier periods).

• Most of the subsequent analysis generalizes to the case where r(·) is a convex function thatgrows to infinity with asymptotic slopes p and h as its argument tends to −∞ and∞, respec-tively.

101

Prof. R. Caldentey CHAPTER 3. APPLICATIONS

Stock ordered at beginning of Period k

Stock at beginningof Period k

Demand duringPeriod k

Figure 3.1.1: System dynamics for the inventory control problem.

Figure 3.1.1 illustrates the problem setup and the dynamics of the system.

By applying the DP algorithm, we have

JN (xN ) = 0,

Jk(xk) = minuk≥0

cuk +H(xk + uk) + Ewk [Jk+1(xk + uk − wk)] , (3.1.1)

whereH(y) = E[r(y − wk)] = pE[(wk − y)+] + hE[(y − wk)+].

If the probability distribution of wk is time-varying, then H depends on k. To simplify notation inwhat follows, we will assume that all demands are identically distributed.

By defining yk = xk + uk︸︷︷︸≥0

(i.e., yk is the inventory level right after getting the new units, and

before demand for the period is realized), we could write

Jk(xk) = minyk≥xk

cyk +H(yk) + Ewk [Jk+1(yk − wk)] − cxk. (3.1.2)

3.1.2 Structure of the cost function

• Note that H(y) is convex, since for a given wk, both terms in its definition are convex (Fig-ure 3.1.2 illustrates this)⇒ the sum is convex⇒ taking expectation on wk preserves convexity.

• Assume Jk+1(·) is convex (to be proved later), then the function Gk(y) minimized in (3.1.2) isconvex. Suppose for now that there is an unconstrained minimum Sk (existence to be verified);that is, for each k, the scalar Sk minimizes the function

Gk(y) = cy +H(y) + Ew[Jk+1(y − w)].

In addition, if Gk(y) has the shape shown in Figure 3.1.3, then the minimizer of Gk(y), foryk ≥ xk, is

y∗k =

Sk if xk < Skxk if xk ≥ Sk.

102

CHAPTER 3. APPLICATIONS Prof. R. Caldentey

Max0,wk-y

ywk

Max0,y-wk

ywk

Figure 3.1.2: Graphical illustration of the two terms in the H function.

yk

Gk(y)

Skxk1 xk2

Figure 3.1.3: The function Gk(y) has a “bowl shape”. The minimum for yk ≥ xk1 is Sk; the minimum for yk ≥ xk2is xk2 .

• Using the reverse transformation uk = yk − xk (recall that uk is the amount ordered), thenan optimal policy is determined by a sequence of scalars S0, S1, . . . , SN−1 and has the form

µ∗k(xk) =

Sk − xk if xk < Sk0 if xk ≥ Sk.

(3.1.3)

This control is known as basestock policy, with basestock level Sk.

To complete the proof of the optimality of the control policy (3.1.3), we need to prove the nextresult:

Proposition 3.1.1 The following three facts hold:

1. The value function Jk+1(y) (and hence, Gk(y)) is convex in y, k = 0, 1, . . . , N − 1.

2. lim|y|→∞Gk(y) =∞, ∀k.

3. lim|y|→∞ Jk(y) =∞, ∀k.

103


Proof: By induction. For k = N − 1,

GN−1(y) = cy +H(y) + Ew[JN (y − w)︸︷︷︸0

],

and since H(·) is convex, GN−1(y) is convex. For y “very negative”, ∂∂yH(y) = −p, and so

∂∂yGN−1(y) = c − p < 0. For y “very positive”, ∂

∂yH(y) = h, and so ∂∂yGN−1(y) = c + h > 0.

So, lim|y|→∞GN−1(y) =∞.1 Hence, the optimal control for the last period turns out to be

µ∗N−1(xN−1) =

SN−1 − xN−1 if xN−1 < SN−1

0 if xN−1 ≥ SN−1,(3.1.4)

and from the DP algorithm in (3.1.1), we get

JN−1(xN−1) =

c(SN−1 − xN−1) +H(SN−1) if xN−1 < SN−1

H(xN−1) if xN−1 ≥ SN−1.

Before continuing, we need the following auxiliary result:

Claim: JN−1(xN−1) is convex in xN−1.

Proof: Note that we can write

JN−1(x) =

−cx+ cSN−1 +H(SN−1) if x < SN−1

H(x) if x ≥ SN−1.(3.1.5)

Figure 3.1.4 illustrates the convexity of the function GN−1(y) = cy + H(y). Recall that we haddenoted SN−1 the unconstrained minimizer of GN−1(y). The unconstrained minimizer H∗ of the

H*

Slope= cH(SN-1)

GN-1(y)=cy+H(y)

Figure 3.1.4: The function GN−1(y) is convex with unconstrained minimizer SN−1.

function H(y) occurs to the right of SN−1. To verify this, compute

∂

∂yGN−1(y) = c+

∂

∂yH(y).

1Note that GN−1(y) is shifted one index back in the argument to show convexity, since given the convexity of JN (·),it turns out to be convex. However, we still need to prove the convexity of JN−1(·).

104


2 Evaluating the derivative at SN−1, we get

∂

∂yGN−1(SN−1) = c+

∂

∂yH(SN−1) = 0,

and therefore, ∂∂yH(SN−1) = −c < 0; that is, H(·) is decreasing at SN−1, and thus its minimum H∗

occurs to its right.

Figure 3.1.5 plots JN−1(xN−1). Note that according to (3.1.5), the function is linear to the left

c xN-1

H*

H(SN-1)

cSN-1+H(SN-1)c SN-1

J(xN-1) is linear in xN-1 for xN-1< SN-1

Slope= c

Figure 3.1.5: The function JN−1(xN−1) is convex with unconstrained minimizer H∗.

of SN−1, and tracks H(xN−1) to the right of SN−1. The minimum value of JN−1(·) occurs atxN−1 = H∗, but we should be cautious on how to interpret this fact: This is the “best possiblestate” that the controller can reach, however, the purpose of DP is to prescribe the best courseof action for any initial state xN−1 at period N − 1, which is given by the optimal control (3.1.4)above.

Continuing with the proof of Proposition 3.1.1, so far we have that given the convexity of JN (x),we prove the convexity of GN−1(x), and then the convexity of JN−1(x). Furthermore, Figure 3.1.5also shows that

lim|y|→∞

JN−1(y) =∞.

The argument can be repeated to show that for all k = N − 2, . . . , 0, if Jk+1(x) is convex,lim|y|→∞ Jk+1(y) =∞, and lim|y|→∞Gk(y) =∞, then we have

Jk(xk) =

c(Sk − xk) +H(Sk) + E[Jk+1(Sk − wk)] if xk < SkH(xk) + E[Jk+1(xk − wk)] if xk ≥ Sk,

where Sk minimizesGk(y) = cy+H(y)+E[Jk+1(y−w)]. Furthermore, Jk(y) is convex, lim|y|→∞ Jk(y) =∞, Gk−1(y) is convex, and lim|y|→∞Gk−1(y) =∞.

2Note that the function H(y), on a sample path basis, is not differentiable everywhere (see Figure 3.1.2). However,

the probability of the r.v. hitting the value y is zero if w has a continuous density, and so we can assert that H(·) is

differentiable w.p.1.

105


Technical note: To formally complete the proof above, when taking derivative of Gk(y), thatwill involve taking derivative of a expected value. Under relatively mild technical conditions, wecan safely interchange differentiation and expectation. For example, it is safe to do that when thedensity fw(w) of the r.v. does not depend on y. More formally, if Rw is the support of the r.v. w,

∂

∂xE[g(x,w)] =

∂

∂x

∫w∈Rw

g(x,w)fw(w)dw.

Using Leibniz’s rule, if the function fw(w) does not depend on x, the set Rw does not depend on xeither, and the derivative ∂

∂xg(x,w) is well defined and bounded, we can interchange derivative andintegral:

∂

∂x

∫w∈Rw

g(x,w)fw(w)dw =∫w∈Rw

(∂

∂xg(x,w)

)fw(w)dw,

and so∂

∂xE[g(x,w)] = E

[∂

∂xg(x,w)

].

3.1.3 Positive fixed cost and (s, S) policies

Suppose that there is a fixed cost K > 0 associated with a positive inventory order, i.e., the cost ofordering u ≥ 0 units is:

C(u) =

K + cu if u > 00 if u = 0.

The DP algorithm takes the formJN (xN ) = 0

Jk(xk) = minuk≥0

C(uk) +H(xk + uk) + Ewk [Jk+1(xk + uk − wk)] ,

where againH(y) = pE[(w − y)+] + hE[(y − w)+].

Consider againGk(y) = cy +H(y) + E[Jk+1(y − w)].

Then,

Jk(xk) = min

Gk(xk)︸︷︷︸Do not order

, minuk>0

K +Gk(xk + uk)︸︷︷︸Order uk

− cxk.By changing variable yk = xk + uk like in the zero fixed-cost case, we get

Jk(xk) = minGk(xk), min

yk>xkK +Gk(yk)

− cxk. (3.1.6)

When K > 0, Gk is not necessarily convex3, opening the possibility of very complicated optimalpolicies (see Figure 3.1.6). Under this kind of function Gk(y), for the cost function (3.1.6), theoptimal policy would be:

3Note that Gk involves K through Jk+1.

106


Gk(s)

y’

IncreasingIncreasing

Figure 3.1.6: Potential form of the function Gk(y) when the fixed cost is nonzero.

1. If xk ∈ Zone I⇒ Gk(xk) > Gk(s), ∀xk < s ⇒ Order u∗k = S−xk, such that y∗k = S. Clearly,Gk(S) +K < Gk(xk),∀xk < s. So, if xk ∈ Zone I, u∗k = S − xk.

2. If xk ∈ Zone II ⇒

• If s < xk < S and u > 0 (i.e., yk > xk) ⇒ K + Gk(yk) > Gk(xk), and it is suboptimalto order.

• If S < xk < y′ and u > 0 (i.e., yk > xk) ⇒ K + Gk(yk) > Gk(xk), and it is alsosuboptimal to order.

So, if xk ∈ Zone II, u∗k = 0.

3. If xk ∈ Zone III ⇒ Order u∗k = S − xk, so that y∗k = S, and Gk(xk) > K + G(S), for ally′ < xk < s.

4. If xk ∈ Zone IV⇒ Do not order (i.e., u∗k = 0), since otherwiseK+Gk(yk) > Gk(xk), ∀yk > xk.

In summary, the optimal policy would be to order u∗k = (S−x) in zone I, u∗k = 0 in zones II and IV,and u∗k = (S − x) in zone III.

We will show below that even though the functions Gk may not be convex, they do have somestructure: they are K-convex.

Definition 3.1.1 A real function g(y) is K-convex if and only if it verifies the property:

K + g(z + y) ≥ g(y) + z

(g(y)− g(y − b)

b

),

for all z ≥ 0, b > 0, y ∈ R.

The definition is illustrated in Figure 3.1.7. Observation: Note that the situation described inFigure 3.1.6 is impossible under K-convexity: Since y0 is a local maximum in zone III, we musthave for b > 0 small enough,

Gk(y0)−Gk(y0 − b) ≥ 0 ⇒ Gk(y0)−Gk(y0 − b)b

≥ 0,

107


Figure 3.1.7: Graph of a k-convex function.

and from the definition of K-convexity, we should have for S = y0 + z, and y = y0,

K +Gk(S) ≥ Gk(y0) + z︸︷︷︸≥0

Gk(y0)−Gk(y0 − b)b︸︷︷︸≥0

≥ Gk(y0),

which does not hold in our case.

Intuition: A K-convex function is a function that is “almost convex”, and for which K representsthe size of the “almost”. Scarf(1960) invented the notion of K-convex functions for the explicitpurpose of analyzing this inventory model.

For a function f to be K-convex, it must lie below the line segment connecting (x, f(x)) and(y,K + f(y)), for all real numbers x and y such that x ≤ y. Figure 3.1.8 below shows that aK-convex function, namely f1, need not be continuous. However, it can be shown that a K-convexfunction cannot have a positive jump at a discontinuity, as illustrated by f2. Moreover, a negativejump cannot be too large, as illustrated by f3.

Next, we compile some results on K-convex functions:

Lemma 3.1.1 Properties of K-convex functions:

(a) A real-valued convex function g is 0-convex and hence also K-convex for all K > 0.

(b) If g1(y) and g2(y) are K-convex and L-convex respectively, then αg1(y) +βg2(y) is (αK +βL)-convex, for all α, β > 0.

(c) If g(y) is K-convex and w is a random variable, then Ew[g(y − w)] is also K-convex, providedEw[|g(y − w)|] <∞, for all y.

(d) If g is a continuous K-convex function and g(y) → ∞ as |y| → ∞, then there exist scalars sand S, with s ≤ S, such that

(i) g(S) ≤ g(y),∀y (i.e., S is a global minimum).

108


f1

f2

f3

yx

K

K

K

Figure 3.1.8: Function f1 is K-convex; f2 and f3 are not.

(ii) g(S) +K = g(s) < g(y),∀y < s.

(iii) g(y) is decreasing on (−∞, s).

(iv) g(y) ≤ g(z) +K,∀y, z, with s ≤ y ≤ z.

Using part (d) of Lemma 3.1.1, we will show that the optimal policy is of the form

µ∗k(xk) =

Sk − xk if xk < sk0 if xk ≥ sk,

where Sk is the value of y that minimizes Gk(y), and sk is the smallest value of y for whichGk(y) = K +Gk(Sk). This control policy is called the (s, S) multiperiod policy.

109


Proof of the optimality of the (s, S) multiperiod policy

For stage N − 1,GN−1(y) = cy +H(y) + Ew[JN (y − w)︸︷︷︸

0

]

Therefore, GN−1(y) is clearly convex ⇒ It is K-convex. Then, we have

JN−1(x) = minGN−1(x),min

y>xK + cy +GN−1(y)

− cx,

where by defining SN−1 as the minimizer of GN−1(y) and sN−1 = miny : GN−1(y) = K +GN−1(SN−1) (see Figure 3.1.9), we have the optimal control

µ∗N−1(xN−1) =

SN−1 − xN−1 if xN−1 < sN−1

0 if xN−1 ≥ sN−1,

which leads to the optimal value function

JN−1(x) =

K +GN−1(SN−1)− cx for x < sN−1

GN−1(x)− cx for x ≥ sN−1.(3.1.7)

Observations:

• sN−1 6= SN−1, because K > 0

• ∂∂yGN−1(sN−1) ≤ 0

It turns out that the left derivative of JN−1(·) at sN−1 is greater than the right derivative⇒ JN−1(·)is not convex (again, see Figure 3.1.9). Here, as we saw for the zero fixed ordering cost, the minimum

GN-1(x)=c x + H(x)

H(x) = GN-1(x) c x

sN-1 SN-1-cxH*

cxJx N )(1

cxGx

cxJx NN

0

11 )()(

Slope= c

Figure 3.1.9: Structure of the cost-to-go function when fixed cost is nonzero.

110


H∗ occurs to the right of SN−1 (recall that SN−1 is the unconstrained minimizer of GN−1(x)). Tosee this, note that

∂

∂yGN−1(y) = c+

∂

∂yH(y) ⇒

∂

∂yGN−1(SN−1) = c+

∂

∂yH(SN−1) = 0 ⇒

∂

∂yH(SN−1) = −c < 0,

meaning that H is decreasing at SN−1, and so its minimum H∗ occurs to the right of SN−1.

Claim: JN−1(x) is K-convex.

Proof: We must verify for all z ≥ 0, b > 0, and y, that

K + JN−1(y + z) ≥ JN−1(y) + z

(JN−1(y)− JN−1(y − b)

b

)(3.1.8)

There are three cases according to the relative position of y, y + z, and sN−1.

Case 1: y ≥ sN−1 (i.e., y + z ≥ y ≥ sN−1).

• If y − b ≥ sN−1 ⇒ JN−1(x) = GN−1(x)︸︷︷︸convex⇒ K-convex

− cx︸︷︷︸linear

, so by part (b) of

Lemma 3.1.1, it is K-convex.

• If y − b < sN−1 ⇒ in view of equation (3.1.7) we can write (3.1.8) as

K + JN−1(y + z) = K +GN−1(y + z)− c(y + z)

≥ GN−1(y)− cy︸︷︷︸JN−1(y)

+z

JN−1(y)︷︸︸︷GN−1(y)− cy−

JN−1(y−b)=

GN−1(sN−1)︷︸︸︷K +GN−1(SN−1)−c(y−b)︷︸︸︷

GN−1(sN−1) + c(y − b)b

,

or equivalently,

K +GN−1(y + z) ≥ GN−1(y) + z

(GN−1(y)−GN−1(sN−1)

b

)(3.1.9)

There are three subcases:

(i) If y is such that GN−1(y) ≥ GN−1(sN−1), y 6= sN−1 ⇒ by the K-convexity ofGN−1, and taking y − sN−1 as the constant b > 0,

K +GN−1(y + z) ≥ GN−1(y) + z

GN−1(y)−GN−1(y−(y−sN−1)︷︸︸︷sN−1 )

y − sN−1

.

Thus, K-convexity hold.

111


(ii) If y is such that GN−1(y) < GN−1(sN−1) ⇒ From part (d-i) in Lemma 3.1.1, for ascalar y + z,

K +GN−1(y + z) ≥ K +GN−1(SN−1)

= GN−1(sN−1) (by definition of sN−1)

> GN−1(y) (by hypothesis of this case).

≥ GN−1(y) + z

<0︷︸︸︷

GN−1(y)−GN−1(sN−1)b

,

and equation (3.1.9) holds.

(iii) If y = sN−1, then by K-convexity of GN−1, note that (3.1.9) becomes

K +GN−1(y + z) ≥ GN−1(y) + z

0︷︸︸︷

GN−1(y)−GN−1(sN−1)b

= GN−1(y).

From Lemma 3.1.1, part (d-iv), taking y = sN−1 there,

K +GN−1(sN−1 + z) ≥ GN−1(sN−1)

is verified, for all z ≥ 0.

Case 2: y ≤ y + z ≤ sN−1.

By equation (3.1.7), the function JN−1(y) is linear ⇒ It is K-convex.

Case 3: y < sN−1 < y + z.

Here, we can write (3.1.8) as

K + JN−1(y + z) = K +GN−1(y + z)− c(y + z)

≥ JN−1(y) + z

JN−1(y)− JN−1(

<y︷︸︸︷y − b)

b

= K +GN−1(sN−1)− cy + z

(K +GN−1(sN−1)− cy − (K +GN−1(sN−1)− c(y − b))

b

)= K +GN−1(sN−1)− cy − czb

b= K +GN−1(sN−1)− c(y + z)

Thus, the previous sequence of relations holds if and only if

K +GN−1(y + z)− c(y + z) ≥ K +GN−1(sN−1)− c(y + z),

or equivalently, if and only if GN−1(y + z) ≥ GN−1(sN−1), which holds from Lemma 3.1.1,part (d-i), since GN−1(·) is K-convex.

112


This completes the proof of the claim.

We have thus proved thatK-convexity and continuity ofGN−1, together with the fact thatGN−1(y)→∞ as |y| → ∞, imply K-convexity of JN−1. In addition, JN−1(x) can be seen to be continuous in x.

Using the following facts:

• From the definition of Gk(y):

GN−2(y) = cy +H(y) + Ew[ JN−1(y − w)︸︷︷︸K−convex fromLemma 3.1.1-(c)

]

︸︷︷︸K-convex from Lemma 3.1.1-(b)

• GN−2(y) is continuous (because of boundedness of wN−2).

• GN−2(y)→∞ as |y| → ∞.

and repeating the preceding argument, we obtain that JN−2 is K-convex, and proceeding similarly,we prove K-convexity and continuity of the functions Gk for all k, as well as that Gk(y) → ∞ as|y| → ∞. At the same time, by using Lemma 3.1.1-(d), we prove optimality of the multiperiod (s, S)policy.

Finally, it is worth noting that it is not necessary that Gk(·) be K-convex for an (s, S) policy to beoptimal; it is just a sufficient condition.

3.1.4 Exercises

Exercise 3.1.1 Consider an inventory problem similar to the one discussed in class, with zero fixedcost. The only difference is that at the beginning of each period k the decision maker, in additionto knowing the current inventory level xk, receives an accurate forecast that the demand wk will beselected in accordance with one out of two possible probability distributions Pl, Ps (large demand,small demand). The a priori probability of a large demand forecast is known.

(a) Obtain the optimal ordering policy for the case of a single-period problem

(b) Extend the result to the N -period case

Exercise 3.1.2 Consider the inventory problem with nonzero fixed cost, but with the differencethat demand is deterministic and must be met at each time period (i.e., the shortage cost per unitis ∞). Show that it is optimal to order a positive amount at period k if and only if the stock xkis insufficient to meet the demand wk. Furthermore, when a positive amount is ordered, it shouldbring up stock to a level that will satisfy demand for an integral number of periods.

Exercise 3.1.3 Consider a problem of expanding over N time periods the capacity of a productionfacility. Let us denote by xk the production capacity at the beginning of period k, and by uk ≥ 0the addition to capacity during the kth period. Thus, capacity evolves according to

xk+1 = xk + uk, k = 0, 1, . . . , N − 1.

113


The demand at the kth period is denoted wk and has a known probability distribution that does notdepend on either xk or uk. Also, successive demands are assumed to be independent and bounded.We denote:

Ck(uk): Expansion cost associated with adding capacity uk.

Pk(xk + uk − wk): Penalty associated with capacity xk + uk and demand wk.

S(xN ): Salvage value of final capacity xN .

Thus, the cost function has the form

Ew0,...,wN−1

[−S(xN ) +

N−1∑k=0

(Ck(uk) + Pk(xk + uk − wk))

].

(a) Derive the DP algorithm for this problem.

(b) Assume that S is a concave function with limx→∞ dS(x)/dx = 0, Pk are convex functions, andthe expansion cost Ck is of the form

Ck(u) =

K + cku if u > 0,0 if u = 0,

where K ≥ 0, ck > 0 for all k. Show that the optimal policy is of the (s, S) type assuming

cky + E [Pk(y − wk)]→∞ as |y| → ∞.

3.2 Single-Leg Revenue Management

Revenue Management (RM) is an OR subfield that deals with business related problems wherethere are finite, perishable capacities that must be depleted by a due time. Applications spanfrom airlines, hospitality industry, and car rental, to more recent practices in retailing and mediaadvertising. The problem sketched below is the basic RM problem that consists of rationing thecapacity of a single resource through imposing limits on the quantities to be sold at different prices,for a given set of prices.

Setting:

• Initial capacity C; remaining capacity denoted by x.

• There are n costumers classes labeled such that p1 > p2 > · · · > pn.

• Time indices run backwards in time.

• Class n arrives first, followed by classes n− 1, n− 2, · · · , 1.

• Demands are r.v: Dn, Dn−1, . . . , D1.

• At the beginning of stage j, demands Dj , Dj−1, . . . , D1.

• Within stage j the model assumes the following sequence of events:

114


1. The realization of the demand Dj occurs, and we observe the value.4

2. We decide on a quantity u to accept: u ≤ minDj , x. The optimal control then is afunction of both current demand and remaining capacity: u∗(Dj , x). This is done foranalytical convenience. In practice, the control decision has to be made before observ-ing Dj . We will see that the calculation of the optimal control does not use informationabout Dj , so this assumption vanishes ex-post.

3. Revenue pju is collected, and we proceed to stage j − 1 (since indices run backwards).

For the single-leg RM problem, the DP formulation becomes

Vj(x) = EDj

[max

0≤u≤minDj ,xpju+ Vj−1(x− u)

], (3.2.1)

with boundary conditions: V0(x) = 0, x = 0, 1, . . . , C. Note that in this formulation we haveinverted the usual order between max· and E[·]. We prove below that this is w.l.o.g. for the kindof setting that we are dealing with here.

3.2.1 System with observable disturbances

Departure from standard DP: We can base our control ut on perfect knowledge of the random noiseof the current period, wt. For this section, assume as in the basic DP setting that indices runforward.

Claim: Assume a discrete finite horizon t = 1, 2, . . . , T . The formulation

Vt(x) = maxu(x,wt)∈Ut(x,wt)

Ewt [gt(x, u(x,wt), wt) + Vt+1(ft(x, u(x,wt), wt))]

is equivalent to

Vt(x) = Ewj

[max

u(x)∈Ut(x)gt(x, u, wt) + Vt+1(ft(x, u, wt))

], (3.2.2)

which is more convenient to handle.

Proof: State space augmentation argument:

1. Reindex disturbances by defining wt = wt+1, t = 1, . . . , T − 1.

2. Augment state to include the new disturbance, and define the system equation:(xt+1

yt+1

)=

(ft(xt, ut, wt)

wt

).

3. Starting from (x0, y0) = (x,w1), the standard DP recursion is:

Vt(x, y) = maxu(x)∈Ut(x)

E ewt [gt(x, u, y) + Vt+1(ft(x, u, y), wt)]

4Note that this is a departure from the basic DP formulation where typically we make a decision in period k before

the random noise wk is realized.

115


4. Define Gt(x) = E ewt [Vt(x, wt)]5. Note that:

Vt(x, y) = maxu(x)∈Ut(x)

E ewt [gt(x, u, y) + Vt+1(ft(x, u, y), wt)]

= maxu(x)∈Ut(x)

gt(x, u, y) + E ewt [Vt+1(ft(x, u, y), wt)] (because gt(·) does not depend on wt)

= maxu(x)∈Ut(x)

gt(x, u, y) +Gt+1(ft(x, u, y)) (by definition of Gt+1(·))

6. Replace y by wt and take expectation with respect to wt on both sides above to obtain:

Ewt [Vt(x,wt)]︸︷︷︸Gt(x)

= Ewt

[max

u(x)∈Ut(x)gt(x, u, wt) +Gt+1(ft(x, u, wt))

]

Observe that the LHS is indeed Gt(x) modulus a small issue with the name of the randomvariable, which is justified by noting that Gt(x) M= E ewt [Vt(x, wt)] = Ew[Vt(x,w)]. Finally,there is another minor “name issue”, because the final DP is expressed in terms of the valuefunction G. It remains to replace G by V to recover formulation (3.2.2).

In words, what we are doing is anticipating and solving today the problem that we will face tomor-row, given the disturbances of today. The implicit sequence of actions of this alternative formulationis the following:

1. We observe current state x.

2. The value of the disturbance wt is realized.

3. We make the optimal decision u∗(x,wt).

4. We collect the current period reward gt(x, u(x,wt), wt).

5. We move to the next state t+ 1.

3.2.2 Structure of the value function

Now, we turn to our original RM problem. First, we define the marginal value of capacity,

∆Vj(x) = Vj(x)− Vj(x− 1), (3.2.3)

and proceed to characterize the structure of the value function.

Proposition 3.2.1 The marginal value of capacity ∆Vj(x) satisfies:

(i) For a fixed j, ∆Vj(x+ 1) ≤ ∆Vj(x), x = 0, . . . , C.

(ii) For a fixed x, ∆Vj+1(x) ≥ ∆Vj(x), j = 1, . . . , n.

The proposition states two intuitive economic properties:

116


(i) For a given period, the marginal value of capacity is decreasing in the number of units left.

(ii) The marginal value of capacity x at stage j is smaller than its value at stage j+ 1 (recall thatindices are running backwards). Intuitively, this is because there are less periods remaining,and hence less opportunities to sell the xth unit.

Before going over the proof of this proposition, we need the following auxiliary lemma:

Lemma 3.2.1 Suppose g : Z+ → R is concave. Let f : Z+ → R be defined by:

f(x) = maxa=1,...,m

ap+ g(x− a)

for any given p ≥ 0; and nonnegative integer m ≤ x. Then f(x) is concave in x as well.

Proof: We proceed in three steps:

1. Change of variable: Define y = x− a, so that we can write:

f(x) = f(x) + px; where f(x) = maxx−m≤y≤x

−yp+ g(y)

With this change of variable, we have that a = x− y and hence the inner part can be writtenas: (x − y)p + g(y), where the range for the argument is such that 0 ≤ x − y ≤ m, orx−m ≤ y ≤ x. The new function is

f(x) = maxx−m≤y≤x

(x− y)p+ g(y).

Thus, f(x) = f(x) + px, where

f(x) = maxx−m≤y≤x

−yp+ g(y)

Note that since x ≥ m, then y ≥ 0.

2. Closed-form for f(x): Let h(y) = −yp+g(y), for y ≥ 0. Let y∗ be the unconstrained maximizerof h(y), i.e. y∗ = argmaxy≥0h(y). Because of the shape of h(y), this maximizer is always welldefined. Moreover, since g(y) is concave, h(y) is also concave, nondecreasing for y ≤ y∗, andnonincreasing for y > y∗. Therefore, for given m and p:

f(x) =

−xp+ g(x) if x ≤ y∗

y∗p+ g(y∗) if y∗ ≤ x ≤ y∗ +m

−(x−m)p+ g(x−m) if x ≥ y∗ +m

The first part holds because h(y) is nondecreasing for 0 ≤ y ≤ y∗. The second part holdsbecause f(x) = −y∗p+ g(y∗) = h(y∗), for y∗ in the range x−m, . . . , x, or equivalently, fory∗ ≤ x ≤ y∗ +m. Finally, since h(y) is nonincreasing for y > y∗, the maximum is attained inthe border of the range, i.e., in x−m.

3. Concavity of f(x):

117


• Take x < y∗. We have

f(x+ 1)− f(x) = [−(x+ 1)p+ g(x+ 1)]− [−xp+ g(x)]

≤ −p+ g(x+ 1)− g(x) (because g(x) is concave)

= f(x)− f(x− 1).

So, for x < y∗, f(x) is concave.

• Take y∗ ≤ x < y∗ +m. Here, f(x+ 1)− f(x) = 0, and so f(x) is trivially concave.

• Take x ≥ y∗ +m. We have

f(x+ 1)− f(x) = [−(x+ 1−m)p+ g(x+ 1−m)]− [−(x−m)p+ g(x−m)]

= −p+ g(x+ 1−m)− g(x−m)

≤ −p+ g(x−m)− g(x−m− 1) (because g(x) is concave)

= f(x)− f(x− 1).

So, for x ≥ y∗ +m, f(x) is concave.

Therefore f(x) is concave for all x ≥ 0, and since f(x) = f(x) + px, f(x) is concave in x ≥ 0 aswell.

Proof of Proposition 3.2.1

Part (i): ∆Vj(x+ 1) ≤ ∆Vj(x), ∀x.

By induction:

• In terminal stage: V0(x) = 0, ∀x, so it holds.

• IH: Assume that Vj−1(x) is concave in x.

• Consider Vj(x). Note that

Vj(x) = EDj

[max

0≤u≤minDj ,xpju+ Vj−1(x− u)

].

For any realization of Dj , the function

H(x,Dj) = max0≤u≤minDj ,x

pju+ Vj−1(x− u)

has exactly the same structure as the function of the Lemma above, with m = minDj , x, andtherefore it is concave in x. Since EDj [H(x,Dj)] is a weighted average of concave functions,it is also concave.

Going back to the original formulation for the single-leg RM problem in (3.2.1), we can express itas follows:

Vj(x) = EDj

[max

0≤u≤minDj ,xpj u+ Vj−1(x− u)

]= Vj−1(x) + EDj

[max

0≤u≤minDj ,x

u∑z=1

(pj −∆Vj−1(x+ 1− z))

], (3.2.4)

118


where we are using (3.2.3) to write Vj−1(x− u) as a sum of increments:

Vj−1(x− u) = Vj−1(x)−u∑z=1

Vj−1(x+ 1− z)

= Vj−1(x)− [∆Vj−1(x) + ∆Vj−1(x− 1) + · · ·++ ∆Vj−1(x+ 1− (u− 1)) + ∆Vj−1(x+ 1− u)]

= Vj−1(x)− [Vj−1(x)− Vj−1(x− 1) + Vj−1(x− 1)− Vj−1(x− 2) + · · ·++ Vj−1(x+ 2− u)− Vj−1(x+ 1− u) + Vj−1(x+ 1− u)− Vj−1(x− u)]

Note that all terms in the RHS except for the last one cancel out. The inner sum in (3.2.4) isdefined to be zero when u = 0.

Part (ii): ∆Vj+1(x) ≥ ∆Vj(x), ∀j.

From (3.2.4) we can write:

Vj+1(x) = Vj(x) + EDj+1

[max

0≤u≤minDj+1,x

u∑z=1

(pj+1 −∆Vj(x+ 1− z))

].

Similarly, we can write:

Vj+1(x− 1) = Vj(x− 1) + EDj+1

[max

0≤u≤minDj+1,x−1

u∑z=1

(pj+1 −∆Vj(x− z))

].

Subtracting both equalities, we get:

∆Vj+1(x) = ∆Vj(x) + EDj+1

[max

0≤u≤minDj+1,x

u∑z=1

(pj+1 −∆Vj(x+ 1− z))

]

−EDj+1

[max


u∑z=1

(pj+1 −∆Vj(x− z))

]

≥ ∆Vj(x) + EDj+1

[max

0≤u≤minDj+1,x

u∑z=1

(pj+1 −∆Vj(x− z))

]

−EDj+1

[max


u∑z=1

(pj+1 −∆Vj(x− z))

]

≥ ∆Vj(x) + EDj+1

[max


u∑z=1

(pj+1 −∆Vj(x− z))

]

−EDj+1

[max


u∑z=1

(pj+1 −∆Vj(x− z))

]= ∆Vj(x),

where the first inequality holds from part (i) in Proposition 3.2.1, and the second one holds becausethe domain of u in the maximization problem of the first expectation (in the second to last line) issmaller, and hence it is a more constrained optimization problem.

119


3.2.3 Structure of the optimal policy

The good feature of formulation (3.2.4) is that it is very insightful about the structure of theoptimal policy. In particular, from part (i) of Proposition 3.2.1, since ∆Vj(x) is decreasing in x,pj −∆Vj−1(x + 1 − z) is decreasing in z. So, it is optimal to keep adding terms to the sum (i.e.,increase u) as long as

pj −∆Vj−1(x+ 1− u) ≥ 0,

or the upper bound minDj , x is reached, whichever comes first. In words, we compare theinstantaneous revenue pj with the marginal value of capacity (i.e., the value of a unit if we keep itfor the next period). If the former dominates, then we accept the price pj for the unit.

The resulting optimal controls can be expressed in terms of optimal protection levels y∗j , forclasses j, j − 1, . . . , 1 (i.e., class j and higher in the revenue order). Specifically, we define

y∗j = maxx : 0 ≤ x ≤ C, pj+1 < ∆Vj(x), j = 1, 2, . . . , n− 1, (3.2.5)

and we assume y∗0 = 0 and y∗n = C. Figure 3.2.1 illustrates the determination of y∗j . For x ≤ y∗j ,pj+1 ≥ ∆Vj(x), and therefore it is worth waiting for the demand to come rather than selling now.

xC

pj+1

Vj(x)

yj*

Reject class j+1 Accept class j+1

Figure 3.2.1: Calculation of the optimal protection level y∗j .

The optimal control at stage j + 1 is then

µ∗j+1(x,Dj+1) = min(x− y∗j )+, Dj+1

The key observation here is that the computation of y∗j does not depend on Dj+1, because theknowledge of Dj+1 does not affect the future value of capacity. Therefore, going back to theassumption we made at the beginning, assuming that we know demand Dj+1 to compute y∗j doesnot really matter, because we do not make real use of that information.

Part (ii) in Proposition 3.2.1 implies the nested protection structure

y∗1 ≤ y∗2 ≤ · · · ≤ y∗n−1 ≤ y∗n = C.

This is illustrated in Figure 3.2.2. The reason is that since the curve ∆Vj−1(x) is below thecurve ∆Vj(x) pointwise, and since by definition, pj > pj+1, then y∗j−1 ≤ y∗j .

120


xC

pj+1

Vj(x)

yj*

Vj-1(x)

yj-1*

pj

Figure 3.2.2: Nesting feature of the optimal protection levels.

3.2.4 Computational complexity

Using the optimal control, the single-leg RM problem (3.2.1) could be reformulated as

Vj(x) = EDj[pj min(x− y∗j−1)+, Dj+ Vj−1(x−min(x− y∗j−1)+, Dj)

](3.2.6)

This procedure is repeated starting from j = 1 and working backward to j = n.

• For discrete-demand distributions, computing the expectation in (3.2.6) for each state x re-quires evaluating at most O(C) terms since min(x − y∗j−1)+, Dj ≤ C. Since there are Cstates (capacity levels), the complexity at each stage is O(C2).

• The critical values y∗j can then be identified from (3.2.5) in log(C) time by binary searchas ∆Vj(x) is nonincreasing. In fact, since we know y∗j ≥ y∗j−1, the binary search can be furtherconstrained to values in the interval [y∗j−1, C]. Therefore, computing y∗j does not add to thecomplexity at stage j

• These steps must be repeated for each of the n−1 stages, giving a total complexity of O(nC2).

3.2.5 Airlines: Practical implementation

Airlines that use capacity control as their RM strategy (as opposed to dynamic pricing) post pro-tection levels y∗j in their own reservation systems, and accept requests for product j + 1 until y∗j isreached or stage j + 1 ends (whichever comes first). Figure 3.2.3 is a snapshot from Expedia.comshowing this practice from American Airlines.

3.2.6 Exercises

Exercise 3.2.1 Single-leg Revenue Management problem: For a single leg RM problemassume that:

• There are n = 10 classes.

121


Figure 3.2.3: Optimal protection levels at American Airlines.

• Demand Dj is calculated through discretizing a truncated normal with mean µ = 10 andstandard deviation σ = 2, on support [0, 20]. Specifically, take:

P(Dj = k) =Φ((k + 0.5− 10)/2)− Φ((k − 0.5− 10)/2)

Φ((20.5− 10)/2)− Φ((−0.5− 10)/2), k = 0, . . . , 20

Note that this discretization and re-scaling verifies:∑20

k=0 P(Dj = k) = 1.

• Total capacity available is C = 100.

• Prices are p1 = 500, p2 = 480, p3 = 465, p4 = 420, p5 = 400, p6 = 350, p7 = 320, p8 = 270, p9 =250, and p10 = 200.

Write a MATLAB or C code to compute optimal protection levels y∗1, . . . , y∗9; and find the total

expected revenue V10(100). Note that you can take advantage of the structure of the optimal policyto simplify its computation. Submit your results, and a copy of the code.

Exercise 3.2.2 Heuristic for the single-leg RM problem: In the airline industry, the single-leg RM problem is typically solved using a heuristic; the so-called EMSR-b (expected marginal seatrevenue - version b). There is no much reason for this other than the tradition of its usage, andthe fact that it provides consistently good results. Here is a description:

Consider stage j + 1 in which we want to determine protection level yj . Define the aggregatedfuture demand for classes j, j − 1, . . . , 1, by Sj =

∑jk=1Dk, and let the weighted-average revenue

122


from classes 1, . . . , j, denoted pj , be defined by

pj =∑j

k=1 pkE[Dk]∑jk=1 E[Dk]

.

Then the EMSR-b protection level for class j and higher, yj , is chosen by

P(Sj > yj) =pj+1

pj.

It is common when using EMSR-b to assume demand for each class j is independent and normallydistributed with mean µj and variance σ2

j , in which case

yj = µ+ zασ,

where µ =∑j

k=1 µk is the mean and σ2 =∑j

k=1 σ2k is the variance of the aggregated demand to

come at stage j + 1, andzα = Φ−1(1− pj+1/pj).

Apply this heuristic to compute protection levels y1, . . . , y9 using the data of the previous exerciseand assuming that demand is normal (no truncation, no discretization), and compare the outcomewith the optimal protection levels computed before.

3.3 Optimal Stopping and Scheduling Problems

In this section, we focus on two other types of problems with perfect state information: optimalstopping problems (mainly) and discuss few ideas on scheduling problems.

3.3.1 Optimal stopping problems

We assume the following:

• At each state, there is a control available that stops the system.

• At each stage, you observe the current state and decide either to stop or continue.

• Each policy consists of a partition of the set of states xk into two regions: the stop region andthe continue region. Figure 3.3.1 illustrates this.

• Domain of states remains the same throughout the process.

Application: Asset selling problem

• Consider a person owning an asset for which she is offered an amount of money from periodto period, across N periods.

• Offers are random and independent, denoted w0, w1, ...wN−1, with wi ∈ [0, w].

123


“Continue” Region “Stop”Region

Initial stateTerminal state

Figure 3.3.1: Each policy consists of a partition of the state space into the stop and the continue regions.

• If the seller accepts an offer, she can invest the money at a fixed rate r > 0.

Otherwise, she waits until next period to consider the next offer.

• Assume that the last offer wN−1 must be accepted if all prior offers are rejected.

• Objective: Find a policy for maximizing reward at the Nth period.

Let’s solve this problem.

• Control:

µk(xk) =

u1 : Sellu2 : Wait

• State: xk = R+ ∪ T.

• System equation:

xk+1 =

T if xk = T, or xk 6= T and µk = u1,

wk otherwise.

• Reward function:

Ew0,...wN−1

[gN (xN ) +

N−1∑k=0

gk(xk, µk, wk)

]where

gN (xN ) =

xN if xN 6= T,

0 if xN = T(i.e., the seller must accept the offer by time N),

and for k = 0, 1, . . . , N − 1,

gk(xk, µk, wk) =

(1 + r)N−kxk if xk 6= T and µk = u1,

0 otherwise.

Note that here, a critical issue is how to account for the reward, being careful with the doublecounting. In this formulation, once the seller accepts the offer, she gets the compound interestfor the rest of the horizon all together, and from there onwards, she gets zero reward.

124


• DP formulation

JN (xN ) =

xN if xN 6= T,

0 if xN = T.(3.3.1)

For k = 0, 1, . . . , N − 1,

Jk(xk) =

max(1 + r)N−kxk︸︷︷︸

Sell

,E[Jk+1(wk)]︸︷︷︸Wait

if xk 6= T,

0 if xk = T.

(3.3.2)

• Optimal policy: Accept offer only when

xk > αkM=

E[Jk+1(wk)](1 + r)N−k

Note that αk represents the net present value of the expected reward. This comparison isa fair one, because it is conducted between the instantaneous payoff xk and the expectedreward discounted back to the present time k. Thus, the optimal policy is of the thresholdtype, described by the scalar sequence αk : k = 0, . . . , N − 1. Figure 3.3.2 represents thisthreshold structure.

0 1 2 N - 1 N k

ACCEPT

REJECT

!1

!N - 1

!2

Figure 3.3.2: Optimal policy of accepting/rejecting offers in the asset selling problem.

Proposition 3.3.1 Assume that offers wk are i.i.d., with w ∼ F (·). Then, αk ≥ αk+1, k =1, . . . , N − 1, with αN = 0.

Proof: For now, let’s disregard the terminal condition, and define

Vk(xk)M=

Jk(xk)(1 + r)N−k

, xk 6= T.

We can rewrite equations (3.3.1) and (3.3.2) as follows:

VN (xN ) = xN ,

Vk(xk) = maxxk, (1 + r)−1Ew[Vk+1(w)], k = 0, . . . , N − 1. (3.3.3)

125


Hence, defining αN = 0 (since we have to accept no matter what in the last period), we get

αk =Ew[Vk+1(w)]

1 + r, k = 0, 1, ..., N − 1.

Next, we compare the value function at periods N − 1 and N : For k = N and k = N − 1, we have

VN (x) = x

VN−1(x) = maxx, (1 + r)−1Ew[VN (w)]︸︷︷︸αN−1

≥ VN (x)

Given that we have a stationary system, from the monotonicity of DP (see Homework #2), we knowthat

V1(x) ≥ V2(x) ≥ · · · ≥ VN (x), ∀x.

Since αk =Ew[Vk+1(w)]

1 + rand αk+1 = Ew[Vk+2(w)]

1+r , we have αk ≥ αk+1.

Compute limiting α

Next, we explore the question: What if the selling horizon is very long? Note that equation (3.3.3)can be written as Vk(xk) = maxxk, αk, where

αk = (1 + r)−1Ew[Vk+1(w)]

=1

1 + r

∫ αk+1

0αk+1dF (w) +

1r + 1

∫ ∞αk+1

wdF (w)

=αk+1

1 + rF (αk+1) +

11 + r

∫ ∞αk+1

wdF (w) (3.3.4)

We will see that the sequence αk converges as k → −∞ (i.e., as the selling horizon becomes verylong).

Observations:

1. 0 ≤ F (α)1 + r

≤ 11 + r

.

2. For k = 0, 1, . . . , N − 1,

0 ≤ 11 + r

∫ ∞αk+1

wdF (w) ≤ 11 + r

∫ ∞0

wdF (w) =E[w]1 + r

.

3. From equation (3.3.4) and Proposition 3.3.1:

αk ≤αk+1

1 + r+

E[w]1 + r

≤ αk1

1 + r+

E[w]1 + r

⇒ αk <E[w]r

Using αk ≥ αk+1 and knowing that the sequence is bounded from above, we know that whenk → −∞, αk → α, where α satisfies

(1 + r)α = F (α)α+∫ ∞α

wdF (w)

126


When N is “big”, then an approximate method is to use the constant policy: Accept the offer xkif and only if xk > α. More formally, if we define G(α) as

G(α) M=1

r + 1

(F (α)α+

∫ ∞α

wdF (w)),

then from the Contraction Mapping Theorem (due to Banach, 1922), G(α) is a contraction mapping,and hence the iterative procedure αn+1 = G(αn) finds the unique fixed point in [0,E[w]/r], startingfrom any arbitrary α0 ∈ [0,E[w]/r].

Recall: G is a contraction mapping if for all x, y ∈ Rn, ||G(x)−G(y)|| < K||x− y||, for a constant0 ≤ K < 1, K independent of x, y.

Application: Purchasing with a deadline

• Assume that a certain quantity of raw material is needed at a certain time.

• Price of raw materials fluctuates

Decision: Purchase or not?

Objective: Minimum expected price of purchase

• Assume that successive prices wk are i.i.d. and have c.d.f. F (·).

• Purchase must be made within N time periods.

• Controls:

µk(xk) =

u1 : Purchaseu2 : Wait

• State: xk = R+ ∪ T.


xk+1 =

T if xk = T, or xk 6= T and µk = u1,

wk otherwise.

• DP formulation:

JN (xN ) =

xN if xN 6= T,

0 otherwise.

For k = 0, . . . , N − 1,

Jk(xk) =

min xk︸︷︷︸

Purchase

,E[Jk+1(wk)]︸︷︷︸Wait

if xk 6= T,

0 if xk = T.

• Optimal policy: Purchase if and only if

xk < αkM= Ew[Jk+1(w)],

127


where

αkM= Ew[Jk+1(w)] = Ew[minw,αk+1] =

∫ αk+1

0wdF (w) +

∫ ∞αk+1

αk+1dF (w).

With terminal condition:αN−1 =

∫ ∞0

wdF (w) = E[w].

Analogously to the asset selling problem, it must hold that

α1 ≤ α2 ≤ · · · ≤ αN−1 = E[w].

Intuitively, we are less stringent and willing to accept a higher price as time goes by.

The case of correlated prices

Suppose that prices evolve according to the system equation

xk+1 = λxk + ξk, where 0 < λ < 1,

and where ξ1, ξ2, . . . , ξN−1 are i.i.d. with E[ξ] = ξ > 0.

DP Algorithm:JN (xN ) = xN

Jk(xk) = minxk,E[Jk+1(xkλ+ ξk)], k = 0, . . . , N − 1.

In particular, for k = N − 1, we have

JN−1(xN−1) = minxN−1,E[JN (xN−1λ+ ξN−1)]= minxN−1, xN−1λ+ ξ

Optimal policy at time N − 1: Purchase only when xN−1 < αN−1, where αN−1 comes from

xN−1 < λxN−1 + ξ ⇐⇒ xN−1 < αN−1M=

ξ

1− λ.

In addition, we can see that

JN−1(x) = minx, λx+ ξ ≤ x = JN (x).

Using the stationarity of the system and the monotonicity property of DP, we have that for any x,Jk(x) ≤ Jk+1(x), k = 0, . . . , N − 1.

Moreover, JN−1(x) is concave and increasing in x (see Figure 3.3.3). By a backward inductionargument, we can prove for k = 0, 1, . . . , N − 2 that Jk(x) is concave and increasing in x (seeFigure 3.3.4). These facts imply that the optimal policy for every period k is of the form: Purchaseif and only if xk < αk, where the scalar αk is the unique positive solution of the equation

x = E[Jk+1(λx+ ξk)].

Notice that the relation Jk(x) ≤ Jk+1(x) for all x and k implies that

αk ≤ αk+1, k = 0, . . . , N − 2,

and again (as one would expect) the threshold price to purchase increases as the deadline getscloser. In other words, one is more willing to accept a higher price as one approaches the end of thehorizon. This is illustrated in Figure 3.3.5.

128


xN-1

xN-1+

JN-1(xN-1)

N-1

JN-1(xN-1)

xN-1

Figure 3.3.3: Structure of the value function JN−1(x) when prices are correlated.

xk

E[Jk+1( xk+ k)]

Jk(xk)

k

Jk(xk)E[Jk+1( k)]

xk

Figure 3.3.4: Structure of the value function Jk(xk) when prices are correlated.

3.3.2 General stopping problems and the one-step look ahead policy

• Consider a stationary problem

• At time k, we may stop at cost t(xk) or choose a control µk(xk) ∈ U(xk) and continue.

• The DP algorithm is given by:JN (xN ) = t(xN ),

and for k = 0, 1, . . . , N − 1,

Jk(xk) = mint(xk), min

uk∈U(xk)E [g(xk, uk, w) + Jk+1(f(xk, uk, w))]

,

and it is optimal to stop at time k for states x in the set

Tk =x : t(x) ≤ min

u∈U(x)E [g(x, u, w) + Jk+1(f(x, u, w))]

. (3.3.5)

129


xk

Jk(x)

x

Jk+1(x)

k+1

Figure 3.3.5: Structure of the value functions Jk(x) and Jk+1(x) when prices are correlated.

• Note that JN−1(x) ≤ JN (x),∀x. This holds because

JN−1(x) = mint(x), min

uN−1∈U(x)E [g(x, uN−1, w) + JN (f(x, uN−1, w))]

≤ t(x) = JN (x).

Using the monotonicity of the DP, we have Jk(x) ≤ Jk+1(x), k = 0, 1, . . . , N − 1. Since

Tk+1 =x : t(x) ≤ min

u∈U(x)E [g(x, u, w) + Jk+2(f(x, u, w))]

, (3.3.6)

and the RHS in (3.3.5) is less or equal than the RHS in (3.3.6), we have

T0 ⊂ T1 ⊂ · · · ⊂ Tk ⊂ Tk+1 ⊂ · · · ⊂ TN−1. (3.3.7)

• Question: When are all stopping sets Tk equal?

Answer: Suppose that the set TN−1 is absorbing in the sense that if a state belongs to TN−1

and termination is not selected, the next state will also be in TN−1; that is,

f(x, u, w) ∈ TN−1, ∀x ∈ TN−1, u ∈ U(x), and w. (3.3.8)

By definition of TN−1 we have

JN−1(x) = t(x), for all x ∈ TN−1.

We obtain for x ∈ TN−1,

minu∈U(x)

E [g(x, u, w) + JN−1(f(x, u, w))] = minu∈U(x)

E [g(x, u, w) + t(f(x, u, w))]

≥ t(x) (because of (3.3.5) applied to k = N − 1).

Since

JN−2(x) = mint(x), min

u∈U(x)E [g(x, u, w) + JN1(f(x, u, w))]

,

then x ∈ TN−2, or equivalently TN−1 ⊂ TN−2. This, together with (3.3.7), implies TN−1 =TN−2. Proceeding similarly, we obtain Tk = TN−1, ∀k.

130


Conclusion: If condition (3.3.8) holds (i.e., the one-step stopping set TN−1 is absorbing),then the stopping sets Tk are all equal to the set of states for which it is better to stop ratherthan continue for one more stage and then stop. A policy of this type is known as a one-step-look-ahead policy. Such a policy turns out to be optimal in several types of applications.

Example 3.3.1 (Asset selling with past offers retained)

Take the previous asset selling problem in Section 3.3.1, and suppose now that rejected offers can be

accepted at a later time. Then, if the asset is not sold at time k, the state evolves according to:

xk+1 = maxxk, wk,

instead of just xk+1 = wk. Note that this system equation retains the best offered got so far from

period 0 to k.

The DP algorithm becomes:

VN (xN ) = xN ,

and for k = 0, 1, . . . , N − 1,

Vk(xk) = maxxk, (1 + r)−1Ewk [Vk+1(maxxk, wk]

.

The one-step stopping set is:

TN−1 =x : x ≥ (1 + r)−1Ew [maxx,w]

.

Define α as the x that satisfies the equation

x =Ew [maxx,w]

1 + r;

so that TN−1 = x : x ≥ α. Thus,

α =1

1 + rEw[maxα, w]

=1

1 + r

(∫ α

0αdF (w) +

∫ ∞α

wdF (w))

=1

1 + r

(αF (α) +

∫ ∞α

wdF (w)),

or equivalently,

(1 + r)α = αF (α) +∫ ∞α

wdF (w).

Since past offers can be accepted at a later date, the effective offer available cannot decrease with time,

and it follows that the one-step stopping set

TN−1 = x : x ≥ α

is absorbing in the sense of (3.3.8). In symbols, for x ∈ TN−1, f(x, u, w) = maxx,w ≥ x ≥ α, and

so f(x, u, w) ∈ TN−1. Therefore, the one-step-look-ahead stopping rule that accepts the first offer that

equals or exceeds α is optimal.

131


3.3.3 Scheduling problem

• Consider a given a set of tasks to perform, with the ordering subject to optimal choice.

• Costs depend on the order.

• There might be uncertainty, and precedence and resource availability constraints.

• Some problems can be solved efficiently by an interchange argument.

Example: Quiz problem

• Given a list of N questions, if question i is answered correctly (which occurs with probabilitypi), we receive reward Ri; if not the quiz terminates.

• Let i and j be the kth and (k + 1)st questions in an optimally ordered list

LM= (i0, i1, . . . , ik−1, i, j, ik+2, . . . , iN−1)

We have

E[Reward(L)] = E[Reward(i0, ..., ik−1)]+pi0 · · · pi1 · · · pik−1(piRi+pipjRj)+E[Reward(ik+2, ..., iN−1)].

Consider the list L, now with i and j interchanged, and let:

L′M= (i0, . . . , ik−1, j, i, ik+2, . . . , iN−1).

Since L is optimal,E[Reward(L)] ≥ E[Reward(L′)],

and thenpiRi + pipjRj ≥ pjRj + pipjRi,

orpiRi

1− pi≥ pjRj

1− pj.

Therefore, to maximize the total expected reward, questions should be ordered in decreasing orderof piRi/(1− pi).

3.3.4 Exercises

Exercise 3.3.1 Consider the optimal stopping, asset selling problem discussed in class. Supposethat the offers wk are i.i.d. random variables, Unif[500, 2000]. For N = 10, compute the thresholdsαk, k = 0, 1, ..., 9, for r = 0.05 and r = 0.1. Recall that αN = 0. Also compute the expected valueJ0(0) for both interest rates.

Exercise 3.3.2 Consider again the optimal stopping, asset selling problem discussed in class.

132


(a) For the stationary, limiting policy defined by α, where α is the solution to the equation

(1 + r)α = F (α)α+∫ ∞α

wdF (w)

Prove that G(α), defined as

G(α) =1

r + 1

(F (α)α+

∫ ∞α

wdF (w)),

is a contraction mapping, and hence the iterative procedure αn+1 = G(αn) finds the uniquefixed point in [0,E[w]/r], starting from any arbitrary α0 ∈ [0,E[w]/r].

Recall: G is a contraction mapping if for all x and y, ||G(x)−G(y)|| < θ||x− y||, for a constant0 ≤ θ < 1, θ independent of x, y.

(b) Apply the iterative procedure to compute α over the scenarios described in Exercise 1.

(c) Compute the expected value J0(0) for Problem 1 when the controller applies control α in everystage. Compare the results and comment on them.

Exercise 3.3.3 (The job/secretary/partner selection problem) A collection of N ≥ 2 ob-jects is observed randomly and sequentially one at a time. The observer may either select thecurrent object observed, in which case the selection process is terminated, or reject the object andproceed to observe the next. The observer can rank each object relative to those already observed,and the objective is to maximize the probability of selecting the “best” object according to somecriterion. It is assumed that no two objects can be judged to be equal. Let r∗ be the smallestpositive integer r such that

1N − 1

+1

N − 2+ · · ·+ 1

r≤ 1

Show that an optimal policy requires that the first r∗ objects be observed. If the r∗th object hasrank 1 relative to the others already observed, it should be selected; otherwise, the observationprocess should be continued until an object of rank 1 relative to those already observed is found.

Hint: Assume uniform distribution of the objects, i.e., if the rth object has rank 1 relative to theprevious (r − 1) objects, then the probability that it is the best is r/N . Define the state of thesystem as

xk =

T if the selection has already terminated,1 if the kth object observed has rank 1 among the first k objects,0 if the kth object observed has rank > 1 among the first k objects.

For k ≥ r∗, let Jk(0) be the maximal probability of finding the best object assuming k objects havebeen observed and the kth object is not best relative to the previous (k − 1) objects. Show that

Jk(0) =k

N

(1

N − 1+ · · ·+ 1

k

).

Analogously, let Jk(1) be the maximal probability of finding the best object assuming k objectshave been observed and the kth object is indeed the best relative to the previous (k − 1) objects.Show that

Jk(1) =k

N.

Then, analyze the case k < r∗.

133


Exercise 3.3.4 A driver is looking for parking on the way to his destination. Each parking placeis free with probability p independently of whether other parking places are free or not. The drivercannot observe whether a parking place is free until he reaches it. If he parks k places from hisdestination, he incurs a cost k. If he reaches the destination without having parked, the cost is C.

(a) Let Fk be the minimal expected cost if he is k parking places from his destination, whereF0 = C. Show that

Fk = pmink, Fk−1+ qFk−1, k = 1, 2, . . . ,

where q = 1− p.

(b) Show that an optimal policy is of the form: “Never park if k ≥ k∗, but take the first free placeif k < k∗”, where k is the number of parking places from the destination, and

k∗ = mini : i integer, qi−1 < (pC + q)−1

Exercise 3.3.5 (Hardy’s Theorem) Let a1, . . . , an and b1, . . . , bn be monotonically nonde-creasing sequences of numbers. Let us associate with each i = 1, . . . , n, a distinct index ji, andconsider the expression

∑ni=1 aibji . Use an interchange argument to show that this expression is

maximized when ji = i for all i, and is minimized when ji = n− i+ 1 for all i.

134

Chapter 4

DP with Imperfect State Information.

So far we have studied the problem that the controller has access to the exact value of the currentstate, but this assumption is sometimes unrealistic. In this chapter, we will study the problemswith imperfect state information. In this setting, we suppose that the controller receives some noisyobservations about the value of the current state instead of the actual underlying states.

4.1 Reduction to the perfect information case

Basic problem with imperfect state information

• Suppose that the controller has access to observations zk of the form

z0 = h0(x0, v0), zk = hk(xk, uk−1, vk), k = 1, 2, ..., N − 1,

where

zk ∈ Zk (observation space)

vk ∈ Vk (random observation disturbances)

The random observation disturbance vk is characterized by a probability distribution

Pvk(.|xk, ..., x0, uk−1, ..., u0, wk−1, ..., w0, vk−1, ..., v0)

• Initial state x0

• Control µk ∈ Uk ⊆ Ck

• Define Ik the information available to the controller at time k and call it the informationvector

Ik = (z0, z1, ..., zk, u0, u1, ...uk−1)

Consider a class of policies consisting of a sequence of functions π = µ0, µ1, ..., µN−1 whereµk(Ik) ∈ Uk for all Ik, k = 0, 1, ..., N − 1

135

Prof. R. Caldentey CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION.

Objective: Find an admissible policy π = µ0, µ1, ..., µN−1 that minimizes the cost function

Jπ = E x0,wk,vkk=0,1,...,N−1

[gN (xN ) +

N−1∑k=0

gk(xk, µk(Ik), wk)

]

subject to the system equation

xk+1 = fk(xk, µk(Ik), wk), k = 0, 1, ..., N − 1,

and the measurement equation

z0 = h0(x0, v0)

zk = hk(xk, µk−1(Ik−1), vk), k = 1, 2, ..., N − 1

Note the difference from the perfect state information case. In perfect state information case, wetried to find a rule that would specify the control uk to be applied for each state xk at time k.However, now we are looking for a rule that gives the control to be applied for every possibleinformation vector Ik, for every sequence of observations received and controls applied up to time k.

Example: Multiaccess Communication

• Consider a group of transmitting stations sharing a common channel

• Stations are synchronized to transmit packets of data at integer times

• Each packet requires one slot (time unit) for transmission

Let ak = Number of packet arrivals during slot k (with a given probability distribution)

xk = Number of packet waiting to be transmitted at the beginning of slot k (backlog)

• Packet transmissions are scheduled using a strategy called slotted Aloha protocol :

– Each packet in the system at the beginning of slot k is transmitted during that slot withprobability uk (common for all packets)

– If two or more packets are transmitted simultaneously, they collide and have to rejointhe backlog for retransmission at a later slot

– Stations can observe the channel and determine whether in any one slot:

1. there was a collision

136

CHAPTER 4. DP WITH IMPERFECT STATE INFORMATION. Prof. R. Caldentey

2. a success in the slot

3. nothing happened (i.e., idle slot)

• Control: transmission probability uk

• Objective: keep backlog small, so we assume a cost per stage gk(xk), with gk(.) a monotonicallyincreasing function of xk

• State of system: size of the backlog xk (unobservable)

• System equation:xk+1 = xk + ak − tk

where ak is the number of new arrivals, and tk is the number of packets successfully transmittedduring slot k. The distribution of tk is given by

tk =

1 (success) w.p. xkuk(1− uk)xk−1 (i.e., P(one Tx, xk − 1 do not Tx)),0 (failure) w.p. 1− xkuk(1− uk)xk−1

• Measurement equation:

zk+1 = vk+1 =

“idle” w.p. (1− uk)xk“success” w.p. xkuk(1− uk)xk−1

“collision” w.p. 1− (1− uk)xk − xkuk(1− uk)xk−1

where zk+1 is the observation obtained at the end of the kth slot

Reformulated as a perfect information problemCandidate for state is the information vector Ik

Ik+1 = (Ik, zk+1, uk), k = 0, 1, ..., N − 2, I0 = z0

The state of the system is Ik, the control is uk and zk+1 can be viewed as a random disturbance.Furthermore, we have

P(zk+1|Ik, uk) = P(zk+1|Ik, uk, z0, z1, ..., zk)

Note that the prior disturbances z0, z1, ..., zk are already included in the information vector Ik. So,in the LHS we now have the system in the framework of basic DP where the probability distributionof zk+1 depends explicitly only on the state Ik and control uk of the new system and not on theprior disturbances (although implicitly it does through Ik).

• The cost per stage:gk(Ik, uk) = Exk,wk [gk(xk, uk, wk)|Ik, uk]

Note that the new formulation is focused on past info and controls rather than on originalsystem disturbances wk.

• DP algorithm:

JN−1(IN−1) = minuN−1∈UN−1

ExN−1,wN−1 [gN (fN−1(xN−1, uN−1, wN−1))

+gN−1(xN−1, uN−1, wN−1)|IN−1, uN−1]

137


and for k = 0, 1, ..., N − 2,

Jk(Ik) = minuk∈Uk

Exk,wk,zk+1[gk(xk, uk, wk) + Jk+1(Ik, zk+1, uk︸︷︷︸

Ik+1

)|Ik, uk]

We minimize the RHS for every possible value of the vector Ik+1 to obtain µ∗k+1(Ik+1). The optimalcost is given by J∗0 = Ez0 [J0(z0)].

Example 4.1.1 (Machine repair)

• A machine can be in one of two unobservable states: P (good state) and P (bad state)

• State space: P, P

• Number of periods: N = 2

• At the end of each period, the machine is inspected with two possible inspection outcomes: G

(probably good state), B (probably bad state)

• Control space: actions after each inspection, which could be either

– C : continue operation of the machine; or

– S : stop, diagnose its state and if it is in bad state P , repair.

• Cost per stage: g(P,C) = 0; g(P, S) = 1; g(P , C) = 2; g(P , S) = 1

• Total cost: g(x0, u0) + g(x1, u1) (assume zero terminal cost)

• Let x0, x1 be the state of the machine at the end of each period

• Distribution of initial state: P(x0 = P ) = 23 , P(x0 = P ) = 1

3

• Assume that we start with a machine in good state, i.e., x−1 = P


xk+1 = wk, k = 0, 1

where the transition probabilities are given by

138


• Note that we do not have perfect state information, since the inspection does not reveal the state

of the machine with certainty. Rather, the result of each inspection may be viewed as a noisy

measure of the system state.

Result of inspections: zk = vk, k = 0, 1; vk ∈ B,G

• Information vector:

I0 = z0, I1 = (z0, z1, u0)

and we seek functions µ0(I0), µ1(I1) that minimize

Ex0,w0,w1v0,v1

g(x0, µ0( z0︸︷︷︸I0

)) + g(x1, µ1(z0, z1, µ0(z0)︸︷︷︸I1

))

DP algorithm. Terminal condition: J2(I2) = 0 for all I2

For k = 0, 1 :

Jk(Ik) = min

cost if uk=C︷︸︸︷P(xk = P |Ik, C) g(P,C)︸︷︷︸

0

+P(xk = P |Ik, C) g(P , C)︸︷︷︸2

+Ezk+1[Jk+1(Ik, zk+1, C)|Ik, C],

cost if uk=S︷︸︸︷P(xk = P |Ik, S) g(P, S)︸︷︷︸

1

+P(xk = P |Ik, S) g(P , S)︸︷︷︸1

+Ezk+1[Jk+1(Ik, zk+1, S)|Ik, S]

Last stage (k = 1): compute J1(I1) for each possible I1 = (z0, z1, u0). Rcalling that J2(I) = 0, ∀I, we

have

cost of C = 2P(x1 = P |I1), cost of S = 1,

and therefore J1(I1) = min2P(x1 = P |I1), 1. Compute probability P(x1 = P |I1) for all possible

realizations of I1 = (z0, z1, u0) by using the conditional probability formula:

P(X|A,B) =P(X,A|B)

P(A|B).

There are 8 cases to consider. We describe here 3 of them.

(1) For I1 = (G,G, S)

P(x1 = P |G,G, S) =P(x1 = P , G,G|S)

P(G,G|S)=

17

139


Numerator:

P(x1 = P , G,G|S) =(

23 ×

34 ×

13 ×

14

)+(

13 ×

14 ×

13 ×

14

)= 7

144

Denominator:

P(G,G|S) =(

23 ×

34 ×

23 ×

34

)+(

23 ×

34 ×

13 ×

14

)+(

13 ×

14 ×

23 ×

34

)+(

13 ×

14 ×

13 ×

14

)= 49

144

Hence,

J1(G,G, S) = 2P(x1 = P |G,G, S) =27< 1, µ∗1(G,G, S) = C

(2) For I1 = (B,G, S)

P(x1 = P |B,G, S) =17

Numerator:

P(x1 = P , B,G|S) = 14 ×

13 ×

(14 ×

23 + 3

4 ×13

)= 5

144

140


Denominator:

P(B,G|S) = 23 ×

14 ×

(23 ×

34 + 1

3 ×14

)+ 1

3 ×34 ×

(23 ×

34 + 1

3 ×14

)

Hence,

J1(B,G, S) =27, µ∗1(B,G, S) = C

(3) For I1 = (G,B, S)

P(x1 = P |G,B, S) =P(x1 = P , G,B|S)

P(G,B|S)=

35

Numerator:

P(x1 = P , G,B|S) = 34 ×

13 ×

(34 ×

23 + 1

4 ×13

)= 7

48

141


Denominator:

P(G,B|S) = 14 ×

23 ×

(34 ×

23 + 1

4 ×13

)+ 3

4 ×13 ×

(34 ×

23 + 1

4 ×13

)= 35

144

Hence,

J1(G,B, S) = 1, µ∗1(G,B, S) = S

Summary: For all other 5 cases of I1, we compute J1(I1) and µ∗1(I1). The optimal policy is to continue

(u1 = C) if the result of last inspection was G and to stop (u1 = S) if the result of the last inspection

was B.

First stage (k = 0): Compute J0(I0) for each of the two possible information vectors I0 = (G), I0 = (B).

We have

cost of C = 2P(x0 = P |I0, C) + Ez1J1(I0, z1, C)|I0, C= 2P(x0 = P |I0, C) + P(z1 = G|I0, C)J1(I0, G,C) + P(z1 = B|I0, C)J1(I0, B,C)

cost of S = 1 + Ez1J1(I0, z1, S)|I0, S= 1 + P(z1 = G|I0, S)J1(I0, G, S) + P(z1 = B|I0, S)J1(I0, B, S),

using the values of J1 from previous stage. Thus, we have

J0(I0) = mincost of C, cost of S

The optimal cost is

J∗ = P(G)J0(G) + P(B)J0(B)

For illustration, we compute one of the values. For example, for I0 = G

P(z1 = G|G,C) =P(z1 = G,

z0︷︸︸︷G |

u0︷︸︸︷C )

P( G︸︷︷︸z0

|C)=

P(z1 = G,G|C)P(G)

=1548712

=1528

Note that the P(G|C) = P(G) follows since z0 = G is independent of the control u0 = C or u0 = S

142


Numerator:

P(z1 = G,G|C) = 23 ×

34 ×

(23 ×

34 + 1

3 ×14

)+ 1

3 ×14 × 1× 1

4 = 1548

Denominator:

P(G) = 23 ×

34 + 1

3 ×14 = 7

12

Similarly, we can compute P(B) = 23 ×

14 + 1

3 ×34 = 5

12

143


Note: The optimal policy for both stages is to continue (C) if the result of latest inspection is G and

to stop and repair (S) otherwise. The optimal cost can be proved to be J∗ = 176144

Problem: The DP can be computationally prohibitive if the number of information vectors Ik is large or

infinite.

4.2 Linear-Quadratic Systems and Sufficient Statistics

In this section, we consider again the problem studied in Section 2.5, but now under the assumptionthat the controller does not observe the real state of the system xk, but just a noisy representationof it, zk. Then, we investigate how we can reduce the quantity of information needed to solveproblems under imperfect state information.

4.2.1 Linear-Quadratic systems

Problem setup

System equation: xk+1 = Akxk +Bkuk + wk [Linear in both state and control.]

Quadratic cost:

Ex0,w0,...,wN−1

x′NQNxN +

N−1∑k=0

(x′kQkxk + u′kRkuk)

,

where:

• Qk are square, symmetric, positive semidefinite matrices with appropriate dimension,

• Rk are square, symmetric, positive definite matrices with appropriate dimension,

• Disturbances wk are independent with E[wk] = 0, finite variance, and independent of xkand uk.

• Controls uk are unconstrained, i.e., uk ∈ Rn.

Observations: Driven by a linear measurement equation:

zk︸︷︷︸∈Rs

= Ck︸︷︷︸∈Rs×n

xk + vk︸︷︷︸∈Rs

, k = 0, 1, . . . , N − 1,

where vks are mutually independent, and also independent from wk and x0.

Key fact to show: Given an information vector Ik = (z0, . . . , zk, u0, . . . , uk−1), the optimal policyµ∗0, . . . , µ∗N−1 is of the form

µ∗k(Ik) = LkE[xk|Ik],

where

• Lk is the same as for the perfect state info case, and solves the “control problem”.

• E[xk|Ik] solves the “estimation problem”.

This means that the control and estimation problems can be solved separately.

144


DP algorithm

The DP algorithm becomes:

• At stage N − 1,

JN−1(IN−1) = minuN−1

ExN−1,wN−1

[x′N−1QN−1xN−1 + u′N−1RN−1uN−1

+ (AN−1xN−1 +BN−1uN−1 + wN−1)′QN (AN−1xN−1 +BN−1uN−1 + wN−1)|IN−1, uN−1]

(4.2.1)

– Recall that wN−1 is independent of xN−1, and that both are random at stage N − 1;that’s why we take expected value over both of them.

– Since the wk are mutually independent and do not depend on xk and uk either, we have

E[wN−1|IN−1, uN−1] = E[wN−1|IN−1] = E[wN−1] = 0,

then the minimization just involves

minuN−1

u′N−1(B′N−1QNBN−1︸︷︷︸≥0

+RN−1︸︷︷︸>0

)uN−1 + 2E[xN−1|IN−1]′A′N−1QNBN−1uN−1

Taking derivative of the argument with respect to uN−1, we have the first order condition:

2(B′N−1QNBN−1 +RN−1)uN−1 + 2E[xN−1|IN−1]′A′N−1QNBN−1 = 0.

This yields the optimal u∗N−1:

u∗N−1 = µ∗N−1(IN−1) = LN−1E[xN−1|IN−1],

whereLN−1 = −(B′N−1QNBN−1 +RN−1)−1B′N−1QNAN−1.

Note that this is very similar to the perfect state info counterpart, except that now xN−1

is replaced by E[xN−1|IN−1].

– Substituting back in (4.2.1), we get:

JN−1(IN−1) = ExN−1

[x′N−1KN−1xN−1|IN−1

](quadratic in xN−1)

+ExN−1

[(xN−1 − E[xN−1|IN−1])′PN−1(xN−1 − E[xN−1|IN−1])|IN−1

](quadratic in estimation error xN−1 − E[xN−1|IN−1])

+EwN−1

[w′N−1QNwN−1

](constant term),

where the matrices KN−1 and PN−1 are given by

PN−1 = A′N−1QNBN−1(RN−1 +B′N−1QNBN−1)−1B′N−1QNAN−1,

andKN−1 = A′N−1QNAN−1 − PN−1 +QN−1.

145


– Note the structure of JN−1: In addition to the quadratic and constant terms (which areidentical to the perfect state info case for a given state xN−1), it involves a quadraticterm in the estimation error

xN−1 − E[xN−1|IN−1].

In words, the estimation error is penalized quadratically in the value function.

• At stage N − 2,

JN−2(IN−2) = minuN−2

ExN−2,wN−2,zN−1

[x′N−2QN−2xN−2 + u′N−2RN−2uN−2 + JN−1(IN−1)|IN−2, uN−2

]= E[x′N−2QN−2xN−2|IN−2] + min

uN−2

u′N−2RN−2uN−2 + E[x′N−1KN−1xN−1|IN−2, uN−2]

+E [(xN−1 − E[xN−1|IN−1])′PN−1(xN−1 − E[xN−1|IN−1])|IN−2, uN−2] (4.2.2)

+EwN−1

[w′N−1QNwN−1

]Key point (to be proved): The term (4.2.2) turns out to be independent of uN−2, and so

we can exclude it from the minimization with respect to uN−2.

This says that the quality of estimation as expressed by the statistics of the error xk − E[xk|Ik]cannot be influenced by the choice of control, which is not very intuitive!

For the next result, we need the linearity of both system and measurement equations.

Lemma 4.2.1 (Quality of Estimation) For every stage k, there is a function Mk(·) such that

Mk(x0, w0, . . . , wk−1, v0, . . . , vk) = xk − E[xk|Ik],

independently of the policy being used.

Proof: Fix a policy, and consider the following two systems:

1. There is a control uk being implemented, and the system evolves according to

xk+1 = Akxk +Bkuk + wk, zk = Ckxk + vk.

2. There is no control being applied, and the system evolves according to

xk+1 = Akxk + wk, zk = Ckxk + vk. (4.2.3)

Consider the evolution of the two systems from identical initial conditions: x0 = x0; and whensystem disturbances and observation noise vectors are also identical:

wk = wk, vk = vk, k = 0, 1, . . . , N − 1.

Consider the vectors:

Zk = (z0, . . . , zk)′, Zk = (z0, . . . , zk)′, W k = (w0, . . . , wk)′,

V k = (v0, . . . , vk)′, and Uk = (u0, . . . , uk)′.

146


Applying the system equations above for stages 0, 1, . . . , k, their linearity implies the existence ofmatrices Fk, Gk and Hk such that:

xk = Fkx0 +GkUk−1 +HkW

k−1, (4.2.4)

xk = Fkx0 +HkWk−1. (4.2.5)

Note that the vector Uk−1 = (u0, . . . , uk−1)′ is part of the information vector Ik, as verified below:

Ik = (z0, . . . , zk, u0, . . . , uk−1︸︷︷︸Uk−1

), k = 1, . . . , N − 1,

I0 = z0.

Then, Uk−1 = E[Uk−1|Ik], and conditioning with respect to Ik in (5.3.2) and (4.2.5):

E[xk|Ik] = FkE[x0|Ik] +GkUk−1 +HkE[W k−1|Ik] (4.2.6)

E[xk|Ik] = FkE[x0|Ik] +HkE[W k−1|Ik]. (4.2.7)

Then,xk︸︷︷︸

from (5.3.2)

− E[xk|Ik]︸︷︷︸from (4.2.6)

= xk︸︷︷︸from (4.2.5)

− E[xk|Ik]︸︷︷︸from (4.2.7)

,

where the term GkUk−1 gets canceled. The intuition for this is that the linearity of the system

equation affects “equally” the true state xk and our estimation of it, E[xk|Ik].

Applying now the measurement equations above for 0, 1, . . . , k, their linearity implies the existenceof a matrix Rk such that:

Zk − Zk = RkUk−1

Note that Zk involves the term Bk−1uk−1 from the system equation for xk, and recursively wecan build such a matrix Rk. In addition, from (4.2.3) above and the sample path identity for thedisturbances, Zk depends on the original wk, vk and x0:

Zk − Zk = RkUk−1 ⇒ Zk = Zk −RkUk−1 = SkW

k−1 + TkVk +Dkx0,

where Sk, Tk, and Dk are matrices of appropriate dimension. Thus, the information provided byIk = (Zk, Uk−1) regarding xk is summarized in Zk, and we have

E[xk|Ik] = E[xk|Zk],

so that

xk − E[xk|Ik] = xk − E[xk|Ik]= xk − E[xk|Zk].

Therefore, the function Mk to use is

Mk(x0, w0, . . . , wk−1, v0, . . . , vk) = xk − E[xk|Zk],

which does not depend on the controls u0, . . . , uk−1.

Going back to the DP equation JN−2(IN−2), and using the Quality of Estimation Lemma, we get

ξN−1M= MN−1(x0, w0, . . . , wN−2, v0, . . . , vN−1) = xN−1 − E[xN−1|IN−1].

147


Since ξN−1 is independent of uN−2, we have

E[ξ′N−1PN−1ξN−1|IN−2, uN−2] = E[ξ′N−1PN−1ξN−1|IN−2].

So, going back to the DP equation for JN−2(IN−2), we can drop the term (4.2.2) to minimizeover uN−2, and similarly to stage N − 1, the minimization yields

u∗N−2 = µ∗N−2(IN−2) = LN−2E[xN−2|IN−2].

Continuing similarly (using also the Quality of Estimation Lemma), we have

µ∗k(Ik) = LkE[xk|Ik],

where Lk is the same as for perfect state info:

Lk = −(Rk +B′kKk+1Bk)−1B′kKk+1Ak,

with Kk generated from KN = QN , using

Kk = A′kKk+1Ak − Pk +Qk,

Pk = A′kKk+1Bk(Rk +B′kKk+1Bk)−1B′kKk+1Ak.

The optimal controller is represented in Figure 4.2.1.

xk+1 = Ak xk + Bk uk + wk zk = Ck xk + vk

Estimator:update

Ik :=Ik-1 (uk-1,zk),and compute

E[xk|Ik]

Actuator:compute

uk=Lk E[xk|Ik]E[xk|Ik]uk

uk

Delay(from previous period)

vk

zk

zk

uk-1

wk

Start reading from here

xk

Figure 4.2.1: Structure of the optimal controller for the L-Q problem.

Separation interpretation:

1. The optimal controller can be decomposed into:

• An estimator, which uses the data to generate the conditional expectation E[xk|Ik].• An actuator, which multiplies E[xk|Ik] by the gain matrix Lk and applies the control uk =LkE[xk|Ik].

148


2. Observation: Consider the problem of finding the estimate x of a random vector x given someinformation (random vector) I, which minimizes the mean squared error

Ex[||x− x||2|I] = E[||x||2]− 2E[x|I]x+ ||x||2.

When we take derivative with respect to x and set it equal to zero:

2x− 2E[x|I] = 0 ⇒ x = E[x|I],

which is exactly our estimator.

3. The estimator portion of the optimal controller is optimal for the problem of estimating thestate xk assuming the control is not subject to choice.

4. The actuator portion is optimal for the control problem assuming perfect state information.

4.2.2 Implementation aspects – Steady-state controller

• In the imperfect info case, we need to compute an estimator xk = E[xk|Ik], which is indeedthe one that minimizes the mean squared error Ex[||x− x||2|I].

• However, this is computationally hard in general.

• Fortunately, if the disturbances wk and vk, and the initial state x0 are Gaussian randomvectors, a convenient implementation of the estimator is possible by means of the Kalmanfilter algorithm.

This algorithm produces xk+1 at time k + 1 just depending on zk+1, uk and xk.

Kalman filter recursion: For all k = 0, 1, . . . , N − 1, compute

xk+1 = Akxk +Bkuk + Σk+1|k+1C′k+1N

−1k+1(zk+1 − Ck+1(Akxk +Bkuk)),

andx0 = E[x0] + Σ0|0C

′0N−10 (z0 − C0E[x0]),

where the matrices Σk|k are precomputable and are given recursively by

Σk+1|k+1 = Σk+1|k − Σk+1|kC′k+1(Ck+1Σk+1|kC

′k+1 +Nk+1)−1Ck+1Σk+1|k,

Σk+1|k = AkΣk|kA′k +Mk, k = 0, 1, . . . , N − 1,

withΣ0|0 = S − SC ′0(C0SC

′0 +N0)−1C0S.

In these equations, Mk, Nk, and S are the covariance matrices1 of wk, vk and x0, respectively, andwe assume that wk and vk have zero mean; that is

E[wk] = E[vk] = 0,

Mk = E[wkw′k], Nk = E[vkv′k],

S = E[(x0 − E[x0])(x0 − E[x0])′].

Moreover, we are assuming that matrices Nk are positive definite (and hence, invertible).1Recall that for a random vector X, its covariance matrix Σ is given by E[(X−E[X])(X−E[X])′]. Its entry (i, j) is

given by Σij = Cov(Xi, Xj) = E[(Xi−E[Xi])(Xj −E[Xj ])]. The covariance matrix Σ is always positive semi-definite.

149


Stationary case

• Assume that the system and measurement equations are stationary (i.e., same distributionacross time; Nk = N , Mk = M).

• Suppose that (A,B) is controllable and that matrix Q can be written as Q = F ′F , where Fis a matrix such that (A,F ) is observable.2

• By the theory of LQ-systems under perfect info, when N → ∞ (i.e., the horizon lengthbecomes large), the optimal controller tends to the steady-state policy

µ∗k(Ik) = Lxk,

whereL = −(R+B′KB)−1B′KA,

and where K is the unique ≥ 0 symmetric solution of the algebraic Riccati equation

K = A′(K −KB(R+B′KB)−1B′K)A+Q.

• It can also be shown in the limit as N →∞, that

xk+1 = (A+BL)xk + ΣC ′N−1(zk+1 − C(A+BL)xk),

where Σ is given byΣ = Σ− ΣC ′(CΣC ′ +N)−1CΣ,

and Σ is the unique ≥ 0 symmetric solution of the Riccati equation

Σ = A(Σ− ΣC ′(CΣC ′ +N)−1CΣ)A′ +M.

The assumptions required for this are:

1. (A,C) is observable.

2. The matrix M can be written as M = DD′, where D is a matrix such that the pair(A,D) is controllable.

Non-Gaussian uncertainty

When the uncertainty of the system is non-Gaussian, computing E[xk|Ik] may be very difficult froma computational viewpoint. So, a suboptimal solution is typically used.

A common suboptimal controller is to replace E[xk|Ik] by the estimate produced by the Kalmanfilter (i.e., act as if x0, wk and vk are Gaussian).

A nice property of this approximation is that it can be proved to be optimal within the class ofcontrollers that are linear functions of Ik.

2Recall the definitions: A pair of matrices (A,B), where A ∈ Rn×n and B ∈ Rn×m, is said to be controllable if

the n × (n,m) matrix: [B,AB,A2B, . . . , An−1B] has full rank. A pair (A,C), A ∈ Rn×n, C ∈ Rm×n is said to be

observable if the pair (A′, C′) is controllable.

150


4.2.3 Sufficient statistics

• Problem of DP algorithm under imperfect state info: Growing dimension of the reformulatedstate space Ik.

• Objective: Find sufficient statistics (ideally, of smaller dimension) for Ik that summarize allthe essential contents of Ik as far as control is concerned.

• Recall the DP formulation for the imperfect state info case:


ExN−1,wN−1

[gN (fN−1(xN−1, uN−1, wN−1))

+ gN−1(xN−1, uN−1, wN−1)|IN−1, uN−1

], (4.2.8)

Jk(Ik) = minuk∈Uk

Exk,wk,zk+1 [gk(xk, uk, wk) + Jk+1(Ik, zk+1, uk)|Ik, uk] . (4.2.9)

• Suppose that we can find a function Sk(Ik) such that the RHS of (4.2.8) and (4.2.9) can bewritten in terms of some function Hk as

minuk∈Uk

Hk(Sk(Ik), uk),

such that

Jk(Ik) = minuk∈Uk

Hk(Sk(Ik), uk).

• Such a function Sk is called a sufficient statistic.

• An optimal policy obtained by the preceding minimization can be written as

µ∗k(Ik) = µ∗k(Sk(Ik)),

where u∗k is an appropriate function.

• Example of a sufficient statistic: Sk(Ik) = Ik.

• Another important sufficient statistic is the conditional probability distribution of the state xkgiven the information vector Ik, i.e.,

Sk(Ik) = Pxk|Ik

For this case, we need an extra assumption: The probability distribution of the observationdisturbance vk+1 depends explicitly only on the immediate preceding xk, uk and wk, and noton earlier ones.

• It turns out that Pxk|Ik is generated recursively by a dynamic system (estimator) of the form

Pxk+1|Ik+1= Φk(Pxk|Ik , uk, zk+1), (4.2.10)

for a suitable function Φk determined from the data of the problem. (We will verify this later)

151


• Claim: Suppose for now that function Φk in equation (4.3.1) exists. We will argue now thatthis is enough to solve the DP algorithm.

Proof: By induction. For k = N − 1 (i.e., to solve (4.2.8)), given the Markovian nature ofthe system, it is sufficient to know the distribution PxN−1,IN−1

together with the distributionPwN−1|xN−1,uN−1

, so that


HN−1(PxN−1|IN−1, uN−1) = JN−1(PxN−1|IN−1

)

for appropriate functions HN−1 and JN−1.

IH: Assume

Jk+1(Ik+1) = minuk+1∈Uk+1

Hk+1(Pxk+1|Ik+1, uk+1) = Jk+1(Pxk+1|Ik+1

), (4.2.11)

for appropriate functions Hk+1 and Jk+1.

We want to show that there exist functions Hk and Jk such that

Jk(Ik) = minuk∈Uk

Hk(Pxk|Ik , uk) = Jk(Pxk|Ik).

Using equations (4.3.1) and (4.2.11), the DP in (4.2.9) can be written as

Jk(Ik) = minuk∈Uk

Exk,wk,zk+1

[gk(xk, uk, wk) + Jk+1(Φk(Pxk|Ik , uk, zk+1))|Ik, uk

]. (4.2.12)

To solve this problem, we also need the joint distribution P (xk, wk, zk+1|Ik, uk), or equivalently,given that from the primitives of the system,

zk+1 = hk+1(xk+1, uk, vk+1), and xk+1 = fk(xk, uk, wk),

we needP (xk, wk, hk+1(fk(xk, uk, wk), uk, vk+1)|Ik, uk).

This distribution can be expressed in terms of Pxk|Ik , the given distributions

P (wk|xk, uk), P (vk+1|fk(xk, uk, wk), uk, wk),

and the system equation xk+1 = fk(xk, uk, wk).

Therefore, the expression minimized over uk in (4.2.12) can be written as a function of Pxk|Ikand uk, and the DP equation (4.2.12) can be written as

Jk(Ik) = minuk∈Uk

Hk(Pxk|Ik , uk)

for a suitable function Hk. Thus, Pxk|Ik is a sufficient statistic.

• If the conditional distribution Pxk|Ik is uniquely determined by another expression Sk(Ik), i.e.,there exist a function Gk such that

Pxk|Ik = Gk(Sk(Ik)),

then Sk(Ik) is also a sufficient statistic.

For example, if we can show that Pxk|Ik is a Gaussian distribution, then the mean and thecovariance matrix corresponding to Pxk|Ik form a sufficient statistic.

152


• The representation of the optimal policy as a sequence of functions of Pxk|Ik , i.e.,

µk(Ik) = µk(Pxk|Ik), k = 0, 1, . . . , N − 1,

is conceptually very useful. It provides a decomposition of the optimal controller in two parts:

1. An estimator, which uses at time k the measurement zk and the control uk−1 to generatethe probability distribution Pxk|Ik .

2. An actuator, which generates a control input to the system as a function of the probabilitydistribution Pxk|Ik .

This is illustrated in Figure 4.2.2.

System equation:

xk+1=fk(xk,uk,wk)

Measurement equation:

zk = hk(xk,uk-1,vk)

Estimator:k-1(Pxk-1|Ik-1,uk-1,zk)

Actuator:uk= k(Pxk|Ik) Pxk|Ikuk

uk

Delay(from previous period)

vk

zk

zk

uk-1

wk

Start reading from here

xk

Figure 4.2.2: Conceptual separation of the optimal controller into an estimator and an actuator.

• This separation is the basis for various suboptimal control schemes that split the controller apriori into an estimator and an actuator.

• The controller µk(Pxk|Ik) can be viewed as controlling the “probabilistic state” Pxk|Ik , so asto minimize the expected cost-to-go conditioned on the information Ik available.

4.2.4 The conditional state distribution recursion

We still need to justify the recursion

Pxk+1|Ik+1= Φk(Pxk|Ik , uk, zk+1) (4.2.13)

For the case where the state, control, observation, and disturbance spaces are the real line, and allr.v. involved posses p.d.f., the conditional density p(xk+1|Ik+1) is generated from p(xk|Ik), uk, andzk+1 by means of the equation:

p(xk+1|Ik+1) = p(xk+1|Ik, uk, zk+1)

=p(xk+1, zk+1|Ik, uk)p(zk+1|Ik, uk)

=p(xk+1|Ik, uk)p(zk+1|Ik, uk, xk+1)∫∞

−∞ p(xk+1|Ik, uk)p(zk+1|Ik, uk, xk+1)dxk+1.

153


In this expression, all the probability densities appearing in the RHS may be expressed in terms ofp(xk|Ik), uk, and zk+1.

In particular:

• The density p(xk+1|Ik, uk) may be expressed through p(xk|Ik), uk, and the system equationxk+1 = fk(xk, uk, wk) using the given density p(wk|xk, uk) and the relation

p(wk|xk, uk) =∫ ∞−∞

p(xk|Ik)p(wk|xk, uk)dxk.

• The density p(zk+1|Ik, uk, xk+1) is expressed through the measurement equation zk+1 = hk+1(xk+1, uk, vk+1)using the densities

p(xk|Ik), p(wk|xk, uk), p(vk+1|xk, uk, wk).

Now, we give an example for the finite space set case.

Example 4.2.1 (A search problem)

• At each period, decide to search or not search a site that may contain a treasure.

• If we search and treasure is present, we find it w.p. β and remove it from the site.

• State xk (unobservable at the beginning of period k): Treasure is present or not.

• Control uk: search or not search.

• If the site is searched in period k, the observation zk+1 takes two values: treasure found or not.

If site is not searched, the value of zk+1 is irrelevant.

• Denote pk: probability a treasure is present at the beginning of period k.

The probability evolves according to the recursion:

pk+1 =

pk if site is not searched at time k

0 if the site is searched and a treasure is found (and removed)pk(1−β)

pk(1−β)+1−pk if the site is searched but no treasure is found

For the third case:

– Numerator pk(1 − β): It is the kth period probability that the treasure is present and the

search is unsuccessful.

– Denominator pk(1− β) + 1− pk: Probability of an unsuccessful search, when the treasure is

either there or not.

• The recursion for pk+1 is a special case of (4.3.4).

4.3 Sufficient Statistics

In this section, we continue investigating the conditional state distribution as a sufficient statisticfor problems with imperfect state information.

154


4.3.1 Conditional state distribution: Review of basics

• Recall the important sufficient statistic conditional probability distribution of the state xkgiven the information vector Ik, i.e.,

Sk(Ik) = Pxk|Ik

For this case, we need an extra assumption: The probability distribution of the observationdisturbance vk+1 depends explicitly only on the immediate preceding xk, uk and wk and noton earlier ones, which gives the system a Markovian flavor.

• It turns out that Pxk|Ik is generated recursively by a dynamic system (estimator) of the form

Pxk+1|Ik+1= Φk(Pxk|Ik , uk, zk+1), (4.3.1)

for a suitable function Φk determined from the data of the problem.

• We have already proven that if function Φk in equation (4.3.1) exists, then we can solve theDP algorithm.

• The representation of the optimal policy as a sequence of functions of Pxk|Ik , i.e.,

µk(Ik) = µk(Pxk|Ik), k = 0, 1, . . . , N − 1,

is conceptually very useful. It provides a decomposition of the optimal controller in two parts:

1. An estimator, which uses at time k the measurement zk and the control uk−1 to generatethe probability distribution Pxk|Ik .

2. An actuator, which generates a control input to the system as a function of the probabilitydistribution Pxk|Ik .

• The DP algorithm can be written as:

JN−1(PxN−1|IN−1) = min

uN−1∈UN−1

ExN−1,wN−1

[gN (fN−1(xN−1, uN−1, wN−1))

+ gN−1(xN−1, uN−1, wN−1)|IN−1, uN−1

],(4.3.2)

and for k = 0, 1, . . . , N − 2,

Jk(Pxk|Ik) = minuk∈Uk

Exk,wk,zk+1

[gk(xk, uk, wk) + Jk+1(Φk(Pxk|Ik , uk, zk+1))|Ik, uk

], (4.3.3)

where Pxk|Ik plays the role of the state, and

Pxk+1|Ik+1= Φk(Pxk|Ik , uk, zk+1) (4.3.4)

is the system equation. Here, the role of control is played by uk, and the role of the disturbanceis played by zk+1.

Example 4.3.1 (A search problem Revisited)

• At each period, decide to search or not search a site that may contain a treasure.

155


• If we search and the treasure is present, we find it w.p. β and remove it from the site.

• State xk (unobservable at the beginning of period k): Treasure is present or not.

• Control uk: search or not search.

• Basic costs: Treasure’s worth is V , and search cost is C.

• If the site is searched in period k, the observation zk+1 takes one of two values: treasure found or

not.

If site is not searched, the value of zk+1 is irrelevant.

• Denote pk: probability that the treasure is present at the beginning of period k.

The probability evolves according to the recursion:

pk+1 =

pk if site is not searched at time k

0 if the site is searched and a treasure is found (and removed)pk(1−β)

pk(1−β)+1−pk if the site is searched but no treasure is found

For the third case:

– Numerator pk(1 − β): It is the kth period probability that the treasure is present and the

search is unsuccessful.

– Denominator pk(1− β) + 1− pk: Probability of an unsuccessful search, when the treasure is

either there or not.

• The recursion for pk+1 is a special case of (4.3.4).

• Assume that once we decide not to search in a period, we cannot search at future times.

• The DP algorithm is

JN (pN ) = 0,

and for k = 0, 1, . . . , N − 1,

Jk(pk) = max

0,−C + pkβV︸︷︷︸reward

for search & find

+ (1− pkβ)︸︷︷︸prob. of

search & not find

Jk+1

(pk(1− β)

pk(1− β) + 1− pk

)• It can be shown by induction that the functions Jk(pk) satisfy

Jk(pk) = 0, ∀pk ≤C

βV

Furthermore, it is optimal to search at period k if and only if

pkβV︸︷︷︸expected reward

from search

≥ C.︸︷︷︸cost of search

156


4.3.2 Finite-state systems

• Suppose the system is a finite-state Markov chain with states 1, . . . , n.

• Then, the conditional probability distribution Pxk|Ik is an n-dimensional vector

(P(xk = 1|Ik),P(xk = 2|Ik), . . . ,P(xk = n|Ik)).

• When a control u ∈ U is applied (U finite), the system moves from state i to state j w.p.pij(u). Note that the real system state transition is only driven by the control u applied ateach stage.

• There is a finite number of possible observation outcomes z1, z2, . . . , zq. The probability ofoccurrence of zθ, given that the current state is xk = j and the preceding control was uk−1,is denoted by P(zk = zθ|uk−1, xk = j) M= rj(uk−1, z

θ), θ = 1, . . . , q.

• The information available to the controller at stage k is

Ik(z1, . . . , zk, u0, . . . , uk−1).

• Following the observation zk, a control uk is applied, and a cost g(xk, uk) is incurred.

• The terminal cost at stage N for being in state x is G(x).

• Objective: Minimize the expected cumulative cost incurred over N stages.

We can reformulate the problem as one with imperfect state information. The objective is to controlthe column vector of conditional probabilities

pk = (p1k, . . . , p

nk)′,

wherepik = P(xk = i|Ik), i = 1, 2, . . . , n.

We refer to pk as the belief state. It evolves according to

pk+1 = Φk(pk, uk, zk+1),

where the function Φk is an estimator that given the sufficient statistic pk provides the new sufficientstatistic pk+1. The initial belief p0 is given.

The conditional probabilities can be updated according to the Bayesian updating rule

pjk+1 = P(xk+1 = j|Ik+1)

= P(xk+1 = j|z0, . . . , zk+1, u0, . . . , uk)

=P(xk+1 = j, zk+1|Ik, uk)

P(zk+1|Ik, uk)(because P(A|B,C) = P(A,B|C)/P(B|C))

=∑n

i=1 P(xk = i|Ik)P(xk+1 = j|xk = i, uk)P(zk+1|uk, xk+1 = j)∑ns=1

∑ni=1 P(xk = i|Ik)P(xk+1 = s|xk = i, uk)P(zk+1|uk, xk+1 = s)

=∑n

i=1 pikpij(uk)rj(uk, zk+1)∑n

s=1

∑ni=1 p

ikpis(uk)rs(uk, zk+1)

.

157


In vector form, we have

pjk+1 =rj(uk, zk+1)[P (uk)′pk]j∑ns=1 rs(uk, zk+1)[P (uk)′pk]s

, j = 1, . . . , n, (4.3.5)

where P (uk) is the n × n transition probability matrix formed by pij(uk), and [P (uk)′pk]j is thejth component of vector [P (uk)′pk].

The corresponding DP algorithm (4.3.2)-(4.3.3) has the specific form

Jk(pk) = minuk∈U

p′kg(uk) + Ezk+1

[Jk+1(Φ(pk, uk, zk+1))|pk, uk

], k = 0, . . . , N − 1, (4.3.6)

where g(uk) is the column vector with components g(1, uk), . . . , g(n, uk), and p′kg(uk) is the expectedstage cost.

The algorithm starts at stage N with

JN (pN ) = p′NG,

where G is the column vector with components the terminal costs G(i), i = 1, . . . , n, and proceedsbackwards.

It turns out that the cost-to-go functions Jk in the DP algorithm are piecewise linear and concave. Aconsequence of this fact is that Jk can be characterized by a finite set of scalars. Still, however, for afixed k, the number of these scalars can increase fast with N , and there may be no computationallyefficient way to solve the problem.

Example 4.3.2 (Machine repair revisted)

Consider again the machine repair problem, whose setting is included below:

• A machine can be in one of two unobservable states (i.e., n = 2): P (bad state) and P (good

state).

• State space: P , P, where for the indexing: State 1 is P , and state 2 is P .

• Number of periods: N = 2

• At the end of each period, the machine is inspected with two possible inspection outcomes: G

(probably good state), B (probably bad state)

• Control space: actions after each inspection, which could be either

– C : continue operation of the machine; or

– S : stop, diagnose its state and if it is in bad state P , repair.

• Cost per stage: g(P , C) = 2; g(P,C) = 0; g(P , S) = 1; g(P, S) = 1, or in vector form:

g(C) =

(20

), g(S) =

(11

).

158


• Total cost: g(x0, u0) + g(x1, u1) (assume zero terminal cost)

• Let x0, x1 be the state of the machine at the end of each period

• Distribution of initial state: P(x0 = P ) = 13 , P(x0 = P ) = 2

3

• Assume that we start with a machine in good state, i.e., x−1 = P


xk+1 = wk, k = 0, 1

where the transition probabilities are given by

In matrix form, following the aforementioned indexing of the states, transition probabilities can be

expressed as

P (C) =

(1 0

1/3 2/3

); P (S) =

(1/3 2/31/3 2/3

).

• Note that we do not have perfect state information, since the inspection does not reveal the state

of the machine with certainty. Rather, the result of each inspection may be viewed as a noisy

assessment of the system state.

Result of inspections: zk = vk, k = 0, 1; vk ∈ B,G

159


The inspection results can be described by the following definitions:

r1(S,G) M= P(zk+1 = G|uk = S, xk+1 = P ) =14

= r1(C,G),

r1(S,B) M= P(zk+1 = B|uk = S, xk+1 = P ) =34

= r1(C,B),

r2(S,G) M= P(zk+1 = G|uk = S, xk+1 = P ) =34

= r2(C,G),

r2(S,B) M= P(zk+1 = B|uk = S, xk+1 = P ) =14

= r2(C,B).

Note that in this case, the observation zk+1 does not depend on the control uk, but just on the

state xk+1.

Define the belief state p0 as the 2-dimensional vector with components:

p10

M= P(x0 = P |I0), p20

M= P(x0 = P |I0) = 1− p10.

Similarly, define the belief state p1 with coordinates

p11

M= P(x1 = P |I1), p21

M= P(x1 = P |I1) = 1− p11,

where the evolution of the beliefs is driven by the estimator

p1 = Φ0(p0, u0, z1).

We will use equation (4.3.5) to compute p1 given p0, but first we calculate the matrix products P (u0)′p0,

for u0 ∈ S,C:

P (S)′p0 =

(1/3 1/32/3 2/3

) (p1

0

p20

)=

(13(p1

0 + p20)

23(p1

0 + p20)

)=

(1323

), (4.3.7)

and

P (C)′p0 =

(1 1/30 2/3

) (p1

0

p20

)=

(p1

0 + 13p

20

23p

20

)=

(13 + 2

3p10

23 −

23p

10

). (4.3.8)

Now, using equation (4.3.5) for state j = 1 (i.e., for state P ), we get

• For u0 = S, z1 = G:

p11 =

r1(S,G)[P (S)′p0]1r1(S,G)[P (S)′p0]1 + r2(S,G)[P (S)′p0]2

=14 ×

13

14 ×

13 + 3

4 ×23

=17.

• For u0 = S, z1 = B:

p11 =

r1(S,B)[P (S)′p0]1r1(S,B)[P (S)′p0]1 + r2(S,B)[P (S)′p0]2

=34 ×

13

34 ×

13 + 1

4 ×23

=35.

• For u0 = C, z1 = G:

p11 =

r1(C,G)[P (C)′p0]1r1(C,G)[P (C)′p0]1 + r2(C,G)[P (C)′p0]2

=14 ×

(13 + 2

3p10

)14 ×

(13 + 2

3p10

)+ 3

4 ×(

23 −

23p

10

) =1 + 2p1

0

7− 4p10

.

160


• For u0 = C, z1 = B:

p11 =

r1(C,B)[P (C)′p0]1r1(C,B)[P (C)′p0]1 + r2(C,B)[P (C)′p0]2

=34 ×

(13 + 2

3p10

)34 ×

(13 + 2

3p10

)+ 1

4 ×(

23 −

23p

10

) =3 + 6p1

0

5 + 4p10

.

In summary, we get

p11 = [Φ0(p0, u0, z1)]1 =

17 if u0 = S, z1 = G,35 if u0 = S, z1 = B,1+2p107−4p10

if u0 = C, z1 = G,

3+6p105+4p10

if u0 = C, z1 = B,

where p20 = 1− p1

0 and p21 = 1− p1

1.

The DP algorithm (4.3.6) may be written as:

J2(p2) = 0 (i.e., zero terminal cost),

and

J1(p1) = minu1∈S,C

p′1g(u1)

= min

(p11, p

21)

(11

)︸︷︷︸

u1=S

, (p11, p

21)

(20

)︸︷︷︸

u1=C

= min

p11 + p2

1︸︷︷︸u1=S

, 2p11︸︷︷︸

u1=C

= min

1︸︷︷︸u1=S

, 2p11︸︷︷︸

u1=C

.

This minimization yields

µ∗1(p1) =

C if p1

1 ≤ 12

S if p11 >

12

For stage k = 0, we have

J0(p0) = minu0∈C,S

p′0g(u0) + Ez1

[J1(Φ(p0, u0, z1))|p0, u0

]= min

2p1

0 + P(z1 = G|I0, C)J1(Φ0(p0, C,G)) + P(z1 = B|I0, C)J1(Φ0(p0, C,B))︸︷︷︸u0=C

,

(p10 + p2

0)︸︷︷︸1

+P(z1 = G|I0, S)J1(Φ0(p0, S,G)) + P(z1 = B|I0, S)J1(Φ0(p0, S,B))

︸︷︷︸u0=S

The probabilities here may be expressed in terms of p0 by using the expression in the denominator

of (4.3.5); that is:

P(zk+1|Ik, uk) =n∑s=1

n∑i=1

P(xk = i|Ik)P(xk+1 = s|xk = i, uk)P(zk+1|uk, xk+1 = s)

=n∑s=1

n∑i=1

pikpis(uk)rs(uk, zk+1)

=n∑s=1

rs(uk, zk+1)[P (uk)′pk]s.

161


In our case:

P(z1 = G|I0, u0 = C) = r1(C,G)[P (C)′p0]1 + r2(C,G)[P (C)′p0]2

=14×(

13

+23p1

0

)+

34×(

23− 2

3p1

0

)=

7− 4p10

12.

Similarly, we obtain:

P(z1 = B|I0, C) =5 + 4p1

0

12, P(z1 = G|I0, S) =

712, P(z1 = B|I0, S) =

512.

Using these values we have

J0(p0) = min

2p10 +

7− 4p10

12J1

(1 + 2p1

0

7− 4p10︸︷︷︸

p11

, 1− p11

)+

5 + 4p10

12J1

(3 + 6p1

0

5 + 4p10︸︷︷︸

p11

, 1− p11

),

1 +712J1

(17,67

)+

512J1

(35,25

).

By substitution of J1(p1) and after some algebra we obtain

J0(p0) =

1912 if 3

8 ≤ p10 ≤ 1,

7+32p1012 if 0 ≤ p1

0 ≤ 38 ,

and an optimal control for the first stage

µ∗0(p0) =

C if p1

0 ≤ 38 ,

S if p10 >

38 .

Also, we know that P(z0 = G) = 712 , and P(z0 = B) = 5

12 . In addition, we can establish the initial

value for p10 according to the value of I0 (i.e., z0):

P(x0 = P |z0 = G) =P(x0 = P , z0 = G)

P(z0 = G)=

13 ×

14

712

=17,

and

P(x0 = P |z0 = B) =P(x0 = P , z0 = B)

P(z0 = B)=

13 ×

34

512

=35,

so that the formula

J∗ = Ez0[J0(Px0|z0)

]=

712J0

(17,67

)+

512J0

(35,25

)=

176144

yields the same optimal cost as the one obtained above by means of the general DP algorithm for

problems with imperfect state information.

Observe also that the functions Jk are linear in this case; recall that we had said that in general they

are piecewise linear.

162


4.4 Exercises

Exercise 4.4.1 Take the linear system and measurement equation for the LQ-system with imper-fect state information. Consider the problem of finding a policy µ∗0(I0), . . . , µ∗N−1(IN−1) thatminimizes the quadratic cost

E

[x′NQxN +

N−1∑k=0

u′kRkuk

]Assume, however, that the random vectors x0, w0, . . . , wN−1, v0, . . . , vN−1 are correlated and havea given joint probability distribution, and finite first and second moments. Show that the optimalpolicy is given by

µ∗k(Ik) = Lk E[yk|Ik],

where the gain matrices Lk are obtained from the algorithm

Lk = −(B′kKk+1Bk +Rk)−1B′kKk+1Ak,

KN = Q,

Kk = A′k(Kk+1 −Kk+1Bk(B′kKk+1Bk +Rk)−1B′kKk+1)Ak,

and the vectors yk are given by yN = xN , and

yk = xk +A−1k wk +A−1

k A−1k+1wk+1 + · · ·+A−1

k · · ·A−1N−1wN−1, for k = 0, . . . , N − 1.

(assuming the matrices A0, A1, . . . , AN−1 are invertible).

Hint: Show that the cost can be written as

E

[y′0K0y0 +

N−1∑k=0

(uk − Lk yk)′Pk(uk − Lk yk)

],

where Pk = B′kKk+1Bk +Rk.

Exercise 4.4.2 Consider the scalar, imperfect state information system

xk+1 = xk + uk + wk,

zk = xk + vk,

where we assume that the initial state x0, and the disturbances wk and vk are all independent. Letthe cost be

E

[x2N +

N−1∑k=0

(x2k + u2

k)

],

and let the given probability distributions be

P(x0 = 2) = 1/2, P(wk = 1) = 1/2, P(vk = 1/4) = 1/2,

P(x0 = −2) = 1/2, P(wk = −1) = 1/2, P(vk = −1/4) = 1/2.

(a) Show that this problem could be transformed in a perfect information problem, where first weinfer the value of x0, and then we sequentially compute the values x1, . . . , xN . Determine theoptimal policy. Hint: For this problem, xk can be determined from xk−1, uk−1, and zk.

163


(b) Determine the policy that is identical to the optimal except that it uses a linear least squareestimator of xk given Ik in place of E[xk|Ik]

Exercise 4.4.3 A linear system with Gaussian disturbances and Gaussian initial state

xk+1 = Axk +Bxk + wk,

is to be controlled so as to minimize a quadratic cost similar to that discussed above. The differenceis that the controller has the option of choosing at each time k one of two types of measurementequations for the next stage k + 1:

First type: zk+1 = C1xk+1 + v1k+1,

Second type: zk+1 = C2xk+1 + v2k+1.

Here, C1 and C2 are given matrices of appropriate dimension, and v1k and v2

k are zero-mean,independent, random sequences with given finite covariances that do not depend on x0 and wk.There is a cost g1 (or g2) each time a measurement of type 1 (or type 2) is taken. The problem isto find the optimal control and measurement selection policy that minimizes the expected value ofthe sum of the quadratic cost

x′NQxN +N−1∑k=0

(x′kQxk + u′kRuk)

and the total measurement cost. Assume for convenience that N = 2 and that the first measure-ment z0 is of type 1. Show that the optimal measurement selection at time k = 0 and k = 1 does notdepend on the value of the information vectors I0 and I1, and can be determined a priori. Describethe nature of the optimal policy.

Exercise 4.4.4 Consider a machine that can be in one of two states, good or bad. Suppose thatthe machine produces an item at the end of each period. The item produced is either good or baddepending on whether the machine is in good or bad state at the beginning of the correspondingperiod, respectively. We suppose that once the machine is in a bad state it remains in that stateuntil it is replaced. If the machine is in a good state at the beginning of a certain period, then withprobability t it will be in the bad state at the end of the period. Once an item is produced, we mayinspect the item at a cost I, or not inspect. If an inspected item is found to be bad, the machineis replaced with a machine in good state at a cost R. The cost for producing a bad item is C > 0.Write a DP algorithm for obtaining an optimal inspection policy assuming a machine is initially ingood state and a horizon of N periods. Then, solve the problem for t = 0.2, I = 1, R = 3, C = 2,and N = 8.

Hint: Definexk = State at the beginning of the kth stage ∈ Good, Bad

wk = State at the end of the kth stage before an action is taken

uk ∈ Inspect, No inspect

Take as information vector the stage at which the last inspection was made.

164


Exercise 4.4.5 A person is offered 2 to 1 odds in a coin-tossing game where he wins whenever atail occurs. However, he suspects that the coin is biassed and has an a priori probability distribu-tion F (p) for the probability p that a head occurs at each toss. The problem is to find an optimalpolicy of deciding whether to continue or stop participating in the game given the outcomes of thegame so far. A maximum of N tossing is allowed. Indicate how such a policy can be found bymeans of DP. Specify the update rule for the belief about p.

Hint: Define the state as nk, the number of heads observed in the first k flips.

165


166

Chapter 5

Infinite Horizon Problems

5.1 Types of infinite horizon problems

• Setting similar to the basic finite horizon problem, but:

– The number of stages is infinite.

– The system is stationary.

• Simpler version: Assume finite number of states. (We will keep this assumption)

• Total cost problems: Minimize over all admissible policies π,

Jπ(x0) = limN→∞

Ew0,w1,...

[N−1∑k=0

αkg(xk, µk(xk), wk)

]

The value function Jπ(x0) should be finite for at least some admissible policies π and someinitial states x0.

Variants of total cost problems:

(a) Stochastic shortest path problems (α = 1): It requires a cost free terminal state t thatis reached in finite time w.p.1.

(b) Discounted problems (α < 1) with bounded cost per stage, i.e., |g(x, u, w)| < M .

Here, Jπ(x0) < decreasing geometric progression αkM.

(c) Discounted and non-discounted problems with unbounded cost per stage.

Here, α ≤ 1, but |g(x, u, w)| could be ∞. Technically more challenging!

• Average cost problems (type (d)): Minimize over all admissible policies π,


1N

Ew0,w1,...

[N−1∑k=0

g(xk, µk(xk), wk)

]

The approach works even if Jπ(x0) is infinite for every policy π and initial state x0.

167

Prof. R. Caldentey CHAPTER 5. INFINITE HORIZON PROBLEMS

5.1.1 Preview of infinite horizon results

• Key issue: The relation between the infinite and finite horizon optimal cost-to-go functions.

• Illustration: Let α = 1 and JN (x) denote the optimal cost of the N -stage problem, generatedafter N iterations of the DP algorithm, starting from J0(x) = 0, and proceeding with

Jk+1 = minu∈U(x)

Ew [g(x, u, w) + Jk(f(x, u, w))] . ∀x. (5.1.1)

Typical results for total cost problems:

– Relation valuable from a computational viewpoint:

J∗(x) = limN→∞

JN (x), ∀x. (5.1.2)

It holds for problems (a) and (b); some unusual exceptions for problems (c).

– The limiting form of the DP algorithm should hold for all states x,

J∗(x) = minu∈U(x)

Ew [g(x, u, w) + J∗(f(x, u, w))] , ∀x. (Bellman’s equation)

– If µ(x) minimizes RHS in Bellman’s equation for each x, the policy π = µ, µ, . . . isoptimal. This is true for most infinite horizon problems of interest (and in particular,for problems (a) and (b)).

5.1.2 Total cost problem formulation

• We assume an underlying system equation

xk+1 = wk.

• At state i, the use of a control u specifies the transition probability pij(u) to the next state j.

• The control u is constrained to take values in a given finite constraint set U(i), where i is thecurrent state.

• We will assume a kth stage cost g(xk, uk) for using control uk at stage xk. If g(i, u, j) isthe cost of using u at state i and moving to state j, we use as cost-per-stage the expectedcost g(i, u) given by

g(i, u) =∑j

pij(u)g(i, u, j).

• The total expected cost associated with an initial state i and a policy π = µ0, µ1, . . . is

Jπ(i) = limN→∞

E

[N−1∑k=0

αkg(xk, µk(xk))∣∣∣∣x0 = i

],

where α is a discount factor, with 0 < α ≤ 1.

• Optimal cost from state i is J∗(i) = minπ Jπ(i).

168

CHAPTER 5. INFINITE HORIZON PROBLEMS Prof. R. Caldentey

• Stationary policy: Admissible policy (i.e., µk(xk) ∈ U(xk)) of the form

π = µ, µ, . . . ,

with corresponding cost function Jµ(i).

The stationary policy µ is optimal if

Jµ(i) = J∗(i) = minπJπ(i), ∀i.

5.2 Stochastic shortest path problems

• Assume there is no discounting (i.e., α = 1).

• Set of “normal states” 1, 2, . . . , n.

• There is a special, cost-free, absorbing, terminal state t. That is, ptt = 1, and g(t, u) = 0 forall u ∈ U(t).

• Objective: Reach the terminal state with minimum expected cost.

• Assumption 5.2.1 There exists an integer m such that for every policy and initial state,there is a positive probability that the termination state t will be reached in at most m stages.Then for all π, we have

ρπ = Pxm 6= t|x0 6= t, π < 1

That is, Pxm = t|x0 6= t, π > 0.

• In terms of discrete-time Markov chains, Assumption 5.2.1 is claiming that t is accessible fromany state i.

• Remark: Assumption 5.2.1 is requiring that all policies are proper. A stationary policy isproper if when using it, there is a positive probability that the destination will be reachedafter at most n stages. Otherwise, it is improper.

However, the results to be presented can be proved under the following weaker conditions:

1. There exists at least one proper policy.

2. For every improper policy π, the corresponding cost Jπ(i) is ∞ for at least one state i.

• Note that the assumption implies that

Pxm 6= t|x0 = i, π ≤ Pxm 6= t|x0 6= t, π = ρπ < 1, ∀i = 1, . . . , n.

• Letρ = max

πρπ.

Since the number of controls available at each state is finite, the number of distinct m-stagepolicies is also finite. So, there must be only a finite number of values of ρπ, so that the maxabove is well defined (we do not need sup). Then,

Pxm 6= t|x0 6= t, π ≤ ρ < 1.

169


• For any π and any initial state i,

Px2m 6= t|x0 = i, π = Px2m 6= t|xm 6= t, x0 = i, π × Pxm 6= t|x0 = i, π≤ Px2m 6= t|xm 6= t, π × Pxm 6= t|x0 6= t, π≤ ρ2,

and similarly,Pxkm 6= t|x0 = i, π ≤ ρk, i = 1, . . . , n.

• So,∣∣E [cost between times km and (k + 1)m− 1]∣∣ ≤ m︸︷︷︸

# of stages

× ρk × maxi=1,...,nu∈U(i)

∣∣g(u, i)∣∣,︸︷︷︸

bound for each stageinstant. cost.

(5.2.1)and hence,

|Jπ(i)| ≤∞∑k=0

mρk maxi=1,...,nu∈U(i)

∣∣g(u, i)∣∣ =

m

1− ρmaxi=1,...,nu∈U(i)

∣∣g(u, i)∣∣.

• Key idea for the main result (to be presented below) is that the tail of the cost series vanishes,i.e.,

limK→∞

∞∑k=mK

E [g(xk, µk(xk))] = 0.

The reason is that limK→∞ PxmK 6= t|x0 = i, π = 0.

Proposition 5.2.1 Under Assumption 5.2.1, the following hold for the stochastic shortest pathproblem:

(a) Given any initial conditions J0(1), . . . , J0(n), the sequence Jk(i) generated by the DP iteration

Jk+1(i) = minu∈U(i)

g(i, u) +n∑j=1

pij(u)Jk(j)

, ∀i, (5.2.2)

converges to the optimal cost J∗(i).

(b) The optimal costs J∗(1), . . . , J∗(n) satisfy Bellman’s equation,

J∗(i) = minu∈U(i)

g(i, u) +n∑j=1

pij(u)J∗(j)

, i = 1, . . . , n, (5.2.3)

and in fact they are the unique solution of this equation.

(c) For any stationary policy µ, the costs Jµ(1), . . . , Jµ(n) are the unique solution of the equation

Jµ(i) = g(i, µ(i)) +n∑j=1

pij(µ(i))Jµ(j), i = 1, . . . , n.

170


Furthermore, given any initial conditions J0(1), . . . , J0(n), the sequence Jk(i) generated by theDP iteration

Jk+1(i) = g(i, µ(i)) +n∑j=1

pij(µ(i))Jk(j), i = 1, . . . , n,

converges to the cost Jµ(i) for each i.

(d) A stationary policy µ is optimal if and only if for every state i, µ(i) attains the minimum inBellman’s equation (5.2.3).

Proof: Following the labeling of the proposition:

(a) For every possible integer K, initial state x0, and policy π = µ0, µ1, . . ., we break down thecost Jπ(x0) as follows:


E

[N−1∑k=0

g(xk, µk(xk))

]

= E

[mK−1∑k=0

g(xk, µk(xk))

]+ limN→∞

E

[N−1∑k=mK

g(xk, µk(xk))

](5.2.4)

Let M be an upper bound on the cost of an m-stage cycle, assuming t is not reached duringthe cycle, i.e.,

M = m maxi=1,...,nu∈U(i)

∣∣g(i, u)∣∣.

Recall from (5.2.1) that∣∣∣∣E[cost during Kth cycle, between stages Km and (K + 1)m− 1]∣∣∣∣ ≤MρK , (5.2.5)

so that ∣∣∣∣∣ limN→∞

E

[N−1∑k=mK

g(xk, µk(xk))

] ∣∣∣∣∣ ≤M∞∑k=K

ρk =ρKM

1− ρ. (5.2.6)

Also, denoting J0(t) = 0, let us view J0 as a terminal cost function. We will provide a bound forits expected value based on the current policy π applied over mK stages. Starting from x0 6= t,J0(xmK) is the cost of reaching state xmK in mK steps. So,∣∣∣∣E [J0(xmK)]

∣∣∣∣ =∣∣∣∣ n∑i=1

PxmK = i|x0 6= t, πJ0(i) + PxmK = t|x0 6= t, π J0(t)︸︷︷︸0

∣∣∣∣=

∣∣∣∣ n∑i=1

PxmK = i|x0 6= t, πJ0(i)∣∣∣∣

≤

(n∑i=1

PxmK = i|x0 6= t, π

)︸︷︷︸

PxmK 6=t|x0 6=t,π

maxi=1,...,n

∣∣J0(i)∣∣

≤ ρK maxi=1,...,n

∣∣J0(i)∣∣ (5.2.7)

171


Now, we set the following bound (following equations (5.2.6) and (5.2.7)):∣∣∣∣E[J0(xmK)]− limN→∞

E

[N−1∑k=mK

g(xk, µk(xk))

] ∣∣∣∣ ≤ ∣∣∣∣E[J0(xmK)]∣∣∣∣+∣∣∣∣ limN→∞

E

[N−1∑k=mK

g(xk, µk(xk))

] ∣∣∣∣≤ ρK max

i=1,...,n

∣∣J0(i)∣∣+M

∞∑k=K

ρk

= ρK maxi=1,...,n

∣∣J0(i)∣∣+

ρKM

1− ρ.

Taking the LHS above, and using (5.2.4), we have∣∣∣∣E[J0(xmK)]− limN→∞

E

[N−1∑k=mK

g(xk, µk(xk))

] ∣∣∣∣ =∣∣∣∣E[J0(xmK)]+E

[mK−1∑k=0

g(xk, µk(xk))

]−Jπ(x0)

∣∣∣∣Then, we get the bounds

−ρK maxi=1,...,n

˛J0(i)

˛+Jπ(x0)− ρ

KM

1− ρ ≤ E[J0(xmK)]+E

"mK−1Xk=0

g(xk, µk(xk))

#≤ ρK max

i=1,...,n

˛J0(i)

˛+Jπ(x0)+

ρKM

1− ρ(5.2.8)

Note that

• The expected value of the middle term above is the mK-stage cost of policy π, startingfrom state x0, with terminal cost J0(xmK).

• The min of this mK-stage cost over all π is equal to the value JmK(x0), which is generatedby the DP recursion (5.2.2) after mK iterations.

Thus, taking the min over π in equation (5.2.8), we obtain for all x0 and K,

− ρK maxi=1,...,n

∣∣J0(i)∣∣+ J∗(x0)− ρKM

1− ρ≤ JmK(x0) ≤ ρK max

i=1,...,n

∣∣J0(i)∣∣+ J∗(x0) +

ρKM

1− ρ. (5.2.9)

When taking limit as K →∞, the terms in LHS and RHS involving ρK → 0, leading to

limK→∞

JmK(x0) = J∗(x0), ∀x0.

Since from (5.2.5) ∣∣JmK+q(x0)− JmK(x0)∣∣ ≤ ρKM, q = 0, . . . ,m− 1,

we see that for q = 0, . . . ,m− 1,

−ρKM + JmK(x0) ≤ JmK+q(x0) ≤ JmK(x0) + ρKM.

Taking limit as K →∞, we get

limK→∞

(ρKM + JmK(x0)

)= J∗(x0).

Thus, for any q = 0, . . . ,m− 1,

limK→∞

JmK+q(x0) = J∗(x0),

and hence,limk→∞

Jk(x0) = J∗(x0).

172


(b) Existence: By taking limit as k → ∞ in the DP iteration (5.2.2), and using the convergenceresult of part (a) ⇒ J∗(1), . . . , J∗(n) satisfy Bellman’s equation.

Uniqueness: If J(1), . . . , J(n) satisfy Bellman’s equation, then the DP iteration (5.2.2) startingfrom J(1), . . . , J(n) just replicates J(1), . . . , J(n). Then, from the convergence result of part (a),J(i) = J∗(i), i = 1, . . . , n.

(c) Given stationary policy µ, redefine the control constraint sets to be U(i) = µ(i) insteadof U(i). From part (b), we then obtain that Jµ(1), . . . , Jµ(n) solve uniquely Bellman’s equationfor this redefined problem; i.e.,

Jµ(i) = g(i, µ(i)) +n∑j=1

pij(µ(i))Jµ(j), i = 1, . . . , n,

and from part (a) it follows that the corresponding DP iteration converges to Jµ(i).

(d) We have that µ(i) attains the minimum in equation (5.2.3) if and only if we have


g(i, u) +n∑j=1

pij(u)J∗(j)

= g(i, µ(i)) +

n∑j=1

pij(µ(i))J∗(j), i = 1, . . . , n.

This equation and part (c) imply that Jµ(i) = J∗(i) for all i. Conversely, if Jµ(i) = J∗(i) forall i, parts (b) and (c) imply the above equation.

This completes the proof of the four parts of the proposition.

Observation: Part (c) provides a a way to compute Jµ(i), i = 1, . . . , n, for a given stationarypolicy µ, but the computation is substantial for large n (of order O(n3)).

Example 5.2.1 (Minimizing Expected Time to Termination)

• Let g(i, u) = 1, ∀i = 1, . . . , n, u ∈ U(i).

• Under our assumptions, the costs J∗(i) uniquely solve Bellman’s equation, which has the form


1 +n∑j=1

pij(u)J∗(j)

, i = 1, . . . , n.

• In the special case where there is only one control at each state, J∗(i) is the mean first passage

time from i to t. These times, denoted mi, are the unique solution of the equations

mi = 1 +n∑j=1

pijmj , i = 1, . . . , n.

Recall that in a discrete-time Markov chain, if there is only one recurrent class and t is a state

of that class (in our case, the only recurrent class is given by t), the mean first passage times

from i to t are the unique solution to the previous system of linear equations.

173


Example 5.2.2 (Spider and a fly)

• A spider and a fly move along a straight line.

• At the beginning of each period, the slider knows the position of the fly.

• The fly moves one unit to the left w.p. p, one unit to the right w.p. p, and stays where it is

w.p. 1− 2p.

• The spider moves one unit towards the fly if its distance from the fly is more than one unit.

• If the spider is one unit away from the fly, it will either move one unit towards the fly or stay where

it is.

• If the spider and the fly land in the same position, the spider captures the fly.

• The spider’s objective is to capture the fly in minimum expected time.

• The initial distance between the spider and the fly is n.

• This is a stochastic shortest path problem with state i =distance between spider and fly, with

i = 1, . . . , n, and t = 0 the termination state.

• There is control choice only at state 1. Otherwise, the spider simply moves towards the fly.

• Assume that the controls (in state 1) are M =move, and M = don’t move.

• The transition probabilities from state 1 when using control M are described in Figure 5.2.1.

F S F S

• P11(M)=2p, described by the two possible situations:

or

• P10(M)=1 2p, when fly did not move

Figure 5.2.1: Transition probabilities for control M from state 1.

Other probabilities are:

p12(M) = p, p11(M) = 1− 2p, p10(M) = p,

and for i ≥ 2,

pii = p, pi(i−1) = 1− 2p, pi(i−2) = p.

All other transition probabilities are zero.

174


Bellman’s equation:

J∗(i) = 1 + pJ∗(i) + (1− 2p)J∗(i− 1) + pJ∗(i− 2), i ≥ 2.

J∗(1) = 1 + min2pJ∗(1)︸︷︷︸M

, pJ∗(2) + (1− 2p)J∗(1)︸︷︷︸M

,

with J∗(0) = 0.

In order to solve the Bellman’s equation, we proceed as follows: First, note that

J∗(2) = 1 + pJ∗(2) + (1− 2p)J∗(1).

Then, substitute J∗(2) in the equation for J∗(1), getting:

J∗(1) = 1 + min

2pJ∗(1),p

1− p+

(1− 2p)J∗(1)1− p

.

Next, we work from here to find that when one unit away from the fly, it is optimal to use M if and

only if p ≥ 1/3. Moreover, it can be verified that

J∗(1) =

1/(1− 2p) if p ≤ 1/3,1/p if p ≥ 1/3.

Given J∗(1), we can compute J∗(2), and then J∗(i), for all i ≥ 3.

5.2.1 Computational approaches

There are three main computational approaches used in practice for calculating the optimal costfunction J∗. From

Value iteration

The DP iteration

Jk+1(i) = mini∈U(i)

g(i, u) +n∑j=1

pij(u)Jk(j)

, i = 1, . . . , n,

is called value iteration.

From equation (5.2.9), we know that the error∣∣JmK(i)− J∗(i)∣∣ ≤ DρK , for some constant D.

The value iteration algorithm can sometimes be strengthened with the use of error bounds (i.e., theyprovide a useful guideline for stopping the value iteration algorithm while being assured that Jkapproximates J∗ with sufficient accuracy). In particular, it can be shown that for all k and j, wehave

Jk+1(j) + (N∗(j)− 1)ck ≤ J∗(j) ≤ Jµk(j) ≤ Jk+1(j) + (Nk(j)− 1)ck,

where

175


• µk is such that µk(i) attains the minimum in the kth iteration for all i,

• N∗(j) = average number of stages to reach t starting from j and using some optimal stationarypolicy,

• Nk(j) = average number of stages to reach t starting from j and using some stationarypolicy µk,

• ck = mini=1,...,nJk+1(i)− Jk(i),

• ck = maxi=1,...,nJk+1(i)− Jk(i),

Unfortunately, the values N∗(j) and Nk(j) are easily computed or approximated only in some cases.

Policy iteration

• It generates a sequence µ1, µ2, . . . of stationary policies, starting with any stationary policy µ0.

• At a typical iteration, given a policy µk, we perform two steps:

(i) Policy evaluation step: Computes Jµk(i) as the solution of the linear system of equations

J(i) = g(i, µk(i)) +n∑j=1

pij(µk(i))J(j), i = 1, . . . , n, (5.2.10)

in the unknowns J(1), . . . , J(n).

(ii) Policy improvement step: Computes a new policy µk+1 as

µk+1(i) = arg minu∈U(i)

g(i, u) +n∑j=1

pij(u)Jµk(j)

, i = 1, . . . , n. (5.2.11)

• The algorithm stops when Jµk(i) = Jµk+1(i) for all i.

Proposition 5.2.2 Under Assumption 5.2.1, the policy iteration algorithm for the stochastic short-est path problem generates an improving sequence of policies (i.e., Jµk+1(i) ≤ Jµk(i), ∀i, k) andterminates with an optimal policy.

Proof: For any k, consider the sequence generated by the recursion

JN+1(i) = g(i, µk+1(i)) +n∑j=1

pij(µk+1(i))JN (j), i = 1, . . . , n, (5.2.12)

where N = 0, 1, . . . , and the solution to equation (5.2.10):

J0(i) = Jµk(i), i = 1, . . . , n.

176


From equation (5.2.10), we have

J0(i) = g(i, µk(i)) +n∑j=1

pij(µk(i))J0(j)

≥ g(i, µk+1(i)) +n∑j=1

pij(µk+1(i))J0(j) (from (5.2.11))

= J1(i), ∀i (from iteration (5.2.12))

By using the above inequality we obtain

J1(i) = g(i, µk+1(i)) +n∑j=1

pij(µk+1(i))J0(j)

≥ g(i, µk+1(i)) +n∑j=1

pij(µk+1(i))J1(j) (because J0(i) ≥ J1(i))

= J2(i), ∀i (from iteration (5.2.12)). (5.2.13)

Continuing similarly we get

J0(i) ≥ J1(i) ≥ · · · ≥ JN (i) ≥ JN+1(i) ≥ · · · · · · , i = 1, . . . , n. (5.2.14)

Since by Proposition 5.2.1(c), JN (i)→ Jµk+1(i), we obtain

J0(i) ≥ Jµk+1(i) ⇒ Jµk(i) ≥ Jµk+1(i), i = 1, . . . , n, k = 0, 1, . . . (5.2.15)

Thus, the sequence of generated policies is improving, and since the number of stationary policiesis finite, we must after a finite number of iterations –say, k + 1– obtain Jµk(i) = Jµk+1(i), for all i.

Then, we will have equality holding throughout equation (5.2.15), which in particular meansfrom (5.2.12),

J0(i) = Jµk(i) = J1(i) = g(i, µk+1(i)) +n∑j=1

pij(µk+1(i))Jµk(j),

and in particular,

Jµk(i) = minu∈U(i)

g(i, u) +n∑j=1

pij(u)Jµk(j)

, i = 1, . . . , n.

Thus, the costs Jµk(1), . . . , Jµk(n) solve Bellman’s equation and by Proposition 5.2.1(b), it followsthat Jµk(i) = J∗(i), and that µk(i) is optimal.

Linear programming

Claim: J∗ is the “largest” J that satisfies the constraints

J(i) ≤ g(i, u) +n∑j=1

pij(u)J(j), (5.2.16)

for all i = 1, . . . , n, and u ∈ U(i).

177


Proof: Assume that J0(i) ≤ J1(i), where J1(i) is generated through value iteration; i.e.,

J0(i) ≤ minu∈U(i)

g(i, u) +n∑j=1

pij(u)J0(j)

, i = 1, . . . , n.

Then because of the stationarity of the problem and the monotonicity property of DP, we will haveJk(i) ≤ Jk+1(i), for all k and i. From Proposition 5.2.1(a), the value iteration sequence convergesto J∗(i), so that J0(i) ≤ J∗(i), for all i.

Hence, J∗ = (J∗(1), . . . , J∗(n)) is the solution of the linear program

maxn∑i=1

J(i),

subject to the constraint (5.2.16).

Figure 5.2.2 illustrates a linear program associated with a two-state stochastic shortest path prob-lem. The decision variables in this case are J(1) and J(2).

J(1)

J(2)

0

J* = (J*(1),J*(2))

J(1) = g(1,u2) + p 11(u2)J(1) + p 12(u2)J(2)

J(1) = g(1,u1) + p 11(u1)J(1) + p 12(u1)J(2)

J(2) = g(2,u1) + p 21(u1)J(1)+ p 22(u1)J(2)

J(2) = g(2,u2) + p 21(u2)J(1)+ p 22(u2)J(2)

Figure 5.2.2: Illustration of the LP solution method for infinite horizon DP.

Drawback: For large n, the dimension of this program is very large. Furthermore, the number ofconstraints is equal to the number of state-control pairs.

5.3 Discounted problems

• Go back to the total cost problem, but now assume a discount factor α < 1 (i.e., future costsmatter less than current cost).

• Can be converted to a stochastic shortest path (SSP) problem, for which the analysis of thepreceding section holds.

• The transformation mechanism relies on adjusting the probabilities using the discount factor α.The instantaneous costs g(i, u) are preserved. Figure 5.3.1 illustrates this transformation.

178


• Justification: Take a policy µ, and apply it over both formulations. Note that:

– Given that the terminal state has not been reached in SSP, the state evolution in thetwo problems is governed by the same transition probabilities.

– The expected cost of the kth stage of the associated SSP is g(xk, µk(xk)), multiplied bythe probability that state t has not been reached, which is αk. This is also the expectedcost of the kth stage of the discounted problem.

– Note that value iteration produces identical iterates for the two problems:

Discounted: Jk+1(i) = minu∈U(i)

g(i, u) + α

∑nj=1 pij(u)Jk(j)

, i = 1, . . . , n.

Corresponding SSP: Jk+1(i) = minu∈U(i)

g(i, u) +

∑nj=1(αpij(u))Jk(j)

, i = 1, . . . , n.

Original discounted problemAssociated stochastic shortest path problem

t1

Figure 5.3.1: Illustration of the transformation from α-discounted to stochastic shortest path.

• The results of SPP, summarized in Proposition 5.2.1, extend to this case. In particular:

(i) Value iteration converges to J∗ for all initial J0:


g(i, u) + αn∑j=1

pij(u)Jk(j)

, i = 1, . . . , n.

(ii) J∗ is the unique solution of Bellman’s equation:


g(i, u) + α

n∑j=1

pij(u)J∗(j)

, i = 1, . . . , n.

(iii) Policy iteration converges finitely to an optimal.

(iv) Linear programming also works.

For completeness, we compile these results in the following proposition.

Proposition 5.3.1 The following hold for the discounted problem:

179


(a) Given any initial conditions J0(1), . . . , J0(n), the value iteration algorithm


g(i, u) + αn∑j=1

pij(u)Jk(j)

, ∀i, (5.3.1)

converges to the optimal cost J∗(i).

(b) The optimal costs J∗(1), . . . , J∗(n) satisfy Bellman’s equation,


g(i, u) + α

n∑j=1

pij(u)J∗(j)

, i = 1, . . . , n,

and in fact they are the unique solution of this equation.

(c) For any stationary policy µ, the costs Jµ(1), . . . , Jµ(n) are the unique solution of the equation

Jµ(i) = g(i, µ(i)) + αn∑j=1

pij(µ(i))Jµ(j), i = 1, . . . , n.

Furthermore, given any initial conditions J0(1), . . . , J0(n), the sequence Jk(i) generated by theDP iteration

Jk+1(i) = g(i, µ(i)) + αn∑j=1

pij(µ(i))Jk(j), i = 1, . . . , n,

converges to the cost Jµ(i) for each i.

(d) A stationary policy µ is optimal if and only if for every state i, µ(i) attains the minimum inBellman’s equation of part (b).

(e) The policy iteration algorithm given by


g(i, u) + α

n∑j=1

pij(u)Jµk(j)

, i = 1, . . . , n,

generates an improving sequence of policies and terminates with an optimal policy.

As in the case of stochastic shortest path problems (see equation (5.2.9)), we can show that

• |Jk(i)− J∗(i)| ≤ Dαk, for some constant D.

• The error bounds become

Jk+1(j) +α

1− αck ≤ J∗(j) ≤ Jµk(j) ≤ Jk+1(j) +

α

1− αck,

where µk(j) attains the minimum in the kth value iteration (5.3.1) for all i, and

ck = mini=1,...,n

[Jk+1(i)− Jk(i)] , and ck = maxi=1,...,n

[Jk+1(i)− Jk(i)] .

Example 5.3.1 (Asset selling problem)

180


• Assume system evolves according to xk+1 = wk.

• If the offer xk of period k is accepted, it is invested at an interest rate r.

• By depreciating the sale amount to period 0 dollars, we view (1 + r)−kxk as the reward for selling

the asset in period k at a price xk, where r > 0 is the interest rate.

Idea: We discount the reward by the interest we did not make for the first k periods.

• The discount factor is therefore: α = 1/(1 + r).

• J∗ is the unique solution of Bellman’s equation

J∗(x) = maxx,

E[J∗(w)]1 + r

.

• An optimal policy is to sell of and only if the current offer xk is greater than or equal to α, where

α =E[J∗(w)]

1 + r.

Example 5.3.2 (Manufacturer’s production plan)

• A manufacturer at each time period receives an order for her product with probability p and receives

no order with probability 1− p.

• At any period she has a choice of processing all unfilled orders in a batch, or process no order at

all.

• The cost per unfilled order at each time period is c > 0, and the setup cost to process the unfilled

orders is K > 0. The manufacturer wants to find a processing policy that minimizes the total

expected cost, assuming the discount factor is α < 1 and the maximum number of orders that

can remain unfilled is n. When the maximum n of unfilled orders is reached, the orders must

necessarily be processed.

• Define the state as the number of unfilled orders at the beginning of each period. The Bellman’s

equation for this problem is

J∗(i) = minK + α(1− p)J∗(0) + αpJ∗(1)︸︷︷︸Process remaining orders

, ci+ α(1− p)J∗(i) + αpJ∗(i+ 1)︸︷︷︸Do nothing

for the states i = 0, 1, . . . , n− 1, and takes the form

J∗(n) = K + α(1− p)J∗(0) + αpJ∗(1)

for state n.

• Consider the value iteration method applied over this problem. We prove now by using the

(finite horizon) DP algorithm that the k-stage optimal cost functions Jk(i) are monotonically

nondecreasing in i for all k, and therefore argue that the optimal infinite horizon cost function J∗(i)is also monotonically nondecreasing in i since

J∗(i) = limk→∞

Jk(i).

181


Given that J∗(i) is monotonically nondecreasing in i, we have that if processing a batch of m

orders is optimal, that is,

K + α(1− p)J∗(0) + αpJ∗(1) ≤ cm+ α(1− p)J∗(m) + αpJ∗(m+ 1),

then processing a batch of m+1 orders is also optimal. Therefore, a threshold policy (i.e., a policy

that processes the orders if their number exceeds some threshold integer m∗) is optimal.

Claim: The k-stage optimal cost functions Jk(i) are monotonically nondecreasing in i for all k.

Proof: We proceed by induction. Start from J0(i) = 0, for all i, and suppose that Jk(i+ 1) ≥Jk(i) for all i. We will see that Jk+1(i+ 1) ≥ Jk+1(i) for all i. Consider first the case i+ 1 < n.

Then, by induction hypothesis, we have

c(i+ 1) + α(1− p)Jk(i+ 1) + αpJk(i+ 2) ≥ ci+ α(1− p)Jk(i) + αpJk(i+ 1). (5.3.2)

Define for any scalar γ,

Fk(γ) = minK + α(1− p)Jk(0) + αpJk(1), γ.

Since Fk(γ) is monotonically increasing in γ, we have from equation (5.3.2),

Jk+1(i+ 1) = Fk(c(i+ 1) + α(1− p)Jk(i+ 1) + αpJk(i+ 2))

≥ Fk(ci+ α(1− p)Jk(i) + αpJk(i+ 1))

= Jk+1(i).

Finally, consider the case i+ 1 = n. Then, we have

Jk+1(n) = K + α(1− p)Jk(0) + αpJk(1)

≥ Fk(ci+ α(1− p)Jk(i) + αpJk(i+ 1))

= Jk+1(n− 1).

The induction is complete.

5.4 Average cost-per-stage problems

5.4.1 General setting

• Stationary system with finite number of states and controls

• Minimize over admissible policies π = µ0, µ1, . . .,

Jπ(i) = limN→∞

1N

E

[N−1∑k=0

g(xk, µk(xk))|x0 = i

]

• Assume 0 ≤ g(xk, µk(xk)) <∞.

182


• Fact: For most problems of interest, the average cost per stage of a policy and the optimalaverage cost per stage are independent of the initial state.

Intuition: Costs incurred in the early stages do not matter in the long run. More formally,suppose that all state communicate under a given stationary policy µ.1 Let

Kij(µ) = first passage time from i to j under µ,

i.e., Kij(µ) is the first index k such that xk = j starting from x0 = i. Then,

Jµ(i) = limN→∞

1N

E

Kij(µ)−1∑k=0


︸︷︷︸

=0

+ limN→∞

1N

E

N−1∑k=Kij(µ)


Therefore, Jµ(i) = Jµ(j), for all i, j with E[Kij(µ)] < ∞ (or equivalently, with P(Kij(µ) =∞) = 0.

• Because communication issues are so important, the methodology relies heavily on Markovchain theory.

5.4.2 Associated stochastic shortest path (SSP) problem

Assumption 5.4.1 State n is such that for some integer m > 0, and for all initial states and allpolicies, n is visited with positive probability at least once within the first m stages.

In other words, state n is recurrent in the Markov chain corresponding to each stationary policy.

Consider a sequence of generated states, and divide it into cycles that go through n, as shown inFigure 5.4.1.

X0=i Xk=n

Figure 5.4.1: Each cycle can be viewed as a state trajectory of a corresponding SSP problem with terminationstate being n.

The SSP is obtained via the transformation described in Figure 5.4.2.

Let the cost at i of the SSP be g(i, u)− λ∗. We will show that

Average cost problem ≡ Min cost cycle problem ≡ SSP problem1We are assuming that there is a single recurrent class. Recall that a state is recurrent if the probability of

reentering it is one. Positive recurrent means that the expected time of returning to it is finite. Also, recall that in a

finite state Markov chain, all recurrent states are positive recurrent.

183


t 1

Figure 5.4.2: LHS: Original average cost per stage problem. RHS: Associated SSP problem. The originaltransition probabilities are adjusted as follows: probabilities from the states i 6= t to state t are set equal topin(u), the probabilities of transition from all states to state n are set to zero, and all other probabilities areleft unchanged.

5.4.3 Heuristic argument

• Under all stationary policies in the original average cost problem, there will be an infinitenumber of cycles marked by successive visits to state n ⇒ We want to find a stationarypolicy µ that minimizes the average cycle stage cost.

• Consider a minimum cycle cost problem: Find a stationary policy µ that minimizes:

Expected cost per transition within a cycle =E[cost from n up to the first return to n]E[time from n up to the first return to n]

=Cnn(µ)Nnn(µ)

.

• Intuitively, the optimal average cost λ∗ should be equal to optimal average cycle cost, i.e.,

λ∗ =Cnn(µ∗)Nnn(µ∗)

; or equivalently Cnn(µ∗)−Nnn(µ∗)λ∗ = 0.

So, for any stationary policy µ,

λ∗ ≤ Cnn(µ)Nnn(µ)

or equivalently Cnn(µ)−Nnn(µ)λ∗ ≥ 0.

• Thus, to obtain an optimal policy µ, we must solve

minµ

Cnn(µ)−Nnn(µ)λ∗

Note that Cnn(µ)−Nnn(µ)λ∗ is the expected cost of µ starting from n in the associated SSPwith stage cost g(i, u)− λ∗, justified by

E

Knt(µ)−1∑k=0

(g(xk, µ(xk))− λ∗|x0 = n)

= Cnn(µ)− Nnn(µ)︸︷︷︸E[Knt(µ)]

λ∗.

184


• Let h∗(i) be the optimal cost of the SSP (i.e., of the path from x0 = i to t) when starting atstates i = 1, . . . , n. Then by Proposition 1(b) in h∗(1), . . . , h∗(n) solve uniquely the Bellman’sequation:

h∗(i) = minu∈U(i)

g(i, u)− λ∗ +n∑j=1

pij(u)h∗(j)

= min

u∈U(i)

g(i, u)− λ∗ +n−1∑j=1

pij(u)h∗(j) + pin(u)︸︷︷︸=0, by construction

h∗(n)

(5.4.1)

• If µ∗ is a stationary policy that minimizes the cycle cost, then µ∗ must satisfy

h∗(n) = Cnn(µ∗)−Nnn(µ∗)λ∗ = 0

See Figure 5.4.3.

n

t

i1

i2

g(i1,u)–*

g(i2,u)–*

Figure 5.4.3: h∗(n) in the SSP is the expected cost of the path from n to t (i.e., of the cycle from n to n inthe original problem) based on the original g(i, u), minus Nnn(µ∗)λ∗.

• We can then rewrite (5.4.1) as

h∗(n) = 0

λ∗ + h∗(i) = minu∈U(i)

g(i, u) +n∑j=1

pij(u)h∗(j)

, ∀i = 1, . . . , n

From the results on SSP, we know that this equation has a unique solution (as long as weimpose the constraint h∗(n) = 0). Moreover, minimization of the RHS should give an optimalstationary policy.

• Interpretation: h∗(i) is a relative or differential cost :

h∗(i) = min

E[cost to go from i to n for the first time]

− E[cost if the stage cost were constant at λ∗ instead of at g(j, u), ∀j]

In words, h∗(i) is a measure of how much away from the average cost we are when startingfrom node i.

185


5.4.4 Bellman’s equation

The following proposition provides the main results regarding Bellman’s equation:

Proposition 5.4.1 Under Assumption 1, the following hold for the average cost per stage problem:

(a) The optimal avergae cost λ∗ is the same for all initial staes and together with some vectorh∗ = h∗(1), . . . , h∗(n) satisfies Bellman’s equation


g(i, u) +n∑j=1

pij(u)h∗(j)

, i = 1, . . . , n. (5.4.2)

Furthermore, if µ(i) attains the minimum in the above equation for all i, the stationary policy µis optimal. In addition, out of all vectors h∗ satisfying this equation, there is a unique vectorfor which h∗(n) = 0.

(b) If a scalar λ and a vector h = h(1), . . . , h(n) satisfy Bellman’s equation, then λ is the averageoptimal cost per stage for each initial state.

(c) Given a stationary policy µ with corresponding average cost per stage λµ, there is a uniquevector hµ = hµ(1), . . . , hµ(n) such that hµ(n) = 0 and

λµ + hµ(i) = g(i, µ(i)) +n∑j=1

pij(µ(i))hµ(j), i = 1, . . . , n.

Proof: We proceed item by item:

(a) Let λ = minµCnn(µ)Nnn(µ) . Then, for all µ,

Cnn(µ)−Nnn(µ)λ ≥ 0,

withCnn(µ∗)−Nnn(µ∗)λ︸︷︷︸

h∗(n) in the associated SSP

= 0 ⇒ h∗(n) = 0.

Consider the associated SSP with stage cost: g(i, u)− λ. Then, by Proposition 1(b), and usingthe fact that pin(u) = 0, the costs h∗(1), . . . , h∗(n) solve uniquely the corresponding Bellman’sequation:


g(i, u)− λ+n−1∑j=1

pij(u)h∗(j)

, i = 1, . . . , n. (5.4.3)

Thus, we can rewrite (5.4.3) as

λ+ h∗(i) = minu∈U(i)

g(i, u) +n∑j=1

pij(u)h∗(j)

, i = 1, . . . , n. (5.4.4)

We will show that this implies λ = λ∗.

186


Let π = µ0, µ1, . . . be any admissible policy, let N be a positive integer, and for all k =0, . . . , N − 1, define Jk(i) using the recursion:

J0(i) = h∗(i), i = 1, . . . , n,

Jk+1(i) = g(i, µN−k−1(i)) +n∑j=1

pij(µN−k−1(i))Jk(j), i = 1, . . . , n. (5.4.5)

In words, JN (i) is the N -stage cost of π when starting state is i and the terminal cost is h∗.

From (5.4.4), since µN−1(·) is just one admissible policy, we have

λ+ h∗(i)︸︷︷︸J0(i)

≤ g(i, µN−1(i)) +n∑j=1

pij(µN−1(i))h∗(j)︸︷︷︸J1(i), from (5.4.5), by setting k=0

, i = 1, . . . , n.

Thus,λ+ J0(i) ≤ J1(i), i = 1, . . . , n.

Then,

J2(i) = g(i, µN−2(i)) +n∑j=1

pij(µN−2(i)) J1(j)︸︷︷︸≥λ+J0(i)

≥ g(i, µN−2(i)) + λ+n∑j=1

pij(µN−2(i))J0(j)

≥ λ+ λ+ h∗(i) (by equation (5.4.4))

= 2λ+ h∗(i), i = 1, . . . , n.

By repeating this argument,

kλ+ h∗(i) ≤ Jk(i), k = 0, . . . , N, i = 1, . . . , n.

In particular, for k = N ,

Nλ+ h∗(i) ≤ JN (i) ⇒ λ+h∗(i)N≤ JN (i)

N, i = 1, . . . , n. (5.4.6)

Equality holds in (5.4.6) if µk(i) attains the minimum in (5.4.4) for all i, k. Now,

λ+h∗(i)N︸︷︷︸

→λ when N→∞

≤ JN (i)N,︸︷︷︸

→Jπ(i) when N→∞

where Jπ(i) is the average cost per stage of π, starting at i. Then, we get

λ ≤ Jπ(i), i = 1, . . . , n,

for all admissible π.

If π = µ, µ, . . . where µ(i) attains the minimum in (5.4.4) for all i, k, we get

λ = minπJπ(i) = λ∗, i = 1, . . . , n.

Replacing λ by λ∗ in equation (5.4.4), we obtain (5.4.2). Finally, “h∗(n) = 0” jointly with (5.4.4)are equivalent to (5.4.3) for the associated SSP. But the solution to (5.4.3) is unique (due toProposition 1(b)), so there must be a unique solution for the equations “h∗(n) = 0” and (5.4.4).

187


(b) The proof follows from the proof of part (a), starting from equation (5.4.4).

(c) The proof follows from part (a), constraining the control set to U(i) = µ(i).

Remarks:

• Proposition 5.4.1 can be shown under weaker conditions. In particular, it can be shownassuming that all stationary policies have a single recurrent class even if their correspondingrecurrent classes do not have state n in common.

• It can also be shown assuming that for every pair of states i, j, there is a stationary policy µunder which there is a positive probability of reaching j starting from i.

Example: A manufacturer, at each time:

1. May process all unfilled orders at cost K > 0, or process no order at all. The cost per unfilledorder at each time is c > 0.

2. Receives an order w.p. p, and no order w.p. 1− p.

• Maximum number of orders that can remain unfilled is n. When there are n pending orders,he has to process).

• Objective: Find a processing policy that minimizes the total expected cost per stage.

• State: Number of unfilled orders. We set state 0 is the special state for the SSP formulation.

• Bellman’s equation: For states i = 0, 1, . . . , n− 1,

λ∗ + h∗(i) = minK + (1− p)h∗(0) + ph∗(1)︸︷︷︸Process unfilled orders

, ci+ (1− p)h∗(i) + ph∗(i+ 1),︸︷︷︸Do nothing

and for state n,

λ∗ + h∗(n) = K + (1− p)h∗(0) + ph∗(1).

• Optimal policy: Process i unfilled orders if

K + (1− p)h∗(0) + ph∗(1) ≤ ci+ (1− p)h∗(i) + ph∗(i+ 1).

• If we view h∗(i) as the differential cost associated with an optimal policy (or by interpretingh∗(i) as the optimal cost-to-go for the associated SSP), then h∗(i) should be monotonicallynondecreasing with i. This monotonicity implies that a threshold policy is optimal: “Processthe orders if their number exceeds some threshold integer m”.

188


5.4.5 Computational approaches

Value iteration

Procedure: Generate optimal k-stage costs by the DP algorithm starting from any J0:


g(i, u) +n∑j=1

pij(u)Jk(j)

, ∀i. (5.4.7)

Claim: limk→∞Jk(i)k = λ∗, ∀i.

Proof: Let h∗ be a solution vector of Bellman’s equation:


g(i, u) +n∑j=1

pij(u)h∗(j)

, i = 1, . . . , n. (5.4.8)

From here, define the recursion

J∗0 (i) = h∗(i)

J∗k+1(i) = minu∈U(i)

g(i, u) +n∑j=1

pij(u)J∗k (j)

, i = 1, . . . , n.

Like in the proof of Proposition 5.4.1(a), it can be shown that

J∗k (i) = kλ∗ + h∗(i), i = 1, . . . , n.

On the other hand, it can be seen that

|Jk(i)− J∗k (i)| ≤ maxj=1,...,n

|J0(j)− h∗(j)|, i = 1, . . . , n,

because Jk(i) and J∗k (i) are optimal costs for two k-stage problems that differ only in the corre-sponding terminal cost functions which are J0 and h∗ respectively.

From the preceding two equations, we see that for all k,

|Jk(i)− (kλ∗ − h∗(i))| ≤ maxj=1,...,n

|J0(j)− h∗(j)|.

Therefore,

− maxj=1,...,n

|J0(j)−h∗(j)|− h∗(i)︸︷︷︸≤maxj=1,...,n |h∗(j)|

≤ Jk(i)−kλ∗ ≤ maxj=1,...,n

|J0(j)−h∗(j)|− h∗(i)︸︷︷︸≤maxj=1,...,n |h∗(j)|

,

or equivalently,|Jk(i)− kλ∗| ≤ max

j=1,...,n|J0(j)− h∗(j)|+ max

j=1,...,n|h∗(j)|,

which implies ∣∣∣∣Jk(i)k− λ∗

∣∣∣∣ ≤ constantk

.

Taking limit as k →∞ in both sides above gives

limk→∞

Jk(i)k

= λ∗.

The only condition required is that Bellman’s equation (5.4.8) holds for some vector h∗.

Remarks:

189


• Pros: Very simple to implement

• Cons:

– Since typically some of the components of Jk diverge to ∞ or −∞, direct calculation oflimk→∞

Jk(i)k is numerically cumbersome.

– Method does not provide a corresponding differential cost vector h∗.

Fixing the difficulties:

• Subtract the same constant from all components of the vector Jk:

Jk(i) := Jk(i)− C; i = 1, . . . , n.

• Consider the algorithm:hk(i) = Jk(i)− Jk(s);

for some fixed state s, and for all i = 1, . . . , n. By using equation (5.4.7) for i = 1, . . . , n,

hk+1(i) = Jk+1(i)− Jk+1(s)

= minu∈U(i)

g(i, u) +n∑

j=1

pij(u)Jk(j)

− minu∈U(s)

g(s, u) +n∑

j=1

psj(u)Jk(j)

= min

u∈U(i)

g(i, u) +n∑

j=1

pij(u)(hk(j)− Jk(s))

− minu∈U(s)

g(s, u) +n∑

j=1

psj(u)(hk(j)− Jk(s))

= min

u∈U(i)

g(i, u) +n∑

j=1

pij(u)hk(j)

− Jk(s)− minu∈U(s)

g(s, u) +n∑

j=1

psj(u)hk(j)

+ Jk(s)

• The above algorithm is called relative value iteration.

– Mathematically equivalent to the value iteration method (5.4.7) that generates Jk(i).

– Iterates generated by the two methods differ by a constant (i.e., Jk(s), since Jk(i) =hk(i) + Jk(s), ∀i).

• Big advantage of new method: Under Assumption 5.4.1 it can be shown that the iterates hk(i)are bounded, while this is typically not true for the “plain vanilla” method.

• It can be seen that if the relative value iteration converges to some vector h, then we have:h(s) = 0, and

λ+ h(i) = minu∈U(i)

g(i, u) +n∑j=1

pij(u)h(j)

By Proposition 5.4.1(b), this implies that λ is indeed the optimal average cost per stage, andh is the associated differential cost vector.

• Disadvantage: Under Assumption 5.4.1, convergence is not guaranteed. However, convergencecan be guaranteed for a simple variant:

hk+1(i) = (1−τ)hk(i)+ minu∈U(i)

g(i, u) + τ

n∑j=1

pij(u)hk(j)

− minu∈U(s)

g(s, u) + τ

n∑j=1

psj(u)hk(j)

,

for i = 1, . . . , n, and τ a constant satisfying 0 < τ < 1.

190


Policy iteration

• Start from an arbitrary stationary policy µ0.

• At a typical iteration, we have a stationary policy µk. We perform two steps per iteration:

– Policy evaluation: Compute λk and hk(i) of µk, using the n+ 1 equations hk(n) = 0,and for i = 1, . . . , n,

λk + hk(i) = g(i, µk(i)) +n∑j=1

pij(µk(i))hk(j).

If λk+1 = λk and hk+1(i) = hk(i), ∀i, stop. Otherwise, continue with the next step.

– Policy improvement: Find a stationary policy µk+1 where for all i, µk+1(i) is suchthat


g(i, u) +n∑j=1

pij(u)hk(j)

,

and repeat.

• The next proposition shows that each iteration of the algorithm makes some irreversibleprogress towards optimality.

Proposition 5.4.2 Under Assumption 5.4.1, in the policy iteration algorithm, for each k we eitherhave λk+1 < λk; or else we have

λk+1 = λk, and hk+1(i) = hk(i), ∀i.

Furthermore, the algorithm terminates and the policies µk and µk+1 obtained upon termination areoptimal.

Proof: Denote µk := µ, µk+1 := µ, λk := λ, λk+1 := λ, hk(i) := h(i), hk+1 := h(i). Define forN = 1, 2, . . .,

hN (i) = g(i, µ(i)) +n∑j=1

pij(µ(i))hN−1(j); i = 1, . . . , n,

where h0(i) = h(i).

Thus, we have

λ = Jµ(i) = limN→∞

1NhN (i), i = 1, 2, . . . , n. (5.4.9)

By definition of µ we have for all i = 1, . . . , n:

h1(i) = g(i, µ(i)) +n∑j=1

pij(µ(i))h0(j) (from the iteration above)

≤ g(i, µ(i)) +n∑j=1

pij(µ(i))h0(j) (because µ was the min of this RHS)

= λ+ h0(i) (because of Proposition 5.4.1).

191


From the equation above, we also obtain

h2(i) = g(i, µ(i)) +n∑j=1

pij(µ(i))h1(j)

≤ g(i, µ(i)) +n∑j=1

pij(µ(i))(λ+ h0(j))

= λ+ g(i, µ(i)) +n∑j=1

pij(µ(i))h0(j)

≤ λ+ g(i, µ(i)) +n∑j=1

pij(µ(i))h0(j)︸︷︷︸λ+h0(i)

= 2λ+ h0(i),

and by proceeding similarly, we see that for all i,N ,

hN (i) ≤ Nλ+ h0(i).

Thus,hN (i)N

≤ λ+h0(i)N

Taking limit as N →∞, the LHS converges to λ (from equation (5.4.9)), and the 2nd term in theRHS goes to zero, implying that λ ≤ λ.

• If λ = λ, the iteration that produces µk+1 is a policy improvement step for the associated SSPwith cost per stage g(i, µk)− λ. Moreover, h(i) and h(i) are the optimal costs starting from i

and corresponding to µ and µ respectively, in this associated SSP. Thus, h(i) ≤ h(i),∀i.

• Since there are only a finite number of stationary policies, there are also a finite number of λ(each one being the average cost per stage of each of the stationary policies). For each λ thereis only a finite number of possible vectors h (see Proposition 5.4.1(c), where we can vary thereference hµ(n) = 0).

• In view of the improvement properties already shown, no pair (λ, h) can be repeated withouttermination of the algorithm, implying that the algorithm must terminate with λ = λ andh(i) = h(i), ∀i.

Claim: When the algorithm terminates, the policies µ and µ are optimal.

Proof: Upon termination, we have for all i,

λ+ h(i) = λ+ h(i)

= g(i, µ(i)) +n∑j=1

pij(µ(i))h(j) (by policy evaluation step)

= g(i, µ(i)) +n∑j=1

pij(µ(i))h(j) (because h(j) = h(j),∀j)

= minu∈U(i)

g(i, u) +n∑j=1

pij(u)h(j)

(by policy improvement step)

192


Therefore, (λ, h) satisfy Bellman’s equation, and by Proposition 5.4.1(b), λ must be equal to theoptimal average cost per stage. Furthermore, µ(i) attains the minimum in the RHS of Bellman’sequation (see the last two equalities above), and hence by Proposition 5.4.1(a), µ is optimal. Sincewe also have for all i (due to the self-consistency of the policy evaluation step),

λ+ h(i) = g(i, µ(i)) +n∑j=1

pij(µ(i))h(j),

the same is true for µ.

5.5 Semi-Markov Decision Problems

5.5.1 General setting

• Stationary system with finite number of states and controls.

• State transitions occur at discrete times.

• Control applied at these discrete times and stays constant between transitions.

• Time between transitions is random, or may depend on the current state and the choice ofcontrol.

• Cost accumulates in continuous time, or maybe incurred at the time of transition.

• Example: Admission control in a system with restricted capacity (e.g., a communication link)

– Customer arrivals: Poisson process.

– Customers entering the system, depart after an exponentially distributed time.

– Upon arrival we must decide whether to admit or block a customer.

– There is a cost for blocking a customer.

– For each customer that is in the system, there is a customer-dependent reward per unitof time.

– Objective: Minimize time-discounted or average cost.

• Note that at transition times tk, the future of the system statistically depends only on thecurrent state. This is guaranteed by not allowing the control to change in between transitions.Otherwise, we should include the time elapsed from the last transition as part of the systemstate.

5.5.2 Problem formulation

• x(t) and u(t): State and control at time t. Stay constant between transitions.

• tk: Time of the kth transition (t0 = 0).

• xk = x(tk): We have x(t) = xk for tk ≤ t ≤ tk+1.

193


• uk = u(tk): We have u(t) = uk for tk ≤ t ≤ tk+1.

• In place of transition probabilities, we have transition distributions. For any pair (state i,control u), specify the joint distribution of the transition interval and the next state:

Qij(τ, u) = Ptk+1 − tk ≤ τ, xk+1 = j|xk = i, uk = u.

• Two important observations:

1. Transition distributions specify the ordinary transition probabilities via

pij(u) = Pxk+1 = j|xk = i, uk = u = limτ→∞

Qij(τ, u).

We assume that for all states i and controls u ∈ U(i), the average transition time,

τi(u) =n∑j=1

∫ ∞0

τQij(dτ, u),

is nonzero and finite, 0 < τi(u) <∞.

2. The conditional cumulative distribution function (c.d.f.) of τ given i, j, and u is (assumingpij(u) > 0)

Pxk+1 = j|xk = i, uk = u =Qij(u)pij(u)

. (5.5.1)

Thus, Qij(u) can be seen as a scaled c.d.f., i.e.,

Qij(u) = Pxk+1 = j|xk = i, uk = u × pij(u).

Important case: Exponential transition distributions

• Important example of transition distributions:

Qij(τ, u) = pij(u)(1− e−νi(u)τ ),

where pij(u) are transition probabilities, and νi(u) > 0 is called the transition rate at state i.

• Interpretation: If the system is in state i and control u is applied,

– The next state will be j w.p. pij(u).

– The time between the transition to state i and the transition to the next state j isExp(νi(u)) (independently of j);

Ptransition time interval > τ |i, u = e−νi(u)τ .

• The exponential distribution is memoryless. This implies that for a given policy, the systemis a continuous-time Markov chain (the future depends on the past through the present).Without the memoryless property, the Markov property holds only at the times of transition.

194


Cost structures

• There is a cost g(i, u) per unit time, i.e.,

g(i, u)dt = cost incurred during small time period dt

• There maybe an extra instantaneous cost g(i, u) at the time of a transition (let’s ignore thisfor the moment).

• Total discounted cost of π = µ0, µ1, . . . , starting from state i (with discount factor β > 0)

limN→∞

E

[N−1∑k=0

∫ tk+1

tk

e−βtg(xk, µk(xk))dt∣∣∣x0 = i

]

• Average cost per unit time of π = µ0, µ1, . . . , starting from state i

limN→∞

1E[tN |x0 = i, π]

E

[N−1∑k=0

∫ tk+1

tk

g(xk, µk(xk))dt∣∣∣x0 = i

]

• We will see that both problems have equivalent discrete time versions.

A note on notation

• The scaled c.d.f. Qij(τ, u) can be used to model discrete, continuous, and mixed distributionsfor the transition time τ .

• Generally, expected values of functions of τ can be written as integrals involving dQij(τ, u).For example, from (5.5.1) (noting that there is no τ in the denominator there), the conditionalexpected value of τ given i, j, and u is written as

E[τ |i, j, u] =∫ ∞

0τ

dQij(τ, u)pij(u)

• If Qij(τ, u) is discontinuous and “staircase-like”, expected values can be written as summa-tions.

5.5.3 Discounted cost problems

• For a policy π = µ0, µ1, . . ., write

Jπ(i) = E[cost of 1st transition] + E[e−βτJπ1(j)|i, µ0(i)], (5.5.2)

where Jπ1(j) is the cost-to-go of the policy π1 = µ1, µ2, . . ..

• We calculate the two costs in the RHS. The expected cost of a single transition if u is appliedat state i is

G(i, u) = Ej [Eτ [transition cost|j]]

=n∑j=1

pij(u)∫ ∞

0

(∫ τ

0e−βtg(i, u)dt

)dQij(τ, u)pij(u)

=n∑j=1

∫ ∞0

1− e−βτ

βg(i, u)dQij(τ, u), (5.5.3)

195


where the 2nd equality follows from computing Eτ [transition cost|j] via integrating the tail ofthe nonnegative r.v. τ , and the 3rd one because

∫ τ0 e−βtdt = (1− e−βτ )/β.

Thus, E[cost of 1st transition] is

G(i, µ0(i)) = g(i, µ0(i))n∑j=1

∫ ∞0

1− e−βτ

βdQij(τ, µ0(i)).

• Regarding the 2nd term in (5.5.2),

E[e−βτJπ1(j)|i, µ0(i)] = Ej [E[e−βτ |j, i, µ0(i)]Jπ1(j)]

=n∑j=1

pij(µ0(i))(∫ ∞

0e−βτ

dQij(τ, µ0(i))pij(µ0(i))

)Jπ1(j)

=n∑j=1

mij(µ0(i))Jπ1(j),

where mij(u) is given by

mij(u) =∫ ∞

0e−βτdQij(τ, u).

Note that mij(u) satisfies

mij(u) <∫ ∞

0dQij(τ, u) = lim

τ→∞Qij(τ, u) = pij(u).

So, mij(u) can be viewed as the effective discount factor (the analog of αpij(u) in the discrete-time case).

• So, going back to (5.5.2), Jπ(i) can be written as

Jπ(i) = G(i, µ0(i)) +n∑j=1

mij(µ0(i))Jπ1(j).

Equivalence to an SSP

• Similar to the discrete-time case, introduce a stochastic shortest path problem with an artificialtermination state t.

• Under control u, from state i the system moves to state j w.p. mij(u), and to the terminalstate t w.p. 1−

∑nj=1mij(u).

• Bellman’s equation: For i = 1, . . . , n,


G(i, u) +n∑j=1

mij(u)J∗(j)

• Analogs of value iteration, policy iteration, and linear programming.

196


• If in addition to the cost per unit of time g, there is an extra (instantaneous) one-stage costg(i, u), Bellman’s equation becomes


g(i, u) +G(i, u) +n∑j=1

mij(u)J∗(j)

Example 5.5.1 (Manufacturer’s production plan)

• A manufacturer receives orders with interarrival times uniformly distributed in [0, τmax].

• He may process all unfilled orders at cost K > 0, or process none. The cost per unit of time of

an unfilled order is c. Maximum number of unfilled orders is n.

• Objective: Find a processing policy that minimizes the total expected cost, assuming the discount

factor is β < 1.

• The nonzero transition distributions are

Qi1(τ,Fill) = Qi,i+1(τ,No Fill) = min

1,τ

τmax

• The one-stage expected cost G (see equation (5.5.3)) is

G(i,Fill) = 0, G(i,Not Fill) = γci,

where

γ =n∑j=1

∫ ∞0

1− e−βτ

βdQij(τ, u) =

∫ τmax

0

1− e−βτ

βτmaxdτ.

• There is an instantaneous cost

g(i,Fill) = K, g(i,Not Fill) = 0.

• The effective discount factors mij(u) in Bellman’s equation are

mi1(Fill) = mi,i+1(Not Fill) = α,

where

α =∫ ∞

0e−βτdQij(τ, u) =

∫ τmax

0

e−βτ

τmaxdτ =

1− e−βτmax

βτmax.

• Bellman’s equation has the form

J∗(i) = minK + αJ∗(1), γci+ αJ∗(i+ 1), i = 1, 2, . . .

As in the discrete-time case, it can be proved that J∗(i) is monotonically decreasing in i. Therefore,

there must exist an optimal threshold i∗ such that the manufacturer must fill the orders if and

only if their number i exceeds i∗.

197


5.5.4 Average cost problems

• Cost function for the continuous time average cost per unit time problem (assuming that thereis a special state that is recurrent under all policies) would be

limT→∞

1T

E[∫ T

0g(x(t), u(t))dt

]However, we will use instead the cost function

limN→∞

1E[tN ]

E[∫ tN

0g(x(t), u(t))dt

],

where tN is the completion time of the Nth transition. This cost function is equivalent to theprevious one under the conditions of the subsequent analysis.

• We now apply the SSP argument used for the discrete-time case. Divide trajectory into cyclesmarked by successive visits to n. The cost at (i, u) is G(i, u)−λ∗τi(u), where λ∗ is the optimalexpected cost per unit of time. Each cycle is viewed as a state trajectory of a correspondingSSP problem with the termination state being essentially n.

• Bellman’s equation for the average cost problem is


G(i, u)− λ∗τi(u) +n∑j=1

pij(u)h∗(j)

.

• The expected transition times are

τi(Fill) = τi(Not Fill) =τmax

2.

• The expected transition cost is

G(i,Fill) = 0, G(i,Not Fill) =ciτmax

2,

and the instantaneous cost is

g(i,Fill) = K, g(i,Not Fill) = 0.

• Bellman’s equation is

h∗(i) = minK − λ∗ τmax

2+ h∗(1), ci

τmax2− λ∗ τmax

2+ h∗(i+ 1)

.

• Again, it can be shown that a threshold policy is optimal.

5.6 Application: Multi-Armed Bandits

5.7 Exercises

Exercise 5.7.1 A computer manufacturer can be in one of two states. In state 1 his productsells well, while in state 2 his product sells poorly. While in state 1 he can advertise his product

198


in which case the one-stage reward is 4 units, and the transition probabilities are p11 = 0.8 andp12 = 0.2. If in state 1, he does not advertise, the reward is 6 units and the transition probabilitiesare p11 = p12 = 0.5. While is state 2, he can do research to improve his product, in which casethe one-stage reward is −5 units, and the transition probabilities are p21 = 0.7 and p22 = 0.3. If instate 2 he does not do the research, the reward is −3, and the transition probabilities are p21 = 0.4,and p22 = 0.6. Consider the infinite horizon, discounted version of this problem.

(a) Show that when the discount factor α is sufficiently small, the computer manufacturer shouldfollow the “shortsighted” policy of not advertising (not doing research) while in state 1 (state 2).By contrast, when α is sufficiently close to 1, he should follow the “farsighted” policy of adver-tising (doing research) while in state 1 (state 2).

(b) For α = 0.9, calculate the optimal policy using policy iteration.

(c) For α = 0.99, use a computer to solve the problem by value iteration.

Exercise 5.7.2 An energetic salesman works every day of the week. He can work in only one oftwo towns A and B on each day. For each day he works in town A (or B) his expected reward is rA(or rB, respectively). The cost of changing towns is c. Assume that c > rA > rB, and that there isa discount factor α < 1.

(a) Show that for α sufficiently small, the optimal policy is to stay in the town he starts in, andthat for α sufficiently close to 1, the optimal policy is to move to town A (if not starting there)and stay in A for all subsequent times.

(b) Solve the problem for c = 3, rA = 2, rB = 1 and α = 0.9 using policy iteration.

(c) Use a computer to solve the problem of part (b) by value iteration.

Exercise 5.7.3 A person has an umbrella that she takes from home to office and viceversa. Thereis a probability p of rain at the same time she leaves home or office independently of earlier weather.If the umbrella is in the place where she is and it rains, she takes the umbrella to go to the otherplace (this involves no cost). If there is no umbrella and it rains, there is a cost W for getting wet.If the umbrella is in the place where she is but it does not rain, she may take the umbrella to go tothe other place (this involves an inconvenience cost V ) or she may leave the umbrella behind (thisinvolves no cost). Costs are discounted at a factor α < 1.

(a) Formulate this as an infinite horizon total cost discounted problem. Try to reduce the numberof states of the model. Two or three states should be enough for this problem!

(b) Characterize the optimal policy as best as you can.

Exercise 5.7.4 An unemployed worker receives a job offer at each time period, which she may ac-cept or reject. The offered salary takes one of n possible values w1, . . . , wn, with given probabilities,independently of preceding offers. If she accepts the offer, she must keep the job for the rest of herlife at the same salary level. If she rejects the offer, she receives unemployment compensation c forthe current period and is eligible to accept future offers. Assume that income is discounted by afactor α < 1.

199


Hint: Define the states si, i = 1, . . . , n, corresponding to the worker being unemployed and beingoffered a salary wi, and si, i = 1, . . . , n,, corresponding to the worker being employed at a salarylevel wi.

(a) Show that there is a threshold w such that it is optimal to accept an offer if and only if itssalary is larger than w, and characterize w.

(b) Consider the variant of the problem where there is a given probability pi that the worker willbe fired from her job at any one period if her salary is wi. Show that the result of part (a)holds in the case where pi is the same for all i. Argue what would happen in the case where pidepends on i.

Exercise 5.7.5 An unemployed worker receives a job offer at each time period, which she may ac-cept or reject. The offered salary takes one of n possible values w1, . . . , , wn with given probabilities,independently of preceding offers. If she accepts the offer, she must keep the job for the rest of herlife at the same salary level. If she rejects the offer, she receives unemployment compensation c forthe current period and is eligible to accept future offers.

Suppose that there is a probability p that the worker will be fired from her job at any one period,and further assume that w1 < w2 < · · · < wn.

Show that when the worker maximizes her average income per period, there is a threshold value wsuch that it is optimal to accept an offer if and only if her salary is larger than w, and characterize w.

Hint: Define the states si, i = 1, . . . , n, corresponding to the worker being unemployed and beingoffered a salary wi, and si, i = 1, . . . , n,, corresponding to the worker being employed at a salarylevel wi.

200

Chapter 6

Point Process Control

The following chapter is based on Chapters I, II and VII in Bremaud’s book Point Processes andQueues (1981) .

6.1 Basic Definitions

Consider some probability space (Ω,F ,P). A real-valued mapping X : Ω→ R is a random variableif for every C ∈ B(R) the pre-image X−1(C) ∈ F .

A filtration (or history) of a measurable space (Ω,F) is a collection (Ft)t≥0 of sub-σ-fields of F suchthat for all 0 ≤ s ≤ t

Fs ⊆ Ft.

We denote by F∞ =∨t≥0Ft := σ

(⋃t≥0Ft

).

A family (Xt)t≥0 of real-valued random variables is called a stochastic process. The filtrationgenerated by Xt is

FXt := σ (Xs : s ∈ [0, t]) .

For a fixed ω ∈ Ω, the function Xt(ω) is called a path of the stochastic process.

We say that the stochastic process is adapted to the filtration Ft if FXt ⊆ Ft for all t ≥ 0. Wesay that Xt is Ft-progressive if for all t ≥ 0 the mapping (t, ω) → Xt(ω) from [0, t] × Ω → R isB([0, t])⊗Ft-measurable.

Let Ft be a filtration. We define the Ft-predictable σ-field P(Ft) as follows:

P(Ft) := σ ((s, t]×A : s ∈ [0, t] and A ∈ Fs) .

A stochastic process Xt is Ft-predictable if Xt is P(Ft)-measurable.

Proposition 6.1.1 A real-valued stochastic process Xt adapted to Ft and left-continuous is Ft-predictable

Given a filtration Ft, a process Xtis called a Ft-martingale over [0, c] if the following three conditionsare satisfied

201

Prof. R. Caldentey CHAPTER 6. POINT PROCESS CONTROL

1. Xt is adapted to Ft.

2. E[|Xt|] <∞ for all t ∈ [0, c].

3. E[Xt|Fs] = Xs a.s., for all 0 ≤ s ≤ t ≤ c.

If the equality in (3) is replaced by ≥ (≤) then Xt is called a submartingale (supermartingale).

Exercise 6.1.1 Let Xt be a real-valued process with independent increment, that is, for all 0 ≤s ≤ t, Xt−Xs is independent of FXs . Suppose that Xt is integrable and E[|Xt|] = 0. Show that Xt

is a FXt -martingale.

If in addition X2t is integrable then X2

t is a FXt -submartingale and X2t − E[X2

t ] is a Ft-martingale.

6.2 Counting Processes

Definition 6.2.1 A sequence of random variables Tn : n ≥ 0 is called a point process if for eachn ≥ 0, Tn ∈ F and

∀ω ∈ Ω T0(ω) = 0 and Tn(ω) < Tn+1(ω) whenever Tn <∞.

We will only consider nonexplosive point process, that is, process for which limn→∞ Tn =∞ P-a.s.

Associated to a nonexplosive point process Tn, we define the corresponding counting processNt : t ≥ 0 as follows

Nt = n if t ∈ [Tn, Tn+1).

Thus, Nt is a right-continuous step function starting at 0 (see figure). Since we are considering

t

Nt

T1 T2 T3 T4 T5

1

2

3

4

5

T0

nonexplosive point processes, Nt <∞ for all t ≥ 0 P-a.s. In addition, if E[Nt] is finite for all t thenthe point process is said to be integrable.

Exercise 6.2.1 Simple Renewal Process:

Consider a sequence Xn : n ≥ 1 of iid nonnegative random variables. We define the point processrecursively as follows: T0 = 0 and Tn = Tn−1 +Xn, n ≥ 1. A sufficient condition for the process Tnto be nonexplosive is E[X] > 0.

202

CHAPTER 6. POINT PROCESS CONTROL Prof. R. Caldentey

One of the most famous renewal process is the Poisson process. In this case X, the inter-arrivalinterval, has an exponential distribution with time homogenous rate λ.

Exercise 6.2.2 Queueing Process:

A simple queueing process Qt is a nonnegative integer-valued process o of the form

Qt = Q0 +At −Dt,

where At (arrival process) and Dt (departure process) are two nonexplosive point processes withoutcommon jumps. Note that by definition Dt ≤ Q0 +At.

Exercise 6.2.3 Let Nt(1) and Nt(2) be two nonexplosive point processes without common jumpsand let Q0 be a nonnegative integer-valued random variable. Define Xt = Q0 +Nt(1)−Nt(2) andmt = min0≤s≤tX−s . Show that Qt = Xt −mt is a simple queueing process with arrival processAt = Nt(1), departure process Dt =

∫ t0 11(Qs− > 0) dNs(2), and mt = −

∫ t0 11(Qs− = 0) dNs(2).

Poisson processes are commonly used in practice to represent point process. One possible expla-nation for this popularity is its inherent mathematical tractability. However, a well-known resultin renewal theory says that the sum of a large number of independent renewal processes converges–as the number of summands goes to infinity– to a Poisson process. So, the Poisson process canbe in fact a good approximation to many applications in practice. To make the Poisson processeven more realistic we would like to have more flexibility on the arrival rate λ. We can achieve thisgeneralization as follows.

Definition 6.2.2 (Doubly Stochastic or Conditional Poisson Process) Let Nt be a point processadapted to a filtration Ft, and let λt be a nonnegative measurable process. Suppose that

λt is F0 −measurable for all t ≥ 0,

and that ∫ t

0λs ds <∞ P-a.s., for all t ≥ 0.

If for all s ∈ [0, t] the increment Nt −Ns is independent of Fs given F0 and

P[Nt −Ns = k|Fs] =1k!

exp(∫ t

sλu du

) (∫ t

sλu du

)kthen Nt is called a (P − Ft)-doubly stochastic Poisson process with stochastic intensity λt.

The process λt is referred as the intensity of the process. A special case of a doubly stochasticPoisson process occurs when λt = λ ∈ F0. Another example is the case where λt = f(t, Yt) for ameasurable nonnegative function f and a process Yt such that FY∞ ⊆ F0.

Exercise 6.2.4 Show that for a doubly stochastic Poisson process Nt is such that E[∫ t

0 λs ds] <∞for all t ≥ 0 then

Mt = Nt −∫ t

0λs ds

is a Ft-martingale.

203


Based on this observation we have the following important result.

Proposition 6.2.1 If Nt is an integrable doubly stochastic Poisson process with Ft-intensity λt,then for all nonnegative Ft-predictable processes Ct

E[∫ ∞

0Cs dNs

]= E

[∫ ∞0

Cs λs ds]

where∫ t

0 Cs dNs :=∑

n≥1CTn 11(Tn ≤ t). It turns out that the converse is also true. This was firstproved by Watanabe in a less general setting.

Proposition 6.2.2 (Watanabe (1964)) Let Nt be a point process adapted to the filtration Ft, andlet λ(t) be a locally integrable nonnegative function. Suppose that Nt−

∫ t0 λ(s) ds is an Ft-martingale.

Then Nt is Poisson process with intensity λ(t), that is, for all 0 ≤ s ≤ t, Nt − Ns is a Poissonrandom variable with parameter

∫ ts λu du independent of Fs.

Motivated by this result we define the notion of stochastic intensity for an arbitrary point processas follows.

Definition 6.2.3 (Stochastic Intensity)

Let Nt be a point process adapted to some filtration Ft, and let λt be a nonnegative Ft-progressiveprocess such that for all t ≥ 0 ∫ t

0λs ds <∞ P − a.s.

If for all nonnegative Ft predictable processes Ct, the equality

E[∫ ∞

0Cs dNs

]= E

[∫ ∞0

Cs λs ds]

is verified, then we say that Nt admits the Ft-intensity λt.

Exercise 6.2.5 Let Nt be a point process with the Ft-intensity λt. Show that if λt id Gt-progressivefor some filtration Gt such that FNt ⊆ Gt ≤ Ft t ≥ 0 then λt is also the Gt-intensity Nt.

Similarly to the Poisson process, we can connect point processes with stochastic intensities tomartingales.

Proposition 6.2.3 (Integration Theorem)

If Nt admits the Ft-intensity λt (where∫ t

0 λs ds <∞ a.s.) then Nt is nonexplosive and

1. Mt = Nt −∫ t

0 λs ds is an Ft-local martingale.

2. if Xt is Ft-predictable process such that E[∫ t

0 |Xs|λs ds] < ∞ then∫ t

0 Xs dMs is an Ft-martingale.

3. if Xt is Ft-predictable process such that∫ t

0 |Xs|λs ds < ∞ a.s. then∫ t

0 Xs dMs is an Ft-localmartingale.

204


6.3 Optimal Intensity Control

In this section we study the problem of controlling a point process. In particular, we focus on thecase where the controller can affect the intensity of the point process. This type of control differsfrom impulsive control where the controller has the ability to add or erase some of the point in thesequence.

We consider a point process Nt that we wish to control. The control u belongs to a set U ofadmissible controls. We will assume that U consists on the set of real-valued processes defined on(Ω,F) adapted to FNt in addition for each t ∈ [0, T ] we assume that ut ∈ Ut. In addition, for eachu ∈ U the point process Nt admits a (Pu, Ft)-intensity λt(u). Here, Ft is some filtration associatedto Nt.

The performance measure that we will consider is given by

J(u) = Eu[∫ T

0Cs(u) ds+ φT (u)

]. (6.3.1)

The expectation in J(u) above is taken with respect to Pu. The function Ct(u) is an Ft-progressiveprocess and φT (u) is a FT -measurable random variable.

We will consider a problem with complete information so that Ft ≡ FNt . In addition, we assumelocal dynamics

ut = u(t,Nt) is FNt − predictable

λt(u) = λ(t,Nt, ut)

Ct(u) = C(t,Nt, ut)

φT = φT (T,NT ).

Exercise 6.3.1 Consider the cost function

J(u) = E

∑0<Tn≤T

kTn(u)

.Where kt(u) is a nonnegative Ft-measurable process. Show that this cost function can be writtenin the from given by equation (6.3.1).

6.3.1 Dynamic Programming for Intensity Control

Theorem 6.3.1 (Hamilton-Jacobi Sufficient Conditions)

Suppose there exists for each n ∈ N+ a differentiable bounded Ft-progressive mapping V (t, ω, n)such that all ω ∈ Omega and all n ∈ N+

∂

∂tV (t, ω, n) + inf

v∈Utλ(t, ω, n, v) [V (t, ω, n+ 1)− V (t, ω, n)] + C(t, ω, n, v) = 0 (6.3.2)

V (T, ω, n) = φ(T, ω, n). (6.3.3)

and suppose there exists for each n ∈ N+ an FNt -predictable process u∗(t, ω, n) such that u∗(t, ω, n)achieves the minimum in equation (6.3.2). Then, u∗ is the optimal control.

205


Exercise 6.3.2 Proof the theorem.

This Theorem lacks of practical applicability because of Value Function is in general path dependent.The analysis can be greatly simplified if we assume that the problem is Markovian.

Corollary 6.3.1 (Markovian Control)

Suppose that λ(t, ω, n, v), C(t, ω, n, v), and φ(t, ω, n) do not dependent on ω and that there is afunction V (t, n) such that

∂

∂tV (t, n) + inf

v∈Utλ(t, n, v) [V (t, n+ 1)− V (t, n)] + C(t, n, v) = 0 (6.3.4)

V (T, n) = φ(T, n). (6.3.5)

suppose that the minimum is achieved by a measurable function U∗(t, n). Then, u∗ is the optimalcontrol.

6.4 Applications to Revenue Management

In this section we present an application of point process optimal control based on the work byGallego and van Ryzin (1994)1.

6.4.1 Model Description and HJB Equation

Consider a seller that owns I units of a product that wants to sell over a fixed time period T .Demand for the product is characterized by a point process Nt. Given a price policy p, Nt admitsthe intensity λ(pt). The seller’s problem is to select a price strategy pt : t ∈ [0, T ] (a predictableprocess) that maximizes the expected revenue over the selling horizon. That is,

max Jp(t, I) := Ep[∫ T

0ps dNs

]subject to

∫ T

0dNs ≤ I Pp a.s.

In order to ensure that the problem is feasible we will assume that there exist a price p∞ suchthat λ(p∞) = 0 a.s. In the case, we define the set of admissible pricing policies A, as the set ofpredictable p(t, I −Nt) policies such that p(t, 0) = p∞. The seller’s problem becomes to maximizeover p ∈ A the expected revenue Jp(t, I).

Using corollary 6.3.1, we can write the optimality condition for this problem as follows:

∂

∂tV (t, n) + sup

pλ(p) [V (t, n)− V (t, n− 1)]− p λ(p) = 0

V (T, n) = 0.

Let us make the following transformation of time t← T −t, that is, t measures the remaining sellingtime. Also, instead of looking at the price p as the decision variable we use λ as the control and

1Optimal Dynamic Pricing of Inventories with Stochastic Demand over Finite Horizons, Mgmnt. Sci. 40, 999-1020.

206


p(λ) as the inverse demand function. The revenue rate r(λ) := λ p(λ) is assumed to be regular, i.e.,continuous, bounded, concave, has a bounded maximizer λ∗ = minλ = argmaxr(λ) and suchthat limλ→0 r(λ) = 0. Under this new definitions the HJB equation becomes

∂

∂tV (t, n) = sup

λr(λ)− λ [V (t, n)− V (t, n− 1)] = 0

V (0, n) = 0.

Proposition 6.4.1 If λ(p) is a regular demand function then there exists a unique solution to theHJB equation. Further, the optimal intensities satisfies λ∗(t, n) ≤ λ∗ for all n for all 0 ≤ t ≤ T .

Closed-form solution to the HJB equation are generally intractable however, the optimality conditioncan be exploited to get some qualitative results about the optimal solution.

Theorem 6.4.1 The optimal value function V ∗(t, n) is strictly increasing and strictly concave inboth n and t. Furthermore, there exists an optimal intensity λ∗(n, t) that is strictly increasing in n

and strictly decreasing in t.

The following figure plots a sample path of price and inventory under an optimal policy.

0 10 20 30 40 50 60 70 80 90 1000

2

4

6

8

10

12

14

16

18

20

22

Time

Inventory

Price

Figure 6.4.1: Path of an optimal price policy and its inventory level. Demand is a time homogeneous Poisson

process with intensity λ(p) = exp(−0.1p), the initial inventory is C0 = 20, and the selling horizon is H = 100. The

dashed line corresponds to the minimum price pmin = 10.

6.4.2 Bounds and Heuristics

The fact that the HJB is intractable in most cases creates the need for alternative solution methods.One possibility, that we consider here, is the use of the certainty equivalent version of the problem.

207


That is, the deterministic control problem resulting from changing all uncertainty by its expectedvalue. In this case, the deterministic version of the problem is given by

V D(T, n) = maxλ

∫ T

0r(λs) ds

subject to∫ T

0λsds ≤ x.

The solution to this time-homogeneous problem can be found easily. Let λ0(T, n) = nT , that is,

the , run-out rate. Then, it is straightforward to show that the optimal deterministic rate isλD = minλ∗, λ0(T, n) and the optimal expect revenue is V D(T, n) = T minr(λ∗), r(λ0(T, n)).

At least two things make the deterministic solution interesting.

Theorem 6.4.2 If λ(p) is a regular demand function then for all n ≥ 0 and t ≥ 0

V ∗(t, n) ≤ V D(t, n).

Thus, the deterministic value function provides an upper bound on the optimal expected revenue.

In addition, if we fixed the price at pD and we denote by V FP (t, n) the expected revenue collectedfrom this fixed price strategy then we have the following important result.

Theorem 6.4.3V FP (t, n)V ∗(t, n)

≥ 1− 12√

minn, λ∗ t.

Therefore, the fixed price policy is asymptotically optimal as the number of product n or the sellinghorizon t become large.

208

Chapter 7

Papers and Additional Readings

209

dynamic programming with applicationspeople.stern.nyu.edu/rcaldent/courses/dp-classnotes-hec.pdf ·...

Documents