· 2021. 1. 17. · reinforcement learning and optimal control and rollout, policy iteration, and...

146
Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Class Notes for Reinforcement Learning Course ASU CSE 691; Spring 2021 These class notes are an extended version of Chapter 1, and Sections 2.1 and 2.2 of the book “Rollout, Policy Iteration, and Distributed Reinforcement Learning,” Athena Scientific, 2020. They can also serve as an extended version of Chapter 1, and Sections 2.1 and 2.2 of the book “Reinforcement Learning and Optimal Control,” Athena Scientific, 2019. They are made available as instructional material for the CSE 691 ASU class on Reinforce- ment Learning, 2021. The contents of the notes will be periodically up- dated. Your comments and suggestions to the author at [email protected] are very welcome. Version of February 3, 2021

Upload: others

Post on 23-Jan-2021

8 views

Category:

Documents


1 download

TRANSCRIPT

Page 1:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Reinforcement Learning and Optimal Control

and

Rollout, Policy Iteration, and

Distributed Reinforcement Learningby

Dimitri P. Bertsekas

Class Notes for

Reinforcement Learning Course

ASU CSE 691; Spring 2021

These class notes are an extended version of Chapter 1, and Sections 2.1 and2.2 of the book “Rollout, Policy Iteration, and Distributed ReinforcementLearning,” Athena Scientific, 2020. They can also serve as an extendedversion of Chapter 1, and Sections 2.1 and 2.2 of the book “ReinforcementLearning and Optimal Control,” Athena Scientific, 2019. They are madeavailable as instructional material for the CSE 691 ASU class on Reinforce-ment Learning, 2021. The contents of the notes will be periodically up-dated. Your comments and suggestions to the author at [email protected] very welcome.

Version of February 3, 2021

Page 2:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

1

Exact and Approximate

Dynamic Programming Principles

Contents

1.1. Introduction - AlphaZero and Value Space Approximation p. 21.2. Deterministic Dynamic Programming . . . . . . . . . p. 7

1.2.1. Finite Horizon Problem Formulation . . . . . . . p. 71.2.2. The Dynamic Programming Algorithm . . . . . . p. 101.2.3. Approximation in Value Space . . . . . . . . . p. 20

1.3. Stochastic Dynamic Programming . . . . . . . . . . . p. 251.3.1. Finite Horizon Problems . . . . . . . . . . . . p. 261.3.2. Approximation in Value Space for Stochastic DP . p. 361.3.3. Infinite Horizon Problems - An Overview . . . . . p. 391.3.4. Infinite Horizon - Policy Iteration and Rollout . . p. 48

1.4. Examples, Variations, and Simplifications . . . . . . . p. 571.4.1. A Few Words About Modeling . . . . . . . . . p. 571.4.2. Problems with a Termination State . . . . . . . p. 591.4.3. State Augmentation, Time Delays, Forecasts, and . . .

Uncontrollable State Components . . . . . . . . p. 621.4.4. Partial State Information and Belief States . . . . p. 681.4.5. Multiagent Problems and Multiagent Rollout . . . p. 711.4.6. Problems with Unknown Parameters - Adaptive . . .

Control . . . . . . . . . . . . . . . . . . . p. 761.5. Reinforcement Learning and Optimal Control - Some . . . .

Terminology . . . . . . . . . . . . . . . . . . . . p. 871.6. Notes and Sources . . . . . . . . . . . . . . . . . p. 89

1

Page 3:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

2 Exact and Approximate Dynamic Programming Principles Chap. 1

In this chapter, we provide some background on exact dynamic program-ming (DP), with a view towards the suboptimal solution methods, whichare based on approximation in value space and are the main subject of thisbook. We first discuss finite horizon problems, which involve a finite se-quence of successive decisions, and are thus conceptually and analyticallysimpler. We then consider somewhat briefly the more intricate infinitehorizon problems, but defer a more detailed treatment for Chapter 5.

We will discuss separately deterministic and stochastic problems (Sec-tions 1.2 and 1.3, respectively). The reason is that deterministic problemsare simpler and have some favorable characteristics, which allow the ap-plication of a broader variety of methods. Significantly they include chal-lenging discrete and combinatorial optimization problems, which can befruitfully addressed with some of the rollout and approximate policy iter-ation methods that are the main focus of this book.

In subsequent chapters, we will discuss selectively some major algo-rithmic topics in approximate DP and reinforcement learning (RL), in-cluding rollout and policy iteration, multiagent problems, and distributedalgorithms. A broader discussion of DP/RL may be found in the author’sRL textbook [Ber19a], and the DP textbooks [Ber12], [Ber17], [Ber18a], theneuro-dynamic programming monograph [BeT96], as well as the textbookliterature described in the last section of this chapter.

The DP/RL methods that are the principal subjects of this book,rollout and policy iteration, have a strong connection with the famousAlphaZero, AlphaGo and other related programs. As an introduction to ourtechnical development, we take a look at this relation in the next section.

1.1 INTRODUCTION - ALPHAZERO AND VALUE SPACEAPPROXIMATION

One of the most exciting recent success stories in RL is the developmentof the AlphaGo and AlphaZero programs by DeepMind Inc; see [SHM16],[SHS17], [SSS17]. AlphaZero plays Chess, Go, and other games, and isan improvement in terms of performance and generality over AlphaGo,which plays the game of Go only. Both programs play better than allcompetitor computer programs available in 2020, and much better thanall humans. These programs are remarkable in several other ways. Inparticular, they have learned how to play without human instruction, justdata generated by playing against themselves. Moreover, they learned howto play very quickly. In fact, AlphaZero learned how to play chess betterthan all humans and computer programs within hours (with the help ofawesome parallel computation power, it must be said).

Perhaps the most impressive aspect of AlphaZero/chess is that itsplay is not just better, but it is also very different than human play interms of long term strategic vision. Remarkably, AlphaZero has discovered

Page 4:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.1 Introduction - AlphaZero and Value Space Approximation 3

new ways to play a game that has been studied intensively by humans forhundreds of years. Still, for all of its impressive success and brilliant imple-mentation, AlphaZero is couched on well established theory and methodol-ogy, which is the subject of the present book, and is portable to far broaderrealms of engineering, economics, and other fields. This is the methodologyof DP, policy iteration, limited lookahead, rollout, and approximation invalue space.†

To understand the overall structure of AlphaZero, and its connectionto our DP/RL methodology, it is useful to divide its design into two parts:off-line training, which is an algorithm that learns how to evaluate chesspositions, and how to steer itself towards good positions with a default/basechess player, and on-line play, which is an algorithm that generates goodmoves in real time against a human or computer opponent, using the train-ing it went through off-line. We will next briefly describe these algorithms,and relate them to DP concepts and principles.

Off-Line Training and Policy Iteration

This is the part of the program that learns how to play through off-lineself-training, and is illustrated in Fig. 1.1.1. The algorithm generates asequence of chess players and position evaluators . A chess player assigns“probabilities” to all possible moves at any given chess position (these arethe probabilities with which the player selects the possible moves at thegiven position). A position evaluator assigns a numerical score to anygiven chess position (akin to a “probability” of winning the game fromthat position), and thus predicts quantitatively the performance of a playerstarting from any position. The chess player and the position evaluator arerepresented by two neural networks, a policy network and a value network ,which accept a chess position and generate a set of move probabilities anda position evaluation, respectively.‡

† It is also worth noting that the principles of the AlphaZero design have

much in common with the work of Tesauro [Tes94], [Tes95], [TeG96] on computer

backgammon. Tesauro’s programs stimulated much interest in RL in the middle

1990s, and exhibit similarly different and better play than human backgammon

players. A related impressive program for the (one-player) game of Tetris, also

based on the method of policy iteration, is described by Scherrer et al. [SGG15].

For a better understanding of the connections of AlphaZero and AlphaGo Zero

with Tesauro’s programs and the concepts developed here, the “Methods” section

of the paper [SSS17] is recommended.

‡ Here the neural networks play the role of function approximators. By view-

ing a player as a function that assigns move probabilities to a position, and a

position evaluator as a function that assigns a numerical score to a position, the

policy and value networks provide approximations to these functions based on

training with data (training algorithms for neural networks and other approxi-

Page 5:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

4 Exact and Approximate Dynamic Programming Principles Chap. 1

Policy ImprovementPolicy ImprovementPolicy Improvement

erent! Approximate Value Function Player Features Mappinerent! Approximate Value Function Player Features Mappin

Self-Learning/Policy Iteration Constraint Relaxation

Learned from scratch ... with 4 hours of training! Current “ImprovLearned from scratch ... with 4 hours of training! Current “Improved”Policy Improvement

Policy Evaluation Improvement of Current Policy

Neural Network Neural Network

Value Policy Value Policy

Figure 1.1.1 Illustration of the AlphaZero training algorithm. It generates asequence of position evaluators and chess players. The position evaluator and thechess player are represented by two neural networks, a value network and a policynetwork, which accept a chess position and generate a position evaluation and aset of move probabilities, respectively.

In the more conventional DP-oriented terms of this book, a positionis the state of the game, a position evaluator is a cost function that givesthe cost-to-go at a given state, and the chess player is a randomized policyfor selecting actions/controls at a given state.†

The overall training algorithm is a form of policy iteration, a DPalgorithm that will be of primary interest to us in this book. Starting froma given player, it repeatedly generates (approximately) improved players,and settles on a final player that is judged empirically to be “best” out ofall the players generated.‡ Policy iteration may be separated conceptuallyin two stages (see Fig. 1.1.1).

(a) Policy evaluation: Given the current player and a chess position, theoutcome of a game played out from the position provides a single data

mation architectures will be discussed in Chapter 4).

† One more complication is that chess and Go are two-player games, while

most of our development will involve single-player optimization. However, DP

theory extends to two-player games, although we will not discuss this extension,

except briefly in Chapter 3.

‡ Quoting from the paper [SSS17]: “The AlphaGo Zero selfplay algorithm

can similarly be understood as an approximate policy iteration scheme in which

MCTS is used for both policy improvement and policy evaluation. Policy im-

provement starts with a neural network policy, executes an MCTS based on that

policy’s recommendations, and then projects the (much stronger) search policy

back into the function space of the neural network. Policy evaluation is applied

to the (much stronger) search policy: the outcomes of selfplay games are also

projected back into the function space of the neural network. These projection

steps are achieved by training the neural network parameters to match the search

probabilities and selfplay game outcome respectively.”

Page 6:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.1 Introduction - AlphaZero and Value Space Approximation 5

point. Many data points are thus collected, and are used to train avalue network, whose output serves as the position evaluator for thatplayer.

(b) Policy improvement : Given the current player and its position evalua-tor, trial move sequences are selected and evaluated for the remainderof the game starting from many positions. An improved player is thengenerated by adjusting the move probabilities of the current playertowards the trial moves that have yielded the best results. In Alp-haZero this is done with a complicated algorithm called Monte CarloTree Search, which will be described in Chapter 2. However, policyimprovement can be done more simply. For example one could tryall possible move sequences from a given position, extending forwardto a given number of moves, and then evaluate the terminal positionwith the player’s position evaluator. The move evaluations obtainedin this way are used to nudge the move probabilities of the currentplayer towards more successful moves, thereby obtaining data that isused to train a policy network that represents the new player.

On-Line Play and Approximation in Value Space - Rollout

Suppose that a “final” player has been obtained through the AlphaZero off-line training process just described. It could then be used in principle toplay chess against any human or computer opponent, since it is capable ofgenerating move probabilities at each given chess position using its policynetwork. In particular, during on-line play, at a given position the playercan simply choose the move of highest probability supplied by the off-linetrained policy network. This player would play very fast on-line, but itwould not play good enough chess to beat strong human opponents. Theextraordinary strength of AlphaZero is attained only after the player and itsposition evaluator obtained from off-line training have been embedded intoanother algorithm, which we refer to as the “on-line player.” This algorithmplays as follows, given the policy network/player obtained off-line and itsvalue network/position evaluator (see Fig. 1.1.2).

At a given position, it generates a lookahead tree of all possible mul-tiple move and countermove sequences, up to a given depth. It then runsthe off-line obtained player for some more moves, and then evaluates theeffect of the remaining moves by using the position evaluator of the off-lineobtained value network. Actually the middle portion, called “truncatedrollout,” is not used in the published version of AlphaZero/chess [SHS17],[SHS17]; the first portion (multistep lookahead) is quite long and imple-mented efficiently, so that the rollout portion is not essential. Rollout isused in AlphaGo [SHM16], and plays a very important role in Tesauro’sbackgammon program [TeG96]. The reason is that in backgammon, long

Page 7:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

6 Exact and Approximate Dynamic Programming Principles Chap. 1

Selective Depth Lookahead Tree

States xk+1

proximation

States xk+2

Base Heuristic Truncated Rollout

Base Heuristic Truncated Rollout

Current State xk. . .. . .x0 Rollout with Base Off-Line Obtained Policy

Terminal Position Evaluation

Terminal Position Evaluation

Terminal Position Evaluation

Current Position

Current Position

Player Corrected

Figure 1.1.2 Illustration of the AlphaGo/AlphaZero on-line player. At a givenposition, it generates a lookahead tree of multiple moves up to a given depth,then runs the off-line obtained player for some more moves, and then evaluatesthe effect of the remaining moves by using the position evaluator of the off-lineobtained player.

multistep lookahead is not possible because of rapid expansion of the looka-head tree with every move.†

We should note that the preceding description of AlphaZero and re-lated games is oversimplified. We will be adding refinements and detailsas the book progresses. However, DP ideas with cost function approxima-tions, similar to the on-line player illustrated in Fig. 1.1.2, will be centralfor our purposes. They will be generically referred to as approximation invalue space. Moreover, the algorithmic division between off-line trainingand on-line policy implementation will be conceptually very important forour purposes.

Note that these two processes may be decoupled and may be designedindependently. For example the off-line training portion may be very sim-ple, such as using a simple known policy for rollout without truncation, orwithout terminal cost approximation. Conversely, a sophisticated processmay be used for off-line training of a terminal cost function approximation,which is used immediately following one-step or multistep lookahead in a

† Tesauro’s rollout-based backgammon program [TeG96] does not use a pol-

icy network. It involves only a value network trained by his earlier TD-Gammon

algorithm [Tes94] (an approximate policy iteration algorithm), which is used to

generate moves for the truncated rollout via a one-step or two-step lookahead

minimization. The terminal cost approximation used at the end of the truncated

rollout is also provided by the value network.

Page 8:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.2 Deterministic Dynamic Programming 7

value space approximation scheme.

1.2 DETERMINISTIC DYNAMIC PROGRAMMING

In all DP problems, the central object is a discrete-time dynamic systemthat generates a sequence of states under the influence of control. Thesystem may evolve deterministically or randomly (under the additionalinfluence of a random disturbance).

1.2.1 Finite Horizon Problem Formulation

In finite horizon problems the system evolves over a finite number N of timesteps (also called stages). The state and control at time k of the system willbe generally denoted by xk and uk, respectively. In deterministic systems,xk+1 is generated nonrandomly, i.e., it is determined solely by xk and uk.Thus, a deterministic DP problem involves a system of the form

xk+1 = fk(xk, uk), k = 0, 1, . . . , N − 1, (1.1)

where

k is the time index,

xk is the state of the system, an element of some space,

uk is the control or decision variable, to be selected at time k from somegiven set Uk(xk) that depends on xk,

fk is a function of (xk, uk) that describes the mechanism by which thestate is updated from time k to time k + 1,

N is the horizon, i.e., the number of times control is applied.

The set of all possible xk is called the state space at time k. It can beany set and may depend on k. Similarly, the set of all possible uk is calledthe control space at time k. Again it can be any set and may depend on k.Similarly the system function fk can be arbitrary and may depend on k.†

† This generality is one of the great strengths of the DP methodology andguides the exposition style of this book, as well as other DP works of the author.By allowing general state and control spaces (discrete, continuous, or mixturesthereof), and a k-dependent choice of these spaces, we can focus attention onthe truly essential algorithmic aspects of the DP solution approach, exclude ex-traneous assumptions and constraints from our model, and avoid duplication ofanalysis.

The generality of our DP model is also partly responsible for our choice

of notation. In the artificial intelligence and operations research communities,

finite state models, often referred to as Markovian Decision Problems (MDP), are

Page 9:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

8 Exact and Approximate Dynamic Programming Principles Chap. 1

s t u

Artificial Terminal Node Terminal Arcs with Cost Equal to Ter-

Artificial Terminal Node Terminal Arcs with Cost Equal to Ter-

Initial State Stage 0 Stage 1 Stage 2 StageInitial State Stage 0 Stage 1 Stage 2 StageInitial State Stage 0 Stage 1 Stage 2 StageInitial State Stage 0 Stage 1 Stage 2 Stage N − 1 Stage1 Stage N.

. . . .

.

. . . .

.

. . . .

.

. . . .

) Artificial Terminal

with Cost gN (xN )

State Space Partition Initial States

Figure 1.2.2 Transition graph for a deterministic finite-state system. Nodescorrespond to states xk. Arcs correspond to state-control pairs (xk , uk). An arc(xk, uk) has start and end nodes xk and xk+1 = fk(xk , uk), respectively. Weview the cost gk(xk, uk) of the transition as the length of this arc. The problemis equivalent to finding a shortest path from initial nodes of stage 0 to terminalnode t.

The problem also involves a cost function that is additive in the sensethat the cost incurred at time k, denoted by gk(xk, uk), accumulates overtime. Formally, gk is a function of (xk, uk) that takes real number values,and may depend on k. For a given initial state x0, the total cost of a controlsequence {u0, . . . , uN−1} is

J(x0;u0, . . . , uN−1) = gN(xN ) +N−1∑

k=0

gk(xk, uk), (1.2)

where gN(xN ) is a terminal cost incurred at the end of the process. This isa well-defined number, since the control sequence {u0, . . . , uN−1} togetherwith x0 determines exactly the state sequence {x1, . . . , xN} via the systemequation (1.1); see Figure 1.2.1. We want to minimize the cost (1.2) overall sequences {u0, . . . , uN−1} that satisfy the control constraints, therebyobtaining the optimal value as a function of x0:†

J*(x0) = minuk∈Uk(xk)k=0,...,N−1

J(x0;u0, . . . , uN−1). (1.3)

Discrete Optimal Control Problems

common and use a transition probability notation (see Chapter 5). Unfortunately,

this notation is not well suited for continuous spaces models, which are important

for the purposes of this book. It involves transition probability distributions over

continuous spaces, and leads to mathematics that are far more complex as well

as less intuitive than those based on the use of the system function (1.1).

† Here and later we write “min” (rather than “inf”) even if we are not sure

that the minimum is attained; similarly we write “max” (rather than “sup”) even

if we are not sure that the maximum is attained.

Page 10:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.2 Deterministic Dynamic Programming 9

There are many situations where the state and control spaces are naturallydiscrete and consist of a finite number of elements. Such problems are oftenconveniently described with an acyclic graph specifying for each state xk thepossible transitions to next states xk+1. The nodes of the graph correspondto states xk and the arcs of the graph correspond to state-control pairs(xk, uk). Each arc with start node xk corresponds to a choice of a singlecontrol uk ∈ Uk(xk) and has as end node the next state fk(xk, uk). Thecost of an arc (xk, uk) is defined as gk(xk, uk); see Fig. 1.2.2. To handle thefinal stage, an artificial terminal node t is added. Each state xN at stageN is connected to the terminal node t with an arc having cost gN (xN ).

Note that control sequences {u0, . . . , uN−1} correspond to paths orig-inating at the initial state (a node at stage 0) and terminating at one of thenodes corresponding to the final stage N . If we view the cost of an arc asits length, we see that a deterministic finite-state finite-horizon problem isequivalent to finding a minimum-length (or shortest) path from the initialnodes of the graph (stage 0) to the terminal node t. Here, by the length ofa path we mean the sum of the lengths of its arcs.†

Generally, combinatorial optimization problems can be formulated asdeterministic finite-state finite-horizon optimal control problem, as we willsee in Section 1.4. The idea is to break down the solution into compo-nents, which can be computed sequentially. The following is an illustrativeexample.

Example 1.2.1 (A Deterministic Scheduling Problem)

Suppose that to produce a certain product, four operations must be performedon a certain machine. The operations are denoted by A, B, C, and D. Weassume that operation B can be performed only after operation A has beenperformed, and operation D can be performed only after operation C hasbeen performed. (Thus the sequence CDAB is allowable but the sequenceCDBA is not.) The setup cost Cmn for passing from any operation m to anyother operation n is given. There is also an initial startup cost SA or SC forstarting with operation A or C, respectively (cf. Fig. 1.2.3). The cost of asequence is the sum of the setup costs associated with it; for example, theoperation sequence ACDB has cost SA +CAC + CCD +CDB.

We can view this problem as a sequence of three decisions, namely thechoice of the first three operations to be performed (the last operation isdetermined from the preceding three). It is appropriate to consider as statethe set of operations already performed, the initial state being an artificialstate corresponding to the beginning of the decision process. The possiblestate transitions corresponding to the possible states and decisions for thisproblem are shown in Fig. 1.2.3. Here the problem is deterministic, i.e., at a

† It turns out also that any shortest path problem (with a possibly nona-

cyclic graph) can be reformulated as a finite-state deterministic optimal control

problem. See [Ber17], Section 2.1, and [Ber91], [Ber98] for extensive accounts of

shortest path methods, which connect with our discussion here.

Page 11:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

10 Exact and Approximate Dynamic Programming Principles Chap. 1

+1 Initial State A C AB AC CA CD ABC

+1 Initial State A C AB AC CA CD ABC

+1 Initial State A C AB AC CA CD ABC

+1 Initial State A C AB AC CA CD ABC

+1 Initial State A C AB AC CA CD ABC

+1 Initial State A C AB AC CA CD ABC

+1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

ACB ACD CAB CAD CDA

ACB ACD CAB CAD CDA

ACB ACD CAB CAD CDA

ACB ACD CAB CAD CDA

SA

CAB

CAC

CCA

CCD

CBC

CCB

CCD

CAB

CAB

CAD

CDA

CCD

CBD

CBD

CDB

CDB

+1 Initial State A C AB AC CA CD ABC+1 Initial State A C AB AC CA CD ABC

SC

Figure 1.2.3 The transition graph of the deterministic scheduling problem ofExample 1.2.1. Each arc of the graph corresponds to a decision leading from

some state (the start node of the arc) to some other state (the end node of thearc). The corresponding cost is shown next to the arc. The cost of the lastoperation is shown as a terminal cost next to the terminal nodes of the graph.

given state, each choice of control leads to a uniquely determined state. Forexample, at state AC the decision to perform operation D leads to state ACDwith certainty, and has cost CCD. Thus the problem can be convenientlyrepresented with the transition graph of Fig. 1.2.3. The optimal solutioncorresponds to the path that starts at the initial state and ends at some stateat the terminal time and has minimum sum of arc costs plus the terminalcost.

1.2.2 The Dynamic Programming Algorithm

In this section we will state the DP algorithm and formally justify it. Thealgorithm rests on a simple idea, the principle of optimality, which roughlystates the following; see Fig. 1.2.4.

Principle of Optimality

Let {u∗

0, . . . , u∗

N−1} be an optimal control sequence, which togetherwith x0 determines the corresponding state sequence {x∗

1, . . . , x∗

N} viathe system equation (1.1). Consider the subproblem whereby we startat x∗

k at time k and wish to minimize the “cost-to-go” from time k totime N ,

Page 12:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.2 Deterministic Dynamic Programming 11

Tail subproblem TimeFuture Stages Terminal Cost k Nk N

{

Cost 0 Cost

Optimal control sequence

Optimal control sequence {u∗

0, . . . , u

k, . . . , u

N−1}

Tail subproblem Time x∗

kTail subproblem Time

Figure 1.2.4 Schematic illustration of the principle of optimality. The tail{u∗

k, . . . , u∗

N−1} of an optimal sequence {u∗0 , . . . , u

∗N−1} is optimal for the tail

subproblem that starts at the state x∗kof the optimal state trajectory.

gk(x∗

k, uk) +

N−1∑

m=k+1

gm(xm, um) + gN (xN ),

over {uk, . . . , uN−1} with um ∈ Um(xm), m = k, . . . , N − 1. Then thetruncated optimal control sequence {u∗

k, . . . , u∗

N−1} is optimal for thissubproblem.

The subproblem referred to above is called the tail subproblem thatstarts at x∗

k. Stated succinctly, the principle of optimality says that thetail of an optimal sequence is optimal for the tail subproblem. Its intuitivejustification is simple. If the truncated control sequence {u∗

k, . . . , u∗

N−1}were not optimal as stated, we would be able to reduce the cost furtherby switching to an optimal sequence for the subproblem once we reach x∗

k

(since the preceding choices of controls, u∗

0, . . . , u∗

k−1, do not restrict ourfuture choices).

For an auto travel analogy, suppose that the fastest route from Phoenixto Boston passes through St Louis. The principle of optimality translatesto the obvious fact that the St Louis to Boston portion of the route is alsothe fastest route for a trip that starts from St Louis and ends in Boston.†

The principle of optimality suggests that the optimal cost functioncan be constructed in piecemeal fashion going backwards: first computethe optimal cost function for the “tail subproblem” involving the last stage,then solve the “tail subproblem” involving the last two stages, and continuein this manner until the optimal cost function for the entire problem isconstructed.

The DP algorithm is based on this idea: it proceeds sequentially, by

† In the words of Bellman [Bel57]: “An optimal trajectory has theproperty that at an intermediate point, no matter how it was reached, therest of the trajectory must coincide with an optimal trajectory as computedfrom this intermediate point as the starting point.”

Page 13:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

12 Exact and Approximate Dynamic Programming Principles Chap. 1

solving all the tail subproblems of a given time length, using the solutionof the tail subproblems of shorter time length. We illustrate the algorithmwith the scheduling problem of Example 1.2.1. The calculations are simplebut tedious, and may be skipped without loss of continuity. However, theymay be worth going over by a reader that has no prior experience in theuse of DP.

Example 1.2.1 (Scheduling Problem - Continued)

Let us consider the scheduling Example 1.2.1, and let us apply the principle ofoptimality to calculate the optimal schedule. We have to schedule optimallythe four operations A, B, C, and D. There is a cost for a transition betweentwo operations, and the numerical values of the transition costs are shown inFig. 1.2.5 next to the corresponding arcs.

According to the principle of optimality, the “tail” portion of an optimalschedule must be optimal. For example, suppose that the optimal scheduleis CABD. Then, having scheduled first C and then A, it must be optimal tocomplete the schedule with BD rather than with DB. With this in mind, wesolve all possible tail subproblems of length two, then all tail subproblems oflength three, and finally the original problem that has length four (the sub-problems of length one are of course trivial because there is only one operationthat is as yet unscheduled). As we will see shortly, the tail subproblems oflength k + 1 are easily solved once we have solved the tail subproblems oflength k, and this is the essence of the DP technique.

Tail Subproblems of Length 2 : These subproblems are the ones that involvetwo unscheduled operations and correspond to the states AB, AC, CA, andCD (see Fig. 1.2.5).

State AB : Here it is only possible to schedule operation C as the next op-eration, so the optimal cost of this subproblem is 9 (the cost of schedul-ing C after B, which is 3, plus the cost of scheduling D after C, whichis 6).

State AC : Here the possibilities are to (a) schedule operation B and thenD, which has cost 5, or (b) schedule operation D and then B, which hascost 9. The first possibility is optimal, and the corresponding cost ofthe tail subproblem is 5, as shown next to node AC in Fig. 1.2.5.

State CA: Here the possibilities are to (a) schedule operation B and thenD, which has cost 3, or (b) schedule operation D and then B, which hascost 7. The first possibility is optimal, and the corresponding cost ofthe tail subproblem is 3, as shown next to node CA in Fig. 1.2.5.

State CD : Here it is only possible to schedule operation A as the nextoperation, so the optimal cost of this subproblem is 5.

Tail Subproblems of Length 3 : These subproblems can now be solved usingthe optimal costs of the subproblems of length 2.

State A: Here the possibilities are to (a) schedule next operation B (cost2) and then solve optimally the corresponding subproblem of length 2

Page 14:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.2 Deterministic Dynamic Programming 13

+1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

+1 Initial State A C AB AC CA CD ABC

+1 Initial State A C AB AC CA CD ABC

+1 Initial State A C AB AC CA CD ABC

+1 Initial State A C AB AC CA CD ABC

+1 Initial State A C AB AC CA CD ABC

+1 Initial State A C AB AC CA CD ABC

ACB ACD CAB CAD CDA

ACB ACD CAB CAD CDA

ACB ACD CAB CAD CDA

ACB ACD CAB CAD CDA

+1 Initial State A C AB AC CA CD ABC+1 Initial State A C AB AC CA CD ABC

3 5 2 4 6 2

3 5 2 4 6 2

3 5 2 4 6 2

3 5 2 4 6 2

3 5 2 4 6 2

3 5 2 4 6 2

3 5 2 4 6 2

3 5 2 4 6 2

3 5 2 4 6 23 5 2 4 6 2

3 5 2 4 6 2

3 5 2 4 6 2

6 1 3 2 9 5 8 7 10

6 1 3 2 9 5 8 7 10

6 1 3 2 9 5 8 7 10

6 1 3 2 9 5 8 7 10

6 1 3 2 9 5 8 7 106 1 3 2 9 5 8 7 10

6 1 3 2 9 5 8 7 10

6 1 3 2 9 5 8 7 10

6 1 3 2 9 5 8 7 10

6 1 3 2 9 5 8 7 10

6 1 3 2 9 5 8 7 10

6 1 3 2 9 5 8 7 10

6 1 3 2 9 5 8 7 10

Figure 1.2.5 Transition graph of the deterministic scheduling problem, withthe cost of each decision shown next to the corresponding arc. Next to eachnode/state we show the cost to optimally complete the schedule starting fromthat state. This is the optimal cost of the corresponding tail subproblem (cf.the principle of optimality). The optimal cost for the original problem is equalto 10, as shown next to the initial state. The optimal schedule correspondsto the thick-line arcs.

(cost 9, as computed earlier), a total cost of 11, or (b) schedule nextoperation C (cost 3) and then solve optimally the corresponding sub-problem of length 2 (cost 5, as computed earlier), a total cost of 8.The second possibility is optimal, and the corresponding cost of the tailsubproblem is 8, as shown next to node A in Fig. 1.2.5.

State C : Here the possibilities are to (a) schedule next operation A (cost4) and then solve optimally the corresponding subproblem of length 2(cost 3, as computed earlier), a total cost of 7, or (b) schedule nextoperation D (cost 6) and then solve optimally the corresponding sub-problem of length 2 (cost 5, as computed earlier), a total cost of 11.The first possibility is optimal, and the corresponding cost of the tailsubproblem is 7, as shown next to node C in Fig. 1.2.5.

Original Problem of Length 4 : The possibilities here are (a) start with oper-ation A (cost 5) and then solve optimally the corresponding subproblem oflength 3 (cost 8, as computed earlier), a total cost of 13, or (b) start withoperation C (cost 3) and then solve optimally the corresponding subproblemof length 3 (cost 7, as computed earlier), a total cost of 10. The second pos-sibility is optimal, and the corresponding optimal cost is 10, as shown nextto the initial state node in Fig. 1.2.5.

Note that having computed the optimal cost of the original problemthrough the solution of all the tail subproblems, we can construct the optimal

Page 15:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

14 Exact and Approximate Dynamic Programming Principles Chap. 1

schedule: we begin at the initial node and proceed forward, each time choosingthe optimal operation, i.e., the one that starts the optimal schedule for thecorresponding tail subproblem. In this way, by inspection of the graph and thecomputational results of Fig. 1.2.5, we determine that CABD is the optimalschedule.

Finding an Optimal Control Sequence by DP

We now state the DP algorithm for deterministic finite horizon problemsby translating into mathematical terms the heuristic argument underlyingthe principle of optimality. The algorithm constructs functions

J*N (xN ), J*

N−1(xN−1), . . . , J*0 (x0),

sequentially, starting from J*N , and proceeding backwards to J*

N−1, J*N−2,

etc. The value J*k (xk) represents the optimal cost of the tail subproblem

that starts at state xk at time k.

DP Algorithm for Deterministic Finite Horizon Problems

Start withJ*N (xN ) = gN (xN ), for all xN , (1.4)

and for k = 0, . . . , N − 1, let

J*k (xk) = min

uk∈Uk(xk)

[

gk(xk, uk) + J*k+1

(

fk(xk, uk))

]

, for all xk.

(1.5)

Note that at stage k, the calculation in Eq. (1.5) must be done forall states xk before proceeding to stage k − 1. The key fact about the DPalgorithm is that for every initial state x0, the number J*

0 (x0) obtained atthe last step, is equal to the optimal cost J*(x0). Indeed, a more generalfact can be shown, namely that for all k = 0, 1, . . . , N − 1, and all statesxk at time k, we have

J*k (xk) = min

um∈Um(xm)m=k,...,N−1

J(xk;uk, . . . , uN−1), (1.6)

where J(xk;uk, . . . , uN−1) is the cost generated by starting at xk and usingsubsequent controls uk, . . . , uN−1:

J(xk;uk, . . . , uN−1) = gN(xN ) +

N−1∑

m=k

gm(xm, um). (1.7)

Page 16:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.2 Deterministic Dynamic Programming 15

Thus, J*k (xk) is the optimal cost for an (N − k)-stage tail subproblem

that starts at state xk and time k, and ends at time N .† Based on theinterpretation (1.6) of J∗

k (xk), we call it the optimal cost-to-go from statexk at stage k, and refer to J∗

k as the optimal cost-to-go function or optimalcost function at time k. In maximization problems the DP algorithm (1.5)is written with maximization in place of minimization, and then J∗

k isreferred to as the optimal value function at time k.

Once the functions J*0 , . . . , J

*N have been obtained, we can use a for-

ward algorithm to construct an optimal control sequence {u∗

0, . . . , u∗

N−1}and corresponding state trajectory {x∗

1, . . . , x∗

N} for the given initial statex0.

Construction of Optimal Control Sequence {u∗

0, . . . , u∗

N−1}

Setu∗

0 ∈ arg minu0∈U0(x0)

[

g0(x0, u0) + J*1

(

f0(x0, u0))

]

,

andx∗

1 = f0(x0, u∗

0).

Sequentially, going forward, for k = 1, 2, . . . , N − 1, set

† We can prove this by induction. The assertion holds for k = N in view ofthe initial condition

J∗N (xN) = gN (xN).

To show that it holds for all k, we use Eqs. (1.6) and (1.7) to write

J∗k (xk) = min

um∈Um(xm)m=k,...,N−1

[

gN(xN) +

N−1∑

m=k

gm(xm, um)

]

= minuk∈Uk(xk)

[

gk(xk, uk)

+ minum∈Um(xm)

m=k+1,...,N−1

[

gN (xN) +

N−1∑

m=k+1

gm(xm, um)

]]

= minuk∈Uk(xk)

[

gk(xk, uk) + J∗k+1

(

fk(xk, uk))

]

,

where for the last equality we use the induction hypothesis. A subtle mathe-

matical point here is that, through the minimization operation, the cost-to-go

functions J∗k may take the value −∞ for some xk. Still the preceding induction

argument is valid even if this is so.

Page 17:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

16 Exact and Approximate Dynamic Programming Principles Chap. 1

u∗

k ∈ arg minuk∈Uk(x

∗k)

[

gk(x∗

k, uk) + J*k+1

(

fk(x∗

k, uk))

]

, (1.8)

andx∗

k+1 = fk(x∗

k, u∗

k).

The same algorithm can be used to find an optimal control sequencefor any tail subproblem. Figure 1.2.5 traces the calculations of the DPalgorithm for the scheduling Example 1.2.1. The numbers next to thenodes, give the corresponding cost-to-go values, and the thick-line arcsgive the construction of the optimal control sequence using the precedingalgorithm.

DP Algorithm for General Discrete Optimization Problems

We have noted earlier that discrete deterministic optimization problems,including challenging combinatorial problems, can be typically formulatedas DP problems by breaking down each feasible solution into a sequence ofdecisions/controls, as illustrated with the scheduling Example 1.2.1. Thisformulation often leads to an intractable DP computation because of anexponential explosion of the number of states as time progresses. However,a DP formulation brings to bear approximate DP methods, such as rolloutand others, to be discussed shortly. We illustrate the reformulation by anexample and then generalize.

Example 1.2.2 (The Traveling Salesman Problem)

An important model for scheduling a sequence of operations is the classicaltraveling salesman problem. Here we are given N cities and the travel timebetween each pair of cities. We wish to find a minimum time travel that visitseach of the cities exactly once and returns to the start city. To convert thisproblem to a DP problem, we form a graph whose nodes are the sequencesof k distinct cities, where k = 1, . . . , N . The k-city sequences correspond tothe states of the kth stage. The initial state x0 consists of some city, takenas the start (city A in the example of Fig. 1.2.6). A k-city node/state leadsto a (k+1)-city node/state by adding a new city at a cost equal to the traveltime between the last two of the k+1 cities; see Fig. 1.2.6. Each sequence ofN cities is connected to an artificial terminal node t with an arc of cost equalto the travel time from the last city of the sequence to the starting city, thuscompleting the transformation to a DP problem.

The optimal costs-to-go from each node to the terminal state can beobtained by the DP algorithm and are shown next to the nodes. Note, how-ever, that the number of nodes grows exponentially with the number of citiesN . This makes the DP solution intractable for large N . As a result, largetraveling salesman and related scheduling problems are typically addressed

Page 18:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.2 Deterministic Dynamic Programming 17

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 201 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 201 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 201 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 201 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 201 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 201 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 201 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

A AB AC AD ABC ABD ACB ACD ADB ADC

A AB AC AD ABC ABD ACB ACD ADB ADCA AB AC AD ABC ABD ACB ACD ADB ADCA AB AC AD ABC ABD ACB ACD ADB ADC

A AB AC AD ABC ABD ACB ACD ADB ADCA AB AC AD ABC ABD ACB ACD ADB ADCA AB AC AD ABC ABD ACB ACD ADB ADCA AB AC AD ABC ABD ACB ACD ADB ADCA AB AC AD ABC ABD ACB ACD ADB ADCA AB AC AD ABC ABD ACB ACD ADB ADC

ABCD ABDC ACBD ACDB ADBC ADCBABCD ABDC ACBD ACDB ADBC ADCBABCD ABDC ACBD ACDB ADBC ADCBABCD ABDC ACBD ACDB ADBC ADCBABCD ABDC ACBD ACDB ADBC ADCBABCD ABDC ACBD ACDB ADBC ADCB

s Terminal State t

15 1 5 15 1 5 15 1 515 1 5 15 1 5

15 1 5 18 4 19 9 21 2515 1 5 18 4 19 9 21 2515 1 5 18 4 19 9 21 2515 1 5 18 4 19 9 21 2515 1 5 18 4 19 9 21 25

15 1 5 18 4 19 9 21 25 8 1215 1 5 18 4 19 9 21 25 8 12

15 1 5 18 4 19 9 21 25 8 12 13

Initial State x0

Matrix of Intercity Travel Costs

Matrix of Intercity Travel Costs(

6 13

6 13

6 13 14

Figure 1.2.6 Example of a DP formulation of the traveling salesman problem.The travel times between the four cities A, B, C, and D are shown in the matrixat the bottom. We form a graph whose nodes are the k-city sequences andcorrespond to the states of the kth stage, assuming that A is the starting city.The transition costs/travel times are shown next to the arcs. The optimalcosts-to-go are generated by DP starting from the terminal state and goingbackwards towards the initial state, and are shown next to the nodes. There isa unique optimal sequence here (ABDCA), and it is marked with thick lines.The optimal sequence can be obtained by forward minimization [cf. Eq. (1.8)],starting from the initial state x0.

with approximation methods, some of which are based on DP, and will bediscussed in future chapters.

Let us now extend the ideas of the preceding example to the generaldiscrete optimization problem:

minimize G(u)

subject to u ∈ U,

Page 19:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

18 Exact and Approximate Dynamic Programming Principles Chap. 1

Artificial Start State End State)...)

...)...

)...

)...)

...)...)

...

. . . i

. . . i

. . . i

. . . i

Set of States (Set of States (Set of States ( Set of States (

Cost G(u)

s t u

Stage 1 Stage 2 Stage 3 StageStage 1 Stage 2 Stage 3 StageStage 1 Stage 2 Stage 3 StageStage 1 Stage 2 Stage 3 Stage N

Initial State 15 1 5 18 4 19 9 21 25 8 12 13

(u0) () (u0, u1) () (u0, u1, u2) ) u = (u0, . . . , uN−1)

u0

u1

u2uN−1

Figure 1.2.7. Formulation of a discrete optimization problem as a DP prob-lem with N stages. There is a cost G(u) only at the terminal stage on the arcconnecting an N-solution u = (u0, . . . , uN−1) upon reaching the terminal state.Alternative formulations may use fewer states by taking advantage of the prob-lem’s structure.

where U is a finite set of feasible solutions and G(u) is a cost function. Weassume that each solution u has N components; i.e., it has the form u =(u0, . . . , uN−1), where N is a positive integer. We can then view the prob-lem as a sequential decision problem, where the components u0, . . . , uN−1

are selected one-at-a-time. A k-tuple (u0, . . . , uk−1) consisting of the firstk components of a solution is called a k-solution. We associate k-solutionswith the kth stage of the finite horizon DP problem shown in Fig. 1.2.7.In particular, for k = 1, . . . , N , we view as the states of the kth stage allthe k-tuples (u0, . . . , uk−1). For stage k = 0, . . . , N − 1, we view uk as thecontrol. The initial state is an artificial state denoted s. From this state,by applying u0, we may move to any “state” (u0), with u0 belonging to theset

U0 ={

u0 | there exists a solution of the form (u0, u1, . . . , uN−1) ∈ U}

.

Thus U0 is the set of choices of u0 that are consistent with feasibility.More generally, from a state (u0, . . . , uk−1), we may move to any state

of the form (u0, . . . , uk−1, uk), upon choosing a control uk that belongs tothe set

Uk(u0, . . . , uk−1) ={

uk | for some uk+1, . . . , uN−1 we have

(u0, . . . , uk−1, uk, uk+1, . . . , uN−1) ∈ U}

.

These are the choices of uk that are consistent with the preceding choicesu0, . . . , uk−1, and are also consistent with feasibility. The last stage cor-responds to the N -solutions u = (u0, . . . , uN−1), and the terminal cost is

Page 20:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.2 Deterministic Dynamic Programming 19

G(u); see Fig. 1.2.7. All other transitions in this DP problem formulationhave cost 0.

Let

J*k (u0, . . . , uk−1)

denote the optimal cost starting from the k-solution (u0, . . . , uk−1), i.e.,the optimal cost of the problem over solutions whose first k componentsare constrained to be equal to u0, . . . , uk−1. The DP algorithm is describedby the equation

J*k (u0, . . . , uk−1) = min

uk∈Uk(u0,...,uk−1)J*k+1(u0, . . . , uk−1, uk), (1.9)

with the terminal condition

J*N (u0, . . . , uN−1) = G(u0, . . . , uN−1).

This algorithm executes backwards in time: starting with the known func-tion J*

N = G, we compute J*N−1, then J*

N−2, and so on up to computing J*0 .

An optimal solution (u∗

0, . . . , u∗

N−1) is then constructed by going forwardthrough the algorithm

u∗

k ∈ arg minuk∈Uk(u

∗0,...,u∗

k−1)J*k+1(u

0, . . . , u∗

k−1, uk), k = 0, . . . , N − 1,

(1.10)first compute u∗

0, then u∗

1, and so on up to u∗

N−1; cf. Eq. (1.8).Of course here the number of states typically grows exponentially with

N , but we can use the DP minimization (1.10) as a starting point for the useof approximation methods. For example we may try to use approximationin value space, whereby we replace J*

k+1 with some suboptimal Jk+1 in Eq.(1.10). One possibility is to use as

Jk+1(u∗

0, . . . , u∗

k−1, uk),

the cost generated by a heuristic method that solves the problem sub-optimally with the values of the first k + 1 decision components fixed atu∗

0, . . . , u∗

k−1, uk. This is the rollout algorithm, which is a very simple andeffective approach for approximate combinatorial optimization. It will bediscussed in the next section, and in Chapters 2 and 3. It will be relatedto the method of policy iteration and self-learning ideas in Chapter 5.

Let us finally note that while we have used a general cost functionG and constraint set C in our discrete optimization model of this section,in many problems G and/or C may have a special structure, which is con-sistent with a sequential decision making process. The traveling salesmanExample 1.2.2 is a case in point, where G consists of N components (theintercity travel costs), one per stage.

Page 21:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

20 Exact and Approximate Dynamic Programming Principles Chap. 1

1.2.3 Approximation in Value Space

The forward optimal control sequence construction of Eq. (1.8) is possibleonly after we have computed J*

k (xk) by DP for all xk and k. Unfortu-nately, in practice this is often prohibitively time-consuming, because thenumber of possible xk and k can be very large. However, a similar forwardalgorithmic process can be used if the optimal cost-to-go functions J*

k are

replaced by some approximations Jk. This is the basis for an idea that iscentral in RL: approximation in value space.† It constructs a suboptimalsolution {u0, . . . , uN−1} in place of the optimal {u∗

0, . . . , u∗

N−1}, based on

using Jk in place of J*k in the DP procedure (1.8).

Approximation in Value Space - Use of Jk in Place of J*k

Start with

u0 ∈ arg minu0∈U0(x0)

[

g0(x0, u0) + J1(

f0(x0, u0))

]

,

and setx1 = f0(x0, u0).

Sequentially, going forward, for k = 1, 2, . . . , N − 1, set

uk ∈ arg minuk∈Uk(xk)

[

gk(xk, uk) + Jk+1

(

fk(xk, uk))

]

, (1.11)

andxk+1 = fk(xk, uk).

In approximation in value space the calculation of the suboptimal se-quence {u0, . . . , uN−1} is done by going forward (no backward calculationis needed once the approximate cost-to-go functions Jk are available). Thisis similar to the calculation of the optimal sequence {u∗

0, . . . , u∗

N−1}, and is

independent of how the functions Jk are computed. The algorithm (1.11)is said to involve a one-step lookahead minimization, since it solves a one-stage DP problem for each k. In the next chapter we will also discuss thepossibility of multistep lookahead , which involves the solution of an ℓ-stepDP problem, where ℓ is an integer, 1 < ℓ < N−k, with a terminal cost func-tion approximation Jk+ℓ. Multistep lookahead typically (but not always)

† Approximation in value space is a simple idea that has been used quite

extensively for deterministic problems, well before the development of the mod-

ern RL methodology. For example it underlies the widely used A∗ method for

computing approximate solutions to large scale shortest path problems.

Page 22:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.2 Deterministic Dynamic Programming 21

provides better performance over one-step lookahead in RL approximationschemes, and will be discussed in Chapter 2. For example in Alphazerochess, long multistep lookahead is critical for good on-line performance.The intuitive reason is that with ℓ > 1 stages being treated “exactly” (byoptimization), the effect of the approximation error Jk+ℓ − J*

k+ℓ tends tobecome less significant. However, the solution of the multistep lookaheadoptimization problem, instead of the one-step lookahead counterpart of Eq.(1.11), becomes more time consuming.

Rollout, Cost Improvement, and On-Line Replanning

A major issue in value space approximation is the construction of suitableapproximate cost-to-go functions Jk. This can be done in many differentways, giving rise to some of the principal RL methods. For example, Jk maybe constructed with a sophisticated off-line training method, as discussed inSection 1.1. Alternatively, Jk may be obtained on-line with rollout , whichwill be discussed in detail in this book, starting with the next chapter.In rollout, the approximate values Jk(xk) are obtained when needed byrunning a heuristic control scheme, called base heuristic or base policy, fora suitably large number of steps, starting from the state xk.

The major theoretical property of rollout is cost improvement : thecost obtained by rollout using some base heuristic is less or equal to thecorresponding cost of the base heuristic. This is true for any starting state,provided the base heuristic satisfies some simple conditions, which will bediscussed in Chapter 2.†

There are also several variants of rollout, including versions involvingmultiple heuristics, combinations with other forms of approximation invalue space methods, and multistep lookahead, which will be discussed insubsequent chapters. An important methodology that is closely related todeterministic rollout is model predictive control , which is used widely incontrol system design, and will be discussed in some detail in this book,starting with Section 3.1; see also Section 1.4.6. The following example istypical of the combinatorial applications that we will discuss in Chapter 3in connection with rollout.

Example 1.2.3 (Multi-Vehicle Routing)

† For an intuitive justification of the cost improvement mechanism, note that

the rollout control uk is calculated from Eq. (1.11) to attain the minimum over

uk over the sum of two terms: the first stage cost gk(xk, uk) plus the cost of the

remaining stages (k+1 to N) using the heuristic controls. Thus rollout involves a

first stage optimization (rather than just using the base heuristic), which accounts

for the cost improvement (under relatively mild conditions). This reasoning also

explains why multistep lookahead tends to provide better performance than one-

step lookahead in rollout schemes.

Page 23:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

22 Exact and Approximate Dynamic Programming Principles Chap. 1

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9

1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2

1 2 3 4 5 6 7 8 9 Vehicle 1 Vehicle 2

10 11

1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 910 11

10 11 12

Capacity=1 Optimal SolutionCapacity=1 Optimal Solution

Figure 1.2.8 An instance of the vehicle routing problem of Example 1.2.3.The two vehicles aim to collectively perform the two tasks as fast as possible,by each moving to a neighboring node at each step. The optimal routes areshown and require a total of 5 vehicle moves.

Consider n vehicles that move along the arcs of a given graph. Some of thenodes of the graph include a task to be performed by the vehicles. Eachtask will be performed only once, immediately after a vehicle reaches thecorresponding node for the first time. We assume a horizon that is largeenough to allow every task to be performed. The problem is to find a routefor each vehicle so that, roughly speaking, the tasks are collectively performedby the vehicles in minimum time. To express this objective, we assume thatfor each move by a vehicle there is a cost of one unit. These costs are summedup to the point where all the tasks have been performed.

For a large number of vehicles and a complicated graph, this is a non-trivial combinatorial problem. It can be approached by DP, like any discretedeterministic optimization problem, as we discussed earlier, in Section 1.2.1.In particular, we can view as state at a given stage the n-tuple of current po-sitions of the vehicles together with the list of pending tasks. Unfortunately,however, the number of these states can be enormous (it increases exponen-tially with the number of tasks and the number of vehicles), so an exact DPsolution is intractable.

This motivates an optimization in value space approach based on roll-out. For this we need an easily implementable heuristic algorithm that willsolve suboptimally the problem starting from any state xk+1, and will providethe cost approximation Jk+1(xk+1) in Eq. (1.11). One possibility is based onthe vehicles choosing their actions selfishly, along shortest paths to their near-est pending task, one-at-a-time in a fixed order. To illustrate, consider thetwo-vehicle problem of Fig. 1.2.8. We introduce a vehicle order (say, firstvehicle 1 then vehicle 2), and move each vehicle in turn one step to the nextnode towards its nearest pending task, until all tasks have been performed.

The rollout algorithm will work as follows. At a given state xk [involvingfor example vehicle positions at the node pair (1, 2) and tasks at nodes 7 and9, as in Fig. 1.2.8], we consider all possible joint vehicle moves (the controls uk

at the state) resulting in the node pairs (3,5), (4,5), (3,4) (4,4), corresponding

Page 24:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.2 Deterministic Dynamic Programming 23

to the next states xk+1 [thus, as an example (3,5) corresponds to vehicle 1moving from 1 to 3, and vehicle 2 moving from 2 to 5]. We then run theheuristic starting from each of these node pairs, and accumulate the incurredcosts up to the time when both tasks are completed. For example startingfrom the vehicle positions/next state (3,5), the heuristic will produce thefollowing sequence of moves:

• Vehicle 1 moves from 3 to 6.

• Vehicle 2 moves from 5 to 2.

• Vehicle 1 moves from 6 to 9 and performs the task at 9.

• Vehicle 2 moves from 2 to 4.

• Vehicle 1 moves from 9 to 12.

• Vehicle 2 moves from 4 to 7 and performs the task at 7.

The two tasks are thus performed in 6 moves once the move to (3,5) has beenmade.

The process of running the heuristic is repeated from the other three ve-hicle position pairs/next states (4,5), (3,4) (4,4), and the number of moves/heuristiccost is recorded. We then choose the next state that corresponds to minimumcost. In our case the joint move to state xk+1 that involves the pair (3, 4)produces the sequence

• Vehicle 1 moves from 3 to 6,

• Vehicle 2 moves from 4 to 7 and performs the task at 7,

• Vehicle 1 moves from 6 to 9 and performs the task at 9,

and performs the two tasks in 3 moves. It can be seen that it yields minimumfirst stage cost plus heuristic cost from the next state, as per Eq. (1.11). Thus,the rollout algorithm will choose it at the state (1,2), and move the vehiclesto state (3,4). At that state the rollout process will be repeated, i.e., considerthe possible next joint moves to the node pairs (6,7), (6,2), (6,1), (3,7), (3,2),(3,1), perform a heuristic calculation from each of them, compare, etc.

It can be verified that the rollout algorithm starting from the state (1,2)shown in Fig. 1.2.8 will attain the optimal cost (a total of 5 vehicle moves). Itwill perform much better than the heuristic, which starting from state (1,2),will move the two vehicles to state (4,4) and then to (7,1), etc (a total of 9vehicle moves). This is an instance of the cost improvement property of therollout algorithm: it performs better than its base heuristic.

A related context where rollout can be very useful is when an optimalsolution has been derived for a given mathematical or simulation-basedmodel of the problem, and the model changes as the system is operating.Then the solution at hand is not optimal anymore, but it can be used as abase heuristic for rollout using the new model, thereby restoring much ofthe optimality loss due to the change in model. This is known as on-linereplanning with rollout , and will be discussed further in Section 1.4.6. Forexample, in the preceding multi-vehicle example, on-line replanning would

Page 25:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

24 Exact and Approximate Dynamic Programming Principles Chap. 1

perform well if the number and location of tasks may change unpredictably,as new tasks appear or old tasks disappear, or if some of the vehicles breakdown and get repaired randomly over time.

Lookahead Simplification and Multiagent Problems

Regardless of the method used to select the approximate cost-to-go func-tions Jk, another important issue is to perform efficiently the minimizationover uk ∈ Uk(xk) in Eq. (1.11). This minimization can be very time-consuming or even impossible, so it must be simplified for practical use.

An important example is when the control consists of multiple com-ponents, uk = (u1

k, . . . , umk ), with each component taking values in a finite

set. Then the size of the control space grows exponentially with m, andmay make the minimization intractable even for relatively small values ofm. This situation arises in several types of problems, including the caseof multiagent problems , where each control component is chosen by a sep-arate decision maker, who will be referred to as an “agent.” For instancein the multi-vehicle routing Example 1.2.3, each vehicle may be viewed asan agent, and the number of joint move choices by the vehicles increasesexponentially with their number. This motivates another rollout approach,called multiagent or agent-by-agent rollout , which we will discuss in Section1.4.5.

Another case where the lookahead minimization in Eq. (1.11) maybecome very difficult is when uk takes a continuum of values. Then, itis necessary to either approximate the control constraint set Uk(xk) bydiscretization, or to use a continuous optimization algorithm such as a gra-dient or Newton-like method. A coordinate descent-type method may alsobe used in the multiagent case where the control consists of multiple com-ponents, uk = (u1

k, . . . , umk ). These possibilities will be considered further

in subsequent chapters. An important case in point is the model predic-tive control methodology, which will be discussed extensively later, startingwith Section 3.1.

Q-Factors and Q-Learning

An alternative (and equivalent) form of the DP algorithm (1.5), uses theoptimal cost-to-go functions J*

k indirectly. In particular, it generates theoptimal Q-factors , defined for all pairs (xk, uk) and k by

Q*k(xk, uk) = gk(xk, uk) + J*

k+1

(

fk(xk, uk))

. (1.12)

Thus the optimal Q-factors are simply the expressions that are minimizedin the right-hand side of the DP equation (1.5).

Note that the optimal cost function J*k can be recovered from the

optimal Q-factor Q*k by means of the minimization

J*k (xk) = min

uk∈Uk(xk)Q*

k(xk, uk). (1.13)

Page 26:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 25

Moreover, the DP algorithm (1.5) can be written in an essentially equivalentform that involves Q-factors only [cf. Eqs. (1.12)-(1.13)]:

Q*k(xk, uk) = gk(xk, uk) + min

uk+1∈Uk+1(fk(xk,uk))Q*

k+1

(

fk(xk, uk), uk+1

)

.

Exact and approximate forms of this and other related algorithms, in-cluding counterparts for stochastic optimal control problems, comprise animportant class of RL methods known as Q-learning.

The expression

Qk(xk, uk) = gk(xk, uk) + Jk+1

(

fk(xk, uk))

,

which is minimized in approximation in value space [cf. Eq. (1.11)] is knownas the (approximate) Q-factor of (xk, uk).† Note that the computation ofthe suboptimal control (1.11) can be done through the Q-factor minimiza-tion

uk ∈ arg minuk∈Uk(xk)

Qk(xk, uk). (1.14)

This suggests the possibility of using Q-factors in place of cost functions inapproximation in value space schemes. We will discuss such schemes later.

1.3 STOCHASTIC DYNAMIC PROGRAMMING

We will now extend the DP algorithm and our discussion of approximationin value space to problems that involve stochastic uncertainty in their sys-tem equation and cost function. We will first discuss the finite horizon case,and the extension of the ideas underlying the principle of optimality andapproximation in value space schemes. We will then consider the infinitehorizon version of the problem, and provide an overview of the underly-ing theory and algorithmic methodology, in sufficient detail to allow us tospeculate about infinite horizon extensions of finite horizon RL algorithmsto be developed in Chapters 2-4. A more detailed discussion of the infinitehorizon RL methodology will be given in Chapter 5.

† The term “Q-factor” has been used in the books [BeT96], [Ber19a], and

is adopted here as well. Another term used is “action value” (at a given state).

The terms “state-action value” and “Q-value” are also common in the literature.

The name “Q-factor” originated in reference to the notation used in an influential

Ph.D. thesis [Wat89] that proposed the use of Q-factors in RL.

Page 27:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

26 Exact and Approximate Dynamic Programming Principles Chap. 1

1.3.1 Finite Horizon Problems

The stochastic optimal control problem differs from the deterministic ver-sion primarily in the nature of the discrete-time dynamic system thatgoverns the evolution of the state xk. This system includes a random“disturbance” wk with a probability distribution Pk(· | xk, uk) that maydepend explicitly on xk and uk, but not on values of prior disturbanceswk−1, . . . , w0. The system has the form

xk+1 = fk(xk, uk, wk), k = 0, 1, . . . , N − 1,

where as earlier xk is an element of some state space, the control uk is an ele-ment of some control space. The cost per stage is denoted by gk(xk, uk, wk)and also depends on the random disturbance wk; see Fig. 1.3.1. The controluk is constrained to take values in a given subset Uk(xk), which dependson the current state xk.

Given an initial state x0 and a policy π = {µ0, . . . , µN−1}, the fu-ture states xk and disturbances wk are random variables with distributionsdefined through the system equation

xk+1 = fk(

xk, µk(xk), wk

)

, k = 0, 1, . . . , N − 1,

and the given distributions Pk(· | xk, uk). Thus, for given functions gk,k = 0, 1, . . . , N , the expected cost of π starting at x0 is

Jπ(x0) = Ewk

k=0,...,N−1

{

gN (xN ) +

N−1∑

k=0

gk(

xk, µk(xk), wk

)

}

,

where the expected value operation E{·} is taken with respect to the jointdistribution of all the random variables wk and xk.† An optimal policy π∗

is one that minimizes this cost; i.e.,

Jπ∗(x0) = minπ∈Π

Jπ(x0),

where Π is the set of all policies.An important difference from the deterministic case is that we opti-

mize not over control sequences {u0, . . . , uN−1} [cf. Eq. (1.3)], but ratherover policies (also called closed-loop control laws , or feedback policies) thatconsist of a sequence of functions

π = {µ0, . . . , µN−1},

† We assume an introductory probability background on the part of the

reader. For an account that is consistent with our use of probability in this

book, see the text by Bertsekas and Tsitsiklis [BeT08].

Page 28:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 27

...... ) xk k xk+1 +1 xN) x0

Random Transition

Random Transition xk+1 = fk(xk, uk, wk) Random cost

) Random Cost) Random Cost gk(xk, uk, wk)

Future Stages Terminal CostFuture Stages Terminal Cost gN(xN )

Control uk

Stage k k Future Stages

Figure 1.3.1 Illustration of an N-stage stochastic optimal control problem.Starting from state xk, the next state under control uk is generated randomly,according to xk+1 = fk(xk, uk, wk), where wk is the random disturbance, and arandom stage cost gk(xk , uk, wk) is incurred.

where µk maps states xk into controls uk = µk(xk), and satisfies the con-trol constraints, i.e., is such that µk(xk) ∈ Uk(xk) for all xk. Policiesare more general objects than control sequences, and in the presence ofstochastic uncertainty, they can result in improved cost, since they allowchoices of controls uk that incorporate knowledge of the state xk. Withoutthis knowledge, the controller cannot adapt appropriately to unexpectedvalues of the state, and as a result the cost can be adversely affected. Thisis a fundamental distinction between deterministic and stochastic optimalcontrol problems.

The optimal cost depends on x0 and is denoted by J*(x0); i.e.,

J*(x0) = minπ∈Π

Jπ(x0).

We view J* as a function that assigns to each initial state x0 the optimalcost J*(x0), and call it the optimal cost function or optimal value function.

Finite Horizon Stochastic Dynamic Programming

The DP algorithm for the stochastic finite horizon optimal control problemhas a similar form to its deterministic version, and shares several of itsmajor characteristics:

(a) Using tail subproblems to break down the minimization over multiplestages to single stage minimizations.

(b) Generating backwards for all k and xk the values J*k (xk), which give

the optimal cost-to-go starting from state xk at stage k.

(c) Obtaining an optimal policy by minimization in the DP equations.

(d) A structure that is suitable for approximation in value space, wherebywe replace J*

k by approximations Jk, and obtain a suboptimal policyby the corresponding minimization.

Page 29:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

28 Exact and Approximate Dynamic Programming Principles Chap. 1

DP Algorithm for Stochastic Finite Horizon Problems

Start withJ*N (xN ) = gN (xN ),

and for k = 0, . . . , N − 1, let

J*k (xk) = min

uk∈Uk(xk)Ewk

{

gk(xk, uk, wk) + J*k+1

(

fk(xk, uk, wk))

}

.

(1.15)For each xk and k, define µ∗

k(xk) = u∗

k where u∗

k attains the min-imum in the right side of this equation. Then, the policy π∗ ={µ∗

0, . . . , µ∗

N−1} is optimal.

The key fact is that starting from any initial state x0, the optimalcost is equal to the number J*

0 (x0), obtained at the last step of the aboveDP algorithm. This can be proved by induction similar to the deterministiccase; we will omit the proof (which incidentally involves some mathematicalfine points; see the discussion of Section 1.3 in the textbook [Ber17]).

Simultaneously with the off-line computation of the optimal cost-to-go functions J*

0 , . . . , J*N , we can compute and store an optimal policy

π∗ = {µ∗

0, . . . , µ∗

N−1} by minimization in Eq. (1.15). We can then use thispolicy on-line to retrieve from memory and apply the control µ∗

k(xk) oncewe reach state xk. The alternative is to forego the storage of the policy π∗

and to calculate the control µ∗

k(xk) by executing the minimization (1.15)on-line.

Let us now illustrate some of the analytical and computational aspectsof the stochastic DP algorithm by means of some examples. There are a fewfavorable cases where the optimal cost-to-go functions J*

k and the optimalpolicies µ∗

k can be computed analytically. A prominent such case involvesa linear system and a quadratic cost function, which is a fundamentalproblem in control theory. We illustrate the scalar version of this problemnext. The analysis can be generalized to multidimensional systems (seeoptimal control textbooks such as [Ber17]).

Example 1.3.1 (Linear Quadratic Optimal Control)

Consider a vehicle that moves on a straight-line road under the influence ofa force uk and without friction. Our objective is to maintain the vehicle’svelocity at a constant level v (as in an oversimplified cruise control system).The velocity vk at time k, after time discretization of its Newtonian dynamicsand addition of stochastic noise, evolves according to

vk+1 = vk + buk +wk, (1.16)

where wk is a stochastic disturbance with zero mean and given variance σ2.By introducing xk = vk − v, the deviation between the vehicle’s velocity vk

Page 30:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 29

at time k from the desired level v, we obtain the system equation

xk+1 = xk + buk + wk.

Here the coefficient b relates to a number of problem characteristicsincluding the weight of the vehicle, the road conditions. Moreover there isfriction, which introduces a coefficient a < 1 multiplying vk in Eq. (1.16). Forour present purposes, we assume that a = 1, and b is constant and knownto the controller (problems involving a system with unknown and/or time-varying parameters will be discussed later). To express our desire to keep xk

near zero with relatively little force, we introduce a quadratic cost over Nstages:

x2N +

N−1∑

k=0

(x2k + ru2

k),

where r is a known nonnegative weighting parameter. We assume no con-straints on xk and uk (in reality such problems include constraints, but itis common to neglect the constraints initially, and check whether they areseriously violated later).

We will apply the DP algorithm, and derive the optimal cost-to-gofunctions J∗

k and optimal policy. We have

J∗N (xN) = x2

N ,

and by applying Eq. (1.15), we obtain

J∗N−1(xN−1) = min

uN−1

E{

x2N−1 + ru2

N−1 + J∗N (xN−1 + buN−1 + wN−1)

}

= minuN−1

E{

x2N−1 + ru2

N−1 + (xN−1 + buN−1 + wN−1)2}

= minuN−1

[

x2N−1 + ru2

N−1 + (xN−1 + buN−1)2

+ 2E{wN−1}(xN−1 + buN−1) + E{w2N−1}

]

,

and finally, using the assumption E{wN−1} = 0,

J∗N−1(xN−1) = x2

N−1 + minuN−1

[

ru2N−1 + (xN−1 + buN−1)

2]

+ σ2. (1.17)

The expression minimized over uN−1 in the preceding equation is convexquadratic in uN−1, so by setting to zero its derivative with respect to uN−1,

0 = 2ruN−1 + 2b(xN−1 + buN−1),

we obtain the optimal policy for the last stage:

µ∗N−1(xN−1) = − b

r + b2xN−1.

Page 31:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

30 Exact and Approximate Dynamic Programming Principles Chap. 1

Substituting this expression into Eq. (1.17), we obtain with a straightforwardcalculation

J∗N−1(xN−1) = PN−1x

2N−1 + σ2,

where

PN−1 =r

r + b2+ 1.

We can now continue the DP algorithm to obtain J∗N−2 from J∗

N−1.An important observation is that J∗

N−1 is quadratic (plus an inconsequentialconstant term), so with a similar calculation we can derive µ∗

N−2 and J∗N−2 in

closed form, as a linear and a quadratic function of xN−2, respectively. Thisprocess can be continued going backwards, and it can be verified by inductionthat for all k, we have the optimal policy and optimal cost-to-go function inthe form

µ∗k(xk) = Lkxk, k = 0, 1, . . . , N − 1,

J∗k (xk) = Pkx

2k + σ2

N−1∑

m=k

Pm+1, k = 0, 1, . . . , N − 1,

where

Lk = − bPk+1

r + b2Pk+1, k = 0, 1, . . . , N − 1,

and the sequence {Pk} is generated backwards by the equation

Pk =rPk+1

r + b2Pk+1+ 1, k = 0, 1, . . . , N − 1,

starting from the terminal condition PN = 1.The process by which we obtained an analytical solution in this example

is noteworthy. A little thought while tracing the steps of the algorithm willconvince the reader that what simplifies the solution is the quadratic natureof the cost and the linearity of the system equation. Indeed, it can be shownin generality that when the system is linear and the cost is quadratic, theoptimal policy and cost-to-go function are given by closed-form expressions,even for multi-dimensional linear systems (see [Ber17], Section 3.1). Theoptimal policy is a linear function of the state, and the optimal cost functionis a quadratic in the state plus a constant.

Another remarkable feature of this example, which can also be extendedto multi-dimensional systems, is that the optimal policy does not depend onthe variance of wk, and remains unaffected when wk is replaced by its mean(which is zero in our example). This is known as certainty equivalence, andoccurs in several types of problems involving a linear system and a quadraticcost; see [Ber17], Sections 3.1 and 4.2. For example it holds even when wk

has nonzero mean. For other problems, certainty equivalence can be used asa basis for problem approximation, e.g., assume that certainty equivalenceholds (i.e., replace stochastic quantities by some typical values, such as theirexpected values) and apply exact DP to the resulting deterministic optimalcontrol problem (see the discussion in Chapter 2).

Page 32:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 31

The linear quadratic type of problem illustrated in the preceding ex-ample is exceptional in that it admits an elegant analytical solution. MostDP problems encountered in practice require a computational solution, asillustrated by the next example.

Example 1.3.2 (Inventory Control - ComputationalIllustration of Stochastic DP)

Consider a problem of ordering a quantity of a certain item at each of Nstages so as to (roughly) meet a stochastic demand, while minimizing theincurred expected cost. We denote

xk stock available at the beginning of the kth stage,

uk stock ordered (and immediately delivered) at the beginning of the kthstage,

wk demand during the kth stage with given probability distribution.

We assume that there is an upper bound B on the amount of stock thatcan be stored, and that any excess demand (wk − xk − uk) is turned awayand is lost. As a result, the stock evolves according to the system equation

xk+1 = max(0, xk + uk − wk),

with the states xk taking values in the interval [0, B]. This also implies thecontrol constraint

0 ≤ uk ≤ B − xk.

The demands w0, . . . , wN−1 are assumed to be independent nonnegative ran-dom variables with given probability distributions. Moreover we assume thatxk, uk, and wk take nonnegative integer values, so the state space is finite forall k, and consists of the integers in the interval [0, B].

The cost incurred in stage k consists of two components:

(a) A cost r(xk) representing a penalty for either positive stock xk (holdingcost for excess inventory) or negative stock xk (shortage cost for unfilleddemand).

(b) The purchasing cost cuk, where c is cost per unit ordered.

See Fig. 1.3.2.There is also a terminal cost gN (xN) for being left with inventory xN

at the end of N stages. Thus, the total cost is

E

{

gN (xN) +

N−1∑

k=0

(

r(xk) + cuk

)

}

.

We want to minimize this cost by proper choice of the orders. Consistent withour closed-loop policies framework, we assume that the order uk is chosenwith knowledge of the state xk. We are thus optimizing over ordering policiesµ0, . . . , µN−1 that satisfy the constraint 0 ≤ µk(xk) ≤ B for all xk and k.

Page 33:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

32 Exact and Approximate Dynamic Programming Principles Chap. 1

wk

xk

uk

Demand at Period k

Stock at Period k Stock at Period k + 1

Inventory System

Stock Ordered at Period

Stock Ordered at Period k

Cost of Period k

r(uk) + cuk

∗ xk+1 = max(0, xk + uk − wk)

Figure 1.3.2 Inventory control example. At stage k, the current stock (state)xk, the stock ordered (control) uk, and the demand (random disturbance) wk

determine the cost r(xk) + cuk of stage k and the stock xk+1 = max(0xk +uk −wk) at the next stage.

The DP algorithm starts with

J∗N (xN) = gN (xN).

At time N − 1, the algorithm computes the optimal cost-to-go

JN−1(xN−1) = r(xN−1)

+ min0≤uN−1≤B−xN−1

[

cuN−1 +EwN−1

{

gN(

max(0, xN−1 + uN−1 − wN−1))}]

,

for all values of xN−1 in the interval [0, B].Similarly, at stage k, the DP algorithm computes the optimal cost-to-go,

J∗k (xk) = r(xk) + min

0≤uk≤B−xk

[

cuk + Ewk

{

J∗k+1

(

max(0, xk + uk − wk))}]

,

(1.18)for all values of xk in the interval [0, B].

The value J∗0 (x0) is the optimal expected cost when the initial stock

at time 0 is x0. The optimal policy µ∗(xk) is computed simultaneously withJ∗k (xk) from the minimization in the right-hand side of Eq. (1.18).

We will now go through the numerical calculations of the DP algorithmfor a special case of the problem. In particular, we assume that there is anupper bound of B = 2 units on the stock that can be stored, i.e. there is aconstraint 0 ≤ xk +uk ≤ 2. The unit purchase cost of inventory is c = 1, andthe holding/shortage cost is

r(xk) = (xk + uk − wk)2,

implying a penalty both for excess inventory and for unmet demand at theend of the kth stage. Thus the cost per stage is

gk(xk, uk, wk) = (xk + uk −wk)2 + uk.

Page 34:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 33

The terminal cost is assumed to be 0 (unused stock at the end of the planninghorizon is discarded):

gN (xN) = 0. (1.19)

The horizon N is 3 stages, and the initial stock x0 is 0. The demand wk hasthe same probability distribution for all stages, given by

p(wk = 0) = 0.1, p(wk = 1) = 0.7, p(wk = 2) = 0.2.

Stock = 0 Stock = 1 Stock = 2Stock = 0 Stock = 1 Stock = 2Stock = 0 Stock = 1 Stock = 2Stock = 0 Stock = 1 Stock = 2

Stock = 0 Stock = 1 Stock = 2Stock = 0 Stock = 1 Stock = 2Stock = 0 Stock = 1 Stock = 2Stock = 0 Stock = 1 Stock = 2

Stock = 0 Stock = 1 Stock = 2

Stock = 0 Stock = 1 Stock = 2

Stock = 0 Stock = 1 Stock = 2

Stock = 0 Stock = 1 Stock = 2

Stock = 0 Stock = 1 Stock = 2

Stock = 0 Stock = 1 Stock = 2

Stock = 0 Stock = 1 Stock = 2

Stock = 0 Stock = 1 Stock = 2Stock = 0 Stock = 1 Stock = 2Stock = 0 Stock = 1 Stock = 2

0.1 0.2 0.7 0.9 1.0

0.1 0.2 0.7 0.9 1.0

0.1 0.2 0.7 0.9 1.0

0.1 0.2 0.7 0.9 1.0

0.1 0.2 0.7 0.9 1.0

0.1 0.2 0.7 0.9 1.0

0.1 0.2 0.7 0.9 1.0

0.1 0.2 0.7 0.9 1.0

0.1 0.2 0.7 0.9 1.0

0.1 0.2 0.7 0.9 1.0

0.1 0.2 0.7 0.9 1.0

0.1 0.2 0.7 0.9 1.0

0.1 0.2 0.7 0.9 1.00.1 0.2 0.7 0.9 1.0

Stock purchased = 0 = 0 Stock purchased = 1

Stock purchased = 2

Stage 0 Stage 0 Stage 1 Stage 1 Stage 2 Stage 2

Stock Cost-to-go Optimal Cost-to-go Optimal Cost-to-go Optimalstock to stock to stock topurchase purchase purchase

0 3.7 1 2.5 1 1.3 1

1 2.7 0 1.5 0 0.3 0

2 2.818 0 1.68 0 1.1 0

Figure 1.3.3 System and DP results for Example 1.3.2. The transition proba-bility diagrams for the different values of stock purchased (control) are shown.The numbers next to the arcs are the transition probabilities. The controlu = 1 is not available at state 2 because of the limitation xk + uk ≤ 2. Simi-larly, the control u = 2 is not available at states 1 and 2. The results of theDP algorithm are given in the table.

Page 35:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

34 Exact and Approximate Dynamic Programming Principles Chap. 1

The starting equation for the DP algorithm is

J3(x3) = 0,

since the terminal cost is 0 [cf. Eq. (1.19)]. The algorithm takes the form [cf.Eq. (1.15)]

Jk(xk) = min0≤uk≤2−xk

uk=0,1,2

Ewk

{

uk+(xk+uk−wk)2+Jk+1

(

max(0, xk+uk−wk))

}

,

where k = 0, 1, 2, and xk, uk, wk can take the values 0, 1, and 2.

Stage 2: We compute J2(x2) for each of the three possible states x2 = 0, 1, 2,.We have

J2(0) = minu2=0,1,2

Ew2

{

u2 + (u2 −w2)2}

= minu2=0,1,2

[

u2 + 0.1(u2)2 + 0.7(u2 − 1)2 + 0.2(u2 − 2)2

]

.

We calculate the expectation of the right side for each of the three possiblevalues of u2:

u2 = 0 : E{·} = 0.7 · 1 + 0.2 · 4 = 1.5,

u2 = 1 : E{·} = 1 + 0.1 · 1 + 0.2 · 1 = 1.3,

u2 = 2 : E{·} = 2 + 0.1 · 4 + 0.7 · 1 = 3.1.

Hence we have, by selecting the minimizing u2,

J2(0) = 1.3, µ∗2(0) = 1.

For x2 = 1, we have

J2(1) = minu2=0,1

Ew2

{

u2 + (1 + u2 − w2)2}

= minu2=0,1

[

u2 + 0.1(1 + u2)2 + 0.7(u2)

2 + 0.2(u2 − 1)2]

.

The expected value in the right side is

u2 = 0 : E{·} = 0.1 · 1 + 0.2 · 1 = 0.3,

u2 = 1 : E{·} = 1 + 0.1 · 4 + 0.7 · 1 = 2.1.

HenceJ2(1) = 0.3, µ∗

2(1) = 0.

For x2 = 2, the only admissible control is u2 = 0, so we have

J2(2) = Ew2

{

(2− w2)2}

= 0.1 · 4 + 0.7 · 1 = 1.1,

Page 36:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 35

J2(2) = 1.1, µ∗2(2) = 0.

Stage 1: Again we compute J1(x1) for each of the three possible statesx1 = 0, 1, 2, using the values J2(0), J2(1), J2(2) obtained in the previousstage. For x1 = 0, we have

J1(0) = minu1=0,1,2

Ew1

{

u1 + (u1 − w1)2 + J2

(

max(0, u1 − w1))

}

,

u1 = 0 : E{·} = 0.1 · J2(0) + 0.7(

1 + J2(0))

+ 0.2(

4 + J2(0))

= 2.8,

u1 = 1 : E{·} = 1 + 0.1(

1 + J2(1))

+ 0.7 · J2(0) + 0.2(

1 + J2(0))

= 2.5,

u1 = 2 : E{·} = 2 + 0.1(

4 + J2(2))

+ 0.7(

1 + J2(1))

+ 0.2 · J2(0) = 3.68,

J1(0) = 2.5, µ∗1(0) = 1.

For x1 = 1, we have

J1(1) = minu1=0,1

Ew1

{

u1 + (1 + u1 −w1)2 + J2

(

max(0, 1 + u1 − w1))

}

,

u1 = 0 : E{·} = 0.1(

1 + J2(1))

+ 0.7 · J2(0) + 0.2(

1 + J2(0))

= 1.5,

u1 = 1 : E{·} = 1 + 0.1(

4 + J2(2))

+ 0.7(

1 + J2(1))

+ 0.2 · J2(0) = 2.68,

J1(1) = 1.5, µ∗1(1) = 0.

For x1 = 2, the only admissible control is u1 = 0, so we have

J1(2) = Ew1

{

(2− w1)2 + J2

(

max(0, 2− w1))

}

= 0.1(

4 + J2(2))

+ 0.7(

1 + J2(1))

+ 0.2 · J2(0)

= 1.68,

J1(2) = 1.68, µ∗1(2) = 0.

Stage 0: Here we need to compute only J0(0) since the initial state is knownto be 0. We have

J0(0) = minu0=0,1,2

Ew0

{

u0 + (u0 − w0)2 + J1

(

max(0, u0 − w0))

}

,

u0 = 0 : E{·} = 0.1 · J1(0) + 0.7(

1 + J1(0))

+ 0.2(

4 + J1(0))

= 4.0,

u0 = 1 : E{·} = 1 + 0.1(

1 + J1(1))

+ 0.7 · J1(0) + 0.2(

1 + J1(0))

= 3.7,

u0 = 2 : E{·} = 2 + 0.1(

4 + J1(2))

+ 0.7(

1 + J1(1))

+ 0.2 · J1(0) = 4.818,

J0(0) = 3.7, µ∗0(0) = 1.

If the initial state were not known a priori, we would have to computein a similar manner J0(1) and J0(2), as well as the minimizing u0. The readermay verify (Exercise 1.1) that these calculations yield

J0(1) = 2.7, µ∗0(1) = 0,

J0(2) = 2.818, µ∗0(2) = 0.

Thus the optimal ordering policy for each stage is to order one unit if thecurrent stock is zero and order nothing otherwise. The results of the DPalgorithm are given in lookup table form in Fig. 1.3.3.

Page 37:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

36 Exact and Approximate Dynamic Programming Principles Chap. 1

1.3.2 Approximation in Value Space for Stochastic DP

Generally the computation of the optimal cost-to-go functions J*k can be

very time-consuming or impossible. One of the principal RL methods todeal with this difficulty is approximation in value space. Here approxima-tions Jk are used in place of J*

k , similar to the deterministic case; cf. Eqs.(1.8) and (1.11).

Approximation in Value Space - Use of Jk in Place of J*k

At any state xk encountered at stage k, set

µk(xk) ∈ arg minuk∈Uk(xk)

Ewk

{

gk(xk, uk, wk) + Jk+1

(

fk(xk, uk, wk))

}

.

(1.20)

The motivation for approximation in value space for stochastic DPproblems is similar to the one for deterministic problems. The one-steplookahead minimization (1.20) needs to be performed only for the N statesx0, . . . , xN−1 that are encountered during the on-line control of the system,and not for every state within the potentially enormous state space.

Our discussion of rollout (cf. Section 1.1.3) also applies to stochasticproblems: we select Jk to be the cost function of a suitable base policy(perhaps with some approximation). Note that any policy can be usedon-line as base policy, including policies obtained by a sophisticated off-line procedure, using for example neural networks and training data.† Therollout algorithm has the cost improvement property, whereby it yields animproved cost relative to its underlying base policy.

A major variant of rollout is truncated rollout , which combines theuse of one-step optimization, simulation of the base policy for a certainnumber of steps m, and then adds an approximate cost Jk+m+1(xk+m+1)to the cost of the simulation, which depends on the state xk+m+1 obtainedat the end of the rollout (see Chapter 2). Note that if one foregoes the useof a base policy (i.e., m = 0), one recovers as a special case the general

† The principal role of neural networks within the context of this book is to

provide the means for approximating various target functions from input-output

data. This includes cost functions and Q-factors of given policies, and optimal

cost-to-go functions and Q-factors; in this case the neural network is referred to

as a value network (sometimes the alternative term critic network is also used). In

other cases the neural network represents a policy viewed as a function from state

to control, in which case it is called a policy network (the alternative term actor

network is also used). The training methods for constructing the cost function,

Q-factor, and policy approximations themselves from data are mostly based on

optimization and regression, and will be discussed in Chapter 4.

Page 38:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 37

Truncated Horizon RolloutApproximation in Policy Space Heuristic Cost Approximation

for Stages Beyond Truncation

for Stages Beyond Truncationfor Stages Beyond Truncation

One-Step or Multistep Lookahead for stages Possible Terminal Cost

Base Policy

Base Policy m-Step

Rollout with Base Policy

Multiagent Q-factor minimization xk

Possible States Possible StatesPossible States xk+1 xk+m+1

, uk, wk)

Figure 1.3.4 Schematic illustration of truncated rollout. One-step lookahead isfollowed by simulation of the base policy for m steps, and an approximate costJk+m+1(xk+m+1) is added to the cost of the simulation, which depends on thestate xk+m+1 obtained at the end of the rollout. If the base policy simulationis omitted (i.e., m = 0), one recovers the general approximation in value spacescheme (1.20). There are also multistep lookahead versions of truncated rollout(see Chapter 2).

approximation in value space scheme (1.20); see Fig. 1.3.4. Note also thatmultistep lookahead versions of truncated rollout are possible. They willbe discussed in subsequent chapters.

We may also simplify the lookahead minimization over uk ∈ Uk(xk)[cf. Eq. (1.15)]. In particular, in the multiagent case where the controlconsists of multiple components, uk = (u1

k, . . . , umk ), a sequence of m single

component minimizations can be used instead, with potentially enormouscomputational savings resulting.

There is one additional issue in approximation in value space forstochastic problems: the computation of the expected value in Eq. (1.20)may be very time-consuming. Then one may consider approximations inthe computation of this expected value, based for example on Monte Carlosimulation or other schemes. Some of the possibilities along this line willbe discussed in the next chapter.

Figure 1.3.5 illustrates the three approximations involved in approx-imation in value space for stochastic problems: cost-to-go approximation,expected value approximation, and simplified minimization. They may bedesigned largely independently of each other, and with a variety of meth-ods. Most of the discussion in this book will revolve around different waysto organize these three approximations, in the context of finite and infinitehorizon problems, and some of their special cases.

Q-Factors for Stochastic Problems

We can define optimal Q-factors for a stochastic problem, similar to the

Page 39:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

38 Exact and Approximate Dynamic Programming Principles Chap. 1

Steps “Future”Steps “Future” First Step

minuk

E{

gk(xk, uk, wk) + Jk+1(xk+1)}

Cost-to-go approximation Expected value approximationCost-to-go approximation Expected value approximation

Simplified minimization

At xk

Figure 1.3.5 Schematic illustration of approximation in value space for stochasticproblems, and the three approximations involved in its design. Typically theapproximations can be designed independently of each other. There are alsomultistep lookahead versions of approximation in value space (see Chapter 2).

case of deterministic problems [cf. Eq. (1.12)], as the expressions that areminimized in the right-hand side of the stochastic DP equation (1.15).They are given by

Q*k(xk, uk) = Ewk

{

gk(xk, uk, wk) + J*k+1

(

fk(xk, uk, wk))

}

. (1.21)

The optimal cost-to-go functions J*k can be recovered from the optimal

Q-factors Q*k by means of

J*k (xk) = min

uk∈Uk(xk)Q*

k(xk, uk),

and the DP algorithm can be written in terms of Q-factors as

Q*k(xk, uk) =Ewk

{

gk(xk, uk, wk)

+ minuk+1∈Uk+1(fk(xk,uk,wk))

Q*k+1

(

fk(xk, uk, wk), uk+1

)

}

.

Similar to the deterministic case, Q-learning involves the calculationof either the optimal Q-factors (1.21) or approximations Qk(xk, uk). Theapproximate Q-factors may be obtained using approximation in value spaceschemes, and can be used to obtain approximately optimal policies throughthe Q-factor minimization

µk(xk) ∈ arg minuk∈Uk(xk)

Qk(xk, uk). (1.22)

In Chapter 4, we will discuss the use of neural networks in such approxi-mations.

Page 40:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 39

Cost Versus Q-Factor Approximations - Robustness and On-Line Replanning

We have seen that it is possible to implement approximation in value spaceby using cost function approximations [cf. Eq. (1.20)] or by using Q-factorapproximations [cf. Eq. (1.22)], so the question arises which one to use in agiven practical situation. One important consideration is the facility of ob-taining suitable cost or Q-factor approximations. This depends largely onthe problem and also on the availability of data on which the approxima-tions can be based. However, there are some other major considerations.

In particular, the cost function approximation scheme

µk(xk) ∈ arg minuk∈Uk(xk)

Ewk

{

gk(xk, uk, wk) + Jk+1

(

fk(xk, uk, wk))

}

,

has an important disadvantage: the expected value above needs to be com-puted on-line for all uk ∈ Uk(xk), and this may involve substantial compu-tation. On the other hand it also has an important advantage in situationswhere the system function fk, the cost per stage gk, or the control con-straint set Uk(xk) can change as the system is operating. Assuming thatthe new fk, gk, or Uk(xk) become known to the controller at time k, on-linereplanning may be used, and this may improve substantially the robustnessof the approximation in value space scheme, as discussed earlier for deter-ministic problems.

By comparison, the Q-factor function approximation scheme (1.22)does not allow for on-line replanning. On the other hand, for problemswhere there is no need for on-line replanning, the Q-factor approximationscheme does not require the on-line computation of expected values andmay allow a much faster on-line computation of the minimizing controlµk(xk) via Eq. (1.22).

1.3.3 Infinite Horizon Problems - An Overview

We will now provide an outline of infinite horizon stochastic DP with anemphasis on its aspects that relate to our RL/approximation methods.In Chapter 5 we will deal with two types of infinite horizon stochasticproblems, where we aim to minimize the total cost over an infinite numberof stages, given by

Jπ(x0) = limN→∞

Ewk

k=0,1,...

{

N−1∑

k=0

αkg(

xk, µk(xk), wk

)

}

; (1.23)

see Fig. 1.3.6. Here, Jπ(x0) denotes the cost associated with an initial statex0 and a policy π = {µ0, µ1, . . .}, and α is a scalar in the interval (0, 1].

When α is strictly less that 1, it has the meaning of a discount factor ,and its effect is that future costs matter to us less than the same costs

Page 41:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

40 Exact and Approximate Dynamic Programming Principles Chap. 1

...... ) xk xk+1) x0

Random Transition

) Random Cost

xk+1 = f(xk, uk, wk)

) αkg(xk, uk, wk)

Termination State Infinite Horizon

Figure 1.3.6 Illustration of an infinite horizon problem. The system and costper stage are stationary, except for the use of a discount factor α. If α = 1, thereis typically a special cost-free termination state that we aim to reach.

incurred at the present time. Among others, a discount factor guaranteesthat the limit defining Jπ(x0) exists and is finite (assuming that the rangeof values of the stage cost g is bounded). This is a nice mathematicalproperty that makes discounted problems analytically and algorithmicallytractable.

Thus, by definition, the infinite horizon cost of a policy is the limitof its finite horizon costs as the horizon tends to infinity. The two types ofproblems that we will focus on in Chapter 5 are:

(a) Stochastic shortest path problems (SSP for short). Here, α = 1 butthere is a special cost-free termination state; once the system reachesthat state it remains there at no further cost. In some types of prob-lems, the termination state may represent a goal state that we aretrying to reach at minimum cost, while in others it may be a statethat we are trying to avoid for as long as possible. We will mostlyassume a problem structure such that termination is inevitable underall policies. Thus the horizon is in effect finite, but its length is ran-dom and may be affected by the policy being used. A significantlymore complicated type of SSP problems, which we will discuss selec-tively, arises when termination can be guaranteed only for a subsetof policies, which includes all optimal policies. Some common typesof SSP belong to this category, including deterministic shortest pathproblems that involve graphs with cycles.

(b) Discounted problems . Here, α < 1 and there need not be a termi-nation state. However, we will see that a discounted problem witha finite number of states can be readily converted to an SSP prob-lem. This can be done by introducing an artificial termination stateto which the system moves with probability 1− α at every state andstage, thus making termination inevitable. As a result, algorithmsand analysis for SSP problems can be easily adapted to discountedproblems. Moreover, a common line of analysis and algorithmic de-velopment may often be used for both SSP and discounted problems,as we will see in Chapter 5.

In our theoretical and algorithmic analysis of Chapter 5, we will focuson finite state problems. Much of our infinite horizon methodology applies

Page 42:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 41

in principle to problems with continuous spaces problems as well, but thereare significant exceptions and mathematical complications (the abstractDP monograph [Ber18a] takes a closer look at such problems). Still we willuse the infinite horizon methodology in continuous spaces settings withsomewhat limited justification, as an extension of the finite-state methodsthat we will justify more rigorously.

Infinite Horizon Theory - Value Iteration

There are several analytical and computational issues regarding our infinitehorizon problems. Many of them revolve around the relation between theoptimal cost function J* of the infinite horizon problem and the optimalcost functions of the corresponding N -stage problems.

In particular, consider the undiscounted case (α = 1) and let JN (x)denote the optimal cost of the problem involving N stages, initial state x,cost per stage g(x, u, w), and zero terminal cost. This cost is generatedafter N iterations of the DP algorithm

Jk+1(x) = minu∈U(x)

Ew

{

g(x, u, w) + Jk(

f(x, u, w))

}

, k = 0, 1, . . . ,

(1.24)starting from J0(x) ≡ 0.† The algorithm (1.24) is known as the valueiteration algorithm (VI for short). Since the infinite horizon cost of a givenpolicy is, by definition, the limit of the corresponding N -stage costs asN → ∞, it is natural to speculate that:

(1) The optimal infinite horizon cost is the limit of the correspondingN -stage optimal costs as N → ∞; i.e.,

J*(x) = limN→∞

JN (x) (1.25)

for all states x.

(2) The following equation should hold for all states x,

J*(x) = minu∈U(x)

Ew

{

g(x, u, w) + J*(

f(x, u, w))

}

. (1.26)

† This is just the finite horizon DP algorithm, except that we have reversedthe time indexing to suit our infinite horizon context. In particular, consider theN-stages problem and let VN−k(x) be the optimal cost-to-go starting at x withk stages to go, and with terminal cost equal to 0. Applying DP, we have for allx,

VN−k(x) = minu∈U(x)

Ew

{

αN−kg(x, u,w) + VN−k+1

(

f(x, u, w))

}

, VN(x) = 0.

By defining Jk(x) = VN−k(x)/αN−k, we obtain the VI algorithm (1.24).

Page 43:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

42 Exact and Approximate Dynamic Programming Principles Chap. 1

This is obtained by taking the limit as N → ∞ in the VI algorithm(1.24) using Eq. (1.25). The preceding equation, called Bellman’sequation, is really a system of equations (one equation per state x),which has as solution the optimal costs-to-go of all the states.

(3) If µ(x) attains the minimum in the right-hand side of the Bellmanequation (1.26) for each x, then the policy {µ, µ, . . .} should be op-timal. This type of policy is called stationary. Intuitively, optimalpolicies can be found within this class of policies, since optimizationof the future costs (the tail subproblem) when starting at a givenstate looks the same regardless of the time when we start.

All three of the preceding results hold for finite-state SSP problemsunder reasonable assumptions. They also hold for discounted problemsin suitably modified form that incorporates the discount factor, providedthe cost per stage function g is bounded over the set of possible values of(x, u, w) (see [Ber12], Chapter 1). In particular, the discounted version ofthe VI algorithm of Eq. (1.24) is

Jk+1(x) = minu∈U(x)

Ew

{

g(x, u, w) + αJk(

f(x, u, w))

}

, k = 0, 1, . . . ,

(1.27)while the discounted version of Bellman’s equation [cf. Eq. (1.26)] is

J*(x) = minu∈U(x)

Ew

{

g(x, u, w) + αJ*(

f(x, u, w))

}

. (1.28)

The VI algorithm of Eq. (1.27) simply expresses the fact that theoptimal cost for k + 1 stages is obtained by minimizing the sum of thefirst stage cost and the optimal cost for the next k stages starting from thenext state f(x, u, w). The latter cost, however, should be discounted byα in view of the cost definition (1.23). The intuitive interpretation of theBellman equation (1.28) is that it is the limit as k → ∞ of the VI algorithm(1.27) assuming that Jk → J*.

The VI algorithm is also often valid, in the sense that Jk → J*, evenif the initial function J0 is nonzero. The motivation for a different choiceof J0 is faster convergence to J*; generally the convergence is faster as J0is chosen closer to J*.

We will now provide two examples that illustrate how the infinitehorizon DP theory can be used for an analytical solution of the problem.The first example involves continuous state and control spaces and is basedon the linear quadratic formulation (cf. Example 1.3.1). The second exam-ple involves discrete state and control spaces and is of the SSP type. Theseexamples involve a favorable special structure, which allows an analyticalsolution. Such problems are rare. Most practical DP problems require acomputational solution.

Page 44:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 43

Example 1.3.3 (Linear Quadratic Optimal Control)

Consider the infinite horizon version of the linear quadratic problem of Ex-ample 1.3.1, assuming a discount factor α < 1. Thus the cost function isgiven by

limN→∞

Ewk

k=0,1,...

{

N−1∑

k=0

αk(x2k + ru2

k)

}

, (1.29)

and the VI algorithm (1.27) has the form

Jk+1(x) = minu

Ew

{

x2 + ru2 + αJk(x+ bu+ w)}

, k = 0, 1, . . . . (1.30)

Since the VI iterates are cost functions of linear quadratic finite horizon prob-lems, it follows from the analysis of the finite horizon case of Example 1.3.1that Jk is a convex quadratic in the state plus a constant. A simple modi-fication of the calculations of that example (to incorporate the effect of thediscount factor) yields the VI iterates as

Jk+1(x) = Kkx2 + σ2

k−1∑

m=0

αk−mKm, k = 0, 1, . . . , (1.31)

starting from the initial condition J0(x) = 0, where the sequence {Kk} isgenerated by the equation

Kk+1 =αrKk

r + αb2Kk

+ 1, k = 0, 1, . . . , (1.32)

starting from the initial condition K0 = 1 (Kk corresponds to PN−k in Ex-ample 1.3.1 because of the reversal of time indexing).

It can be verified that the sequence {Kk} is convergent to the scalarK∗, which is the unique positive solution of the equation

K =αrK

r + αb2K+ 1, (1.33)

the limit as k → ∞ of Eq. (1.32) (a graphical explanation is given in Fig.1.3.7). Equation (1.33) is a one-dimensional special case of the famous discrete-time Riccati equation from control theory. Moreover, the VI algorithm (1.31)converges to the optimal cost function, which is the quadratic function

J∗(x) = K∗x2 +α

1− αK∗σ2; (1.34)

cf. Eq. (1.31). It can be verified by straightforward calculation that J∗ satisfiesBellman’s equation,

J∗(x) = minu

Ew

{

x2 + ru2 + αJ∗(x+ bu+ w)}

.

Page 45:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

44 Exact and Approximate Dynamic Programming Principles Chap. 1

45◦0

r

b2+ 1 1

+ 1 1

K KK K∗Kk Kk+1

K K K+ 1 1 −

r

αb2

F (K) = αrK

r+αb2K+ 1

Figure 1.3.7 Graphical construction of the two solutions of Bellman’s equa-tion, and proof of the convergence of the VI algorithm (1.31)-(1.32) for thelinear quadratic Example 1.3.3, starting from a quadratic initial function. Thesequence {Kk} generated by the VI algorithm is given by Kk+1 = F (Kk)where

F (K) =αrK

r + αb2K+ 1.

Because F is concave and monotonically increasing in the interval (−r/αb2,∞),as shown in the figure, the equation K = F (K) has one positive solution K∗

and one negative solution K. The iteration Kk+1 = F (Kk) [cf. Eq. (1.32)]converges to K∗ starting from anywhere in the interval (K,∞) as shown inthe figure. This iteration is essentially equivalent to the VI algorithm with aquadratic starting function.

The optimal policy is obtained by the preceding minimization:

µ∗(x) ∈ argminu

Ew

{

x2 + ru2 + αK∗(x+ bu+ w)2}

+α2

1− αK∗σ2.

By setting the derivative of the minimized expression to 0, it can be seen thatµ∗ is linear and stationary of the form

µ∗(x) = Lx, (1.35)

where

L = − αbK∗

r + αb2K∗. (1.36)

This is the limit of the nonstationary optimal policy derived in Example 1.3.1for the finite horizon version of the problem.

The preceding derivation can be extended to general multidimensionallinear quadratic infinite horizon discounted problems, and can be establishedby a rigorous analysis (see [Ber12], Chapter 4). The linearity and simplicity ofthe optimal policy (1.35) is noteworthy, and a major reason for the popularityof the linear quadratic formulation.

Page 46:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 45

+1, . . . , kSpider 1 Spider 2 Fly 1 Fly 2 n− 11 n n 2 0 1 2+ 1 n− 2 0 1 2 2 0 1 22 0 1 2

Figure 1.3.8 Illustration of transition probabilities of the spider and fly prob-lem. The state is the distance between spider and fly, and cannot increase,except when the spider and fly are one unit apart. There are n + 1 states,with state 0 being the termination state.

The undiscounted version (α = 1) of the infinite horizon linear quadraticproblem is also of interest. A mathematical complication here is that theoptimal cost function (1.34) becomes infinite as α → 1, when σ2 > 0. Onthe other hand, for a deterministic problem (wk ≡ 0 or equivalently σ2 = 0)the informal derivation given above goes through, and the optimal controlleris still linear of the form (1.35), where now L is the limit as α → 1 of thediscounted version (1.36):

L = − bK∗

r + b2K∗, (1.37)

where K∗ is the positive solution of the Riccati equation

K =rK

r + b2K+ 1. (1.38)

(This quadratic equation in K has two solutions, one positive and one nega-tive, when r > 0; see Fig. 1.3.7.)

The linear policy µ∗(x) = Lx, with the coefficient L given by Eq. (1.37),is also optimal for a stochastic linear quadratic optimal control problem withthe average cost criterion, which is not discussed in this book, but is coveredelsewhere in the literature (see e.g., [Ber12], Section 5.6.5, and the referencesquoted there).

Example 1.3.4 (Spider and Fly)

A spider and a fly move along a straight line at times k = 0, 1, . . . The initialpositions of the fly and the spider are integer. At each stage, the fly movesone unit to the left with probability p, one unit to the right with probability p,and stays where it is with probability 1−2p. The spider, knows the position ofthe fly at the beginning of each stage, and will always move one unit towardsthe fly if its distance from the fly is more that one unit. If the spider is oneunit away from the fly, it will either move one unit towards the fly or staywhere it is. If the spider and the fly land in the same position at the endof a stage, then the spider captures the fly and the process terminates. Thespider’s objective is to capture the fly in minimum expected time; see Fig.1.3.8.

We view as state the distance between spider and fly. Then the problemcan be formulated as an SSP problem with states 0, 1, . . . , n, where n is the

Page 47:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

46 Exact and Approximate Dynamic Programming Principles Chap. 1

initial distance. State 0 is a termination state where the spider captures thefly. Let us denote p1y(m) and p1y(m) the transition probabilities from state1 to state y under the two spider choices/controls of moving and not moving,respectively, and let us denote by pxy the transition probabilities from a statex ≥ 2.† We have

pxx = p, px(x−1) = 1− 2p, px(x−2) = p, x ≥ 2,

p11(m) = 2p, p10(m) = 1− 2p,

p12(m) = p, p11(m) = 1− 2p, p10(m) = p,

with all other transition probabilities being 0.For states x ≥ 2, Bellman’s equation is written as

J∗(x) = 1 + pJ∗(x) + (1− 2p)J∗(x− 1) + pJ∗(x− 2), x ≥ 2, (1.39)

where for the termination state 0, we have

J∗(0) = 0.

The only state where the spider has a choice is when it is one unit away fromthe fly, and for that state Bellman’s equation is given by

J∗(1) = 1 + min[

2pJ∗(1), pJ∗(2) + (1− 2p)J∗(1)]

, (1.40)

where the first and the second expression within the bracket above are asso-ciated with the spider moving and not moving, respectively. By writing Eq.(1.39) for x = 2, we obtain

J∗(2) = 1 + pJ∗(2) + (1− 2p)J∗(1),

from which

J∗(2) =1

1− p+

(1− 2p)J∗(1)

1− p. (1.41)

Substituting this expression in Eq. (1.40), we obtain

J∗(1) = 1 + min

[

2pJ∗(1),p

1− p+

p(1− 2p)J∗(1)

1− p+ (1− 2p)J∗(1)

]

,

† Transition probabilities are a common way to describe stochastic finite-state systems, and will be used in Chapter 5. In particular, given the transitionprobabilities pxy(u), i.e., the conditional probabilities of moving from state x tostate y under control u, an equivalent system equation description is xk+1 = wk,where the disturbance wk is generated according to the conditional distribution

p(wk = y | xk = x, uk = u) = pxy(u),

for all possible values of (x, y, u).

Page 48:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 47

or equivalently,

J∗(1) = 1 + min

[

2pJ∗(1),p

1− p+

(1− 2p)J∗(1)

1− p

]

. (1.42)

Thus Bellman’s equation at state 1 can be reduced to the one-dimensio-nal Eq. (1.42), and once this equation is solved it can be used to solve theBellman Eq. (1.39) for all x. To solve Eq. (1.42), we consider the two caseswhere the first expression within the bracket is larger and is smaller than thesecond expression. Thus we solve for J∗(1) in the two cases where

J∗(1) = 1 + 2pJ∗(1), (1.43)

2pJ∗(1) ≤ p

1− p+

(1− 2p)J∗(1)

1− p, (1.44)

and

J∗(1) = 1 +p

1− p+

(1− 2p)J∗(1)

1− p, (1.45)

2pJ∗(1) ≥ p

1− p+

(1− 2p)J∗(1)

1− p. (1.46)

The solution of Eq. (1.43) is seen to be

J∗(1) =1

1− 2p,

and by substitution in Eq. (1.44), we find that this solution is valid when

2p

1− 2p≤ p

1− p+

1

1− p,

or equivalently (after some calculation), p ≤ 1/3. Thus for p ≤ 1/3, it isoptimal for the spider to move when it is one unit away from the fly.

Similarly, the solution of Eq. (1.45) is seen to be

J∗(1) =1

p,

and by substitution in Eq. (1.46), we find that this solution is valid when

2 ≥ p

1− p+

1− 2p

p(1− p),

or equivalently (after some calculation), p ≥ 1/3. Thus, for p ≥ 1/3 it isoptimal for the spider not to move when it is one unit away from the fly.

The minimal expected number of steps for capture when the spider isone unit away from the fly was calculated earlier to be

J∗(1) =

{

1/(1− 2p) if p ≤ 1/3,1/p if p ≥ 1/3.

Page 49:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

48 Exact and Approximate Dynamic Programming Principles Chap. 1

Given the value of J∗(1), we can calculate from Eq. (1.41) the minimal ex-pected number of steps for capture when two units away, J∗(2), and we canthen obtain the remaining values J∗(i), i = 3, . . . , n, from Eq. (1.39).

The problem can also be solved by using the VI algorithm, which takesthe form

Jk+1(0) = 0, Jk+1(1) = 1 + min[

2pJk(1), pJk(2) + (1− 2p)Jk(1)]

,

Jk+1(x) = 1 + pJk(x) + (1− 2p)Jk(x− 1) + pJk(x− 2), x ≥ 2,

where the initial conditions are J0(0) = 0 and J0(x) are arbitrary scalars forx ≥ 1. The algorithm is iterative and can be shown to converge to the optimalcosts J∗(x) as k → ∞.

Despite the simplicity of its context, the spider-and-fly problem canbecome quite difficult even with seemingly minor modifications. For exampleif the fly is “faster” than the spider, and can move up to two units to the leftor right with positive probabilities, the distance between spider and fly maybecome arbitrarily large with positive probability, and the number of statesbecomes infinite. Then, not only the Bellman equation cannot be solvedanalytically, but also the standard theory of SSP problems does not apply(a more complex extension of the theory that applies to infinite state spaceproblems is needed; see [Ber18a], Chapter 4). As another example, supposethat the spider has “blurry” vision and cannot see the exact location of thefly. Then we may formulate the problem as one where the choice of the flydepends on all of its past observations. We will consider problems of this typelater, and we will see that they are far more difficult, both analytically andcomputationally, than the problems we have discussed so far.

1.3.4 Infinite Horizon - Policy Iteration and Rollout

Another major class of infinite horizon algorithms is based on policy itera-tion (PI for short), which involves the repeated use of policy improvementand rollout (starting from some policy and generating an improved policy).Figure 1.3.9 describes the method and indicates that each of its iterationsconsists of two phases:

(a) Policy evaluation, which computes the cost function Jµ. One possi-bility is to solve the corresponding Bellman equation

Jµ(x) = Ew

{

g(

x, µ(x), w)

+ αJ(

f(x, µ(x), w))

}

. (1.47)

However, the value Jµ(x) for any x can also be computed by MonteCarlo simulation, by averaging over many randomly generated trajec-tories the cost of the policy starting from x. Other, more sophisticatedpossibilities include the use of specialized simulation-based methods,such as temporal difference methods , for which there is extensive lit-erature (see e.g., the books [[BeT96], [SuB98], [Ber12]).

Page 50:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 49

x µJµ

Jµ instead of J*

Bellman Eq. with

Policy Evaluation Policy Improvement Rollout Policy ˜Policy Evaluation Policy Improvement Rollout Policy ˜

Policy Evaluation Policy Improvement Rollout Policy ˜Policy Evaluation Policy Improvement Rollout Policy ˜Base Policy Rollout Policy Approximation in Value Space

Base Policy Rollout Policy Approximation in Value Space

Rollout Policy µ

Figure 1.3.9 Schematic illustration of PI as repeated rollout. It generates asequence of policies, with each policy µ in the sequence being the base policy thatgenerates the next policy µ in the sequence as the corresponding rollout policy;cf. Eq. (1.48).

(b) Policy improvement , which computes the improved rollout policy µusing the one-step lookahead minimization

µ(x) ∈ arg minu∈U(x)

Ew

{

g(x, u, w)+αJµ(

f(x, u, w))

}

, k = 0, 1, . . . .

(1.48)

We will discuss several different forms of PI in Chapter 5. We willargue there that PI is the DP concept that forms the foundation for self-learning in RL, i.e., learning from data that is self-generated (from thesystem itself as it operates) rather than from data supplied from an externalsource.

Policy Iteration and Newton’s Method - Bellman Operators

The convergence rate of PI (in its exact form) is very fast in both theoryand practice. In particular, it converges in a finite number of iterations forfinite-state discounted problems, as we will discuss in Chapter 5. It alsoconverges quadratically (its iteration error is proportional to the square ofthe preceding iteration error) for linear quadratic problems like the one ofthe following example, and its multidimensional extensions (this latter facthas been shown by Kleinman [Kle68]).

An intuitive explanation is that PI can be viewed as a form of New-ton’s method for solving the Bellman equation

J(x) = minu∈U(x)

Ew

{

g(x, u, w) + αJ(

f(x, u, w))

}

(1.49)

[cf. Eq. (1.28)], in the function space of cost functions J(x). This requiressome analysis, and we refer to the papers by Puterman and Brumelle[PuB78], [PuB79], who provide fast (quadratic) rate of convergence re-

Page 51:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

50 Exact and Approximate Dynamic Programming Principles Chap. 1

sults.† By comparison, the convergence rate of VI is much slower: thedistance of the iterates Jk from J* diminishes to zero at the rate of thegeometric progression {const · αk} (see Chapter 5), and can be quite slowwhen the discount factor α is near 1.

These convergence rate properties and the relation to Newton’s methodcan also be inferred from geometric interpretations of VI, PI, and rollout,which are given in Fig. 1.3.10. This figure is intuitive, but uses abstractnotation, which requires some explanation (we will use extensively this no-tation in Chapter 5, where we will discuss it further). In particular, wedenote by TJ the function in the right-hand side of Bellman’s Eq. (1.49).It has a value at state x given by

(TJ)(x) = minu∈U(x)

Ew

{

g(x, u, w) + αJ(

f(x, u, w))

}

. (1.50)

Also for each policy µ, we introduce the corresponding function TµJ , whichhas value at x given by

(TµJ)(x) = Ew

{

g(

x, µ(x), w)

+ αJ(

f(x, µ(x), w))

}

. (1.51)

Thus T and Tµ can be viewed as operators (referred to as the Bellmanoperators), which map cost functions J of x to other cost functions (TJ orTµJ , respectively) of x.

† Newton’s method for solving a system of n nonlinear equations with nunknowns, compactly written as F (y) = 0, has the form

yk+1 = yk −(

∂F (yk)

∂y

)−1

F (yk),

where ∂F (yk)/∂y is the n×n Jacobian matrix of F evaluated at the n-dimensionalvector yk. The quadratic convergence property states that near the solution y∗,we have

‖yk+1 − y∗‖ = O(

‖yk − y∗‖2)

,

where ‖·‖ is the Euclidean norm. The Newton iteration can equivalently be statedas follows: given yk, we linearize F at yk (using a first order Taylor expansion),and set yk+1 to the solution of the linearized equation

F (yk) +∂F (yk)

∂y(y − yk) = 0,

which is a system of n linear equation with n unknowns, the components of the

vector y. This statement (suitably modified to account for lack of differentiability

of F ) applies also to problems where F has piecewise linear and either concave

or convex components, and is consistent with the geometric interpretation given

in Fig. 1.3.10.

Page 52:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 51

Note that the Bellman equation J = TµJ is linear, while the functionTJ can be written as minµ TµJ and its components are concave. For ex-ample, suppose that there are two states 1 and 2, and two controls u andv. Consider the policy that applies control u at state 1 and control v atstate 2. Then we have

(TµJ)(1) =

2∑

j=1

p1j(u)(

g(1, u, j) + αJ(j))

,

(TµJ)(2) =

2∑

j=1

p2j(v)(

g(2, v, j) + αJ(j))

,

where pij(u) and pij(v) are the probabilities that the next state will be j,when the current state is i, and the control is u or v, respectively. Clearly,(TµJ)(1) and (TµJ)(2) are linear functions of J .

Also the Bellman equation T = TJ takes the form

(TJ)(1) = min

[

2∑

j=1

p1j(u)(

g(1, u, j) + αJ(j))

,

2∑

j=1

p1j(v)(

g(1, v, j) + αJ(j))

]

,

(TJ)(2) = min

[

2∑

j=1

p2j(u)(

g(2, u, j) + αJ(j))

,

2∑

j=1

p2j(v)(

g(2, v, j) + αJ(j))

]

.

Clearly, (TJ)(1) and (TJ)(2) are concave and piecewise linear (with asmany linear pieces as the number of controls).

Using the operators T and Tµ, we can write the VI algorithm in thecompact form

Jk+1 = TJk,

and the PI algorithm in the compact form

Tµk+1Jµk = TJµk ,

where {µk} is the generated policy sequence. The operator notation simpli-fies derivations and proofs related to the DP methodology, and facilitatesits extensions to infinite horizon DP models beyond the ones that we have

Page 53:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

52 Exact and Approximate Dynamic Programming Principles Chap. 1

discussed in this section (see Chapter 5, the DP textbook [Ber12], and theabstract DP monograph [Ber18a]).

The fast convergence rate of PI explains to some extent the large costimprovements that are typically observed in practice with the rollout algo-rithm. In particular, rollout can be viewed as a single iteration of Newton’smethod applied to the solution of the Bellman equation. Thus, given thebase policy µ and its cost function Jµ, the rollout algorithm first constructsa linearized version of Bellman’s equation at Jµ (its linear approximationat Jµ), and then solves it to obtain Jµ. The linearized Bellman equationalso yields the rollout policy µ through the equation TµJµ = TJµ. If thefunction TJ is nearly linear (has small “curvature,”) the rollout policy per-formance Jµ(x) is very close to the optimal J*(x), even if the base policyµ is far from optimal. This can be seen from Fig. 1.3.10, which illustratesthe large cost reduction obtained by the rollout algorithm.

Example 1.3.5 (Policy Iteration and Rollout for LinearQuadratic Problems)

Consider the infinite horizon version of the linear quadratic problem discussedin Example 1.3.3, assuming a discount factor α < 1. We will derive the formof PI starting from a linear base policy of the form

µ0(x) = L0x,

where L0 is a scalar such that the closed loop system

xk+1 = (1 + bL0)xk + wk, (1.52)

is stable, i.e., |1+bL0| < 1, or equivalently − 1b< L0 < 0. This is necessary for

the policy µ0 to keep the state bounded and the corresponding costs Jµ0 (x)finite. We will see that the PI algorithm generates a sequence of linear stablepolicies.

We will now describe the policy evaluation and policy improvementphases for the starting policy µ0. We can calculate Jµ0 by noting that itinvolves the uncontrolled closed loop system (1.52) and a quadratic cost func-tion. Thus

Jµ0 (x) = K0x2 + constant, (1.53)

and to calculate K0 we will simply ignore the constant above. Equivalently,we will focus on the deterministic case (wk ≡ 0), so that under the policy µ0,we have

xk+1 = (1 + bL0)xk = (1 + bL0)kx0,

and

Jµ0 (x) = K0x2 = lim

N→∞

N−1∑

k=0

αk(1 + rL20)(1 + bL0)

2kx2.

Thus, by cancelling x2 from both sides above, we can calculate K0 as

K0 =1 + rL2

0

1− α(1 + bL0)2. (1.54)

Page 54:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 53

J J∗ = TJ∗

0 Prob. = 1

1 J J

TJ 45 Degree Line

Prob. = 1 Prob. =

J J∗ = TJ∗

0 Prob. = 1

1 J J

J0

Cost-to-go approximation Expected value approximation

Jµ = TµJµJµ = TµJµ

Simplified minimization Value iterations Policy evaluations

Policy Improvement with Base Policy

Policy Improvement with Base Policy µ

Linearized Bellman Eq. at JµYields Rollout Policy µ

Through TµJµ = TJµ

J1 J2

Optimal cost Cost of rollout policy ˜

Optimal cost Cost of rollout policy µOptimal cost Cost of rollout policy µ Cost of base policy µ

Optimal cost Cost of rollout policy ˜

TJ

TµJ

Approximate Policy Evaluation forApproximate Policy Evaluation for µ

TJ = minµ TµJ

Figure 1.3.10 Geometric interpretation of PI and VI. The functions J , TJ ,TµJ , etc, are multidimensional (they have as many scalar components as thereare states), but they are shown projected onto one dimension. Each policy µ

defines the linear function TµJ of J , given by Eq. (1.51), and TJ is the functiongiven by Eq. (1.50), which can also be written as TJ = minµ TµJ . Note that ifthere are finitely many policies, the components of TJ are piecewise linear. Theoptimal cost J∗ satisfies J∗ = TJ∗, so it is obtained from the intersection of thegraph of TJ and the 45 degree line shown.

The VI sequence is indicated in the top figure by the staircase construction,which asymptotically leads to J∗. The bottom figure shows a policy iterationstarting from a base policy µ. It computes Jµ by policy evaluation (by solvingthe linear equation J = TµJ as shown). It then performs a policy improvementusing µ as the base policy to produce the rollout policy µ as shown: the costfunction of the rollout policy, Jµ, is obtained by solving the version of Bellman’sequation that is linearized at the point Jµ, as in Newton’s method. This processwill be discussed further in Chapter 5.

Page 55:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

54 Exact and Approximate Dynamic Programming Principles Chap. 1

In conclusion, the policy evaluation phase of PI for the starting linearpolicy µ0(x) = L0x yields Jµ0 in the form (1.53)-(1.54). The policy improve-ment phase involves the quadratic minimization

µ1(x) ∈ argminu1

[

x2 + ru2 + αK0(x+ bu)2]

,

and after a straightforward calculation (setting to 0 the derivative with respectto u1) yields µ

1 as the linear policy

µ1(x) = L1x,

where

L1 = − αbK0

r + αb2K0. (1.55)

It can also be verified that µ1 is a stable policy. An intuitive way to get a senseof this is via the cost improvement property of PI: we have Jµ1 (x) ≤ Jµ0 (x)

for all x, so Jµ1(x) must be finite, which implies stability of µ1.The preceding calculation can be continued, so the PI algorithm yields

the sequence of linear policies

µk(x) = Lkx, k = 0, 1, . . . , (1.56)

where Lk is generated by the iteration

Lk+1 = − αbKk

r + αb2Kk

, (1.57)

[cf. Eq. (1.55)], with Kk given by

Kk =1 + rL2

k

1− α(1 + bLk)2, (1.58)

[cf. Eq. (1.54)]. The corresponding cost function sequence has the formJµk

(x) = Kkx2 for the deterministic problem where wk ≡ 0, and

Jµk (x) = Kkx2 + constant,

for the more general stochastic problem where wk has zero mean but nonzerovariance. It can be shown to converge to the optimal cost function J∗, whilethe generated sequence of linear policies {µk}, where µk(x) = Lkx, convergesto the optimal policy. The convergence rate of the sequence {Kk} has beenshown to be quadratic [Kle68], consistently with the interpretation of PI as aform of Newton’s method, which was noted earlier.

In the deterministic case where wk ≡ 0, the PI is well-defined even inthe undiscounted case where α = 1. The generated policies are given by Eq.(1.56) with Lk given by

Lk+1 = − bKk

r + b2Kk

,

Page 56:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.3 Stochastic Dynamic Programming 55

with

Kk =1 + rL2

k

1− (1 + bLk)2,

[cf. Eqs. (1.57) and (1.58) for α = 1]. Their cost functions are quadratic ofthe form Jµk (x) = Kkx

2.Note that rollout consists of a single iteration of PI, so given the base

policy µ0(x) = L0x, the rollout policy is µ1(x) = L1x. The cost improvementof the rollout policy over the base policy is quite substantial, particularly asr → 0. Figure 1.3.11 provides a plot of the cost improvement ratio

Jµ1 (x)

Jµ0 (x)=

K1

K0

as a function of L0 when b = 1 and r = 1, and as a function of r when b = 1and L0 = −0.1.

Our mathematical justification of the preceding analysis has been ab-breviated, but it serves as a prototype for the analysis of more general multi-dimensional linear quadratic problems, and illustrates the PI process. Furtherdiscussion of the ideas underlying PI, including the policy improvement prop-erty, will be given in Chapter 5.

Approximate Policy Iteration

Both policy evaluation and policy improvement can be approximated, pos-sibly using neural networks, and training data collected through simulation,as we will discuss in Chapters 4 and 5; see Fig. 1.3.12. Other approxima-tions include truncated rollout for policy evaluation. Moreover, multisteplookahead may be used in place of one-step lookahead, and simplified min-imization, based for example on multiagent rollout, may also be used; seeChapter 5.

Note that the methods for policy evaluation and for policy improve-ment can be designed largely independently of each other. Moreover, thereis a large variety of exact and approximate methods for policy evaluation.In this book we will focus primarily on direct methods in the terminology of[Ber12] (Section 6.1.3); these are methods that evaluate Jµ using samples ofpairs

(

x, Jµ(x))

. We will not discuss indirect methods , which are based onthe solution of projected equations that approximate Bellman’s equationfor Jµ. Such methods include temporal difference and aggregation methods;see the survey [Ber11b], and the textbook [Ber12], Chapters 6 and 7. Thesemethods are popular and are supported by extensive theoretical analysis,but they are not central to our principal themes of rollout and distributedcomputation.

Finally, it is important point to keep in mind the practical implemen-tation links of off-line and on-line PI-type algorithms. In particular, thefinal policy and policy evaluation obtained by off-line (exact or approxi-mate) PI can be used as base policy for implementing on-line a truncated

Page 57:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

56 Exact and Approximate Dynamic Programming Principles Chap. 1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

0.1

0.2

0.3

0.4

0.5

0.6

Jµ1(x)/Jµ0(x) = K1/K0

-1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0

0

0.2

0.4

0.6

0.8

1

Jµ1(x)/Jµ0(x) = K1/K0

Figure 1.3.11 Illustration of the cost improvement property of rollout in

the linear quadratic Example 1.3.5. The top figure shows the rollout-to-basepolicy cost improvement ratio

Jµ1 (x)

Jµ0 (x)=

K1

K0

as a function of coefficient L0 of the linear policy µ0(x) = L0x, with the valuesof b and r held fixed at b = 1 and r = 1. The bottom figure shows the costimprovement ratio K1/K0 as a function of r when b = 1 and L0 is fixed atthe value L0 = −0.1.

rollout policy that provides both on-line policy improvement and also al-lows for on-line replanning. This type of scheme is largely responsible forthe success of the AlphaGo/AlphaZero programs, as well as Tesauro’s 1996backgammon program, which will be discussed in the next chapter. Thebase policies in both cases have been obtained via sophisticated neural net-work training, but play at levels below top humans. However, the one-stepor multistep lookahead (in the case of AlphaZero) policies that are used foron-line play perform at a level much above the best humans. We refer tosubsequent chapters for further discussion.

Page 58:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 57

Base Policy Rollout Policy Approximation in Value SpaceBase Policy Rollout Policy Approximation in Value Space

Base Policy Rollout Policy Approximation in Value SpaceBase Policy Rollout Policy Approximation in Value SpaceBase Policy Rollout Policy Approximation in Value SpaceApproximation in Policy Space

Cost Data Policy Data System:

x µ

Rollout Policy µ

Value Network Policy Network Value Data

-Step Value Network Policy Network-Step Value Network Policy Network

Figure 1.3.12 Schematic illustration of approximate PI. The policy evaluationand policy improvement phases, are approximated with a value and (optionally)a policy network, respectively. These could be neural networks, which are trainedwith (state, cost function value) data that is generated using the current basepolicy µ, and with (state, rollout policy control) data that is generated using therollout policy µ; see Chapters 4 and 5.

1.4 EXAMPLES, VARIATIONS, AND SIMPLIFICATIONS

In this section we provide a few examples that illustrate problem formula-tion techniques, exact and approximate solution methods, and adaptationsof the basic DP algorithm to various contexts. We refer to DP textbooksfor extensive additional discussions of modeling and problem formulationtechniques (see e.g, the many examples that can be found in the author’sDP and RL textbooks [Ber12], [Ber17], [Ber19a], as well as in the neuro-dynamic programming monograph [BeT96]).

An important fact to keep in mind is that there are many ways tomodel a given practical problem in terms of DP, and that there is no uniquechoice for state and control variables. This will be brought out by theexamples in this section, and is facilitated by the generality of DP: its basicalgorithmic principles apply for arbitrary state, control, and disturbancespaces, and system and cost functions.

1.4.1 A Few Words About Modeling

In practice, optimization problems seldom come neatly packaged as mathe-matical problems that can be solved by DP/RL or some other methodology.Generally, a practical problem is a prime candidate for a DP formulationif it involves multiple sequential decisions separated by the collection ofinformation that can enhance the effectiveness of future decisions.

However, there are other types of problems that can be fruitfully for-mulated by DP. These include the entire class of deterministic problems,where there is no information to be collected: all the information neededin a deterministic problem is either known or can be predicted from theproblem data that is available at time 0. Moreover, for deterministic prob-lems there is a plethora of non-DP methods, such as linear, nonlinear, and

Page 59:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

58 Exact and Approximate Dynamic Programming Principles Chap. 1

integer programming, random and nonrandom search, discrete optimiza-tion heuristics, etc. Still, however, the use of RL methods for deterministicoptimization is a major subject in this book, which will be discussed inChapters 2 and 3. We will argue there that rollout, when suitably ap-plied, can improve substantially on the performance of other heuristic orsuboptimal methods, however derived. Moreover, we will see that oftenfor discrete optimization problems the DP sequential structure is intro-duced artificially, with the aim to facilitate the use of approximate DP/RLmethods.

There are also problems that fit quite well into the sequential struc-ture of DP, but can be fruitfully addressed by RL methods that do not havea fundamental connection with DP. An important case in point is policygradient and policy search methods, which will considered only peripher-ally in this book. Here the policy of the problem is parametrized by a setof parameters, so that the cost of the policy becomes a function of theseparameters, and can be optimized by non-DP methods such as gradient orrandom search-based suboptimal approaches. This generally relates to theapproximation in policy space approach, which will be discussed further inChapter 2. We also refer to Section 5.7 of the RL book [Ber19a] and to theend-of-chapter references.

As a guide for formulating optimal control problems in a manner thatis suitable for a DP solution the following two-stage process is suggested:

(a) Identify the controls/decisions uk and the times k at which these con-trols are applied. Usually this step is fairly straightforward. However,in some cases there may be some choices to make. For example indeterministic problems, where the objective is to select an optimalsequence of controls {u0, . . . , uN−1}, one may lump multiple controlsto be chosen together, e.g., view the pair (u0, u1) as a single choice.This is usually not possible in stochastic problems, where distinct de-cisions are differentiated by the information/feedback available whenmaking them.

(b) Select the states xk. The basic guideline here is that xk should en-compass all the information that is relevant for future optimization,i.e., the information that is known to the controller at time k andcan be used with advantage in choosing uk. In effect, at time k thestate xk should separate the past from the future, in the sense thatanything that has happened in the past (states, controls, and dis-turbances from stages prior to stage k) is irrelevant to the choicesof future controls as long we know xk. Sometimes this is describedby saying that the state should have a “Markov property” to expressan analogy with states of Markov chains, where (by definition) theconditional probability distribution of future states depends on thepast history of the chain only through the present state.

Page 60:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 59

The control and state selection may also have to be refined or special-ized in order to enhance the application of known results and algorithms.This includes the choice of a finite or an infinite horizon, and the availabilityof good base policies or heuristics in the context of rollout.

Note that there may be multiple possibilities for selecting the states,because information may be packaged in several different ways that areequally useful from the point of view of control. It may thus be worth con-sidering alternative ways to choose the states; for example try to use statesthat minimize the dimensionality of the state space. For a trivial examplethat illustrates the point, if a quantity xk qualifies as state, then (xk−1, xk)also qualifies as state, since (xk−1, xk) contains all the information con-tained within xk that can be useful to the controller when selecting uk.However, using (xk−1, xk) in place of xk, gains nothing in terms of optimalcost while complicating the DP algorithm that would have to be executedover a larger space.

The concept of a sufficient statistic, which refers to a quantity thatsummarizes all the essential content of the information available to thecontroller, may be useful in providing alternative descriptions of the statespace. An important paradigm is problems involving partial or imperfectstate information, where xk evolves over time but is not fully accessiblefor measurement (for example, xk may be the position/velocity vector ofa moving vehicle, but we may obtain measurements of just the position).If Ik is the collection of all measurements and controls up to time k (theinformation vector), it is correct to use Ik as state in a reformulated DPproblem that involves perfect state observation. However, a better alter-native may be to use as state the conditional probability distribution

Pk(xk | Ik),

called belief state, which (as it turns out) subsumes all the information thatis useful for the purposes of choosing a control. On the other hand, thebelief state Pk(xk | Ik) is an infinite-dimensional object, whereas Ik may befinite dimensional, so the best choice may be problem-dependent. Still, ineither case, the stochastic DP algorithm applies, with the sufficient statistic[whether Ik or Pk(xk | Ik)] playing the role of the state; see Section 1.4.4.

1.4.2 Problems with a Termination State

Many DP problems of interest involve a termination state, i.e., a state tthat is cost-free and absorbing in the sense that for all k,

gk(t, uk, wk) = 0, fk(t, uk, wk) = t, for all wk and uk ∈ Uk(t).

Thus the control process essentially terminates upon reaching t, even if thishappens before the end of the horizon. One may reach t by choice if a special

Page 61:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

60 Exact and Approximate Dynamic Programming Principles Chap. 1

stopping decision is available, or by means of a random transition fromanother state. Problems involving games, such as chess, Go, backgammon,and others involve a termination state (the end of the game) and haveplayed an important role in the development of the RL methodology.†

Generally, when it is known that an optimal policy will reach the ter-mination state with certainty within at most some given number of stagesN , the DP problem can be formulated as an N -stage horizon problem, witha very large termination cost for the nontermination states.‡ The reasonis that even if the termination state t is reached at a time k < N , we canextend our stay at t for an additional N − k stages at no additional cost,so the optimal policy will still be optimal policy, since it will not incur thelarge termination test at the end of the horizon. An example is the multi-vehicle problem of Example 1.2.3: we reach the termination state once alltasks have been performed, but the number of stages for this to occur isunknown and depends on the policy. However, a bound of the requiredtime for termination by an optimal policy can be easily calculated.

Example 1.4.1 (Parking)

A driver is looking for inexpensive parking on the way to his destination.The parking area contains N spaces, numbered 0, . . . , N − 1, and a garagefollowing space N − 1. The driver starts at space 0 and traverses the parkingspaces sequentially, i.e., from space k he goes next to space k + 1, etc. Eachparking space k costs c(k) and is free with probability p(k) independently ofwhether other parking spaces are free or not. If the driver reaches the lastparking space N − 1 and does not park there, he must park at the garage,which costs C. The driver can observe whether a parking space is free onlywhen he reaches it, and then, if it is free, he makes a decision to park in thatspace or not to park and check the next space. The problem is to find theminimum expected cost parking policy.

We formulate the problem as a DP problem with N stages, correspond-ing to the parking spaces, and an artificial termination state t that corre-sponds to having parked; see Fig. 1.4.1. At each stage k = 1, . . . , N − 1, wehave three states: the artificial termination state t, and the two states F andF , corresponding to space k being free or taken, respectively. At stage 0, wehave only two states, F and F , and at the final stage there is only one state,the termination state t. The decision/control is to park or continue at state F[there is no choice at states F and state t]. From location k, the termination

† Games often involve two players/decision makers, in which case they can

be addressed by suitably modified exact or approximate DP algorithms. The

DP algorithm that we have discussed in this chapter involves a single decision

maker, but can be used to find an optimal policy for one player against a fixed

and known policy of the other player.

‡ When an upper bound on the number of stages to termination is not known,

the problem may be formulated as an infinite horizon problem of the stochastic

shortest path type, which we will discuss in Chapter 5.

Page 62:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 61

j · · · j · · ·n 0 10 1 0 1 2

) C c

C c(1)

Garage

Stage 1 Stage 2 Stage 3 Stage N NN N − 1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0)

1) k k

(0) c(k) ) c(k + 1)

+ 1) c(N − 1) Parked

1) Parking Spaces

k k + 1

Termination State

Figure 1.4.1 Cost structure of the parking problem. The driver may park atspace k = 0, 1, . . . , N − 1 at cost c(k), if the space is free, or continue to thenext space k+ 1 at no cost. At space N (the garage) the driver must park atcost C.

state t is reached at cost c(k) when a parking decision is made (assuminglocation k is free). Otherwise, the driver continues to the next state at nocost. At stage N , the driver must park at cost C.

Let us now derive the form of the DP algorithm, denoting:

J∗k (F ): The optimal cost-to-go upon arrival at a space k that is free.

J∗k (F ): The optimal cost-to-go upon arrival at a space k that is taken.

J∗k (t): The cost-to-go of the “parked”/termination state t.

The DP algorithm for k = 0, . . . , N − 1 takes the form

J∗k (F ) =

{

min[

c(k), p(k + 1)J∗k+1(F ) +

(

1− p(k + 1))

J∗k+1(F )

]

if k < N − 1,

min[

c(N − 1), C]

if k = N − 1,

J∗k (F ) =

{

p(k + 1)J∗k+1(F ) +

(

1− p(k + 1))

J∗k+1(F ) if k < N − 1,

C if k = N − 1,

for the states other than the termination state t, while for t we have

J∗k (t) = 0, k = 1, . . . , N.

The minimization above corresponds to the two choices (park or not park) atthe states F that correspond to a free parking space.

While this algorithm is easily executed, it can be written in a simplerand equivalent form. This can be done by introducing the scalars

Jk = p(k)J∗k (F ) +

(

1− p(k))

J∗k (F ), k = 0, . . . , N − 1,

which can be viewed as the optimal expected cost-to-go upon arriving at spacek but before verifying its free or taken status. Indeed, from the preceding DPalgorithm, we have

JN−1 = p(N − 1)min[

c(N − 1), C]

+(

1− p(N − 1))

C,

Jk = p(k)min[

c(k), Jk+1

]

+(

1− p(k))

Jk+1, k = 0, . . . , N − 2.

From this algorithm we can also obtain the optimal parking policy:

Park at space k = 0, . . . , N − 1 if it is free and we have c(k) ≤ Jk+1.

Page 63:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

62 Exact and Approximate Dynamic Programming Principles Chap. 1

1.4.3 State Augmentation, Time Delays, Forecasts, andUncontrollable State Components

In practice, we are often faced with situations where some of the assump-tions of our stochastic optimal control problem are violated. For example,the disturbances may involve a complex probabilistic description that maycreate correlations that extend across stages, or the system equation mayinclude dependences on controls applied in earlier stages, which affect thestate with some delay.

Generally, in such cases the problem can be reformulated into ourDP problem format through a technique, which is called state augmentationbecause it typically involves the enlargement of the state space. The generalguideline in state augmentation is to include in the enlarged state at timek all the information that is known to the controller at time k and canbe used with advantage in selecting uk. State augmentation allows thetreatment of time delays in the effects of control on future states, correlateddisturbances, forecasts of probability distributions of future disturbances,and many other complications. We note, however, that state augmentationoften comes at a price: the reformulated problem may have a very complexstate space. We provide some examples.

Time Delays

In some applications the system state xk+1 depends not only on the pre-ceding state xk and control uk, but also on earlier states and controls. Suchsituations can be handled by expanding the state to include an appropriatenumber of earlier states and controls.

As an example, assume that there is at most a single stage delay inthe state and control; i.e., the system equation has the form

xk+1 = fk(xk, xk−1, uk, uk−1, wk), k = 1, . . . , N − 1, (1.59)

x1 = f0(x0, u0, w0).

If we introduce additional state variables yk and sk, and we make theidentifications yk = xk−1, sk = uk−1, the system equation (1.59) yields

xk+1

yk+1

sk+1

=

fk(xk, yk, uk, sk, wk)xk

uk

. (1.60)

By defining xk = (xk, yk, sk) as the new state, we have

xk+1 = fk(xk, uk, wk),

where the system function fk is defined from Eq. (1.60).

Page 64:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 63

By using the preceding equation as the system equation and by ex-pressing the cost function in terms of the new state, the problem is reducedto a problem without time delays. Naturally, the control uk should nowdepend on the new state xk, or equivalently a policy should consist of func-tions µk of the current state xk, as well as the preceding state xk−1 andthe preceding control uk−1.

When the DP algorithm for the reformulated problem is translatedin terms of the variables of the original problem, it takes the form

J*N (xN ) = gN (xN ),

JN−1(xN−1, xN−2, uN−2)

= minuN−1∈UN−1(xN−1)

EwN−1

{

gN−1(xN−1, uN−1, wN−1)

+ J*N

(

fN−1(xN−1, xN−2, uN−1, uN−2, wN−1))

}

,

J*k (xk, xk−1, uk−1) = min

uk∈Uk(xk)Ewk

{

gk(xk, uk, wk)

+ J*k+1

(

fk(xk, xk−1, uk, uk−1, wk), xk, uk

)

}

, k = 1, . . . , N − 2,

J*0 (x0) = min

u0∈U0(x0)Ew0

{

g0(x0, u0, w0) + J*1

(

f0(x0, u0, w0), x0, u0

)

}

.

Similar reformulations are possible when time delays appear in thecost or the control constraints; for example, in the case where the cost hasthe form

E

{

gN(xN , xN−1) + g0(x0, u0, w0) +

N−1∑

k=1

gk(xk, xk−1, uk, wk)

}

.

The extreme case of time delays in the cost arises in the nonadditive form

E{

gN (xN , xN−1, . . . , x0, uN−1, . . . , u0, wN−1, . . . , w0)}

.

Then, the problem can be reduced to the standard problem format, byusing as augmented state

xk =(

xk, xk−1, . . . , x0, uk−1, . . . , u0, wk−1, . . . , w0

)

and E{

gN(xN )}

as reformulated cost. Policies consist of functions µk ofthe present and past states xk, . . . , x0, the past controls uk−1, . . . , u0, andthe past disturbances wk−1, . . . , w0. Naturally, we must assume that thepast disturbances are known to the controller. Otherwise, we are faced witha problem where the state is imprecisely known to the controller, which willbe discussed in the next section.

Page 65:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

64 Exact and Approximate Dynamic Programming Principles Chap. 1

Forecasts

Consider a situation where at time k the controller has access to a forecastyk that results in a reassessment of the probability distribution of the sub-sequent disturbance wk and, possibly, future disturbances. For example, ykmay be an exact prediction of wk or an exact prediction that the probabilitydistribution of wk is a specific one out of a finite collection of distributions.Forecasts of interest in practice are, for example, probabilistic predictionson the state of the weather, the interest rate for money, and the demand forinventory. Generally, forecasts can be handled by introducing additionalstate variables corresponding to the information that the forecasts provide.We will illustrate the process with a simple example.

Assume that at the beginning of each stage k, the controller receivesan accurate prediction that the next disturbance wk will be selected ac-cording to a particular probability distribution out of a given collection ofdistributions {P1, . . . , Pm}; i.e., if the forecast is i, then wk is selected ac-cording to Pi. The a priori probability that the forecast will be i is denotedby pi and is given.

The forecasting process can be represented by means of the equation

yk+1 = ξk,

where yk+1 can take the values 1, . . . ,m, corresponding to the m possibleforecasts, and ξk is a random variable taking the value i with probabilitypi. The interpretation here is that when ξk takes the value i, then wk+1

will occur according to the distribution Pi.By combining the system equation with the forecast equation yk+1 =

ξk, we obtain an augmented system given by

(

xk+1

yk+1

)

=

(

fk(xk, uk, wk)ξk

)

.

The new state isxk = (xk, yk).

The new disturbance iswk = (wk, ξk),

and its probability distribution is determined by the distributions Pi andthe probabilities pi, and depends explicitly on xk (via yk) but not on theprior disturbances.

Thus, by suitable reformulation of the cost, the problem can be castas a stochastic DP problem. Note that the control applied depends onboth the current state and the current forecast. The DP algorithm takesthe form

J*N (xN , yN) = gN (xN ),

Page 66:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 65

J*k (xk, yk) = min

uk∈Uk(xk)Ewk

{

gk(xk, uk, wk)

+

m∑

i=1

piJ*k+1

(

fk(xk, uk, wk), i) ∣

∣ yk

}

,

(1.61)where yk may take the values 1, . . . ,m, and the expectation over wk istaken with respect to the distribution Pyk .

Note that the preceding formulation admits several extensions. Oneexample is the case where forecasts can be influenced by the control action(e.g., pay extra for a more accurate forecast), and may involve severalfuture disturbances. However, the price for these extensions is increasedcomplexity of the corresponding DP algorithm.

Problems with Uncontrollable State Components

In many problems of interest the natural state of the problem consists ofseveral components, some of which cannot be affected by the choice ofcontrol. In such cases the DP algorithm can be simplified considerably,and be executed over the controllable components of the state.

As an example, let the state of the system be a composite (xk, yk) oftwo components xk and yk. The evolution of the main component, xk, isaffected by the control uk according to the equation

xk+1 = fk(xk, yk, uk, wk),

where the distribution Pk(wk | xk, yk, uk) is given. The evolution of theother component, yk, is governed by a given conditional distribution Pk(yk |xk) and cannot be affected by the control, except indirectly through xk.One is tempted to view yk as a disturbance, but there is a difference: yk isobserved by the controller before applying uk, while wk occurs after uk isapplied, and indeed wk may probabilistically depend on uk.

It turns out that we can formulate a DP algorithm that is executedover the controllable component of the state, with the dependence on theuncontrollable component being “averaged out” as in the preceding ex-ample (see also the parking Example 1.4.1). In particular, let J*

k (xk, yk)denote the optimal cost-to-go at stage k and state (xk, yk), and define

Jk(xk) = Eyk

{

J*k (xk, yk) | xk

}

.

Note that the preceding expression can be interpreted as an “average cost-to-go” at xk (averaged over the values of the uncontrollable componentyk). Then, similar to the preceding parking example, a DP algorithm that

Page 67:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

66 Exact and Approximate Dynamic Programming Principles Chap. 1

generates Jk(xk) can be obtained, and has the following form:†

Jk(xk) = Eyk

{

minuk∈Uk(xk,yk)

Ewk

{

gk(xk, yk, uk, wk)

+ Jk+1

(

fk(xk, yk, uk, wk)) ∣

∣ xk, yk, uk

}∣

∣ xk

}

.

(1.62)Note that the minimization in the right-hand side of the preceding

equation must still be performed for all values of the full state (xk, yk) inorder to yield an optimal control law as a function of (xk, yk). Nonetheless,the equivalent DP algorithm (1.62) has the advantage that it is executedover a significantly reduced state space. Later, when we consider approx-imation in value space, we will find that it is often more convenient toapproximate Jk(xk) than to approximate J*

k (xk, yk); see the following dis-cussions of forecasts and of the game of tetris.

As an example, consider the augmented state resulting from the incor-poration of forecasts, as described earlier. Then, the forecast yk representsan uncontrolled state component, so that the DP algorithm can be simpli-fied as in Eq. (1.62). In particular, assume that the forecast yk can takevalues i = 1, . . . ,m with probability pi. Then, by defining

Jk(xk) =

m∑

i=1

piJ*k (xk, i), k = 0, 1, . . . , N − 1,

and JN (xN ) = gN (xN ), we have, using Eq. (1.61),

Jk(xk) =

m∑

i=1

pi minuk∈Uk(xk)

Ewk

{

gk(xk, uk, wk)

+ Jk+1

(

fk(xk, uk, wk)) ∣

∣ yk = i}

,

† This is a consequence of the calculation

Jk(xk) = Eyk

{

J∗k (xk, yk) | xk

}

= Eyk

{

minuk∈Uk(xk,yk)

Ewk,xk+1,yk+1

{

gk(xk, yk, uk, wk)

+ J∗k+1(xk+1, yk+1)

∣ xk, yk, uk

} ∣

∣ xk

}

= Eyk

{

minuk∈Uk(xk,yk)

Ewk,xk+1

{

gk(xk, yk, uk, wk)

+ Eyk+1

{

J∗k+1(xk+1, yk+1)

∣ xk+1

} ∣

∣ xk, yk, uk

}

∣ xk

}

.

Page 68:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 67

Figure 1.4.2 Illustration of a tetris board.

which is executed over the space of xk rather than xk and yk. Note thatthis is a simpler algorithm to approximate than the one of Eq. (1.61).

Uncontrollable state components often occur in arrival systems, suchas queueing, where action must be taken in response to a random event(such as a customer arrival) that cannot be influenced by the choice ofcontrol. Then the state of the arrival system must be augmented to includethe random event, but the DP algorithm can be executed over a smallerspace, as per Eq. (1.62). Here is an example of this type.

Example 1.4.2 (Tetris)

Tetris is a popular video game played on a two-dimensional grid. Each squarein the grid can be full or empty, making up a “wall of bricks” with “holes”and a “jagged top” (see Fig. 1.4.2). The squares fill up as blocks of differentshapes fall from the top of the grid and are added to the top of the wall. As agiven block falls, the player can move horizontally and rotate the block in allpossible ways, subject to the constraints imposed by the sides of the grid andthe top of the wall. The falling blocks are generated independently accordingto some probability distribution, defined over a finite set of standard shapes.The game starts with an empty grid and ends when a square in the top rowbecomes full and the top of the wall reaches the top of the grid. When arow of full squares is created, this row is removed, the bricks lying above thisrow move one row downward, and the player scores a point. The player’sobjective is to maximize the score attained (total number of rows removed)up to termination of the game, whichever occurs first.

We can model the problem of finding an optimal tetris playing strategyas a finite horizon stochastic DP problem, with very long horizon. The stateconsists of two components:

(1) The board position, i.e., a binary description of the full/empty statusof each square, denoted by x.

Page 69:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

68 Exact and Approximate Dynamic Programming Principles Chap. 1

(2) The shape of the current falling block, denoted by y.

The control, denoted by u, is the horizontal positioning and rotation appliedto the falling block. There is also an additional termination state which iscost-free. Once the state reaches the termination state, it stays there with nochange in score. Moreover there is a very large amount added to the scorewhen the end of the horizon is reached without the game having terminated.

The shape y is generated according to a probability distribution p(y),independently of the control, so it can be viewed as an uncontrollable statecomponent. The DP algorithm (1.62) is executed over the space of boardpositions x and has the intuitive form

Jk(x) =∑

y

p(y)maxu

[

g(x, y, u) + Jk+1

(

f(x, y, u))

]

, for all x, (1.63)

where

g(x, y, u) is the number of points scored (rows removed),

f(x, y, u) is the next board position (or termination state),

when the state is (x, y) and control u is applied, respectively. The DP algo-rithm (1.63) assumes a finite horizon formulation of the problem.

Alternatively, we may consider an undiscounted infinite horizon formu-lation, involving a termination state (i.e., a stochastic shortest path problem).The “reduced” form of Bellman’s equation, which corresponds to the DP al-gorithm (1.63), has the form

J(x) =∑

y

p(y)maxu

[

g(x, y, u) + J(

f(x, y, u))

]

, for all x.

The value J(x) can be interpreted as an “average score” at x (averaged overthe values of the uncontrollable block shapes y).

Finally, let us note that despite the simplification achieved by elimi-nating the uncontrollable portion of the state, the number of states x is stillenormous, and the problem can only be addressed by suboptimal methods,which will be discussed later in this book.†

1.4.4 Partial State Information and Belief States

We have assumed so far that the controller has access to the exact value ofthe current state xk, so a policy consists of a sequence of functions µk(·),k = 0, . . . , N − 1. However, in many practical settings this assumption isunrealistic, because some components of the state may be inaccessible for

† Tetris has received a lot of attention as a challenging testbed for RL al-gorithms over a period spanning 20 years (1995-2015), starting with the papers[TsV96] and [BeI96], and ending with the papers [GGS13], [SGG15], which con-tain many references to related works in the intervening years.

Page 70:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 69

j · · · j · · ·n 0 10 1 0 1 2

) C c

C c(1)

Garage

Stage 1 Stage 2 Stage 3 Stage N NN N − 1

Initial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0)

1) k k

(0) c(k) ) c(k + 1)

+ 1) c(N − 1) Parked

1) Parking Spaces

k k + 1

Termination State

Enlarged State Space t+

kt−

k

Figure 1.4.3 Cost structure and transitions of the bidirectional parking problem.The driver may park at space k = 0, 1, . . . , N − 1 at cost c(k), if the space is free,can move to k − 1 at cost t−

kor can move to k + 1 at cost t+

k. At space N (the

garage) the driver must park at cost C.

observation, the sensors used for measuring them may be inaccurate, orthe cost of measuring them more accurately may be prohibitive.

Often in such situations the controller has access to only some ofthe components of the current state, and the corresponding observationsmay also be corrupted by stochastic uncertainty. For example in three-dimensional motion problems, the state may consist of the six-tuple of po-sition and velocity components, but the observations may consist of noise-corrupted radar measurements of the three position components. Thisgives rise to problems of partial or imperfect state information, which havereceived a lot of attention in the optimization and artificial intelligenceliterature (see e.g., [Ber17], [RuN16]; these problems are also popularlyreferred to with the acronym POMDP for partially observed MarkovianDecision problem).

Generally, solving a POMDP exactly is typically intractable in prac-tice, even though there are DP algorithms for doing so. The most com-mon approach is to replace the state xk with a belief state, which we willdenote by bk. It is the probability distribution of xk given all the obser-vations that have been obtained by the controller up to time k, and itcan serve as “state” in an appropriate DP algorithm. The belief state canin principle be computed and updated by a variety of methods that arebased on Bayes’ rule, such as Kalman filtering (see e.g., [AnM79], [KuV86],[Kri16], [ChC17]) and particle filtering (see e.g., [GSS93], [DoJ09], [Can16],[Kri16]).

In problems where the state xk can take a finite but large numberof values, say n, the belief states comprise an n-dimensional simplex, sodiscretization becomes problematic. As a result, alternative suboptimal so-lution methods are often used in partial state information problems. Someof these methods will be described in future chapters.

Example 1.4.3 (Bidirectional Parking)

Let us consider a more complex version of the parking problem of Example

Page 71:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

70 Exact and Approximate Dynamic Programming Principles Chap. 1

1.4.1. As in that example, a driver is looking for inexpensive parking on theway to his destination, along a line of N parking spaces with a garage at theend. The difference is that the driver can move in either direction, ratherthan just forward towards the garage. In particular, at space i, the drivercan park at cost c(i) if i is free, can move to i− 1 at a cost t−i or can move toi+1 at a cost t+i . Moreover, the driver records and remembers the free/takenstatus of the spaces previously visited and may return to any of these spaces;see Fig. 1.4.3.

Let us assume that the probability p(i) of a space i being free changesover time, i.e., a space found free (or taken) at a given visit may get taken(or become free, respectively) by the time of the next visit. The initial prob-abilities p(i), before visiting any spaces, are known, and the mechanism bywhich these probabilities change over time is also known to the driver. As anexample, we may assume that at each time stage, p(i) increases by a certainknown factor with some probability ξ and decreases by another known factorwith the complementary probability 1− ξ.

Here the belief state is the vector of current probabilities(

p(0), . . . , p(N − 1))

,

and it can be updated with a simple algorithm at each time based on the newobservation: the free/taken status of the space visited at that time.

Despite their inherent computational difficulty, it turns out that con-ceptually, partial state information problems are no different than the per-fect state information problems we have been addressing so far. In fact byvarious reformulations, we can reduce a partial state information problemto one with perfect state information. Once this is done, it is possible tostate an exact DP algorithm that is defined over the set of belief states.This algorithm has the form

J*k (bk) = min

uk∈Uk

[

gk(bk, uk) + Ezk+1

{

J*k+1

(

Fk(bk, uk, zk+1))

}

]

, (1.64)

where:

J*k (bk) denotes the optimal cost-to-go starting from belief state bk at

stage k.

Uk is the control constraint set at time k (since the state xk is un-known at stage k, Uk must be independent of xk).

gk(bk, uk) denotes the expected stage cost of stage k. It is calculatedas the expected value of the stage cost gk(xk, uk, wk), with the jointdistribution of (xk, wk) determined by the belief state bk and thedistribution of wk.

Fk(bk, uk, zk+1) denotes the belief state at the next stage, given thatthe current belief state is bk, control uk is applied, and observationzk+1 is received following the application of uk:

bk+1 = Fk(bk, uk, zk+1). (1.65)

Page 72:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 71

k Controller

Controller µk

Belief State“Future” System x Belief State∗

y bk

k Control uk = µk(bk)

+1 bk+1 = Fk(bk, uk, zk+1) ˆ

zk+1

) Cost gk(bk, uk)

Belief Estimator

Figure 1.4.4 Schematic illustration of the view of an imperfect state informationproblem as one of perfect state information, whose state is the belief state bk, i.e.,the conditional probability distribution of xk given all the observations up to timek. The observation zk+1 plays the role of the stochastic disturbance. The functionFk is a sequential estimator that updates the current belief state bk .

This is the system equation for a perfect state information problemwith state bk, control uk, “disturbance” zk+1, and cost per stagegk(bk, uk). The function Fk is viewed as a sequential belief estimator ,which updates the current belief state bk based on the new observationzk+1. It is given by either an explicit formula or an algorithm (such asKalman filtering or particle filtering) that is based on the probabilitydistribution of zk and the use of Bayes’ rule.

The expected value Ezk+1{·} is taken with respect to the distribu-

tion of zk+1, given bk and uk. Note that zk+1 is random, and itsdistribution depends on xk and uk, so the expected value

Ezk+1

{

J*k+1

(

Fk(bk, uk, zk+1))

}

in Eq. (1.64) is a function of bk and uk.

The algorithm (1.64) is just the ordinary DP algorithm for the perfectstate information problem shown in Fig. 1.4.4. It involves the system/beliefestimator (1.65) and the cost per stage gk(bk, uk). Note that since bk takesvalues in a continuous space, the algorithm (1.64) can only be executedapproximately, using approximation in value space methods.

We refer to the textbook [Ber17], Chapter 4, for a detailed derivationof the DP algorithm (1.64), and to the monograph [BeS78] for a mathe-matical treatment that applies to infinite-dimensional state and disturbancespaces as well.

1.4.5 Multiagent Problems and Multiagent Rollout

Let us consider the discounted infinite horizon problem and a special struc-

Page 73:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

72 Exact and Approximate Dynamic Programming Principles Chap. 1

2 Agent 1 Agent

Sensor InfoSensor Info

Sensor Info

Sensor Info

Sensor Info

Sensor InfoSensor InfoSensor Info

Sensor Info

Sensor Info

State InfoState Info State Info

Agent 2 Agent 3 Agent 4 Agent 5

Agent 2 Agent 3 Agent 4 Agent 5

Agent 2 Agent 3 Agent 4 Agent 5

Agent 2 Agent 3 Agent 4 Agent 5

Agent 2 Agent 3 Agent 4 Agent 5 Environment Computing CloudEnvironment Computing Cloud

+1u1

u2

u3

u4

u5

Figure 1.4.5 Schematic illustration of a multiagent problem. There are multiple“agents,” and each agent ℓ = 1, . . . ,m controls its own decision variable uℓ. Ateach stage, agents exchange new information and also exchange information withthe “environment,” and then select their decision variables for the stage.

ture of the control space, whereby the control u consists of m components,u = (u1, . . . , um), with a separable control constraint structure uℓ ∈ Uℓ(x),ℓ = 1, . . . ,m. Thus the control constraint set is the Cartesian product

U(x) = U1(x)× · · · × Um(x), (1.66)

where the sets Uℓ(x) are given. This structure is inspired by applicationsinvolving distributed decision making by multiple agents with communica-tion and coordination between the agents; see Fig. 1.4.5.

In particular, we will view each component uℓ, ℓ = 1, . . . ,m, as beingchosen from within Uℓ(x) by a separate “agent” (a decision making entity).For the sake of the following discussion, we assume that each set Uℓ(x) isfinite. Then the one-step lookahead minimization of the standard rolloutscheme with base policy µ is given by

u ∈ arg minu∈U(x)

Ew

{

g(x, u, w) + αJµ(

f(x, u, w))

}

, (1.67)

and involves as many as nm Q-factors, where n is the maximum number ofelements of the sets Uℓ(x) [so that nm is an upper bound to the number ofcontrols in U(x), in view of its Cartesian product structure (1.66)]. Thusthe standard rollout algorithm requires an exponential [order O(nm)] num-ber of Q-factor computations per stage, which can be overwhelming evenfor moderate values of m.

This potentially large computational overhead motivates a far morecomputationally efficient rollout algorithm, whereby the one-step lookahead

Page 74:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 73

minimization (1.67) is replaced by a sequence of m successive minimiza-tions, one-agent-at-a-time, with the results incorporated into the subse-quent minimizations. In particular, at state x we perform the sequence ofminimizations

µ1(x) ∈ arg minu1∈U1(x)

Ew

{

g(x, u1, µ2(x), . . . , µm(x), w)

+ αJµ(

f(x, u1, µ2(x), . . . , µm(x), w))

}

,

µ2(x) ∈ arg minu2∈U2(x)

Ew

{

g(x, µ1(x), u2, µ3(x) . . . , µm(x), w)

+ αJµ(

f(x, µ1(x), u2, µ3(x), . . . , µm(x), w))

}

,

. . . . . . . . . . . .

µm(x) ∈ arg minum∈Um(x)

Ew

{

g(x, µ1(x), µ2(x), . . . , µm−1(x), um, w)

+ αJµ(

f(x, µ1(x), µ2(x), . . . , µm−1(x), um, w))

}

.

Thus each agent component uℓ is obtained by a minimization with the pre-ceding agent components u1, . . . , uℓ−1 fixed at the previously computed val-ues of the rollout policy, and the following agent components uℓ+1, . . . , um

fixed at the values given by the base policy. This algorithm requires orderO(nm) Q-factor computations per stage, a potentially huge computationalsaving over the order O(nm) computations required by standard rollout.

A key idea here is that the computational requirements of the rolloutone-step minimization (1.67) are proportional to the number of controls inthe set Uk(xk) and are independent of the size of the state spacet. This mo-tivates a reformulation of the problem, first suggested in the book [BeT96],Section 6.1.4, whereby control space complexity is traded off with statespace complexity, by “unfolding” the control uk into its m components,which are applied one agent-at-a-time rather than all-agents-at-once.

In particular, we can reformulate the problem by breaking down thecollective decision uk into m individual component decisions, thereby re-ducing the complexity of the control space while increasing the complexityof the state space. The potential advantage is that the extra state spacecomplexity does not affect the computational requirements of some RLalgorithms, including rollout.

To this end, we introduce a modified but equivalent problem, involv-ing one-at-a-time agent control selection. At the generic state x, we breakdown the control u into the sequence of the m controls u1, u2, . . . , um, andbetween x and the next state x = f(x, u, w), we introduce artificial inter-mediate “states” (x, u1), (x, u1, u2), . . . , (x, u1, . . . , um−1), and correspond-ing transitions. The choice of the last control component um at “state”

Page 75:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

74 Exact and Approximate Dynamic Programming Principles Chap. 1

...

Random Transition

) Random Costx x x

x x = f(x, u, w)

) Control um

Stage

) g(x, u, w)

) u1

(x, u1)) u2

(x, u1, u2)u3 um−1

(x, u1, . . . , um−1) Control

Figure 1.4.6 Equivalent formulation of the N-stage stochastic optimal controlproblem for the case where the control u consists of m components u1, u2, . . . , um:

u = (u1, . . . , um) ∈ U1(xk)× · · · × Um(xk).

The figure depicts the kth stage transitions. Starting from state x, we generatethe intermediate states

(x, u1), (xk, u1, u2), . . . , (x, u1, . . . , um−1),

using the respective controls u1, . . . , um−1. The final control um leads from(x, u1, . . . , um−1) to x = f(x, u,w), and the random cost g(x, u,w) is incurred.

(x, u1, . . . , um−1) marks the transition to the next state x = f(x, u, w) ac-cording to the system equation, while incurring cost g(x, u, w); see Fig.1.4.6.

It is evident that this reformulated problem is equivalent to the origi-nal, since any control choice that is possible in one problem is also possiblein the other problem, while the cost structure of the two problems is thesame. In particular, every policy

(

µ1(x), . . . , µm(x))

of the original prob-lem, including a base policy in the context of rollout, is admissible for thereformulated problem, and has the same cost function for the original aswell as the reformulated problem.

The motivation for the reformulated problem is that the control spaceis simplified at the expense of introducing m−1 additional layers of states,and the corresponding m− 1 cost-to-go functions

J1(x, u1), J2(x, u1, u2), . . . , Jm−1(x, u1, . . . , um−1).

The increase in size of the state space does not adversely affect the opera-tion of rollout, since the Q-factor minimization (1.67) is performed for justone state at each stage.

The major fact that can be proved about multiagent rollout (to beshown in Chapters 3 and 5) is that it still achieves cost improvement :

Jµ(x) ≤ Jµ(x), for all x,

where Jµ(x) is the cost function of the base policy µ, and Jµ(x) is thecost function of the rollout policy µ = (µ1, . . . , µm), starting from state

Page 76:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 75

Figure 1.4.7 Illustration of a 2-dimensional spiders-and-fly problem with 20spiders and 5 flies (cf. Example 1.4.4). The flies moves randomly, regardless ofthe position of the spiders. During a stage, each spider moves to a neighboringlocation or stays where it is, so there are 5 moves per spider (except for spidersat the edges of the grid). The total number of possible joint spiders moves is alittle less than 520.

x. Furthermore, this cost improvement property can be extended to mul-tiagent PI schemes that involve one-agent-at-a-time policy improvementoperations, and have sound convergence properties. Moreover, multiagentrollout will become the starting point for various related PI schemes thatare well suited for distributed operation in important practical contextsinvolving multiple autonomous decision makers.

The following two-dimensional version of the spider-and-fly Example1.3.4 illustrates the multiagent rollout scheme; see Fig. 1.4.7.

Example 1.4.4 (Spiders and Flies)

This example is representative of a broad range of practical problems such asmultirobot service systems involving delivery, maintenance and repair, searchand rescue, firefighting, etc. Here there are m spiders and several flies movingon a 2-dimensional grid; cf. Fig. 1.4.7. The objective is for the spiders to catchall the flies as fast as possible.

During a stage, each fly moves to a some other position according to agiven state-dependent probability distribution. Each spider learns the currentstate (the vector of spiders and fly locations) at the beginning of each stage,and either moves to a neighboring location or stays where it is. Thus eachspider has as many as 5 choices at each stage. The control is u = (u1, . . . , um),

Page 77:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

76 Exact and Approximate Dynamic Programming Principles Chap. 1

where uℓ is the choice of the ℓth spider, so there are about 5m possible valuesof u.

To apply multiagent rollout, we need a base policy. A simple possibilityis to use the policy that directs each spider to move on the path of minimumdistance to the closest fly position. According to the multiagent rollout for-malism, the spiders choose their moves one-at-time in the order from 1 to m,taking into account the current positions of the flies and the earlier movesof other spiders, and assuming that future moves will be chosen according tothe base policy, which is a tractable computation.

In particular, at the beginning at the typical stage, spider 1 selectsits best move (out of the no more than 5 possible moves), assuming theother spiders 2, . . . ,m will move towards their closest surviving fly during thecurrent stage, and all spiders will move towards their closest surviving flyduring the following stages, up to the time where no surviving flies remain.Spider 1 then broadcasts its selected move to all other spiders. Then spider2 selects its move taking into account the move already chosen by spider 1,and assuming that spiders 3, . . . , m will move towards their closest survivingfly during the current stage, and all spiders will move towards their closestsurviving fly during the following stages, up to the time where no survivingflies remain. Spider 2 then broadcasts its choice to all other spiders. Thisprocess of one-spider-at-a-time move selection is repeated for the remainingspiders 3, . . . ,m, marking the end of the stage.

Note that while standard rollout computes and compares 5m Q-factors(actually a little less to take into account edge effects), multiagent rolloutcomputes and compares ≤ 5 moves per spider, for a total of less than 5m.Despite this tremendous computational economy, experiments with this typeof spiders and flies problems have shown that multiagent rollout achieves acomparable performance to the one of standard rollout.

1.4.6 Problems with Unknown Parameters - Adaptive Control

Our discussion so far dealt with problems with a known mathematicalmodel, i.e., one where the system equation, cost function, control con-straints, and probability distributions of disturbances are perfectly known.The mathematical model may be available through explicit mathematicalformulas and assumptions, or through a computer program that can em-ulate all of the mathematical operations involved in the model, includingMonte Carlo simulation for the calculation of expected values.

It is important to note here that from our point of view, it makes nodifference whether the mathematical model is available through closed formmathematical expressions or through a computer simulator : the methodsthat we discuss are valid either way, only their suitability for a given prob-lem may be affected by the availability of mathematical formulas. More-over, problems with a known mathematical model are the only type that wewill formally address in this book with DP and approximate DP methods .

Page 78:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 77

Of course in practice, it is common that the system parameters areeither not known exactly or may change over time.† As an example considerour oversimplified cruise control system of Example 1.3.1 or its infinitehorizon version of Examples 1.3.3 and 1.3.5. The state evolves accordingto

xk+1 = xk + buk + wk, (1.68)

where xk is the deviation vk−v of the vehicle’s velocity vk from the nominalv, uk is the force that propels the car forward, and wk is the disturbancethat has nonzero mean. However, the coefficient b and the distribution ofwk change frequently, and cannot be modeled with any precision becausethey depend on unpredictable time-varying conditions, such as the slopeand condition of the road, and the weight of the car (which is affected bythe number of passengers). Moreover, the nominal velocity v is set by thedriver, and when it changes it may affect the parameter b in the systemequation, and other parameters.‡

In this section, we will briefly review some of the most commonly usedapproaches for dealing with unknown parameters in optimal control theoryand practice. We should note also that unknown problem environments arean integral part of the artificial intelligence view of RL. In particular, toquote from the popular book by Sutton and Barto [SuB18], RL is viewedas “a computational approach to learning from interaction,” and “learningfrom interaction with the environment is a foundational idea underlyingnearly all theories of learning and intelligence.” The idea of learning frominteraction with the environment is often connected with the idea of ex-ploring the environment to identify its characteristics. In control theorythis is often viewed as part of the system identification methodology, whichaims to construct mathematical models of dynamic systems. The systemidentification process is often combined with the control process to dealwith unknown or changing problem parameters. This is one of the mostchallenging areas of stochastic optimal and suboptimal control, and hasbeen studied since the early 1960s.

Robust and Adaptive Control

Given a controller design that has been obtained assuming a nominal DP

† The difficulties of decision and control within a changing environment are

often underestimated. Among others, they complicate the balance between off-

line training and on-line play, which we discussed in Section 1.1 in connection the

AlphaZero. As much as learning to play high quality chess is a great challenge,

the rules of play are stable and do not change unpredictably in the middle of a

game!

‡ Adaptive cruise control, which can also adapt the car’s velocity based on its

proximity to other cars, has been studied extensively and has been incorporated

in several commercially sold car models.

Page 79:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

78 Exact and Approximate Dynamic Programming Principles Chap. 1

problem model, one possibility is to simply ignore changes in problem pa-rameters. We may then try to investigate the performance of the currentdesign for a suitable range of problem parameter values, and ensure that itis adequate for the entire range. This is sometimes called a robust controllerdesign. For example, consider the oversimplified cruise control system ofEq. (1.68) with a linear controller of the form µ(x) = Lx for some scalar L.Then we check the range of parameters b for which the current controlleris stable (this is the interval of values b for which |1+ bL| < 1), and ensurethat b remains within that range during the system’s operation.

The more general class of methods where the controller is modifiedin response to problem parameter changes is part of a broad field knownas adaptive control , i.e., control that adapts to changing parameters. Thisis a rich methodology with many and diverse applications. We will notdiscuss adaptive control in this book. We just mention a simple time-honored adaptive control approach for continuous-state problems calledPID (Proportional-Integral-Derivative) control , for which we refer to thecontrol literature, including the books by Astrom and Hagglund [AsH95],[AsH06] (also the discussion in Section 5.7 of the RL textbook [Ber19a]).

In particular, PID control aims to maintain the output of a single-input single-output dynamic system around a set point or to follow a giventrajectory, as the system parameters change within a relatively broad range.In its simplest form, the PID controller is parametrized by three scalar pa-rameters, which may be determined by a variety of methods, some of themmanual/heuristic. PID control is used widely and with success, althoughits range of application is mainly restricted to single-input, single-outputcontinuous-state control systems.

Dealing with Unknown Parameters by System Identification

In PID control, no attempt is made to maintain a mathematical model andto track unknown model parameters as they change. An alternative andapparently reasonable form of suboptimal control is to separate the controlprocess into two phases, a system identification phase and a control phase.In the first phase the unknown parameters are estimated, while the controltakes no account of the interim results of estimation. The final parameterestimates from the first phase are then used to implement an optimal orsuboptimal policy in the second phase. This alternation of estimation andcontrol phases may be repeated several times during any system run inorder to take into account subsequent changes of the parameters. Moreover,it is not necessary to introduce a hard separation between the identificationand the control phases. They may be going on simultaneously, with newparameter estimates being introduced into the control process, wheneverthis is thought to be desirable; see Fig. 1.4.8.

One drawback of this approach is that it is not always easy to deter-mine when to terminate one phase and start the other. A second difficulty,

Page 80:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 79

k Controller

) System Data Control Parameter Estimation

System Data Control Parameter Estimation

System State Data Control Parameter Estimation

System State Data Control Parameter Estimation

System State Data Control Parameter EstimationSystem State Data Control Parameter Estimation

Figure 1.4.8 Schematic illustration of concurrent parameter estimation and sys-tem control. The system parameters are estimated on-line and the estimates areperiodically passed on to the controller.

of a more fundamental nature, is that the control process may make someof the unknown parameters invisible to the estimation process. This isknown as the problem of parameter identifiability, which is discussed inthe context of optimal control in several sources, including [BoV79] and[Kum83]; see also [Ber17], Section 6.7. For a simple example, consider thescalar system

xk+1 = axk + buk, k = 0, . . . , N − 1,

and the quadratic costN∑

k=1

(xk)2.

Assuming perfect state information, if the parameters a and b are known,it can be seen that the optimal control law is

µ∗

k(xk) = −a

bxk,

which sets all future states to 0. Assume now that the parameters a and bare unknown, and consider the two-phase method. During the first phasethe control law

µk(xk) = γxk (1.69)

is used (γ is some scalar; for example, γ = −a/b, where a and b are somea priori estimates of a and b, respectively). At the end of the first phase,the control law is changed to

µk(xk) = −a

bxk,

Page 81:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

80 Exact and Approximate Dynamic Programming Principles Chap. 1

where a and b are the estimates obtained from the estimation process.However, with the control law (1.69), the closed-loop system is

xk+1 = (a+ bγ)xk,

so the estimation process can at best yield the value of (a + bγ) but notthe values of both a and b. In other words, the estimation process cannotdiscriminate between pairs of values (a1, b1) and (a2, b2) such that

a1 + b1γ = a2 + b2γ.

Therefore, a and b are not identifiable when feedback control of the form(1.69) is applied.

On-line parameter estimation algorithms, which address among oth-ers the issue of identifiability, have been discussed extensively in the controltheory literature, but the corresponding methodology is complex and be-yond our scope in this book. However, assuming that we can make theestimation phase work somehow, we are free to revise the controller us-ing the newly estimated parameters in a variety of ways. Unfortunately,there is still another difficulty: it may be hard to recompute an optimal ornear-optimal policy on-line, using a newly identified system model. In par-ticular, it may be impossible to use time-consuming methods that involvefor example the training of a neural network. A simpler possibility is touse rollout for on-line replanning, which we discuss next.

Adaptive Control by Rollout and On-Line Replanning

We will now consider an approach for dealing with unknown or changingparameters, which is based on on-line replanning. We have discussed thisapproach in the context of rollout and multiagent rollout, where we stressedthe importance of fast on-line policy improvement (cf. Sections 1.1 and 1.2).

Let us assume that some problem parameters change and the currentcontroller becomes aware of the change “instantly” (i.e., very quickly be-fore the next stage begins). The method by which the problem parametersare recalculated or become known is immaterial for the purposes of the fol-lowing discussion. It may involve a limited form of parameter estimation,whereby the unknown parameters are “tracked” by data collection over afew time stages, with due attention paid to issues of parameter identifi-ability; or it may involve new features of the control environment, suchas a changing number of servers and/or tasks in a service system (thinkof new spiders and/or flies appearing or disappearing unexpectedly in thespiders-and-flies Example 1.4.4).

We thus assume away/ignore issues of parameter estimation, and fo-cus on revising the controller by on-line replanning based on the newly ob-tained parameters. This revision may be based on any suboptimal method,

Page 82:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 81

Multiagent Q-factor minimization xk

Possible States

Possible States xk+1

Rollout with Base PolicyRollout with Base Policy

Changing System, Cost, and Constraint Parameters

Changing System, Cost, and Constraint ParametersChanging System, Cost, and Constraint Parameters

Lookahead Minimization

Lookahead Minimization

Figure 1.4.9 Schematic illustration of adaptive control by rollout. One-steplookahead is followed by simulation with the base policy, which stays fixed. Thesystem, cost, and constraint parameters are changing over time, and the mostrecent values are incorporated into the lookahead minimization and rollout oper-ations. For the discussion in this section, we may assume that all the changingparameter information is provided by some computation and sensor “cloud” thatis beyond our control. The base policy may also be revised based on variouscriteria.

but rollout with the current policy used as the base policy is particularlyattractive. The advantage of rollout within this context is that it is simpleand reliable. In particular, it does not require a complicated training pro-cedure to revise the current policy, based for example on the use of neuralnetworks or other approximation architectures, so no new policy is explicitlycomputed in response to the parameter changes . Instead the current policyis used as the base policy for rollout, and the available controls at the cur-rent state are compared by a one-step or mutistep minimization, with costfunction approximation provided by the base policy (cf. Fig. 1.4.9). Notealso that over time the base policy may also be revised (on the basis of anunspecified rationale), in which case the rollout policy will be revised bothin response to the changed current policy and in response to the changingparameters.

The principal requirement for using rollout in an adaptive controlcontext is that the rollout control computation should be fast enough tobe performed between stages. Note, however, that accelerated/truncatedversions of rollout, as well as parallel computation, can be used to meet

Page 83:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

82 Exact and Approximate Dynamic Programming Principles Chap. 1

this time constraint.The following example explores on-line replanning with the use of roll-

out in the context of the simple one-dimensional linear quadratic problemthat we discussed earlier in this chapter.

Example 1.4.5 (On-Line Replanning for Linear QuadraticProblems Based on Rollout)

Consider the deterministic undiscounted infinite horizon linear quadratic prob-lem of Example 1.3.3. It involves the linear system

xk+1 = xk + buk,

and the quadratic cost function

limN→∞

N−1∑

k=0

(x2k + ru2

k).

The optimal cost function is given by

J∗(x) = K∗x2,

where K∗ is the unique positive solution of the Riccati equation

K =rK

r + b2K+ 1. (1.70)

The optimal policy has the form

µ∗(x) = L∗x, (1.71)

where

L∗ = − bK∗

r + b2K∗. (1.72)

As an example, consider the optimal policy that corresponds to thenominal problem parameters b = 2 and r = 0.5: this is the policy (1.71)-(1.72), with K obtained as the positive solution of the quadratic Riccati Eq.(1.70) for b = 2 and r = 0.5. We thus obtain

K =2 +

√6

4.

From Eq. (1.72) we then have

L = − 2 +√6

5 + 2√6. (1.73)

We will now consider changes of the values of b and r while keeping L constant,and we will compare the quadratic cost coefficient of the following three costfunctions as b and r vary:

Page 84:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 83

(a) The optimal cost function K∗x2, where K∗ is given by the positivesolution of the Riccati Eq. (1.70).

(b) The cost function KLx2 that corresponds to the base policy

µL(x) = Lx,

where L is given by Eq. (1.73). Here, from Example 1.3.5, we have

KL =1 + rL2

1− (1 + bL)2.

(c) The cost function KLx2 that corresponds to the rollout policy

µL(x) = Lx,

obtained by using the policy µL as base policy. Using the formulas ofExample 1.3.5, we have

L = − bKL

r + b2KL

,

and

KL =1 + rL2

1− (1 + bL)2.

Figure 1.4.10 shows the coefficients K∗, KL, and KL for a range ofvalues of r and b. We have

K∗ ≤ KL ≤ KL.

The differenceKL−K∗ is indicative of the robustness of the policy µL, i.e., theperformance loss incurred by ignoring the values of b and r, and continuingto use the policy µL, which is optimal for the nominal values b = 2 andr = 0.5, but suboptimal for other values of b and r. The difference KL−K∗ isindicative of the performance loss due to using on-line replanning by rolloutrather than using optimal replanning. Finally, the difference KL − KL isindicative of the performance improvement due to on-line replanning usingrollout rather than keeping the policy µL unchanged.

The next example summarizes how rollout and on-line replanningconnect with model predictive control (MPC). A detailed discussion willbe given in Chapter 3; see also the author’s papers [Ber05a], [Ber05b],where the relations between rollout and MPC were first explored.

Page 85:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

84 Exact and Approximate Dynamic Programming Principles Chap. 1

0 5 10 15 20 25 30

1

2

3

4

5

6

7

8

0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4

1

1.2

1.4

1.6

1.8

2

2.2

2.4

Cost of rollout policy ˜ Optimal Base Rolllout

Optimal Base Rolllout

Optimal Base Rollout

Optimal Base Rollout

Cost of rollout policy ˜ Optimal Base Rolllout

Optimal Base Rolllout

Figure 1.4.10 Illustration of control by rollout under changing problem pa-rameters. The quadratic cost coefficients K∗ (optimal, denoted by solid line),KL (base policy, denoted by circles), and KL (rollout policy, denoted by as-terisks) for the two cases where r = 0.5 and b varies, and b = 2 and r varies.The value of L is fixed at the value that is optimal for b = 2 and r = 0.5 [cf.Eq. (1.73)]. The rollout policy is very close to optimal, even when the basepolicy is far from optimal. This is related to the size of the second derivativeof the Riccati Eq. (1.70); see Exercise 1.3.

Example 1.4.6 (Model Predictive Control, Rollout, andOn-Line Replanning)

Let us briefly discuss the MPC methodology, with a view towards its connec-tion with the rollout algorithm. Consider an undiscounted infinite horizondeterministic problem, involving the system

xk+1 = f(xk, uk),

Page 86:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.4 Examples, Variations, and Simplifications 85

uk x

k xk+1

-Factors Current State x

Current State xk

Next Cities Next States

Sample Q-Factors Simulation Control 1 Control 2 Control 3

,n Stage k k StagesStages k+1, . . . , k+ℓ−1

Sample Q-Factors Simulation Control 1 StateSample Q-Factors Simulation Control 1 State xk+ℓ = 0

1)-Stages Base Heuristic Minimizationk (ℓ− 1)-Stages Base Heuristic Minimization

Figure 1.4.11 Illustration of the problem solved by MPC at state xk. Weminimize the cost function over the next ℓ stages while imposing the require-ment that xk+ℓ = 0. We then apply the first control of the optimizing se-quence. In the context of rollout, the minimization over uk is the one-steplookahead, while the minimization over uk+1, . . . , uk+ℓ−1 that drives xk+ℓ to0 is the base heuristic.

whose state xk and control uk are finite-dimensional vectors. The cost perstage is assumed nonnegative

g(xk, uk) ≥ 0, for all (xk, uk),

(e.g., a quadratic cost). There are control constraints uk ∈ U(xk), and tosimplify the following discussion, we will assume that there are no state con-straints. We assume that the system can be kept at the origin at zero cost,i.e.,

f(0, uk) = 0, g(0, uk) = 0 for some control uk ∈ U(0).

For a given initial state x0, we want to obtain a sequence {u0, u1, . . .} thatsatisfies the control constraints, while minimizing the total cost. This is aclassical problem in control system design, where the aim is to keep the stateof the system near the origin (or more generally some desired set point), inthe face of disturbances and/or parameter changes.

The MPC algorithm at each encountered state xk applies a control thatis computed as follows; see Fig. 1.4.11:

(a) It solves an ℓ-stage optimal control problem involving the same costfunction and the requirement xk+ℓ = 0. This is the problem

minui, i=k,...,k+ℓ−1

k+ℓ−1∑

i=k

g(xi, ui), (1.74)

subject to the system equation constraints

xi+1 = f(xi, ui), i = k, . . . , k + ℓ− 1,

Page 87:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

86 Exact and Approximate Dynamic Programming Principles Chap. 1

the control constraints

ui ∈ U(xi), i = k, . . . , k + ℓ− 1,

and the terminal state constraint xk+ℓ = 0. Here ℓ is an integer withℓ > 1, which is chosen in some largely empirical way (see Section 3.1for further discussion).

(b) If {uk, . . . , uk+ℓ−1} is the optimal control sequence of this problem,MPC applies uk and discards the other controls uk+1, . . . , uk+ℓ−1.

(c) At the next stage, MPC repeats this process, once the next state xk+1

is revealed.

To make the connection of MPC with rollout, we note that the one-step lookahead function J implicitly used by MPC [cf. Eq. (1.74)] is the cost-to-go function of a certain base heuristic. This is the heuristic that drivesto 0 the state after ℓ − 1 stages (not ℓ stages) and keeps the state at 0thereafter, while observing the state and control constraints, and minimizingthe associated (ℓ − 1)-stages cost. This rollout view of MPC, first discussedin the author’s paper [Ber05a], is useful for making a connection with theapproximate DP/RL techniques that are the main subject of this book, suchas truncated rollout, cost function approximations, PI, etc.

Returning to the issue of dealing with changing problem parameters,it is natural to consider on-line replanning as per our earlier discussion. Inparticular, once new estimates of system and/or cost function parametersbecome available, MPC can adapt accordingly by introducing the new pa-rameter estimates into the ℓ-stage optimization problem in (a) above.

Finally, let us note a common variant of MPC, where the requirementof driving the system state to 0 in ℓ steps in the ℓ-stage MPC problem (1.74),is replaced by a terminal cost Gk+ℓ(xk+ℓ). Thus at state xk, we solve theproblem

minui, i=k,...,k+ℓ−1

[

Gk+ℓ(xk+ℓ) +

k+ℓ−1∑

i=k

gi(xi, ui)

]

,

instead of problem (1.74) where we require that xk+ℓ = 0. This variant canalso be viewed as approximation in value space/truncated rollout with eitherone-step or multistep lookahead minimization.

The on-line replanning scheme suffers from a potential computationalbottleneck, associated with fast estimation of new system parameter valuesand the associated policy recalculation. A variation of on-line replanning,aims to deal with this difficulty by precomputation. In particular, assumethat the set of problem parameters may take a known finite set of val-ues.† Then we may precompute a separate controller for each of thesevalues. Once the control scheme detects a change in problem parameters,

† For example each set of parameter values may correspond to a distinct

maneuver of a vehicle, motion of a robotic arm, flying regime of an aircraft, etc.

Page 88:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.5 Reinforcement Learning and Optimal Control 87

it switches to the corresponding predesigned current controller. This issometimes called a multiple model control design or gain scheduling, andhas been applied with success in various settings over the years.

1.5 REINFORCEMENT LEARNING AND OPTIMALCONTROL - SOME TERMINOLOGY

The current state of RL has greatly benefited from the cross-fertilization ofideas from optimal control and from artificial intelligence. The strong con-nections between these two fields are now widely recognized. Still, however,substantial differences in language and emphasis remain between RL-baseddiscussions (where artificial intelligence-related terminology is used) andDP-based discussions (where optimal control-related terminology is used).

The terminology used in this book is standard in DP and optimal con-trol, and in an effort to forestall confusion of readers that are accustomedto either the artificial intelligence or the optimal control terminology, weprovide a list of terms commonly used in RL, and their optimal controlcounterparts.

(a) Environment = System.

(b) Agent = Decision maker or controller.

(c) Action = Decision or control.

(d) Reward of a stage = (Opposite of) Cost of a stage.

(e) State value = (Opposite of) Cost starting from a state.

(f) Value (or reward) function = (Opposite of) Cost function.

(g) Maximizing the value function = Minimizing the cost function.

(h) Action (or state-action) value = Q-factor (or Q-value) of a state-control pair. (Q-value is also used often in RL.)

(i) Planning = Solving a DP problem with a known mathematicalmodel.

(j) Learning = Solving a DP problem without using an explicit mathe-matical model. (This is the principal meaning of the term “learning”in RL. Other meanings are also common.)

(k) Self-learning (or self-play in the context of games) = Solving a DPproblem using some form of policy iteration.

(l) Deep reinforcement learning = Approximate DP using valueand/or policy approximation with deep neural networks.

(m) Prediction = Policy evaluation.

(n) Generalized policy iteration = Optimistic policy iteration.

Page 89:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

88 Exact and Approximate Dynamic Programming Principles Chap. 1

(o) State abstraction = State aggregation.

(p) Temporal abstraction = Time aggregation.

(q) Learning a model = System identification.

(r) Episodic task or episode = Finite-step system trajectory.

(s) Continuing task = Infinite-step system trajectory.

(t) Experience replay = Reuse of samples in a simulation process.

(u) Bellman operator = DP mapping or operator.

(v) Backup = Applying the DP operator at some state.

(w) Sweep = Applying the DP operator at all states.

(x) Greedy policy with respect to a cost function J = Minimizingpolicy in the DP expression defined by J .

(y) Afterstate = Post-decision state.

(z) Ground truth = Empirical evidence or information provided bydirect observation.

Some of the preceding terms will be introduced in future chapters; see alsothe RL textbook [Ber19a]. The reader may then wish to return to thissection as an aid in connecting with the relevant RL literature.

Notation

Unfortunately the confusion arising from different terminology has beenexacerbated by the use of different notations. The present book roughlyfollows the “standard” notation of the Bellman/Pontryagin optimal con-trol era; see e.g., the classical books by Athans and Falb [AtF66], Bellman[Bel67], and Bryson and Ho [BrH75]. This notation is consistent withthe author’s other DP books, and is the most appropriate for a unifiedtreatment of the subject, which simultaneously addresses discrete and con-tinuous spaces problems.

A summary of our most prominently used symbols is as follows:

(a) x: state.

(b) u: control.

(c) J : cost function.

(d) g: cost per stage.

(e) f : system function.

(f) i: discrete state.

(g) pij(u): transition probability from state i to state j under control u.

(h) α: discount factor in discounted problems.

Page 90:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.6 Notes and Sources 89

The x-u-J notation is standard in optimal control textbooks (e.g., thebooks by Athans and Falb [AtF66], and Bryson and Ho [BrH75], as well asthe more recent book by Liberzon [Lib11]). The notations f and g are alsoused most commonly in the literature of the early optimal control period aswell as later (unfortunately the more natural symbol “c” has not been usedmuch in place of “g” for the cost per stage). The discrete system notations iand pij(u) are very common in the discrete-state Markov decision problemand operations research literature, where discrete-state problems have beentreated extensively [sometimes the alternative notation p(j | i, u) is used forthe transition probabilities].

The artificial intelligence literature addresses for the most part finite-state Markov decision problems, most frequently the discounted and sto-chastic shortest path infinite horizon problems that are discussed in Chap-ter 5. The most commonly used notation is s for state, a for action,r(s, a, s′) for reward per stage, p(s′ | s, a) or p(s, a, s′) for transition prob-ability from s to s′ under action a, and γ for discount factor. However,this type of notation is not well suited for continuous spaces models, whichare of major interest for this book. The reason is that it requires the useof transition probability distributions defined over continuous spaces, andit leads to more complex and less intuitive mathematics. Moreover thetransition probability notation makes no sense for deterministic problems,which involve no probabilistic structure at all.

1.6 NOTES AND SOURCES

Sections 1.1-1.3: Our discussion of exact DP in this chapter has beenbrief since our focus in this book will be on approximate DP and RL. Theauthor’s DP textbook [Ber17] provides an extensive discussion of finitehorizon exact DP, and its applications to discrete and continuous spacesproblems, using a notation and style that is consistent with the one usedhere. The books by Puterman [Put94] and by the author [Ber12] providedetailed treatments of infinite horizon finite-state stochastic DP problems.The book [Ber12] also covers continuous/infinite state and control spacesproblems, including the linear quadratic problems that we have discussedin this chapter through examples. Continuous spaces problems presentspecial analytical and computational challenges, which are at the forefrontof research of the RL methodology.

Some of the more complex mathematical aspects of exact DP arediscussed in the monograph by Bertsekas and Shreve [BeS78], particu-larly the probabilistic/measure-theoretic issues associated with stochasticoptimal control, including partial state information problems. The book[Ber12] provides in an appendix an accessible summary introduction of themeasure-theoretic framework of the book [BeS78], while a long paper byYu and Bertsekas [YuB15] contains recent supplementary research with a

Page 91:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

90 Exact and Approximate Dynamic Programming Principles Chap. 1

particular focus on the analysis of policy iteration methods, which were nottreated in [BeS78].

The author’s abstract DP monograph [Ber18a] aims at a unified de-velopment of the core theory and algorithms of total cost sequential de-cision problems, and addresses simultaneously stochastic, minimax, game,risk-sensitive, and other DP problems, through the use of the abstract DPoperator (or Bellman operator as it is often called in RL). The idea hereis to gain insight through abstraction. In particular, the structure of aDP model is encoded in its abstract Bellman operator, which serves asthe “mathematical signature” of the model. Thus, characteristics of thisoperator (such as monotonicity and contraction) largely determine the an-alytical results and computational algorithms that can be applied to thatmodel. This abstract viewpoint has been adopted in Sections 5.3-5.6 of thepresent book.

The approximate DP and RL literature has expanded tremendouslysince the connections between DP and RL became apparent in the late 80sand early 90s. We restrict ourselves to mentioning textbooks and researchmonographs, which supplement our discussion, express related viewpoints,and collectively provide a guide to the literature. Two books were writtenin the 1990s, setting the tone for subsequent developments in the field.One in 1996 by Bertsekas and Tsitsiklis [BeT96], which reflects a decision,control, and optimization viewpoint, and another in 1998 by Sutton andBarto, which reflects an artificial intelligence viewpoint (a 2nd edition,[SuB18], was published in 2018). We refer to the former book and also tothe author’s DP textbooks [Ber12], [Ber17] for a broader discussion of someof the topics of this book, including algorithmic convergence issues andadditional DP models, such as those based on average cost and semi-Markovproblem optimization. Note that both of these books deal with finite-stateMarkovian decision models and use a transition probability notation, asthey do not address continuous spaces problems, which are of major interestin this book.

More recent books are by Gosavi [Gos15] (a much expanded 2ndedition of his 2003 monograph), which emphasizes simulation-based op-timization and RL algorithms, Cao [Cao07], which focuses on a sensi-tivity approach to simulation-based methods, Chang, Fu, Hu, and Mar-cus [CFH13] (a 2nd edition of their 2007 monograph), which emphasizesfinite-horizon/multistep lookahead schemes and adaptive sampling, Buso-niu, Babuska, De Schutter, and Ernst [BBD10a], which focuses on functionapproximation methods for continuous space systems and includes a dis-cussion of random search methods, Szepesvari [Sze10], which is a shortmonograph that selectively treats some of the major RL algorithms suchas temporal differences, armed bandit methods, and Q-learning, Powell[Pow11], which emphasizes resource allocation and operations research ap-plications, Powell and Ryzhov [PoR12], which focuses on specialized topicsin learning and Bayesian optimization, Vrabie, Vamvoudakis, and Lewis

Page 92:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.6 Notes and Sources 91

[VVL13], which discusses neural network-based methods and on-line adap-tive control, Kochenderfer et al. [KAC15], which selectively discusses ap-plications and approximations in DP and the treatment of uncertainty,Jiang and Jiang [JiJ17], which addresses adaptive control and robustnessissues within an approximate DP framework, Liu, Wei, Wang, Yang, and Li[LWW17], which deals with forms of adaptive dynamic programming, andtopics in both RL and optimal control, and Zoppoli, Sanguineti, Gnecco,and Parisini [ZSG20], which addresses neural network approximations inoptimal control as well as multiagent/team problems with nonclassical in-formation patterns.

There are also several books that, while not exclusively focused onDP and/or RL, touch upon several of the topics of this book. The book byBorkar [Bor08] is an advanced monograph that addresses rigorously manyof the convergence issues of iterative stochastic algorithms in approximateDP, mainly using the so called ODE approach. The book by Meyn [Mey07]is broader in its coverage, but discusses some of the popular approximateDP/RL algorithms. The book by Haykin [Hay08] discusses approximateDP in the broader context of neural network-related subjects. The bookby Krishnamurthy [Kri16] focuses on partial state information problems,with discussion of both exact DP, and approximate DP/RL methods. Thetextbooks by Kouvaritakis and Cannon [KoC16], Borrelli, Bemporad, andMorari [BBM17], and Rawlings, Mayne, and Diehl [RMD17] collectivelyprovide a comprehensive view of the MPC methodology. The book byBrandimarte [Bra21] is a tutorial introduction to DP/RL that emphasizesoperations research applications and includes MATLAB codes.

The present book is similar in style, terminology, and notation to theauthor’s recent RL textbook [Ber19a], which provides a more comprehen-sive account of the subject. It particular, the 2019 RL textbook includesa broader coverage of approximation in value space methods, includingcertainty equivalent control and aggregation methods. It also covers sub-stantially policy gradient methods for approximation in policy space, whichwe will not address here.

In addition to textbooks, there are many surveys and short researchmonographs relating to our subject, which are rapidly multiplying in num-ber. We refer to the author’s RL textbook [Ber19a] for a fairly comprehen-sive list up to 2019.

Approximation in value space, rollout, and policy iteration are theprincipal subjects of this book.† These are very powerful and general tech-

† The name “rollout” (also called “policy rollout”) was introduced by Tesauro

and Galperin [TeG96] in the context of rolling the dice in the game of backgam-

mon. In Tesauro’s proposal, a given backgammon position is evaluated by “rolling

out” many games starting from that position to the end of the game. To quote

from the paper [TeG96]: “In backgammon parlance, the expected value of a po-

sition is known as the “equity” of the position, and estimating the equity by

Page 93:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

92 Exact and Approximate Dynamic Programming Principles Chap. 1

niques: they can be applied to deterministic and stochastic problems, finiteand infinite horizon problems, discrete and continuous spaces problems, andmixtures thereof. Rollout is reliable, easy to implement, and can be usedin conjunction with on-line replanning.

As we have noted, rollout with a given base policy is simply the firstiteration of the policy iteration algorithm starting from the base policy.Truncated rollout will be interpreted in Chapter 5 as an “optimistic” formof policy iteration, whereby a policy is evaluated inexactly, by using alimited number of value iterations.†

Policy iteration, which will be viewed here as the repeated use ofrollout, is more ambitious and challenging than rollout. It requires off-linetraining, possibly in conjunction with the use of neural networks. Togetherwith its neural network and distributed implementations, it will be dis-cussed in Chapters 4 and 5. References for rollout and policy iteration willbe given at the end of Chapters 2-5.

There is a vast literature on linear quadratic problems. The con-nection of policy iteration with Newton’s method within this context andits quadratic convergence rate was first derived by Kleinman [Kle68] forcontinuous-time problems (the corresponding discrete-time result was givenby Hewer [Hew71]). For followup work, which relates to policy iterationwith approximations, see Feitzinger, Hylla, and Sachs [FHS09], and Hylla[Hyl11]. The idea of using policy iteration for adaptive control (Section1.3.7) was applied to linear quadratic problems by Vrabie, Pastravanu,Abu-Khalaf, and Lewis [VPA09], and was subsequently discussed by sev-eral other authors. It can be used in conjunction with simulation-based

Monte-Carlo sampling is known as performing a “rollout.” This involves playing

the position out to completion many times with different random dice sequences,

using a fixed policy P to make move decisions for both sides.”

† Truncated rollout was also proposed in [TeG96]. To quote from this paper:

“Using large multi-layer networks to do full rollouts is not feasible for real-time

move decisions, since the large networks are at least a factor of 100 slower than

the linear evaluators described previously. We have therefore investigated an

alternative Monte-Carlo algorithm, using so-called “truncated rollouts.” In this

technique trials are not played out to completion, but instead only a few steps in

the simulation are taken, and the neural net’s equity estimate of the final position

reached is used instead of the actual outcome. The truncated rollout algorithm

requires much less CPU time, due to two factors: First, there are potentially many

fewer steps per trial. Second, there is much less variance per trial, since only a few

random steps are taken and a real-valued estimate is recorded, rather than many

random steps and an integer final outcome. These two factors combine to give at

least an order of magnitude speed-up compared to full rollouts, while still giving

a large error reduction relative to the base player.” Analysis and computational

experience with truncated rollout since 1996 are consistent with the preceding

assessment.

Page 94:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.6 Notes and Sources 93

model-free implementations of policy iteration for linear quadratic prob-lems introduced by Bradtke, Ydstie, and Barto [BYB94].

Section 1.4: Applications of rollout in multiagent problems and MPC willbe discussed in Chapter 3. Rollout will also be applied to deterministic dis-crete spaces problems, including a vast array of combinatorial optimizationproblems in Chapter 3. Rollout, multiagent problems, MPC, and discreteoptimization problems, together with related distributed algorithms, areamong the major research subjects discussed in this book.

Multiagent problem research has a long history (Marschak [Mar55],Radner [Rad62], Witsenhausen [Wit68], [Wit71a], [Wit71b]), and was re-searched extensively in the 70s; see the review paper by Ho [Ho80] andthe references cited there. The names used for the field at that time wereteam theory and decentralized control . For a sampling of subsequent worksin team theory and multiagent optimization, we refer to the papers byKrainak, Speyer, and Marcus [KLM82a], [KLM82b], and de Waal and vanSchuppen [WaS00]. For more recent works, see Nayyar, Mahajan, andTeneketzis [NMT13], Nayyar and Teneketzis [NaT19], Li et al. [LTZ19], Quand Li [QuL19], Gupta [Gup20], the book by Zoppoli, Sanguineti, Gnecco,and Parisini [ZSG20], and the references quoted there. In addition to theaforementioned works, surveys of multiagent sequential decision makingfrom an RL perspective were given by Busoniu, Babuska, and De Schutter[BBD08], [BBD10b].

We note that the term “multiagent” has been used with several differ-ent meanings in the literature. For example some authors place emphasison the case where the agents do not have common information when se-lecting their decisions. This gives rise to sequential decision problems with“nonclassical information patterns,” which can be very complex, partly be-cause they cannot be addressed by exact DP. Other authors adopt as theirstarting point a problem where the agents are “weakly” coupled throughthe system equation, the cost function, or the constraints, and considermethods whereby the weak coupling is exploited to address the problemthrough (suboptimal) decoupled computations.

Agent-by-agent minimization in multiagent approximation in valuespace and rollout was proposed in the author’s paper [Ber19c], whichalso discusses extensions to infinite horizon policy iteration algorithms,and explores connections with the concept of person-by-person optimalityfrom team theory; see also the survey [Ber20a], and the papers [Ber19d],[Ber20b]. A computational study where several of the multiagent algo-rithmic ideas were tested and validated is the paper by Bhattacharya et al.[BKB20]. This paper considers a large-scale multi-robot routing and repairproblem, involving partial state information, and explores some of the at-tendant implementation issues, including autonomous multiagent rollout,through the use of policy neural networks and other precomputed signalingpolicies.

Page 95:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

94 Exact and Approximate Dynamic Programming Principles Chap. 1

The subject of adaptive control (cf. Section 1.3.7) has a long historyand its literature is very extensive; see the books by Astrom and Wit-tenmark [AsW94], Astrom and Hagglund [AsH95], [AsH06], Goodwin andSin [GoS84], Ioannou and Sun [IoS96], Jiang and Jiang [JiJ17], Krstic,Kanellakopoulos, and Kokotovic [KKK95], Kumar and Varaiya [KuV86],Liu, et al. [LWW17], Lavretsky and Wise [LaW13], Narendra and An-naswamy [NaA12], Sastry and Bodson [SaB11], Slotine and Li [SlL91], andVrabie, Vamvoudakis, and Lewis [VVL13]. These books describe a vastarray of methods spanning 60 years, and ranging from adaptive and PIDmodel-free approaches, to simultaneous or sequential control and identifi-cation, to ARMAX/time series models, to extremum-seeking methods, tosimulation-based RL techniques, etc.†

The research on problems involving unknown models and using datafor model identification prior to or simultaneously with control was rekin-dled with the advent of the artificial intelligence side of RL and its focuson the active exploration of the environment. Here there is emphasis in“learning from interaction with the environment” [SuB18] through the useof (possibly hidden) Markov decision models, machine learning, and neuralnetworks, in a wide array of methods that are under active developmentat present. In this book we will not deal with control of unknown models,except tangentially, and in the context of on-line replanning for problemswith changing model parameters, as in the discussion of Section 1.4.6.

The literature on MPC is voluminous. Some early widely cited papersare Clarke, Mohtadi, and Tuffs [CMT87a], [CMT87b], and Keerthi andGilbert [KeG88]. For surveys, which give many of the early references,see Morari and Lee [MoL99], Mayne et al. [MRR00], and Findeisen et al.[FIA03], and for a more recent review, see Mayne [May14]. The connectionsbetween MPC and rollout were discussed in the author’s survey [Ber05a].Textbooks on MPC include Maciejowski [Mac02], Goodwin, Seron, and DeDona [GSD06], Camacho and Bordons [CaB07], Kouvaritakis and Cannon[KoC16], Borrelli, Bemporad, and Morari [BBM17], and Rawlings, Mayne,and Diehl [RMD17].

† The ideas of PID control originated even earlier. According to Wikipedia,

“a formal control law for what we now call PID or three-term control was first

developed using theoretical analysis, by Russian American engineer Nicolas Mi-

norsky” [Min22].

Page 96:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.6 Notes and Sources 95

E X E R C I S E S

1.1 (Computational Exercise - Multi-Vehicle Routing)

In the multi-vehicle routing problem of Example 1.2.3 view the two vehicles asagents.

(a) Apply the multiagent rollout algorithm of Section 1.4.5, starting from thepair location (1,2). Compare the multiagent rollout performance with theperformance of the ordinary rollout algorithm, where both vehicles moveat once, and with the base heuristic that moves each vehicle one step alongthe shortest path to the nearest pending task.

(b) Repeat for the case where, in addition to vehicles 1 and 2 that start atnodes 1 and 2, respectively, there is a third vehicle that starts at node 10,and there is a third task to be performed at node 8.

1.2 (Computational Exercise - Spiders and Flies)

Consider the spiders and flies problem of Example 1.2.3 with two differences: thefive flies are stationary (rather than moving randomly), and there are only twospiders that start at the fourth square from the right at the top row of the gridof Fig. 1.4.7. The base heuristic is to move each spider one square towards itsnearest fly, with distance measured by the Manhattan metric, and with preferencegiven to a horizontal direction over a vertical direction in case of a tie. Apply themultiagent rollout algorithm of Section 1.4.5, and compare its performance withthe one of the ordinary rollout algorithm, and with the one of the base heuristic.

1.3 (Computational Exercise - Linear Quadratic Problem)

In a more realistic version of the cruise control system of Example 1.3.1, thesystem has the form

xk+1 = axk + buk + wk,

where the coefficient a satisfies 0 < a ≤ 1, and the disturbance wk has zero meanand variance σ2. The cost function has the form

(xN − xN)2 +

N−1∑

k=0

(

(xk − xk)2 + ru2

k))

,

where x0, . . . , xN are given nonpositive target values (a velocity profile) thatserve to adjust the vehicle’s velocity, in order to maintain a safe distance fromthe vehicle ahead, etc. In a practical setting, the velocity profile is recalculatedby using on-line radar measurements.

Page 97:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

96 Exact and Approximate Dynamic Programming Principles Chap. 1

(a) Use exact DP to derive the optimal policy for given values of a and b,and a given velocity profile. Hint : Hypothesize that J∗

k (xk), the optimalcost-to-go starting from state xk at time k, has the form

J∗k (xk) = Pkx

2k − 2pkxk + constant,

and assuming your hypothesis is correct, verify that the optimal policy islinear in the state plus a constant, and has the form

µ∗k(xk) = Lkxk + ℓk,

where

Lk = − abPk+1

r + b2Pk+1, ℓk =

bpk+1

r + b2Pk+1.

Use the DP algorithm to verify that your hypothesis is correct, and toderive recursive formulas that express the scalar coefficients Pk and pk interms of Pk+1, pk+1, Lk, and ℓk, starting with

PN = 1, pN = xN .

(b) Design an experiment to compare the performance of a fixed linear policyµ, derived for a fixed nominal velocity profile as in part (a), and the perfor-mance of the algorithm that uses on-line replanning, whereby the optimalpolicy µ∗ is recalculated each time the velocity profile changes. Comparewith the performance of the rollout policy µ that uses µ as the base policyand on-line replanning.

1.4 (Computational Exercise - Parking Problem)

In reference to Example 1.4.3, a driver aims to park at an inexpensive space onthe way to his destination. There are L parking spaces available and a garage atthe end. The driver can move in either direction. For example if he is in spacei he can either move to i − 1 with a cost t − i , or to i + 1 with a cost t + i, orhe can park at a cost c(i) (if the parking space i is free). The only exception iswhen he arrives at the garage (indicated by index N) and he has to park thereat a cost C. Moreover, after the driver visits a parking space he remembers itsfree/taken status and has an option to return to any parking space he has alreadyvisited. However the driver must park within a given number of stages N , so thatthe problem has a finite horizon. The initial probability of space i being free isgiven, and the driver can only observe the free/taken status of a parking onlyafter he/she visits the space. Moreover, the free/taken status of a parking visitedso far does not change over time.

Write a program to calculate the optimal solution using exact dynamicprogramming over a state space that is as small as possible. Try to experimentwith different problem data, and try to visualize the optimal cost/policy withsuitable graphical plots. Comment on run-time as you increase the number ofparking spots L.

Page 98:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.6 Notes and Sources 97

1.5 (PI and Newton’s Method for Linear Quadratic Problems)

The purpose of this exercise is to demonstrate the fast convergence of the PIalgorithm. Consider the one-dimensional linear quadratic Example 1.3.5 and1.4.6 for α = 1.

(a) Verify that the Bellman equation,

Kx2 = minu

[

x2 + ru2 +K(x+ bu)2]

,

can be written as the equation F (K) = 0, where

F (K) = K − rK

r + b2K− 1.

(b) Consider the PI algorithm

Kk =1 + rL2

k

1− (1 + bLk)2,

where

Lk+1 = − bKk

r + b2Kk

,

and µ(x) = L0x is the starting policy. Verify that this iteration is equivalentto Newton’s method

Kk+1 = Kk −(

∂F (Kk)

∂K

)−1

F (Kk),

for solving the Bellman equation F (K) = 0. This relation can be shown ingreater generality, for multidimensional linear quadratic problems.

(c) Rollout cost improvement over base policy in adaptive control : Consider arange of values of b, r, and L0, and study computationally the effect of thesecond derivative of F on the ratio K1/K0 of rollout to base policy costs,and on the ratio K1/K

∗ of rollout to optimal policy costs.

Solution of part (b): We have

L = − bK

r + b2K⇐⇒ K = − rL

b(1 + bL),

so that

F (K) = K − rK

r + b2K− 1

=b2K2

r + b2K− 1

= −bLK − 1

=rL2 − (bL+ 1)

bL+ 1.

Page 99:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

98 Exact and Approximate Dynamic Programming Principles Chap. 1

We next calculate the inverse partial derivative ∂F∂K

as

(

∂F

∂K

)−1

=(r + b2K)2

2b2K(r + 2b2K)

=1

L2· K

2r + b2K

=1

L2· 1 + bL

2 + bL· Kr

= − 1

L2· 1 + bL

2 + bL· L

b(1 + bL)

= − 1

bL(2 + bL).

Therefore, the Newton method iterate is given by

K −(

∂F

∂K

)−1

F (K) = − rL

b(1 + bL)+

1

bL(2 + bL)· rL

2 − (bL+ 1)

1 + bL

=−rL2(2 + bL) + rL2 − (1 + bL)

bL(2 + bL)(1 + bL)

=−rL2(1 + bL)− (1 + bL)

bL(2 + bL)(1 + bL)

=−rL2 − 1

bL(2 + bL)

=1 + rL2

1− (1 + bL)2,

and is identical to the PI iterate.

1.6 (Post-Decision States)

The purpose of this exercise is to demonstrate a type of DP simplification thatarises often (see [Ber12], Section 6.1.5 for further discussion). Consider the finitehorizon stochastic DP problem and assume that the system equation has a specialstructure whereby from state xk after applying uk we move to an intermediate“post-decision state”

yk = pk(xk, uk)

at cost gk(xk, uk). Then from yk we move at no cost to the new state xk+1

according to

xk+1 = hk(yk, wk) = hk(pk(xk, uk), wk), (1.75)

where the distribution of the disturbance wk depends only on yk, and not onprior disturbances, states, and controls. Denote by Jk(xk) the optimal cost-to-gostarting at time k from state xk, and by Vk(yk) the optimal cost-to-go startingat time k from post-decision state yk.

Page 100:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 1.6 Notes and Sources 99

(a) Use Eq. (1.75) to verify that a DP algorithm that generates only Jk is givenby

Jk(xk) = minuk∈Uk(xk)

[

g(xk, uk) + Ewk

{

Jk+1

(

hk(pk(xk, uk), wk))

}

]

.

(b) Show that a DP algorithm that generates both Jk and Vk is given by

Jk(xk) = minuk∈Uk(xk)

[

g(xk, uk) + Vk

(

pk(xk, uk))

]

,

Vk(yk) = Ewk

{

Jk+1

(

hk(yk, wk), wk

)

}

.

(c) Show that a DP algorithm that generates only Vk for all k is given by

Vk(yk) = Ewk

{

minuk+1∈Uk+1(hk(yk,wk))

[

gk+1(hk(yk, wk), uk+1)

+ Vk+1(pk+1(hk(yk, wk), uk+1))]

}

.

Page 101:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact
Page 102:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

2

General Principles of

Approximation in Value Space

Contents

2.1. Approximation in Value and Policy Space . . . . . . p. 1022.1.1. Approximation in Value Space - One-Step and . . . .

Multistep Lookahead . . . . . . . . . . . . p. 1032.1.2. Approximation in Policy Space . . . . . . . . p. 1062.1.3. Combined Approximation in Value and . . . . . . .

Policy Space . . . . . . . . . . . . . . . . p. 1082.2. Approaches for Value Space Approximation . . . . . p. 112

2.2.1. Model-Based and Model-Free Implementations . p. 1132.2.2. Off-Line and On-Line Implementations . . . . . p. 1142.2.3. Methods for Cost-to-Go Approximation . . . . p. 1162.2.4. Methods for Expediting the Lookahead . . . . . . .

Minimization . . . . . . . . . . . . . . . . p. 1172.2.5. Simplification of the Lookahead Minimization . . . . .

by Q-Factor Approximation . . . . . . . . . p. 120

101

Page 103:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

102 General Principles of Approximation in Value Space Chap. 2

As we noted in Chapter 1, the exact solution of optimal control problemsby DP is often impossible in practice. To a great extent, the reason liesin what Bellman has called the “curse of dimensionality.” This refers to arapid increase of the required computation and memory storage as the sizeof the problem increases. Moreover, there are many circumstances wherethe structure of the given problem is known well in advance, but some ofthe problem data, such as various system parameters, may be unknown un-til shortly before control is needed, thus seriously constraining the amountof time available for the DP computation. The same is true when the sys-tem parameters change while we are in the process of applying control, inwhich case on-line replanning is needed. These difficulties motivate subop-timal control schemes that strike a reasonable balance between convenientimplementation and adequate performance.

In this chapter, we first elaborate, in Sections 2.1 and 2.2, on theprincipal approximation approaches that we discussed in Chapter 1. Wethen focus on the key ideas of approximation in value space, one-step andmultistep lookahead, and policy improvement by rollout, as well as variouspossibilities for their implementation, including off-line training. We focusprimarily on finite horizon methods, but much of our methodology willbe adapted to infinite horizon DP later, together with some additionalmethods that are specific to the infinite horizon context.

2.1 APPROXIMATION IN VALUE AND POLICY SPACE

There are two general types of approximation in DP-based suboptimalcontrol. The first is approximation in value space, where we aim to ap-proximate the optimal cost function or the cost function of a given policy,often using some process based on data collection. The second is approx-imation in policy space, where we select a policy from a suitable class ofpolicies based on some criterion; the selection process often uses data, op-timization, and neural network approximations. In some settings the valuespace and policy space approximation approaches may be combined.

In this section we provide a broad overview of the main ideas forthese two types of approximation. In this connection, it is important tokeep in mind the two types of algorithms underlying AlphaZero and relatedgame programs that we discussed in Section 1.1. The first type is off-line training, which computes cost functions and policies before the actualcontrol process begins; for example by using data and training algorithmsfor neural networks. The second type is on-line play, which selects andapplies controls during the actual control process of the system. In thisconnection, we note that approximation in policy space is strictly an off-line training-type of algorithm. On the other hand, approximation in valuespace is primarily an on-line play-type of algorithm, which however, mayuse cost functions and policies obtained by extensive off-line training.

Page 104:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 2.1 Approximation in Value and Policy Space 103

Steps “Future”Steps “Future” First Step

minuk

E{

gk(xk, uk, wk)+Jk+1(xk+1)}

At xk

Approximate Q-Factor Qk(xk, uk)

Min Approximation

Min Approximation E{·} Approximation Cost-to-Go ApproximationApproximation Cost-to-Go Approximation

Figure 2.1.1 Schematic illustration of one-step lookahead and the three principaltypes of approximations. At each state xk, it uses the control obtained from theminimization

µk(xk) ∈ arg minuk∈Uk(xk)

E

{

gk(xk , uk, wk) + Jk+1

(

fk(xk, uk, wk))

}

,

or equivalently in terms of Q-factors,

µk(xk) ∈ arg minuk∈Uk(xk)

Qk(xk , uk).

This defines the suboptimal policy

π = {µ0, . . . , µN−1},

referred to as the one-step lookahead policy based on Jk+1, k = 0, . . . , N − 1.There are three potential areas of approximation here, which can be consideredindependently of each other: cost-to-go approximation, expected value approxi-mation, and minimization approximation.

2.1.1 Approximation in Value Space - One-Step and MultistepLookahead

Let us consider the finite horizon stochastic DP problem of Section 1.2. Inthe principal form of approximation in value space discussed in this book,we replace the optimal cost-to-go function J*

k+1 in the DP equation with

Jk+1. In particular, at state xk, we use the control obtained from theminimization

µk(xk) ∈ arg minuk∈Uk(xk)

E{

gk(xk, uk, wk) + Jk+1

(

fk(xk, uk, wk))

}

. (2.1)

This process defines a suboptimal policy π = {µ0, . . . , µN−1} and is calledone-step lookahead ; see Fig. 2.1.1, and compare with the discussion of Sec-tion 1.3.1. There are several possibilities for selecting or computing thefunctions Jk+1, many of which are discussed in this book. In some schemesthe expected value and minimization operations may also be carried outapproximately; see Fig. 2.1.1. For example, we discussed the possibility ofapproximate minimization for multiagent problems in Section 1.4.6.

Page 105:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

104 General Principles of Approximation in Value Space Chap. 2

Note that the expected value expression appearing in the right-handside of Eq. (2.1) can be viewed as an approximate Q-factor,

Qk(xk, uk) = E{

gk(xk, uk, wk) + Jk+1

(

fk(xk, uk, wk))

}

,

and the minimization in Eq. (2.1) can be written as

µk(xk) ∈ arg minuk∈Uk(xk)

Qk(xk, uk).

As noted in Section 1.3.2, this suggests a variant of approximation in valuespace, which is based on using Q-factor approximations that may be ob-tained directly, i.e., without the intermediate step of obtaining the costfunction approximations Jk. We will focus primarily on cost function ap-proximation, but we will occasionally digress to discuss direct Q-factorapproximation.

Multistep Lookahead

An important extension of one-step lookahead is multistep lookahead (alsoreferred to as ℓ-step lookahead), whereby at state xk we minimize the costof the first ℓ > 1 stages with the future costs approximated by a functionJk+ℓ. For example, in two-step lookahead the function Jk+1 is given by

Jk+1(xk+1) = minuk+1∈Uk+1(xk+1)

E{

gk+1(xk+1, uk+1, wk+1)

+ Jk+2

(

fk+1(xk+1, uk+1, wk+1))

}

,

where Jk+2 is some approximation of the optimal cost-to-go function J*k+2.

More generally, at state xk we solve the ℓ-stage problem

minuk,µk+1,...,µk+ℓ−1

E

{

gk(xk, uk, wk) +

k+ℓ−1∑

i=k+1

gi(

xi, µi(xi), wi

)

+ Jk+ℓ(xk+ℓ)

}

to obtain a corresponding optimal sequence {uk, µk+1, . . . , µk+ℓ−1}. Wethen use the first control uk in this sequence, discard µk+1, . . . , µk+ℓ−1,obtain the next state

xk+1 = fk(xk, uk, wk),

and repeat the process with xk replaced by xk+1; see Fig. 2.1.2.Actually, one may view ℓ-step lookahead as the special case of one-

step lookahead where the lookahead function is the optimal cost function ofan (ℓ − 1)-stage DP problem with a terminal cost Jk+ℓ(xk+ℓ) on the state

Page 106:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 2.1 Approximation in Value and Policy Space 105

First ℓ Steps “Future”Steps “Future”

At xk minuk,µk+1,...,µk+ℓ−1

E

{

gk(xk, uk, wk) +k+ℓ−1∑

i=k+1

gi(

xi, µi(xi), wi

)

+ Jk+ℓ(xk+ℓ)

}

Figure 2.1.2 Schematic illustration of ℓ-step lookahead with approximation invalue space. At each state xk, we solve an ℓ-stage DP problem to obtain a se-quence {uk, µk+1, . . . , µk+ℓ−1}, and then use the first control uk in this sequence.This involves the same three approximations as one-step lookahead: cost-to-goapproximation, expected value approximation, and minimization approximation.The minimization of the expected value is more time consuming, but the cost-to-go approximation after ℓ need not be chosen as accurately/carefully as one-steplookahead.

xk+ℓ obtained after ℓ− 1 stages. However, it is often important to view ℓ-step lookahead separately, in order to address special implementation issuesthat do not arise in the context of one-step lookahead.

The motivation for ℓ-step lookahead is that by increasing the valueof ℓ, one may require a less accurate approximation Jk+ℓ to obtain goodperformance. Otherwise expressed, for the same quality of cost functionapproximation, better performance may be obtained as ℓ becomes larger.This makes intuitive sense, since with multistep lookahead, the cost of morestages is treated with optimization exactly, and is also supported by errorbounds to be given in Section 5.2.6. Moreover, after many stages, due torandomness, discounting, or other factors, the cost of the remaining stagesmay become negligible or may not depend much on the choice of the controluk at the current stage k. Indeed, in practice, longer lookahead results inbetter performance, although one can construct artificial examples wherethis is not so (see [Ber19a], Section 2.2.1).

Note that in a deterministic setting, the lookahead problems are alsodeterministic, and may be addressed by efficient shortest path methods.This makes deterministic problems particularly good candidates for the useof multistep lookahead. Generally, the implementation of ℓ-step lookaheadcan be prohibitively time-consuming for stochastic problems, because itrequires at each step the solution of a stochastic DP problem with an ℓ-step horizon. As a practical guideline, one should at least try to use thelargest value of ℓ for which the computational overhead for solving theℓ-step lookahead minimization problem on-line is acceptable.

In our discussion of approximation in value space of the present sec-tion, we will focus primarily on one-step lookahead. Usually, there arestraightforward extensions of the main ideas to the multistep context.

Infinite Horizon Problems

In this chapter we will focus on finite horizon problems. However, ap-

Page 107:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

106 General Principles of Approximation in Value Space Chap. 2

Steps “Future”Steps “Future” First StepMin Approximation

Min Approximation E{·} Approximation Cost-to-Go Approximation

minu∈U(x)

Ew

{

g(x, u, w) + αJ(

f(x, u, w))

}

) At x

Approximate Q-Factor Q(x, u)

Optimal Cost Approximation

First ℓ Steps “Future”Steps “Future”

At xkmin

uk,µk+1,...,µk+ℓ−1

E

{

g(xk, uk, wk) +

k+ℓ−1∑

i=k+1

αi−kg

(

xi, µi(xi), wi

)

+ αℓJ(xk+ℓ)

}

One-Step Lookahead Multistep Lookahead

One-Step Lookahead Multistep Lookahead

Figure 2.1.3 Schematic illustration of approximation in value space with one-step and multistep lookahead for infinite horizon problems, and the associatedthree approximations.

proximation in value space, with both one-step and multistep lookahead isconceptually very similar in infinite horizon problems. This is convenient,as it will allow us to develop the approximation methodology within theconceptually simpler finite horizon context, and postpone to Chapter 5 amore detailed discussion of infinite horizon algorithms.

There are three potential areas of approximation for infinite horizonproblems: cost-to-go approximation, expected value approximation, andminimization approximation; cf. Fig. 2.1.3. This is similar to the finitehorizon case; cf. Fig. 2.1.1.

A major advantage of the infinite horizon context is that only one ap-proximate cost function J is needed, rather than the N functions J1, . . . , JNof the N -step horizon case. Moreover, for infinite horizon problems, thereare additional important algorithms that are amenable to approximationin value space. Approximate policy iteration, Q-learning, temporal differ-ence methods, and their variants are some of these (we will only discussapproximate policy iteration in this book). For this reason, in the infinitehorizon case, there is a richer set of algorithmic options for approximationin value space, despite the fact that the associated mathematical theory ismore complex.

2.1.2 Approximation in Policy Space

The major alternative to approximation in value space is approximation inpolicy space, whereby we select the policy from a suitably restricted class

Page 108:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 2.1 Approximation in Value and Policy Space 107

Uncertainty System Environment Cost Control Current State

Uncertainty System Environment Cost Control Current State

Uncertainty System Environment Cost Control Current State

Controller

uk = µk(xk, rk) Current State(x , u )

) Current State xk

µk(·, rk) Approximate Q-Factor

1 Training Data

Figure 2.1.4 Schematic illustration of parametric approximation in policy space.A policy

µk(xk , rk), k = 0, 1, . . . , N − 1,

from a parametric class is computed off-line based on data, and it is used togenerate the control uk = µk(xk , rk) on-line, when at state xk.

of policies, usually a parametric class of some form. In particular, we canintroduce a parametric family of policies (or approximation architecture,as we will call it in Chapter 4),

µk(xk, rk), k = 0, . . . , N − 1,

where rk is a parameter, and then estimate the parameters rk using sometype of training process or optimization; cf. Fig. 2.1.4.

Note an important conceptual difference between approximation invalue space and approximation in policy space. The former is primarilyan on-line method (with off-line training used optionally to construct costfunction approximations for one-step or multistep lookahead). The lateris primarily an off-line training method (which may be used optionally toprovide a policy for on-line rollout).

Neural networks, described in Chapter 4, are often used to gener-ate the parametric class of policies, in which case rk is the vector ofweights/parameters of the neural network. In Chapter 4, we will also dis-cuss methods for obtaining the training data required for obtaining theparameters rk, and we will consider several other classes of approximationarchitectures. An important advantage of approximation in policy space isthat once the parametrized policy is obtained, the computation of controls

uk = µk(xk, rk), k = 0, . . . , N − 1,

during on-line operation of the system is often much easier compared withthe lookahead minimization (2.1). For this reason, one of the major uses ofapproximation in policy space is to provide an approximate implementationof a known policy (no matter how obtained) for the purpose of convenienton-line use.

Page 109:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

108 General Principles of Approximation in Value Space Chap. 2

In Section 4.4, we will discuss in some detail the approximation ofpolicies for the case where the number of controls available at xk is finite.In this case, we will see that µk(xk, rk) is computed as a randomized policy,i.e., a set of probabilities of applying each of the available controls at xk (aparametrized policy may be computed in randomized form for reasons ofalgorithmic/training convenience; in practice it is typically implemented byapplying at state xk the control of maximum probability; for example thetraining algorithm of AlphaZero uses randomized policies). The methodsto obtain the parameter rk are similar to classification methods used inpattern recognition. This is not surprising because when the control spaceis finite, it is possible to view different controls as distinct categories ofobjects, and to view any policy π = {µ0, . . . , µN−1} as a classifier thatassigns a state xk at time k to category µk(xk).

There are also alternative optimization-based approaches, where themain idea is that once we use a vector (r0, r1, . . . , rN−1) to parametrizethe policies π, the expected cost Jπ(x0) is parametrized as well, and canbe viewed as a function of (r0, r1, . . . , rN−1). We can then optimize thiscost by using a gradient-like or random search method. This is a widelyused approach for optimization in policy space, which, however, will not bediscussed in this book (for details and many references to the literature, seethe RL book [Ber19a], Section 5.7). Actually, this type of approach is usedmost often in an infinite horizon context, where the policies of interestare stationary, so a single function µ needs to be approximated ratherthan the N functions µ0, . . . , µN−1. In this chapter, we focus on finitehorizon problems, postponing the discussion of infinite horizon problemsfor Chapter 5. Many of the finite horizon methods of Chapters 2-4, however,also apply with small modifications to infinite horizon problems.

2.1.3 Combined Approximation in Value and Policy Space

In this section, we discuss various ways to combine approximation in valueand in policy space. In particular, we first describe how approximation inpolicy space can be built starting from approximation in value space. Wethen describe a reverse process, namely how we can start from some policy,and construct an approximation in value or Q-factor space, which in turncan be used to construct a new policy through one-step or multistep looka-head. This is the rollout approach, which we will discuss at length in thischapter and the next one. Finally, we show how to combine the two typesof approximation in a perpetual cycle of repeated approximations in valueand policy space. This involves the use of approximation architectures,such as neural networks, which we will consider in Chapter 4.

From Values to Policies

A general scheme for parametric approximation in policy space is to obtain

Page 110:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 2.1 Approximation in Value and Policy Space 109

a large number of sample state-control pairs (xsk, u

sk), s = 1, . . . , q, such

that for each s, usk is a “good” control at state xs

k. We can then choose theparameter rk by solving the least squares/regression problem

minrk

q∑

s=1

∥usk − µk(xs

k, rk)∥

2(2.2)

(possibly modified to add regularization).† In particular, we may determineusk using a human or a software “expert” that can choose “near-optimal”

controls at given states, so µk is trained to match the behavior of the expert.Methods of this type are commonly referred to as supervised learning inartificial intelligence.

A special case of the above procedure, which connects with approxi-mation in value space, is to generate the sample state-control pairs (xs

k, usk)

through a one-step lookahead minimization of the form

usk ∈ arg min

u∈Uk(xsk)E{

gk(xsk, u, wk) + Jk+1

(

fk(xsk, u, wk)

)

}

, (2.3)

where Jk+1 is a suitable (separately obtained) approximation in valuespace, or an approximate Q-factor based minimization

usk ∈ arg min

uk∈Uk(xsk)Qk(xs

k, uk), (2.4)

where Qk is a separately obtained Q-factor approximation. We may viewthis as approximation in policy space built on top of approximation in valuespace.

† Throughout this book ‖·‖ denotes the standard quadratic Euclidean norm.

It is implicitly assumed here (and in similar situations later) that the controls

are members of a Euclidean space (i.e., the space of finite dimensional vectors

with real-valued components) so that the distance between two controls can be

measured by their normed difference (randomized controls, i.e., probabilities that

a particular action will be used, fall in this category). Regression problems of

this type arise in the training of parametric classifiers based on data, including

the use of neural networks (see Section 4.4). Assuming a finite control space,

the classifier is trained using the data(

xsk, u

sk

)

, s = 1, . . . , q, which are viewed

as state-category pairs, and then a state xk is classified as being of “category”

µk(xk, rk). Parametric approximation architectures, and their training through

the use of classification and regression techniques are described in Chapter 4.

An important modification is to use regularized regression where a quadratic

regularization term is added to the least squares objective. This term is a positive

multiple of the squared deviation ‖r − r‖2 of r from some initial guess r.

Page 111:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

110 General Principles of Approximation in Value Space Chap. 2

From Policies to Values to New Policies - Rollout

An important approach for approximation in value space is to use one-stepor ℓ-step lookahead with cost function approximation Jk+ℓ(xk+ℓ) equalto the tail problem cost Jk+ℓ,π(xk+ℓ) starting from xk+ℓ and using someknown policy π = {µ0, . . . , µN−1}. Thus, in the case of one-step lookahead,we use the control

µk(xk) ∈ arg minuk∈Uk(xk)

E{

gk(xk, uk, wk) + Jk+1,π

(

fk(xk, uk, wk))

}

.

Equivalently, we can use one-step lookahead with the Q-factorsQk,π(xk, uk)of the policy, i.e., use the control

µk(xk) ∈ arg minuk∈Uk(xk)

Qk,π(xk, uk).

This has the advantage of simplifying the one-step lookahead minimization.In practice, for computational expediency, an approximation to Jk,π(xk) orQk,π(xk, uk) is often used instead, as we will discuss shortly.

Using the cost function or the Q-factors of a policy as a basis forapproximation in value space and obtaining a new policy by lookaheadminimization, constitutes the rollout algorithm. This is one of the principalsubjects of this book. When the values Jk+1,π(xk+1) are not availableanalytically (which is the typical case), it is necessary to compute themas needed by some form of simulation. In particular, for a deterministicfinite horizon problem, we may compute Jk+1,π(xk+1) by accumulating thestage costs along the (unique) trajectory that starts at xk+1 and uses π tothe end of the horizon. For a stochastic problem it is necessary to obtainJk+1,π(xk+1) by Monte Carlo simulation, i.e., generate a number of randomtrajectories starting from xk+1 and using π up to the end of the horizon,and then average the corresponding random trajectory costs.

Thus, starting with a policy π, which we will call the base policy, theprocess of one-step or multistep lookahead with cost approximations Jk,π ,defines a new policy π = {µ0, . . . , µN−1}, which we will call the rolloutpolicy; see Fig. 2.1.5. One of the fundamental facts in DP is that therollout policy has a policy improvement property: π has no worse cost thanπ, i.e.,

Jk,π(xk) ≤ Jk,π(xk), (2.5)

for all xk and k (see Sections 2.3 and 2.4).Since we generally cannot expect to be able to compute the base

policy cost function values Jk,π(xk) at all states xk, we may use instead costfunction approximations that are constructed from data. One possibility isto use Monte Carlo simulation to collect many pairs of state and base policycosts

(

xsk, Jk,π(x

sk))

, from which to obtain cost function approximations Jk,for each of the stages k = 1, . . . , N , through some form of training process.

Page 112:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 2.1 Approximation in Value and Policy Space 111

Selective Depth Lookahead Tree

Truncated Horizon Rollout

Truncated Horizon Rollout

with π

States xk+1

proximation

Current State xkApproximation in Policy Space Heuristic Cost Approximation

Stages Beyond TruncationStages Beyond Truncation

for Stages Beyond Truncation

for Stages Beyond Truncationfor Stages Beyond Truncation

One-Step or Multistep Lookahead for stages Possible

One-Step or Multistep Lookahead for stages Possible Terminal Cost. . .x0

-Factors Current State x

Figure 2.1.5 Schematic illustration of rollout, which involves approximation invalue space using a base policy π = {µ0, . . . , µN−1}. We use one-step or ℓ-step lookahead where Jk+ℓ(xk+ℓ) is equal to the tail problem cost Jk+ℓ,π(xk+ℓ)starting from xk+ℓ and using policy π. To economize in computation cost, wemay use truncated rollout, whereby we perform the simulation with π for a limitednumber of stages starting from each possible next state xk+ℓ, and either neglectthe costs of the remaining stages or add some heuristic cost approximation at theend to compensate for these costs. The figure illustrates the case ℓ = 1.

The functions Jk approximate the base policy cost functions Jk,π , thusyielding an approximate rollout policy π = {µ0, . . . , µN−1} through one-step or multistep lookahead. This policy satisfies the cost improvementproperty (2.5) in an approximate sense (within some error bound, which issmall if Jk is close to Jk,π; see Section 5.2.6 for more precise statements).

Let us also note the computationally expedient possibility of truncatedrollout , which is particularly useful for problems with a long horizon; seeFig. 2.1.5. This is to perform the simulation with π for a limited numberof stages, and either neglect the costs of the remaining stages or use someheuristic cost approximation at the end to compensate for these costs. Theterminal cost approximation could itself be an approximation of the costfunction of the base policy. We will discuss truncated rollout later, startingin Section 2.3.

Perpetual Rollout and Approximate Policy Iteration

The rollout process starts with a base policy, which, through approxima-tion in value space, can generate state-control samples of the rollout policy.Once this is done, the rollout policy may be implemented by approximation

Page 113:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

112 General Principles of Approximation in Value Space Chap. 2

Base Policy Rollout Policy Approximation in Value SpaceBase Policy Rollout Policy Approximation in Value Space

Base Policy Rollout Policy Approximation in Value SpaceBase Policy Rollout Policy Approximation in Value SpaceBase Policy Rollout Policy Approximation in Value SpaceApproximation in Policy Space

Cost Data Policy Data System:

Value Network Policy NetworkValue Network Policy Networkx µ

Rollout Policy µ

Value Network Policy Network Value Data

15 spiders move in 4 directions with perfect vision Value/Critic Policy/Actor

Figure 2.1.6 Schematic illustration of sequential approximation in value andpolicy space (or perpetual rollout). It produces a sequence of policies and theircost function approximations. Each generated policy is viewed as the rolloutpolicy with the preceding policy viewed as the base policy. The rollout policy isthen approximated in policy space, and viewed as the base policy for the nextiteration. The approximation in value space may involve either costs or Q-factorsof the base policy.

in policy space and training using state-control pairs generated by the roll-out policy, as discussed earlier. Thus the rollout process can be repeatedin perpetuity, so we can obtain a sequence of policies (through approxima-tion in policy space) and corresponding sequence of cost approximations(through approximation in value space); see Fig. 2.1.6.

When neural networks are used, the approximations in value andpolicy space are commonly referred to as the value network and the policynetwork , respectively. For example, the AlphaGo and AlphaZero programs([SHM16], [SHS17], [SHS17]), use both value and policy networks for ap-proximation in value space and policy space. Note that the value and policynetworks must be constructed off-line (before the control process begins),since their training involves a lot of data collection and computation.

The process just described also applies and indeed becomes simplerfor infinite horizon problems. It underlies the class of approximate policyiteration methods, which we discussed briefly in Section 1.3, and we willrevisit later in Chapter 5. An important type of such a method is known asoptimistic approximate policy iteration, where the approximations in valueand policy space are done using limited amounts of data (e.g., use just afew samples of state-control pairs to perform one or more gradient-typeiterations to update the parameters of a value network and/or a policynetwork). Methods of this type include algorithms such as Q-learning andpolicy evaluation by temporal differences . Together with their variations,which depend on the details of the data collection, and the amount ofdata used for the value and policy space approximations, these algorithmsunderlie a large part of the RL methodology.

2.2 APPROACHES FOR VALUE SPACE APPROXIMATION

In this section we discuss some of the generic implementation aspects of

Page 114:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 2.2 Approaches for Value Space Approximation 113

approximation in value space. In particular, we address how to overcomethe lack of a mathematical model through the use of a computer simulator,we discuss issues relating to the mixture of on-line and off-line computationthat is appropriate for given types of problems, and also consider methodsto simplify the one-step lookahead minimization.

2.2.1 Model-Based and Model-Free Implementations

Generally, a finite horizon DP problem is defined by the state, control,and disturbance spaces, the functions fk and gk, the control constraint setsUk(xk), and the probability distributions of the disturbances. We refer tothese as the mathematical model of the problem.

As we noted in Section 1.4, in this book we will address formallyonly problems for which a mathematical model is known, thereby bypass-ing difficult issues of simultaneous system identification and control. Themathematical model may be available through closed mathematical formor through a computer simulator. From our point of view, it makes no dif-ference whether the model is based on mathematical expressions or a com-puter simulator : the methods that we discuss are valid either way, onlytheir suitability for a given problem may be affected by the availability ofmathematical formulas.

A lot of the RL literature makes a distinction between model-freeand model-based methods: often the “model-free” term is reserved formethods that apply to problems where no mathematical model is available.In this book, we will adopt a somewhat different technical definition of theterm “model-free” (also used in the RL book [Ber19a]). In particular,we will refer to a method as being model-free if the calculation of all theexpected values in Eq. (2.1), and other related expressions, is done withMonte Carlo simulation, and the conditional probability distribution of wk,given (xk, uk), is either not analytically available, or is not used for reasonsof computational expediency.†

Note that for deterministic problems there is no expected value tocompute, so methods for these problems technically come under the model-based category, even if values of the functions gk and fk become availablethrough complicated computer calculations (so the mathematical model ishidden inside the computer model). Still however, Monte Carlo simulationmay enter the solution process of a deterministic problem for computationalexpediency (i.e., computing sums of a large number of terms) or otherreasons. For another example, the games of chess and Go are perfectlydeterministic, but the AlphaGo and AlphaZero programs are trained usingrandomized policies and rely on Monte Carlo tree search techniques, whichuse sampling and will be discussed in Sections 2.4.2 and 2.4.3.

† An example is when Monte Carlo integration is used to estimate integrals

that are analytically available but are hard to compute.

Page 115:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

114 General Principles of Approximation in Value Space Chap. 2

2.2.2 Off-Line and On-Line Implementations

In value space approximation, an important consideration is the compu-tation that is done off-line (i.e., before the control process begins, andcomputation that is done on-line as needed (i.e., after the control processbegins, and for just the states xk to be encountered). In practice, there isalways a fair amount of computation to be done both off-line and on-line:it’s more an issue of balance between the two, and how it facilitates variousoperating aspects of the approximation method at hand.

Usually, for challenging problems, the controls µk(xk) are computedon-line, since their storage may be difficult when the state space is large.However, an important design choice is whether we are approximating Q-factors or cost functions, and in the latter case, whether the approximatecost function values Jk+1(xk+1) are computed on-line by simulation or not.We may distinguish between:

(i) Methods where the entire approximate Q-factor function Qk or costfunction Jk+1 is computed for every k, before the control process be-gins. The advantage of this is that a lot of the computation is doneoff-line, before the first control is applied at time 0. Once the controlprocess starts, no extra computation is needed to obtain Jk+1(xk+1)for implementing the corresponding suboptimal policy using approx-imation in value space.

(ii) Methods where most of the computation is performed just after thecurrent state xk becomes known. An important example of such amethod is when the values Jk+1(xk+1) are computed only at the rel-evant next states xk+1, and are used to compute the control to beapplied via Eq. (2.1). These methods are the ones best suited foron-line replanning, where the problem data may change over time.Depending on the problem, such changes may include unanticipateddemands for services, emergency situations, forecasts relating to fu-ture values of the disturbances, etc. A similar situation where on-linemethods may have an advantage is when the initial state and otherproblem data may become known just before the control process be-gins.

Generally, it may be said that there is a tradeoff between off-linecomputation and on-line replanning: as we increase the amount off-linecomputation, we lose flexibility in applying on-line replanning. Examplesof schemes involving a lot of off-line computation are the neural networkand other parametric approximations that we will discuss in Chapter 4.Examples of schemes involving a lot of on-line computation are rollout(Sections 2.3-2.5) and model predictive control (Section 3.1). Of coursethere are also hybrid methods, where significant computation is done off-line to expedite the on-line computation of needed values of Jk+1. Anexample is the truncated rollout schemes to be discussed in Section 2.3.4.

Page 116:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 2.2 Approaches for Value Space Approximation 115

On-Line Methods and Exploration

The balance between on-line and off-line computation is also intimatelyconnected with the timing of availability of data. If significant data is notavailable until just before or even after the control process begins, an on-line replanning approach is typically necessary, which affects the nature ofrequired off-line computation.

A related issue within this context relates to whether the availabledata is sufficient for the purposes of control, or whether new supplemen-tary data should be actively collected as control is being applied. The latterdata collection process is known as exploration, and its design and orga-nization may be a significant part of the overall decision making scheme.The area of multiarmed bandit optimization embodies some of the funda-mental ideas of balancing the competing desires of exploitation and explo-ration (generate and evaluate controls that seem most promising in termsof performance versus applying inadequately explored controls and visitingunexplored parts of the state space); see Section 2.4.2.

We note that the exploration versus exploitation dilemma has alsoreceived considerable early attention in stochastic optimal control theory.In particular, there was a lot of discussion in the 1960s about the dualpurposes of stochastic sequential decision making: system identification(i.e., taking actions that aim towards efficient estimation of model param-eters), and system control (i.e., taking actions that aim towards guidingthe system towards low cost trajectories). The interplay between these twoobjectives has been known as dual control ; see Feldbaum [Fel60], Astrom[Ast83]. One of its characteristics is the need for balance between “cau-tion” (the need for conservatism in applying control, while the system isnot fully known), and “probing” (the need for aggressiveness in applyingcontrol, in order to excite the system enough to be able to identify it). Justlike exploration and exploitation, these notions cannot be easily quantified,but often manifest themselves in specific control schemes.

When we discuss infinite horizon problems and the policy iterationmethod in Chapter 5, we will encounter still another context where explo-ration plays an important role. This is an off-line implementation contextwhereby it is necessary to evaluate approximately the entire cost functionof a given policy µ by simulation. To do this, we may need to generate costsamples using µ, but this may bias the simulation by underrepresentingstates that are unlikely to occur under µ. As a result, the cost-to-go esti-mates of these underrepresented states may be highly inaccurate, causingpotentially serious errors in the calculation of the next policy via the policyimprovement operation. This is a major difficulty in simulation-based ap-proximate policy iteration, particularly when the system is deterministic,or when the randomness embodied in the transition probabilities of thecurrent policy is “relatively small,” since then few states may be reachedfrom a given initial state when the current policy is simulated.

Page 117:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

116 General Principles of Approximation in Value Space Chap. 2

There are several approaches to enhancing exploration within the pol-icy iteration method, but no foolproof remedy. One possibility is to breakdown the simulation to multiple short trajectories and to ensure that theinitial states employed form a rich and representative subset. Parallel com-putation may play an important role within this context. Another possibil-ity is to artificially introduce some extra randomization in the simulation ofthe current policy µ, by occasionally using randomly generated transitionsrather than the ones dictated by µ.

Finally, let us note that the issue of exploration does not arise withinthe context of rollout algorithms (at least for “standard” rollout schemes).This contributes significantly to the reliability and the convenient imple-mentation of these methods.

2.2.3 Methods for Cost-to-Go Approximation

There are two major issues that arise in the implementation of one-steplookahead; each of the two can be considered separately from the other:

(a) Obtaining Jk+1, i.e., the method to compute the lookahead functionsJk+1 that are involved in the lookahead minimization (2.1). Thereare quite a few approaches here (see Fig. 2.2.1). Some are discussedin this chapter, and more will be discussed in subsequent chapters.

(b) Control selection, i.e., the method to perform the minimization (2.1)and implement the suboptimal policy µk. Again there are severalexact and approximate methods for control selection, some of whichwill be discussed in this chapter (see Fig. 2.2.1). In particular, twopossibilities are:

(1) Approximate the expected value appearing in the minimizedexpression.

(2) Perform the minimization approximately.

We will provide a high level discussion of these issues, focusing for simplicityon just the case of one-step lookahead.

Regarding the computation of Jk+1, we may distinguish between threetypes of methods:

(a) Problem approximation: Here the functions Jk+1 in Eq. (2.1) areobtained as the optimal or nearly optimal cost functions of a sim-plified optimization problem, which is more convenient for computa-tion. Simplifications may include exploiting decomposable structure,reducing the size of the state space, and ignoring various types ofuncertainties; for example we may consider using as Jk+1 the costfunction of a related deterministic problem, thus allowing computa-tion of Jk+1(xk+1) by shortest path-type methods. A major type ofproblem approximation method is aggregation, which is described andanalyzed in the books [Ber12], and [Ber19a], and the surveys [Ber18b],

Page 118:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 2.2 Approaches for Value Space Approximation 117

Steps “Future”

Simple choices Parametric approximation Problem approximation

Steps “Future” First Step

Simple choices Parametric approximation Problem approximation

Aggregation

Aggregation Adaptive simulation Monte-Carlo Tree Search

Approximate Min Approximate

Approximate Min Approximate E{·}

minuk

E{

gk(xk, uk, wk)+Jk+1(xk+1)}

{·} Approximate Cost-to-Go Jk+1

Certainty equivalence Monte Carlo tree search

Certainty equivalence Monte Carlo tree search

Rollout, Model Predictive Control

Parametric approximation Neural nets

Parametric approximation Neural nets Discretization

At xk

Simplification MultiagentSimplification Multiagent

Figure 2.2.1 Schematic illustration of various approaches for approximation invalue space with one-step lookahead.

[Ber18c]. In this book, our discussion of problem approximation willbe peripheral to our development, even though it can be combinedwith the approximation in value space methods that are our mainfocus; see the RL book [Ber19a] for a more extensive coverage.

(b) On-line approximate optimization, such as rollout algorithms andmodel predictive control, which will be discussed in this chapter andthe next one. These methods often involve the use of a suboptimalpolicy or heuristic, which is applied on-line when needed to approxi-mate the true optimal cost-to-go values. The suboptimal policy maybe obtained by any other method, e.g., one based on heuristic rea-soning, or on a more principled approach, based on approximation inpolicy space.

(c) Parametric cost approximation, which is discussed in Chapter 4. Herethe functions Jk+1 in Eq. (2.1) are obtained from a given parametricclass of functions Jk+1(xk+1, rk+1), where rk+1 is a parameter vec-tor, selected by a suitable algorithm. The parametric class typicallyinvolves prominent characteristics of xk+1 called features , which canbe obtained either through insight into the problem at hand, or byusing training data and some form of neural network.

Additional variants of these methods are obtained when used in com-bination with approximate minimization over uk in Eq. (2.1), and also whenthe expected value over wk is computed approximately via some type ofapproximation or adaptive simulation (cf. Section 2.4.2). We will next dis-cuss a variety of ideas relating to approximate minimization in the one-stepand multistep lookahead.

2.2.4 Methods for Expediting the Lookahead Minimization

A major concern in approximation in value space is to facilitate the calcu-

Page 119:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

118 General Principles of Approximation in Value Space Chap. 2

lation of the suboptimal control µk(xk) at state xk via the minimization ofthe one-step lookahead expression

E{

gk(xk, uk, wk) + Jk+1

(

fk(xk, uk, wk))

}

, (2.6)

once the cost-to-go approximating functions Jk+1 or approximate Q-factorsQk have been selected; cf. Fig. 2.2.1. In this section, we will assume that wehave a mathematical model, i.e., that the functions gk and fk are availablein essentially closed form, and that the conditional probability distributionof wk, given (xk, uk), is also available.

Important issues here are the computation of the expected value (ifthe problem is stochastic) and the minimization over uk ∈ Uk(xk) in Eq.(2.6). Both of these operations may involve substantial work, which is ofparticular concern when the minimization is to be performed on-line.

Probabilistic Approximation

One possibility to simplify the probabilistic calculations is to eliminatethe expected value from the expression (2.6) using (assumed) certaintyequivalence. Here we choose a typical value wk of wk, and obtain thecontrol µk(xk) from the deterministic optimization

µk(xk) ∈ arg minuk∈Uk(xk)

[

gk(xk, uk, wk) + Jk+1

(

fk(xk, uk, wk))

]

. (2.7)

The approach of turning a stochastic problem into a deterministic one byreplacing uncertain quantities with single typical values highlights the pos-sibility that Jk+1 may itself be obtained by using deterministic methods.This approach and its variations are discussed in detail in the RL book[Ber19a] (Section 2.3). For example, an extension of certainty equivalenceis a simplification of the probability distribution of wk (such as concen-trating its probability mass to a few representative values), so that thecomputation of the expected value in the one-step lookahead expression(2.6) is simplified. These possibilities also admit straightforward extensionto the case of multistep lookahead; see Section 2.5.2.

A useful variation of the certainty equivalence approach is to fixat typical values only some of the uncertain quantities. For example, apartial state information problem may be treated as one of perfect stateinformation, using an estimate xk of xk as if it were exact, while fullytaking into account the stochastic nature of the disturbances. Thus, if{

µp0(x0), . . . , µ

pN−1(xN−1)

}

is an optimal policy obtained from the DP al-gorithm for the stochastic perfect state information problem

minimize E

{

gN(xN ) +N−1∑

k=0

gk(

xk, µk(xk), wk

)

}

subject to xk+1 = fk(

xk, µk(xk), wk

)

, µk(xk) ∈ Uk(xk), k = 0, . . . , N − 1,

Page 120:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 2.2 Approaches for Value Space Approximation 119

then the control input applied by this variant of CEC is µpk(xk), where xk

is the estimate of xk given the information available up to time k.One may also consider an approximation to ℓ-step lookahead, which is

based on deterministic computations. This is a hybrid, partially determin-istic approach, whereby at state xk, we allow for a stochastic disturbancewk at the current stage, but fix the future disturbances wk+1, . . . , wk+ℓ−1,up to the end of the lookahead horizon, to some typical values . This allowsus to bring to bear deterministic shortest path methods in the computationof approximate costs-to-go beyond the first stage.

In particular, with this approach, when at state xk, the needed val-ues Jk+1(xk+1) are the ones corresponding to xk+1 = fk(xk, uk, wk), forall uk ∈ Uk(xk) and wk. Each of these values will be computed by solvingan (ℓ− 1)-step deterministic shortest path problem that starts at xk+1 andinvolves the typical values of the disturbances wk+1, . . . , wk+ℓ−1. Once thevalues Jk+1(xk+1) are obtained, they will be used to compute the approx-imate Q-factors of pairs (xk, uk) using the formula

Qk(xk, uk) = E{

gk(xk, uk, wk) + Jk+1

(

fk(xk, uk, wk))

}

,

which incorporates the first stage uncertainty. Finally, the control chosenby such a scheme at time k will be

µk(xk) ∈ arg minuk∈Uk(xk)

Qk(xk, uk).

Lookahead Minimization - Simplifications

Let us now consider the issue of algorithmic minimization over Uk(xk) inEqs. (2.6) and (2.7). If Uk(xk) is a finite set, the minimization can bedone by brute force, through exhaustive computation and comparison ofthe corresponding Q-factors. This of course can be very time consuming,particularly for multistep lookahead, but parallel computation can be usedwith great effect for this purpose [as well as for the calculation of theexpected value in the expression (2.6)]. In particular, the Q-factors ofdifferent control choices uk ∈ Uk(xk) are uncoupled and their computationmay be done in parallel. Note, however, that when Monte Carlo simulationis used to evaluate the expectation over wk, it may be important to usethe same samples of wk for all controls for purposes of variance reduction.This issue is discussed in Section 2.4.5.

For some problems, integer programming techniques may also be usedin place of brute force minimization over Uk(xk). Moreover, for determinis-tic problems with multistep lookahead, sophisticated exact or approximateshortest path methods may be considered; several methods of this type areavailable, such as label correcting methods, A∗ methods, and their vari-ants (see the author’s textbooks [Ber98] and [Ber17] for detailed accounts,which are consistent with the context of this chapter).

Page 121:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

120 General Principles of Approximation in Value Space Chap. 2

When the control constraint set is infinite, it may be replaced by afinite set through discretization. However, a more efficient alternative maybe to use continuous space nonlinear programming techniques (see Section2.5). This possibility can be attractive for deterministic problems, whichlend themselves better to continuous space optimization; an example is themodel predictive control context (see Section 3.1).

For stochastic problems with continuous control spaces and eitherone-step or multistep lookahead, the methodology of stochastic program-ming may be useful. This methodology bears a close connection with linearand nonlinear programming, and will be explained in Section 2.5.2 in thecontext of some specialized rollout methods.

A major type of simplification of the lookahead minimization is possi-ble for problems where the control uk consists of multiple components, i.e.,uk = (u1

k, . . . , umk ). We refer to these as multiagent problems , in reference

to the situation where each component is chosen by a separate autonomousagent. We have discussed briefly such problems in Section 1.4.5, and wewill consider them again in Section 3.2, where we will introduce “one-agent-at-a-time” minimization schemes that reduce dramatically the amount ofcomputation needed for lookahead minimization.

Finally, one more possibility to simplify or bypass the one-step looka-head minimization (2.6) is to construct a Q-factor approximation, which isbased on approximate Q-factor samples. This is also suitable for model-freepolicy implementation and is described next.

2.2.5 Simplification of the Lookahead Minimization by Q-Factor Approximation

We now discuss how to simplify the minimization over Uk(xk) of the ap-proximate Q-factors

E{

gk(xk, uk, wk) + Jk+1

(

fk(xk, uk, wk))

}

, (2.8)

in the computation of the one-step lookahead control. We assume that:

(a) Either the system equation

xk+1 = fk(xk, uk, wk),

is known or there is a computer program/simulator that for any givenstate xk and control uk ∈ Uk(xk), simulates sample probabilistic tran-sitions to a successor state xk+1, and generates the correspondingtransition costs.

(b) The cost function approximation Jk+1 is available in some way. Inparticular, Jk+1 may be obtained by solving a simpler problem forwhich a model is available, or it may be obtained by some other

Page 122:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 2.2 Approaches for Value Space Approximation 121

method, based for example on simulation with a given policy and/orneural network training.

Given a state xk, we may compute the Q-factors (2.8) for all the pairs(xk, uk), uk ∈ Uk(xk), and then select the minimizing control. However,in many cases this can be very time-consuming for on-line computationpurposes. To deal with this difficulty, a common approach is to introducea parametric family of Q-factor functions (see Section 4.3.2),

Qk(xk, uk, rk),

where rk is the parameter vector and use a least squares fit/regression toapproximate the expected value that is minimized in Eq. (2.6). The stepsare as follows:

Summary of Q-Factor Approximation Based on Cost Func-tion Approximation

Assume that the value of Jk+1(xk+1) is available for any given xk+1:

(a) Use the system equation or a simulator to collect a large num-ber of “representative” quadruplets (xs

k, usk, x

sk+1, g

sk), and corre-

sponding Q-factors

βsk = gsk + Jk+1(xs

k+1), s = 1, . . . , q. (2.9)

Here xsk+1 is the next state

xsk+1 = fk(xs

k, usk, w

sk)

that corresponds to some sample disturbance wsk. This distur-

bance also determines the one-stage-cost sample

gsk = gk(xsk, u

sk, w

sk),

(see Fig. 2.2.2).

(b) Compute the parameter rk by the least-squares regression

rk ∈ argminrk

q∑

s=1

(

Qk(xsk, u

sk, rk)− βs

k

)2. (2.10)

(c) Use the policy

µk(xk) ∈ arg minuk∈Uk(xk)

Qk(xk, uk, rk). (2.11)

Page 123:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

122 General Principles of Approximation in Value Space Chap. 2

Sample StateSample State

Sample StateSample State

Sample State xs

k

Sample Control us

k Sample Transition Cost gsk

Sample Next State xs

k+1Sample Q-Factor

Sample Q-Factor βs

k= gs

k+ Jk+1(xs

k+1)

) Jk+1

Use a Neural Scheme or Other Scheme System Equation orUse a Neural Scheme or Other Scheme System Equation or

Use a Neural Scheme or Other Scheme System Equation or Simulator

Figure 2.2.2 Schematic illustration of Q-factor approximation, assuming approx-imate cost functions Jk+1 are known. The input to the system model or simulatorare sample state-control pairs (xs

k, us

k), and the outputs are a next state sample

xsk+1

and cost sample gsk. These correspond to a disturbance ws

kaccording to

xsk+1 = fk(x

sk, u

sk, w

sk), gsk = gk(x

sk, u

sk, w

sk).

The sample Q-factors βskare generated according to Eq. (2.9), and are used in

the least squares regression (2.10) to yield a parametric Q-factor approximationQk and the policy implementation (2.11).

Note some important points about the preceding procedure:

(1) It is model-free in the sense that it is based on Monte Carlo simu-lation. Moreover, it does not need the functions fk and gk, and theprobability distribution of wk to generate the policy µk through theleast squares regression (2.10) and the Q-factor minimization (2.11).Using the simulator to collect the samples (2.9) and the cost functionapproximation Jk+1 suffices.

(2) Two approximations are potentially required: One to compute Jk+1,which is needed for the samples βs

k [cf. Eq. (2.9)], and another to com-

pute Qk through the regression (2.10). The approximation methodsto obtain Jk+1 and Qk may be unrelated.

(3) The policy µk obtained through the minimization (2.11) is not thesame as the one obtained through the minimization (2.6). There aretwo reasons for this. One is the approximation error introduced bythe Q-factor architecture Qk, and the other is the simulation errorintroduced by the finite-sample regression (2.10). We have to acceptthese sources of error as the price to pay for the convenience of notrequiring a mathematical model for policy implementation.

(4) By committing to the off-line computation of approximate Q-factorsthrough the regression (2.10) and the control computation via Eq.(2.11), we preclude the possibility of on-line replanning, i.e., adaptthe control computation to on-line changes in the cost per stage gkor the system function fk or the probability distribution of wk.

Let us also mention a regularized variant of the least squares mini-mization in Eq. (2.10), where a positive multiple of the squared deviation‖r − r‖2 of r from some initial guess r is added to the least squares objec-

Page 124:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

Sec. 2.2 Approaches for Value Space Approximation 123

tive. In some cases, a nonquadratic minimization may be used in place ofEq. (2.10) to determine rk, but in this book we will focus on least squaresexclusively. We refer to Chapter 4 for further discussion.

Page 125:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact
Page 126:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

References

[ABB19] Agrawal, A., Barratt, S., Boyd, S., and Stellato, B., 2019. “Learning ConvexOptimization Control Policies,” arXiv preprint arXiv:1912.09529.

[ACF02] Auer, P., Cesa-Bianchi, N., and Fischer, P., 2002. “Finite Time Analysis of theMultiarmed Bandit Problem,” Machine Learning, Vol. 47, pp. 235-256.

[ADH19] Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R., 2019. “Fine-GrainedAnalysis of Optimization and Generalization for Overparameterized Two-Layer Neural

Networks,” arXiv preprint arXiv:1901.08584.

[AHZ19] Arcari, E., Hewing, L., and Zeilinger, M. N., 2019. “An Approximate DynamicProgramming Approach for Dual Stochastic Model Predictive Control,” arXiv preprintarXiv:1911.03728.

[ALZ08] Asmuth, J., Littman, M. L., and Zinkov, R., 2008. “Potential-Based Shapingin Model-Based Reinforcement Learning,” Proc. of 23rd AAAI Conference, pp. 604-609.

[AMS09] Audibert, J.Y., Munos, R., and Szepesvari, C., 2009. “Exploration-ExploitationTradeoff Using Variance Estimates in Multi-Armed Bandits,” Theoretical ComputerScience, Vol. 410, pp. 1876-1902.

[ASR20] Andersen, A. R., Stidsen, T. J. R., and Reinhardt, L. B., 2020. “Simulation-Based Rolling Horizon Scheduling for Operating Theatres,” in SN Operations ResearchForum, Vol. 1, pp. 1-26.

[AXG16] Ames, A. D., Xu, X., Grizzle, J. W., and Tabuada, P., 2016. “Control BarrierFunction Based Quadratic Programs for Safety Critical Systems,” IEEE Transactionson Automatic Control, Vol. 62, pp. 3861-3876.

[Abr90] Abramson, B., 1990. “Expected-Outcome: A General Model of Static Evalua-tion,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 12, pp. 182-193.

[Agr95] Agrawal, R., 1995. “Sample Mean Based Index Policies with O(logn) Regret forthe Multiarmed Bandit Problem,” Advances in Applied Probability, Vol. 27, pp. 1054-

1078.

[AnH14] Antunes, D., and Heemels, W.P.M.H., 2014. “Rollout Event-Triggered Control:Beyond Periodic Control Performance,” IEEE Transactions on Automatic Control, Vol.59, pp. 3296-3311.

[AnM79] Anderson, B. D. O., and Moore, J. B., 1979. Optimal Filtering, Prentice-Hall,Englewood Cliffs, N. J.

[Ast83] Astrom, K. J., 1983. “Theory and Applications of Adaptive Control - A Survey,”Automatica, Vol. 19, pp. 471-486.

1

Page 127:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

2 References

[AtF66] Athans, M., and Falb, P., 1966. Optimal Control, McGraw-Hill, N. Y.

[AvB20] Avrachenkov, K., and Borkar, V. S., 2020. “Whittle Index Based Q-Learningfor Restless Bandits with Average Reward,” arXiv preprint arXiv:2004.14427.

[BBD08] Busoniu, L., Babuska, R., and De Schutter, B., 2008. “A Comprehensive Sur-vey of Multiagent Reinforcement Learning,” IEEE Transactions on Systems, Man, andCybernetics, Part C, Vol. 38, pp. 156-172.

[BBD10a] Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D., 2010. ReinforcementLearning and Dynamic Programming Using Function Approximators, CRC Press, N. Y.

[BBD10b] Busoniu, L., Babuska, R., and De Schutter, B., 2010. “Multi-Agent Reinforce-ment Learning: An Overview,” in Innovations in Multi-Agent Systems and Applications,Springer, pp. 183-221.

[BBG13] Bertazzi, L., Bosco, A., Guerriero, F., and Lagana, D., 2013. “A StochasticInventory Routing Problem with Stock-Out,” Transportation Research, Part C, Vol. 27,pp. 89-107.

[BBM17] Borrelli, F., Bemporad, A., and Morari, M., 2017. Predictive Control for Linearand Hybrid Systems, Cambridge Univ. Press, Cambridge, UK.

[BBP13] Bhatnagar, S., Borkar, V. S., and Prashanth, L. A., 2013. “Adaptive FeaturePursuit: Online Adaptation of Features in Reinforcement Learning,” in Reinforcement

Learning and Approximate Dynamic Programming for Feedback Control , by F. Lewisand D. Liu (eds.), IEEE Press, Piscataway, N. J., pp. 517-534.

[BBW20] Bhattacharya, S., Badyal, S., Wheeler, T., Gil, S., Bertsekas, D. P., 2020.“Reinforcement Learning for POMDP: Partitioned Rollout and Policy Iteration withApplication to Autonomous Sequential Repair Problems,” to appear in IEEE Roboticsand Automation Letters, 2020; arXiv preprint arXiv:2002.04175.

[BCD10] Brochu, E., Cora, V. M., and De Freitas, N., 2010. “A Tutorial on BayesianOptimization of Expensive Cost Functions, with Application to Active User Modelingand Hierarchical Reinforcement Learning,” arXiv preprint arXiv:1012.2599.

[BCN18] Bottou, L., Curtis, F. E., and Nocedal, J., 2018. “Optimization Methods forLarge-Scale Machine Learning,” SIAM Review, Vol. 60, pp. 223-311.

[BKB20] Bhattacharya, S., Kailas, S., Badyal, S., Gil, S., and Bertsekas, D. P., 2020.“Multiagent Rollout and Policy Iteration for POMDP with Application to Multi-RobotRepair Problems,” in Proc. of Conference on Robot Learning (CoRL); also arXiv preprint,arXiv:2011.04222.

[BLL19] Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A., 2019. “Benign Over-fitting in Linear Regression,” arXiv preprint arXiv:1906.11300.

[BMM18] Belkin, M., Ma, S., and Mandal, S., 2018. “To Understand Deep Learning weNeed to Understand Kernel Learning,” arXiv preprint arXiv:1802.01396.

[BPW12] Browne, C., Powley, E., Whitehouse, D., Lucas, L., Cowling, P. I., Rohlfsha-gen, P., Tavener, S., Perez, D., Samothrakis, S., and Colton, S., 2012. “A Survey ofMonte Carlo Tree Search Methods,” IEEE Trans. on Computational Intelligence and AIin Games, Vol. 4, pp. 1-43.

[BRT18] Belkin, M., Rakhlin, A., and Tsybakov, A. B., 2018. “Does Data InterpolationContradict Statistical Optimality?” arXiv preprint arXiv:1806.09471.

[BTW97] Bertsekas, D. P., Tsitsiklis, J. N., and Wu, C., 1997. “Rollout Algorithms forCombinatorial Optimization,” Heuristics, Vol. 3, pp. 245-262.

Page 128:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

References 3

[BWL19] Beuchat, P. N., Warrington, J., and Lygeros, J., 2019. “Accelerated Point-Wise Maximum Approach to Approximate Dynamic Programming,” arXiv preprintarXiv:1901.03619.

[BYB94] Bradtke, S. J., Ydstie, B. E., and Barto, A. G., 1994. “Adaptive LinearQuadratic Control Using Policy Iteration,” Proc. IEEE American Control Conference,Vol. 3, pp. 3475-3479.

[BaF88] Bar-Shalom, Y., and Fortman, T. E., 1988. Tracking and Data Association,Academic Press, N. Y.

[BaL19] Banjac, G., and Lygeros, J., 2019. “A Data-Driven Policy Iteration SchemeBased on Linear Programming,” Proc. 2019 IEEE CDC, pp. 816-821.

[BaP12] Bauso, D., and Pesenti, R., 2012. “Team Theory and Person-by-Person Opti-mization with Binary Decisions,” SIAM Journal on Control and Optimization, Vol. 50,pp. 3011-3028.

[Bai93] Baird, L. C., 1993. “Advantage Updating,” Report WL-TR-93-1146, WrightPatterson AFB, OH.

[Bai94] Baird, L. C., 1994. “Reinforcement Learning in Continuous Time: AdvantageUpdating,” International Conf. on Neural Networks, Orlando, Fla.

[Bar90] Bar-Shalom, Y., 1990. Multitarget-Multisensor Tracking: Advanced Applica-tions, Artech House, Norwood, MA.

[BeC89] Bertsekas, D. P., and Castanon, D. A., 1989. “The Auction Algorithm forTransportation Problems,” Annals of Operations Research, Vol. 20, pp. 67-96.

[BeC99] Bertsekas, D. P., and Castanon, D. A., 1999. “Rollout Algorithms for StochasticScheduling Problems,” Heuristics, Vol. 5, pp. 89-108.

[BeC02] Ben-Gal, I., and Caramanis, M., 2002. “Sequential DOE via Dynamic Program-ming,” IIE Transactions, Vol. 34, pp. 1087-1100.

[BeC08] Besse, C., and Chaib-draa, B., 2008. “Parallel Rollout for Online Solution ofDEC-POMDPs,” Proc. of 21st International FLAIRS Conference, pp. 619-624.

[BeL14] Beyme, S., and Leung, C., 2014. “Rollout Algorithm for Target Search in aWireless Sensor Network,” 80th Vehicular Technology Conference (VTC2014), IEEE,pp. 1-5.

[BeI96] Bertsekas, D. P., and Ioffe, S., 1996. “Temporal Differences-Based Policy Iter-ation and Applications in Neuro-Dynamic Programming,” Lab. for Info. and DecisionSystems Report LIDS-P-2349, Massachusetts Institute of Technology.

[BeP03] Bertsimas, D., and Popescu, I., 2003. “Revenue Management in a DynamicNetwork Environment,” Transportation Science, Vol. 37, pp. 257-277.

[BeR71a] Bertsekas, D. P., and Rhodes, I. B., 1971. “On the Minimax Reachability ofTarget Sets and Target Tubes,” Automatica, Vol. 7, pp. 233-247.

[BeR71b] Bertsekas, D. P., and Rhodes, I. B., 1971. “Recursive State Estimation for aSet-Membership Description of the Uncertainty,” IEEE Trans. Automatic Control, Vol.AC-16, pp. 117-128.

[BeR73] Bertsekas, D. P., and Rhodes, I. B., 1973. “Sufficiently Informative Functionsand the Minimax Feedback Control of Uncertain Dynamic Systems,” IEEE Trans. Au-tomatic Control, Vol. AC-18, pp. 117-124.

[BeS78] Bertsekas, D. P., and Shreve, S. E., 1978. Stochastic Optimal Control: TheDiscrete Time Case, Academic Press, N. Y.; republished by Athena Scientific, Belmont,

Page 129:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

4 References

MA, 1996 (can be downloaded in from the author’s website).

[BeS18] Bertazzi, L., and Secomandi, N., 2018. “Faster Rollout Search for the VehicleRouting Problem with Stochastic Demands and Restocking,” European J. of OperationalResearch, Vol. 270, pp.487-497.

[BeT89] Bertsekas, D. P., and Tsitsiklis, J. N., 1989. Parallel and Distributed Computa-tion: Numerical Methods, Prentice-Hall, Englewood Cliffs, N. J.; republished by AthenaScientific, Belmont, MA, 1997 (can be downloaded from the author’s website).

[BeT91] Bertsekas, D. P., and Tsitsiklis, J. N., 1991. “An Analysis of Stochastic ShortestPath Problems,” Math. Operations Res., Vol. 16, pp. 580-595.

[BeT96] Bertsekas, D. P., and Tsitsiklis, J. N., 1996. Neuro-Dynamic Programming,Athena Scientific, Belmont, MA.

[BeT97] Bertsimas, D., and Tsitsiklis, J. N., 1997. Introduction to Linear Optimization,Athena Scientific, Belmont, MA.

[BeT00] Bertsekas, D. P., and Tsitsiklis, J. N., 2000. “Gradient Convergence of GradientMethods with Errors,” SIAM J. on Optimization, Vol. 36, pp. 627-642.

[BeT08] Bertsekas, D. P., and Tsitsiklis, J. N., 2008. Introduction to Probability, 2ndEdition, Athena Scientific, Belmont, MA.

[BeY09] Bertsekas, D. P., and Yu, H., 2009. “Projected Equation Methods for Approxi-

mate Solution of Large Linear Systems,” J. of Computational and Applied Math., Vol.227, pp. 27-50.

[BeY10] Bertsekas, D. P., and Yu, H., 2010. “Asynchronous Distributed Policy Iterationin Dynamic Programming,” Proc. of Allerton Conf. on Communication, Control andComputing, Allerton Park, Ill, pp. 1368-1374.

[BeY12] Bertsekas, D. P., and Yu, H., 2012. “Q-Learning and Enhanced Policy Iterationin Discounted Dynamic Programming,” Math. of Operations Research, Vol. 37, pp. 66-94.

[BeY16] Bertsekas, D. P., and Yu, H., 2016. “Stochastic Shortest Path Problems UnderWeak Conditions,” Lab. for Information and Decision Systems Report LIDS-2909, MIT.

[Bel56] Bellman, R., 1956. “A Problem in the Sequential Design of Experiments,”Sankhya: The Indian Journal of Statistics, Vol. 16, pp. 221-229.

[Bel57] Bellman, R., 1957. Dynamic Programming, Princeton University Press, Prince-ton, N. J.

[Bel67] Bellman, R., 1967. Introduction to the Mathematical Theory of Control Pro-cesses, Academic Press, Vols. I and II, New York, N. Y.

[Bel84] Bellman, R., 1984. Eye of the Hurricane, World Scientific Publishing, Singapore.

[Ben09] Bengio, Y., 2009. “Learning Deep Architectures for AI,” Foundations and Trendsin Machine Learning, Vol. 2, pp. 1-127.

[Ber71] Bertsekas, D. P., 1971. “Control of Uncertain Systems With a Set-Member-ship Description of the Uncertainty,” Ph.D. Dissertation, Massachusetts Institute ofTechnology, Cambridge, MA (can be downloaded from the author’s website).

[Ber72] Bertsekas, D. P., 1972. “Infinite Time Reachability of State Space Regions byUsing Feedback Control,” IEEE Trans. Automatic Control, Vol. AC-17, pp. 604-613.

[Ber73] Bertsekas, D. P., 1973. “Linear Convex Stochastic Control Problems over anInfinite Horizon,” IEEE Trans. Automatic Control, Vol. AC-18, pp. 314-315.

Page 130:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

References 5

[Ber77] Bertsekas, D. P., 1977. “Monotone Mappings with Application in Dynamic Pro-gramming,” SIAM J. on Control and Opt., Vol. 15, pp. 438-464.

[Ber79] Bertsekas, D. P., 1979. “A Distributed Algorithm for the Assignment Problem,”Lab. for Information and Decision Systems Report, MIT, May 1979.

[Ber82] Bertsekas, D. P., 1982. “Distributed Dynamic Programming,” IEEE Trans. Au-tomatic Control, Vol. AC-27, pp. 610-616.

[Ber83] Bertsekas, D. P., 1983. “Asynchronous Distributed Computation of Fixed Points,”Math. Programming, Vol. 27, pp. 107-120.

[Ber91] Bertsekas, D. P., 1991. Linear Network Optimization: Algorithms and Codes,MIT Press, Cambridge, MA (can be downloaded from the author’s website).

[Ber96] Bertsekas, D. P., 1996. “Incremental Least Squares Methods and the ExtendedKalman Filter,” SIAM J. on Optimization, Vol. 6, pp. 807-822.

[Ber97a] Bertsekas, D. P., 1997. “A New Class of Incremental Gradient Methods forLeast Squares Problems,” SIAM J. on Optimization, Vol. 7, pp. 913-926.

[Ber97b] Bertsekas, D. P., 1997. “Differential Training of Rollout Policies,” Proc. of the35th Allerton Conference on Communication, Control, and Computing, Allerton Park,Ill.

[Ber98] Bertsekas, D. P., 1998. Network Optimization: Continuous and Discrete Models,

Athena Scientific, Belmont, MA (can be downloaded from the author’s website).

[Ber05a] Bertsekas, D. P., 2005. “Dynamic Programming and Suboptimal Control: ASurvey from ADP to MPC,” European J. of Control, Vol. 11, pp. 310-334.

[Ber05b] Bertsekas, D. P., 2005. “Rollout Algorithms for Constrained Dynamic Pro-gramming,” Lab. for Information and Decision Systems Report LIDS-P-2646, MIT.

[Ber07] Bertsekas, D. P., 2007. “Separable Dynamic Programming and ApproximateDecomposition Methods,” IEEE Trans. on Aut. Control, Vol. 52, pp. 911-916.

[Ber10a] Bertsekas, D. P., 2010. “Incremental Gradient, Subgradient, and ProximalMethods for Convex Optimization: A Survey,” Lab. for Information and Decision Sys-tems Report LIDS-P-2848, MIT; a condensed version with the same title appears inOptimization for Machine Learning, by S. Sra, S. Nowozin, and S. J. Wright, (eds.),MIT Press, Cambridge, MA, 2012, pp. 85-119.

[Ber10b] Bertsekas, D. P., 2010. “Williams-Baird Counterexample for Q-Factor Asyn-chronous Policy Iteration,”http://web.mit.edu/dimitrib/www/Williams-Baird Counterexample.pdf.

[Ber11a] Bertsekas, D. P., 2011. “Incremental Proximal Methods for Large Scale ConvexOptimization,” Math. Programming, Vol. 129, pp. 163-195.

[Ber11b] Bertsekas, D. P., 2011. “Approximate Policy Iteration: A Survey and SomeNew Methods,” J. of Control Theory and Applications, Vol. 9, pp. 310-335; a somewhat

expanded version appears as Lab. for Info. and Decision Systems Report LIDS-2833,MIT, 2011.

[Ber12] Bertsekas, D. P., 2012. Dynamic Programming and Optimal Control, Vol. II,4th Edition, Athena Scientific, Belmont, MA.

[Ber13a] Bertsekas, D. P., 2013. “Rollout Algorithms for Discrete Optimization: A Sur-vey,” Handbook of Combinatorial Optimization, Springer.

[Ber13b] Bertsekas, D. P., 2013. “λ-Policy Iteration: A Review and a New Implementa-tion,” in Reinforcement Learning and Approximate Dynamic Programming for Feedback

Page 131:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

6 References

Control, by F. Lewis and D. Liu (eds.), IEEE Press, Piscataway, N. J., pp. 381-409.

[Ber15a] Bertsekas, D. P., 2015. Convex Optimization Algorithms, Athena Scientific,Belmont, MA.

[Ber15b] Bertsekas, D. P., 2015. “Incremental Aggregated Proximal and AugmentedLagrangian Algorithms,” Lab. for Information and Decision Systems Report LIDS-P-3176, MIT; arXiv preprint arXiv:1507.1365936.

[Ber16] Bertsekas, D. P., 2016. Nonlinear Programming, 3rd Edition, Athena Scientific,Belmont, MA.

[Ber17] Bertsekas, D. P., 2017. Dynamic Programming and Optimal Control, Vol. I, 4thEdition, Athena Scientific, Belmont, MA.

[Ber18a] Bertsekas, D. P., 2018. Abstract Dynamic Programming, 2nd Edition, AthenaScientific, Belmont, MA (can be downloaded from the author’s website).

[Ber18b] Bertsekas, D. P., 2018. “Feature-Based Aggregation and Deep ReinforcementLearning: A Survey and Some New Implementations,” Lab. for Information and De-cision Systems Report, MIT; arXiv preprint arXiv:1804.04577; IEEE/CAA Journal ofAutomatica Sinica, Vol. 6, 2019, pp. 1-31.

[Ber18c] Bertsekas, D. P., 2018. “Biased Aggregation, Rollout, and Enhanced PolicyImprovement for Reinforcement Learning,” Lab. for Information and Decision SystemsReport, MIT; arXiv preprint arXiv:1910.02426.

[Ber18d] Bertsekas, D. P., 2018. “Proximal Algorithms and Temporal Difference Methodsfor Solving Fixed Point Problems,” Computational Optim. Appl., Vol. 70, pp. 709-736.

[Ber19a] Bertsekas, D. P., 2019. Reinforcement Learning and Optimal Control, AthenaScientific, Belmont, MA.

[Ber19b] Bertsekas, D. P., 2019. “Robust Shortest Path Planning and SemicontractiveDynamic Programming,” Naval Research Logistics, Vol. 66, pp. 15-37.

[Ber19c] Bertsekas, D. P., 2019. “Multiagent Rollout Algorithms and ReinforcementLearning,” arXiv preprint arXiv:1910.00120.

[Ber19d] Bertsekas, D. P., 2019. “Constrained Multiagent Rollout and MultidimensionalAssignment with the Auction Algorithm,” arXiv preprint, arxiv.org/abs/2002.07407.

[Ber20a] Bertsekas, D. P., 2020. “Multiagent Reinforcement Learning: Rollout and Pol-icy Iteration,” ASU Report; appears in IEEE/CAA Journal of Automatica Sinica.

[Ber20b] Bertsekas, D. P., 2020. “Multiagent Value Iteration Algorithms in DynamicProgramming and Reinforcement Learning,” arXiv preprint, arxiv.org/abs/2005.01627;appears in Results in Control and Optimization Journal.

[Bet10] Bethke, B. M., 2010. Kernel-Based Approximate Dynamic Programming UsingBellman Residual Elimination, Ph.D. Thesis, MIT.

[BiL97] Birge, J. R., and Louveaux, 1997. Introduction to Stochastic Programming,Springer, New York, N. Y.

[Bia16] Bianchi, P., 2016. “Ergodic Convergence of a Stochastic Proximal Point Algo-rithm,” SIAM J. on Optimization, Vol. 26, pp. 2235-2260.

[Bis95] Bishop, C. M, 1995. Neural Networks for Pattern Recognition, Oxford UniversityPress, N. Y.

[Bis06] Bishop, C. M, 2006. Pattern Recognition and Machine Learning, Springer, N. Y.

Page 132:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

References 7

[BlM08] Blanchini, F., and Miani, S., 2008. Set-Theoretic Methods in Control, Birkhauser,Boston.

[Bla86] Blackman, S. S., 1986. Multi-Target Tracking with Radar Applications, ArtechHouse, Dehdam, MA.

[Bla99] Blanchini, F., 1999. “Set Invariance in Control – A Survey,” Automatica, Vol.35, pp. 1747-1768.

[Bor08] Borkar, V. S., 2008. Stochastic Approximation: A Dynamical Systems View-point, Cambridge Univ. Press.

[BrH75] Bryson, A., and Ho, Y. C., 1975. Applied Optimal Control: Optimization,Estimation, and Control, (revised edition), Taylor and Francis, Levittown, Penn.

[Bra21] Brandimarte, P., 2021. From Shortest Paths to Reinforcement Learning: AMATLAB-Based Tutorial on Dynamic Programming, Springer.

[BuK97] Burnetas, A. N., and Katehakis, M. N., 1997. “Optimal Adaptive Policies forMarkov Decision Processes,” Math. of Operations Research, Vol. 22, pp. 222-255.

[CBH09] Choi, H. L., Brunet, L., and How, J. P., 2009. “Consensus-Based DecentralizedAuctions for Robust Task Allocation,” IEEE Transactions on Robotics, Vol. 25, pp.912-926.

[CFH05] Chang, H. S., Hu, J., Fu, M. C., and Marcus, S. I., 2005. “An Adaptive

Sampling Algorithm for Solving Markov Decision Processes,” Operations Research, Vol.53, pp. 126-139.

[CFH13] Chang, H. S., Hu, J., Fu, M. C., and Marcus, S. I., 2013. Simulation-BasedAlgorithms for Markov Decision Processes, 2nd Edition, Springer, N. Y.

[CGC04] Chang, H. S., Givan, R. L., and Chong, E. K. P., 2004. “Parallel Rolloutfor Online Solution of Partially Observable Markov Decision Processes,” Discrete EventDynamic Systems, Vol. 14, pp. 309-341.

[CLT19] Chapman, M. P., Lacotte, J., Tamar, A., Lee, D., Smith, K. M., Cheng, V.,Fisac, J. F., Jha, S., Pavone, M., and Tomlin, C. J., 2019. “A Risk-Sensitive Finite-Time Reachability Approach for Safety of Stochastic Dynamic Systems,” arXiv preprintarXiv:1902.11277.

[CMT87a] Clarke, D. W., Mohtadi, C., and Tuffs, P. S., 1987. “Generalized PredictiveControl - Part I. The Basic Algorithm,” Automatica Vol. 23, pp. 137-148.

[CMT87b] Clarke, D. W., Mohtadi, C., and Tuffs, P. S., 1987. “Generalized PredictiveControl - Part II,” Automatica Vol. 23, pp. 149-160.

[CRV06] Cogill, R., Rotkowitz, M., Van Roy, B., and Lall, S., 2006. “An ApproximateDynamic Programming Approach to Decentralized Control of Stochastic Systems,” inControl of Uncertain Systems: Modelling, Approximation, and Design, Springer, Berlin,pp. 243-256.

[CXL19] Chu, Z., Xu, Z., and Li, H., 2019. “New Heuristics for the RCPSP with MultipleOverlapping Modes,” Computers and Industrial Engineering, Vol. 131, pp. 146-156.

[CaB07] Camacho, E. F., and Bordons, C., 2007. Model Predictive Control, 2nd Edition,Springer, New York, N. Y.

[Can16] Candy, J. V., 2016. Bayesian Signal Processing: Classical, Modern, and ParticleFiltering Methods, Wiley-IEEE Press.

[Cao07] Cao, X. R., 2007. Stochastic Learning and Optimization: A Sensitivity-BasedApproach, Springer, N. Y.

Page 133:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

8 References

[ChC17] Chui, C. K., and Chen, G., 2017. Kalman Filtering, Springer InternationalPublishing.

[ChS00] Christianini, N., and Shawe-Taylor, J., 2000. Support Vector Machines andOther Kernel-Based Learning Methods, Cambridge Univ. Press.

[Che59] Chernoff, H., 1959. “Sequential Design of Experiments,” The Annals of Mathe-matical Statistics, Vol. 30, pp. 755-770.

[Chr97] Christodouleas, J. D., 1997. “Solution Methods for Multiprocessor NetworkScheduling Problems with Application to Railroad Operations,” Ph.D. Thesis, Opera-tions Research Center, Massachusetts Institute of Technology.

[Cou06] Coulom, R., 2006. “Efficient Selectivity and Backup Operators in Monte-CarloTree Search,” International Conference on Computers and Games, Springer, pp. 72-83.

[CrS00] Cristianini, N., and Shawe-Taylor, J., 2000. An Introduction to Support VectorMachines and Other Kernel-Based Learning Methods, Cambridge Univ. Press.

[Cyb89] Cybenko, 1989. “Approximation by Superpositions of a Sigmoidal Function,”Math. of Control, Signals, and Systems, Vol. 2, pp. 303-314.

[DDF19] Daubechies, I., DeVore, R., Foucart, S., Hanin, B., and Petrova, G., 2019.“Nonlinear Approximation and (Deep) ReLU Networks,” arXiv preprint arXiv:1905.02199.

[DFM12] Desai, V. V., Farias, V. F., and Moallemi, C. C., 2012. “Aproximate Dynamic

Programming via a Smoothed Approximate Linear Program,” Operations Research, Vol.60, pp. 655-674.

[DFM13] Desai, V. V., Farias, V. F., and Moallemi, C. C., 2013. “Bounds for MarkovDecision Processes,” in Reinforcement Learning and Approximate Dynamic Program-ming for Feedback Control, by F. Lewis and D. Liu (eds.), IEEE Press, Piscataway, N.J., pp. 452-473.

[DFV03] de Farias, D. P., and Van Roy, B., 2003. “The Linear Programming Approachto Approximate Dynamic Programming,” Operations Research, Vol. 51, pp. 850-865.

[DFV04] de Farias, D. P., and Van Roy, B., 2004. “On Constraint Sampling in theLinear Programming Approach to Approximate Dynamic Programming,” Mathematicsof Operations Research, Vol. 29, pp. 462-478.

[DHS12] Duda, R. O., Hart, P. E., and Stork, D. G., 2012. Pattern Classification, J.Wiley, N. Y.

[DNW16] David, O. E., Netanyahu, N. S., and Wolf, L., 2016. “Deepchess: End-to-EndDeep Neural Network for Automatic Learning in Chess,” in International Conference onArtificial Neural Networks, pp. 88-96.

[DeF04] De Farias, D. P., 2004. “The Linear Programming Approach to ApproximateDynamic Programming,” in Learning and Approximate Dynamic Programming, by J.Si, A. Barto, W. Powell, and D. Wunsch, (Eds.), IEEE Press, N. Y.

[DeK11] Devlin, S., and Kudenko, D., 2011. “Theoretical Considerations of Potential-Based Reward Shaping for Multi-Agent Systems,” in Proceedings of AAMAS.

[Den67] Denardo, E. V., 1967. “Contraction Mappings in the Theory Underlying Dy-namic Programming,” SIAM Review, Vol. 9, pp. 165-177.

[DiL08] Dimitrakakis, C., and Lagoudakis, M. G., 2008. “Rollout Sampling ApproximatePolicy Iteration,” Machine Learning, Vol. 72, pp. 157-171.

[DiM10] Di Castro, D., and Mannor, S., 2010. “Adaptive Bases for Reinforcement Learn-ing,” Machine Learning and Knowledge Discovery in Databases, Vol. 6321, pp. 312-327.

Page 134:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

References 9

[DiW02] Dietterich, T. G., and Wang, X., 2002. “Batch Value Function Approximationvia Support Vectors,” in Advances in Neural Information Processing Systems, pp. 1491-1498.

[DoJ09] Doucet, A., and Johansen, A. M., 2009. “A Tutorial on Particle Filtering andSmoothing: Fifteen Years Later,” Handbook of Nonlinear Filtering, Oxford UniversityPress, Vol. 12, p. 3.

[DrH01] Drezner, Z., and Hamacher, H. W. eds., 2001. Facility Location: Applicationsand Theory, Springer Science and Business Media.

[DuV99] Duin, C., and Voss, S., 1999. “The Pilot Method: A Strategy for Heuristic Rep-etition with Application to the Steiner Problem in Graphs,” Networks: An InternationalJournal, Vol. 34, pp. 181-191.

[EDS18] Efroni, Y., Dalal, G., Scherrer, B., and Mannor, S., 2018. “Beyond the One-StepGreedy Approach in Reinforcement Learning,” in Proc. International Conf. on MachineLearning, pp. 1387-1396.

[EMM05] Engel, Y., Mannor, S., and Meir, R., 2005. “Reinforcement Learning withGaussian Processes,” in Proc. of the 22nd ICML, pp. 201-208.

[FHS09] Feitzinger, F., Hylla, T., and Sachs, E. W., 2009. “Inexact Kleinman?NewtonMethod for Riccati Equations,” SIAM Journal on Matrix Analysis and Applications,Vol. 3, pp. 272-288.

[FIA03] Findeisen, R., Imsland, L., Allgower, F., and Foss, B.A., 2003. “State andOutput Feedback Nonlinear Model Predictive Control: An Overview,” European Journalof Control, Vol. 9, pp. 190-206.

[FPB15] Farahmand, A. M., Precup, D., Barreto, A. M., and Ghavamzadeh, M., 2015.“Classification-Based Approximate Policy Iteration,” IEEE Trans. on Automatic Con-trol, Vol. 60, pp. 2989-2993.

[FeV02] Ferris, M. C., and Voelker, M. M., 2002. “Neuro-Dynamic Programming forRadiation Treatment Planning,” Numerical Analysis Group Research Report NA-02/06,Oxford University Computing Laboratory, Oxford University.

[FeV04] Ferris, M. C., and Voelker, M. M., 2004. “Fractionation in Radiation TreatmentPlanning,” Mathematical Programming B, Vol. 102, pp. 387-413.

[Fel60] Feldbaum, A. A., 1960. “Dual Control Theory,” Automation and Remote Control,Vol. 21, pp. 874-1039.

[FoK09] Forrester, A. I., and Keane, A. J., 2009. “Recent Advances in Surrogate-BasedOptimization. Progress in Aerospace Sciences,” Vol. 45, pp. 50-79.

[Fra18] Frazier, P. I., 2018. “A Tutorial on Bayesian Optimization,” arXiv preprintarXiv:1807.02811.

[Fu17] Fu, M. C., 2017. “Markov Decision Processes, AlphaGo, and Monte Carlo TreeSearch: Back to the Future,” Leading Developments from INFORMS Communities,INFORMS, pp. 68-88.

[Fun89] Funahashi, K., 1989. “On the Approximate Realization of Continuous Mappingsby Neural Networks,” Neural Networks, Vol. 2, pp. 183-192.

[GBC16] Goodfellow, I., Bengio, J., and Courville, A., Deep Learning, MIT Press, Cam-bridge, MA.

[GBL19] Goodson, J. C., Bertazzi, L., and Levary, R. R., 2019. “Robust Dynamic MediaSelection with Yield Uncertainty: Max-Min Policies and Dual Bounds,” Report.

Page 135:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

10 References

[GDM19] Guerriero, F., Di Puglia Pugliese, L., and Macrina, G., 2019. “A Rollout Algo-rithm for the Resource Constrained Elementary Shortest Path Problem,” OptimizationMethods and Software, Vol. 34, pp. 1056-1074.

[GGS13] Gabillon, V., Ghavamzadeh, M., and Scherrer, B., 2013. “Approximate Dy-namic Programming Finally Performs Well in the Game of Tetris,” in NIPS, pp. 1754-1762.

[GGW11] Gittins, J., Glazebrook, K., and Weber, R., 2011. Multi-Armed Bandit Allo-cation Indices, J. Wiley, N. Y.

[GLG11] Gabillon, V., Lazaric, A., Ghavamzadeh, M., and Scherrer, B., 2011. “Classi-fication-Based Policy Iteration with a Critic,” in Proc. of ICML.

[GSD06] Goodwin, G., Seron, M. M., and De Dona, J. A., 2006. Constrained Controland Estimation: An Optimisation Approach, Springer, N. Y.

[GSS93] Gordon, N. J., Salmond, D. J., and Smith, A. F., 1993. “Novel Approach toNonlinear/Non-Gaussian Bayesian State Estimation,” in IEE Proceedings, Vol. 140, pp.107-113.

[GTA17] Gommans, T. M. P., Theunisse, T. A. F., Antunes, D. J., and Heemels, W.P. M. H., 2017. “Resource-Aware MPC for Constrained Linear Systems: Two RolloutApproaches,” Journal of Process Control, Vol. 51, pp. 68-83.

[GTO15] Goodson, J. C., Thomas, B. W., and Ohlmann, J. W., 2015. “Restocking-Based Rollout Policies for the Vehicle Routing Problem with Stochastic Demand andDuration Limits,” Transportation Science, Vol. 50, pp. 591-607.

[GTO17] Goodson, J. C., Thomas, B. W., and Ohlmann, J. W., 2017. “A Rollout Al-gorithm Framework for Heuristic Solutions to Finite-Horizon Stochastic Dynamic Pro-grams,” European Journal of Operational Research, Vol. 258, pp. 216-229.

[Gos15] Gosavi, A., 2015. Simulation-Based Optimization: Parametric OptimizationTechniques and Reinforcement Learning, 2nd Edition, Springer, N. Y.

[Grz17] Grzes, M., 2017. “Reward Shaping in Episodic Reinforcement Learning,” in Proc.of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 565-573.

[GuM01] Guerriero, F., and Musmanno, R., 2001. “Label Correcting Methods to SolveMulticriteria Shortest Path Problems,” J. Optimization Theory Appl., Vol. 111, pp.589-613.

[GuM03] Guerriero, F., and Mancini, M., 2003. “A Cooperative Parallel Rollout Algo-rithm for the Sequential Ordering Problem,” Parallel Computing, Vol. 29, pp. 663-677.

[Gup20] Gupta, A., 2020. “Existence of Team-Optimal Solutions in Static Teams withCommon Information: A Topology of Information Approach,” SIAM J. on Control andOptimization, Vol. 58, pp.998-1021.

[HJG16] Huang, Q., Jia, Q. S., and Guan, X., 2016. “Robust Scheduling of EV ChargingLoad with Uncertain Wind Power Integration,” IEEE Trans. on Smart Grid, Vol. 9, pp.1043-1054.

[HLS06] Han, J., Lai, T. L. and Spivakovsky, V., 2006. “Approximate Policy Optimiza-tion and Adaptive Control in Regression Models,” Computational Economics, Vol. 27,pp. 433-452.

[HMR19] Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J., 2019. “Surprisesin High-Dimensional Ridgeless Least Squares Interpolation,” arXiv preprint arXiv:1903-.08560.

Page 136:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

References 11

[HSS08] Hofmann, T., Scholkopf, B., and Smola, A. J., 2008. “Kernel Methods inMachine Learning,” The Annals of Statistics, Vol. 36, pp. 1171-1220.

[HSW89] Hornick, K., Stinchcombe, M., and White, H., 1989. “Multilayer FeedforwardNetworks are Universal Approximators,” Neural Networks, Vol. 2, pp. 359-159.

[HWM19] Hewing, L., Wabersich, K. P., Menner, M., and Zeilinger, M. N., 2019.“Learning-Based Model Predictive Control: Toward Safe Learning in Control,” AnnualReview of Control, Robotics, and Autonomous Systems.

[Han98] Hansen, E. A., 1998. “Solving POMDPs by Searching in Policy Space,” in Proc.of the 14th Conf. on Uncertainty in Artificial Intelligence, pp. 211-219.

[Hay08] Haykin, S., 2008. Neural Networks and Learning Machines, 3rd Edition, Prentice-Hall, Englewood-Cliffs, N. J.

[HeZ19] Hewing, L., and Zeilinger, M. N., 2019. “Scenario-Based Probabilistic Reach-able Sets for Recursively Feasible Stochastic Model Predictive Control,” IEEE ControlSystems Letters, Vol. 4, pp. 450-455.

[Hew71] Hewer, G., 1971. “An Iterative Technique for the Computation of the SteadyState Gains for the Discrete Optimal Regulator,” IEEE Trans. on Automatic Control,Vol. 16, pp. 382-384.

[Ho80] Ho, Y. C., 1980. “Team Decision Theory and Information Structures,” Proceed-ings of the IEEE, Vol. 68, pp. 644-654.

[HuM16] Huan, X., and Marzouk, Y. M., 2016. “Sequential Bayesian Optimal Experi-mental Design via Approximate Dynamic Programming,” arXiv preprint arXiv:1604.08320.

[Hua15] Huan, X., 2015. Numerical Approaches for Sequential Bayesian Optimal Ex-perimental Design, Ph.D. Thesis, MIT.

[Hyl11] Hylla, T., 2011. Extension of Inexact Kleinman-Newton Methods to a GeneralMonotonicity Preserving Convergence Theory, PhD Thesis, Univ. of Trier.

[IFT19] Issakkimuthu, M., Fern, A., and Tadepalli, P., 2019. “The Choice FunctionFramework for Online Policy Improvement,” arXiv preprint arXiv:1910.00614.

[IJT18] Iusem, Jofre, A., and Thompson, P., 2018. “Incremental Constraint Projec-tion Methods for Monotone Stochastic Variational Inequalities,” Math. of OperationsResearch, Vol. 44, pp. 236-263.

[JGJ18] Jones, M., Goldstein, M., Jonathan, P., and Randell, D., 2018. “Bayes Lin-ear Analysis of Risks in Sequential Optimal Design Problems,” Electronic Journal ofStatistics, Vol. 12, pp. 4002-4031.

[JiJ17] Jiang, Y., and Jiang, Z. P., 2017. Robust Adaptive Dynamic Programming, J.Wiley, N. Y.

[Jon90] Jones, L. K., 1990. “Constructive Approximations for Neural Networks by Sig-moidal Functions,” Proceedings of the IEEE, Vol. 78, pp. 1586-1589.

[JuP07] Jung, T., and Polani, D., 2007. “Kernelizing LSPE(λ),” Proc. 2007 IEEE Sym-posium on Approximate Dynamic Programming and Reinforcement Learning, Honolulu,Ha., pp. 338-345.

[KAC15] Kochenderfer, M. J., with Amato, C., Chowdhary, G., How, J. P., DavisonReynolds, H. J., Thornton, J. R., Torres-Carrasquillo, P. A., Ore, N. K., Vian, J., 2015.Decision Making under Uncertainty: Theory and Application, MIT Press, Cambridge,MA.

[KAH15] Khashooei, B. A., Antunes, D. J. and Heemels, W.P.M.H., 2015. “Rollout

Page 137:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

12 References

Strategies for Output-Based Event-Triggered Control,” in Proc. 2015 European ControlConference, pp. 2168-2173.

[KLC98] Kaelbling, L. P., Littman, M. L., and Cassandra, A. R., 1998. “Planning andActing in Partially Observable Stochastic Domains,” Artificial Intelligence, Vol. 101, pp.99-134.

[KLM82a] Krainak, J. L. S. J. C., Speyer, J., and Marcus, S., 1982. “Static TeamProblems - Part I: Sufficient Conditions and the Exponential Cost Criterion,” IEEETransactions on Automatic Control, Vol. 27, pp. 839-848.

[KLM82b] Krainak, J. L. S. J. C., Speyer, J., and Marcus, S., 1982. “Static Team Prob-lems - Part II: Affine Control Laws, Projections, Algorithms, and the LEGT Problem,”IEEE Transactions on Automatic Control, Vol. 27, pp. 848-859.

[KLM96] Kaelbling, L. P., Littman, M. L., and Moore, A. W., 1996. “ReinforcementLearning: A Survey,” J. of Artificial Intelligence Res., Vol. 4, pp. 237-285.

[KMP06] Keller, P. W., Mannor, S., and Precup, D., 2006. “Automatic Basis FunctionConstruction for Approximate Dynamic Programming and Reinforcement Learning,”Proc. of the 23rd ICML, Pittsburgh, Penn.

[KaW94] Kall, P., and Wallace, S. W., 1994. Stochastic Programming, Wiley, Chichester,UK.

[KeG88] Keerthi, S. S., and Gilbert, E. G., 1988. “Optimal, Infinite Horizon FeedbackLaws for a General Class of Constrained Discrete Time Systems: Stability and Moving-Horizon Approximations,” J. Optimization Theory Appl., Vo. 57, pp. 265-293.

[Kle68] Kleinman, D. L., 1968. “On an Iterative Technique for Riccati Equation Com-putations,” IEEE Trans. Aut. Control, Vol. AC-13, pp. 114-115.

[KoC16] Kouvaritakis, B., and Cannon, M., 2016. Model Predictive Control: Classical,Robust and Stochastic, Springer, N. Y.

[KoG98] Kolmanovsky, I.. and Gilbert, E. G., 1998. “Theory and Computation of Distur-bance Invariant Sets for Discrete-Time Linear Systems,” Math. Problems in Engineering,Vol. 4, pp. 317-367.

[KoS06] Kocsis, L., and Szepesvari, C., 2006. “Bandit Based Monte-Carlo Planning,”Proc. of 17th European Conference on Machine Learning, Berlin, pp. 282-293.

[Kre19] Krener, A. J., 2019. “Adaptive Horizon Model Predictive Control and Al’brekht’sMethod,” arXiv preprint arXiv:1904.00053.

[Kri16] Krishnamurthy, V., 2016. Partially Observed Markov Decision Processes, Cam-bridge Univ. Press.

[KuV86] Kumar, P. R., and Varaiya, P. P., 1986. Stochastic Systems: Estimation, Iden-tification, and Adaptive Control, Prentice-Hall, Englewood Cliffs, N. J.

[Kun14] Kung, S. Y., 2014. Kernel Methods and Machine Learning, Cambridge Univ.

Press.

[LEC20] Lee, E. H., Eriksson, D., Cheng, B., McCourt, M., and Bindel, D., 2020. “Effi-cient Rollout Strategies for Bayesian Optimization,” arXiv preprint arXiv:2002.10539.

[LGM10] Lazaric, A., Ghavamzadeh, M., and Munos, R., 2010. “Analysis of a Classifica-tion-Based Policy Iteration Algorithm,” INRIA Report.

[LGW16] Lan, Y., Guan, X., and Wu, J., 2016. “Rollout Strategies for Real-Time Multi-Energy Scheduling in Microgrid with Storage System,” IET Generation, Transmissionand Distribution, Vol. 10, pp. 688-696.

Page 138:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

References 13

[LJM19] Li, Y., Johansson, K. H., and Martensson, J., 2019. “Lambda-Policy Iterationwith Randomization for Contractive Models with Infinite Policies: Well Posedness andConvergence,” arXiv preprint arXiv:1912.08504.

[LLL19] Liu, Z., Lu, J., Liu, Z., Liao, G., Zhang, H. H., and Dong, J., 2019. “PatientScheduling in Hemodialysis Service,” J. of Combinatorial Optimization, Vol. 37, pp.337-362.

[LLP93] Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S., 1993. “Multilayer Feed-forward Networks with a Nonpolynomial Activation Function can Approximate anyFunction,” Neural Networks, Vol. 6, pp. 861-867.

[LPS20] Liu, M., Pedrielli, G., Sulc, P., Poppleton, E., Bertsekas, D. P., 2020. “Expert-RNA: A New Framework for RNA Structure Prediction,” working paper, Arizona StateUniversity.

[LTZ19] Li, Y., Tang, Y., Zhang, R., and Li, N., 2019. “Distributed Reinforcement Learn-ing for Decentralized Linear Quadratic Control: A Derivative-Free Policy OptimizationApproach,” arXiv preprint arXiv:1912.09135.

[LWT17] Lowe, L., Wu, Y., Tamar, A., Harb, J., Abbeel, P., Mordatch, I., 2017. “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments,” in Advances inNeural Information Processing Systems, pp. 6379-6390.

[LWW16] Lam, R., Willcox, K., and Wolpert, D. H., 2016. “Bayesian Optimization with

a Finite Budget: An Approximate Dynamic Programming Approach,” In Advances inNeural Information Processing Systems, pp. 883-891.

[LWW17] Liu, D., Wei, Q., Wang, D., Yang, X., and Li, H., 2017. Adaptive DynamicProgramming with Applications in Optimal Control, Springer, Berlin.

[LaP03] Lagoudakis, M. G., and Parr, R., 2003. “Reinforcement Learning as Classifica-tion: Leveraging Modern Classifiers,” in Proc. of ICML, pp. 424-431.

[LaR85] Lai, T., and Robbins, H., 1985. “Asymptotically Efficient Adaptive AllocationRules,” Advances in Applied Math., Vol. 6, pp. 4-22.

[LaW13] Lavretsky, E., andWise, K., 2013. Robust and Adaptive Control with AerospaceApplications, Springer.

[LaW17] Lam, R., and Willcox, K., 2017. “Lookahead Bayesian Optimization with In-equality Constraints,” in Advances in Neural Information Processing Systems, pp. 1890-1900.

[LiS16] Liang, S., and Srikant, R., 2016. “Why Deep Neural Networks for FunctionApproximation?” arXiv preprint arXiv:1610.04161.

[LiW14] Liu, D., and Wei, Q., 2014. “Policy Iteration Adaptive Dynamic ProgrammingAlgorithm for Discrete-Time Nonlinear Systems,” IEEE Trans. on Neural Networks andLearning Systems, Vol. 25, pp. 621-634.

[LiW15] Li, H., and Womer, N. K., 2015. “Solving Stochastic Resource-ConstrainedProject Scheduling Problems by Closed-Loop Approximate Dynamic Programming,”European J. of Operational Research, Vol. 246, pp. 20-33.

[Lib11] Liberzon, D., 2011. Calculus of Variations and Optimal Control Theory: AConcise Introduction, Princeton Univ. Press.

[MCT10] Mishra, N., Choudhary, A. K., Tiwari, M. K., and Shankar, R., 2010. “RolloutStrategy-Based Probabilistic Causal Model Approach for the Multiple Fault Diagnosis,”Robotics and Computer-Integrated Manufacturing, Vol. 26, pp. 325-332.

Page 139:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

14 References

[MMB02] McGovern, A., Moss, E., and Barto, A., 2002. “Building a Basic BuildingBlock Scheduler Using Reinforcement Learning and Rollouts,” Machine Learning, Vol.49, pp. 141-160.

[MMS05] Menache, I., Mannor, S., and Shimkin, N., 2005. “Basis Function Adaptationin Temporal Difference Reinforcement Learning,” Ann. Oper. Res., Vol. 134, pp. 215-238.

[MPK99] Meuleau, N., Peshkin, L., Kim, K. E., and Kaelbling, L. P., 1999. “LearningFinite-State Controllers for Partially Observable Environments,” in Proc. of the 15thConference on Uncertainty in Artificial Intelligence, pp. 427-436.

[MPP04] Meloni, C., Pacciarelli, D., and Pranzo, M., 2004. “A Rollout Metaheuristic forJob Shop Scheduling Problems,” Annals of Operations Research, Vol. 131, pp. 215-235.

[MRR00] Mayne, D., Rawlings, J. B., Rao, C. V., and Scokaert, P. O. M., 2000. “Con-strained Model Predictive Control: Stability and Optimality,” Automatica, Vol. 36, pp.789-814.

[MVS19] Muthukumar, V., Vodrahalli, K., and Sahai, A., 2019. “Harmless Interpolationof Noisy Data in Regression,” arXiv preprint arXiv:1903.09139.

[MYF03] Moriyama, H., Yamashita, N., and Fukushima, M., 2003. “The IncrementalGauss-Newton Algorithm with Adaptive Stepsize Rule,” Computational Optimizationand Applications, Vol. 26, pp. 107-141.

[MaJ15] Mastin, A., and Jaillet, P., 2015. “Average-Case Performance of Rollout Algo-rithms for Knapsack Problems,” J. of Optimization Theory and Applications, Vol. 165,pp. 964-984.

[Mac02] Maciejowski, J. M., 2002. Predictive Control with Constraints, Addison-Wesley,Reading, MA.

[Mar55] Marschak, J., 1975. “Elements for a Theory of Teams,” Management Science,Vol. 1, pp. 127-137.

[Mar84] Martins, E. Q. V., 1984. “On a Multicriteria Shortest Path Problem,” EuropeanJ. of Operational Research, Vol. 16, pp. 236-245.

[May14] Mayne, D. Q., 2014. “Model Predictive Control: Recent Developments andFuture Promise,” Automatica, Vol. 50, pp. 2967-2986.

[MeB99] Meuleau, N., and Bourgine, P., 1999. “Exploration of Multi-State Environ-ments: Local Measures and Back-Propagation of Uncertainty,” Machine Learning, Vol.35, pp. 117-154.

[Mey07] Meyn, S., 2007. Control Techniques for Complex Networks, Cambridge Univ.Press, N. Y.

[Min22] Minorsky, N., 1922. “Directional Stability of Automatically Steered Bodies,” J.Amer. Soc. Naval Eng.,Vol. 34, pp. 280-309.

[MoL99] Morari, M., and Lee, J. H., 1999. “Model Predictive Control: Past, Present,and Future,” Computers and Chemical Engineering, Vol. 23, pp. 667-682.

[Mun14] Munos, R., 2014. “From Bandits to Monte-Carlo Tree Search: The OptimisticPrinciple Applied to Optimization and Planning,” Foundations and Trends in MachineLearning, Vol. 7, pp. 1-129.

[NHR99] Ng, A. Y., Harada, D., and Russell, S. J., 1999. “Policy Invariance UnderReward Transformations: Theory and Application to Reward Shaping,” in Proc. of the16th International Conference on Machine Learning, pp. 278-287.

Page 140:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

References 15

[NMT13] Nayyar, A., Mahajan, A. and Teneketzis, D., 2013. “Decentralized Stochas-tic Control with Partial History Sharing: A Common Information Approach,” IEEETransactions on Automatic Control, Vol. 58, pp. 1644-1658.

[NSE19] Nozhati, S., Sarkale, Y., Ellingwood, B., Chong, E. K., and Mahmoud, H.,2019. “Near-Optimal Planning Using Approximate Dynamic Programming to EnhancePost-Hazard Community Resilience Management,” Reliability Engineering and SystemSafety, Vol. 181, pp. 116-126.

[NaA12] Narendra, K. S., and Annaswamy, A. M., 2012. Stable Adaptive Systems,Courier Corporation.

[NaT19] Nayyar, A., and Teneketzis, D., 2019. “Common Knowledge and SequentialTeam Problems,” IEEE Trans. on Automatic Control, Vol. 64, pp. 5108-5115.

[Ned11] Nedic, A., 2011. “Random Algorithms for Convex Minimization Problems,”Math. Programming, Ser. B, Vol. 129, pp. 225-253.

[OrS02] Ormoneit, D., and Sen, S., 2002. “Kernel-Based Reinforcement Learning,” Ma-chine Learning, Vol. 49, pp. 161-178.

[PDB92] Pattipati, K. R., Deb, S., Bar-Shalom, Y., and Washburn, R. B., 1992. “A NewRelaxation Algorithm and Passive Sensor Data Association,” IEEE Trans. AutomaticControl, Vol. 37, pp. 198-213.

[PDC14] Pillonetto, G., Dinuzzo, F., Chen, T., De Nicolao, G., and Ljung, L., 2014.“Kernel Methods in System Identification, Machine Learning and Function Estimation:A Survey,” Automatica, Vol. 50, pp. 657-682.

[PPB01] Popp, R. L., Pattipati, K. R., and Bar-Shalom, Y., 2001. “m-Best SD As-signment Algorithm with Application to Multitarget Tracking,” IEEE Transactions onAerospace and Electronic Systems, Vol. 37, pp. 22-39.

[PaR12] Papahristou, N., and Refanidis, I., 2012. “On the Design and Training of Bots toPlay Backgammon Variants,” in IFIP International Conference on Artificial IntelligenceApplications and Innovations, pp. 78-87.

[PaT00] Paschalidis, I. C., and Tsitsiklis, J. N., 2000. “Congestion-Dependent Pricingof Network Services,” IEEE/ACM Trans. on Networking, Vol. 8, pp. 171-184.

[PeG04] Peret, L., and Garcia, F., 2004. “On-Line Search for Solving Markov DecisionProcesses via Heuristic Sampling,” in Proc. of the 16th European Conference on ArtificialIntelligence, pp. 530-534.

[PoB04] Poupart, P., and Boutilier, C., 2004. “Bounded Finite State Controllers,” inAdvances in Neural Information Processing Systems, pp. 823-830.

[PoF08] Powell, W. B. and Frazier, P., 2008. “Optimal Learning,” in State-of-the-ArtDecision-Making Tools in the Information-Intensive Age, INFORMS, pp. 213-246.

[PoR97] Poore, A. B., and Robertson, A. J. A., 1997. “New Lagrangian RelaxationBased Algorithm for a Class of Multidimensional Assignment Problems,” ComputationalOptimization and Applications, Vol. 8, pp. 129-150.

[PoR12] Powell, W. B., and Ryzhov, I. O., 2012. Optimal Learning, J. Wiley, N. Y.

[Poo94] Poore, A. B., 1994. “Multidimensional Assignment Formulation of Data As-sociation Problems Arising from Multitarget Tracking and Multisensor Data Fusion,”Computational Optimization and Applications, Vol. 3, pp. 27-57.

[Pow11] Powell, W. B., 2011. Approximate Dynamic Programming: Solving the Cursesof Dimensionality, 2nd Edition, J. Wiley and Sons, Hoboken, N. J.

Page 141:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

16 References

[Pre95] Prekopa, A., 1995. Stochastic Programming, Kluwer, Boston.

[PuB78] Puterman, M. L., and Brumelle, S. L., 1978. “The Analytic Theory of Pol-icy Iteration,” in Dynamic Programming and Its Applications, M. L. Puterman (ed.),Academic Press, N. Y.

[PuB79] Puterman, M. L., and Brumelle, S. L., 1979. “On the Convergence of PolicyIteration in Stationary Dynamic Programming,” Mathematics of Operations Research,Vol. 4, pp. 60-69.

[PuS78] Puterman, M. L., and Shin, M. C., 1978. “Modified Policy Iteration Algorithmsfor Discounted Markov Decision Problems,” Management Sci., Vol. 24, pp. 1127-1137.

[PuS82] Puterman, M. L., and Shin, M. C., 1982. “Action Elimination Procedures forModified Policy Iteration Algorithms,” Operations Research, Vol. 30, pp. 301-318.

[Put94] Puterman, M. L., 1994. Markovian Decision Problems, J. Wiley, N. Y.

[QHS05] Queipo, N. V., Haftka, R. T., Shyy, W., Goel, T., Vaidyanathan, R., andTucker, P. K., 2005. “Surrogate-Based Analysis and Optimization,” Progress in AerospaceSciences, Vol. 41, pp. 1-28.

[QuL19] Qu, G., and Li, N., “Exploiting Fast Decaying and Locality in Multi-AgentMDP with Tree Dependence Structure,” Proc. of 2019 CDC, Nice, France.

[RCR17] Rudi, A., Carratino, L., and Rosasco, L., 2017. “Falkon: An Optimal Large

Scale Kernel Method,” in Advances in Neural Information Processing Systems, pp. 3888-3898.

[RMD17] Rawlings, J. B., Mayne, D. Q., and Diehl, M. M., 2017. Model PredictiveControl: Theory, Computation, and Design, 2nd Ed., Nob Hill Publishing.

[RPF12] Ryzhov, I. O., Powell, W. B., and Frazier, P. I., 2012. “The Knowledge GradientAlgorithm for a General Class of Online Learning Problems,” Operations Research, Vol.60, pp. 180-195.

[RSM08] Reisinger, J., Stone, P., and Miikkulainen, R., 2008. “Online Kernel Selectionfor Bayesian Reinforcement Learning,” in Proc. of the 25th International Conference onMachine Learning, pp. 816-823.

[RaR17] Rawlings, J. B., and Risbeck, M. J., 2017. “Model Predictive Control withDiscrete Actuators: Theory and Application,” Automatica, Vol. 78, pp. 258-265.

[RaW06] Rasmussen, C. E., and Williams, C. K., 2006. Gaussian Processes for MachineLearning, MIT Press, Cambridge, MA.

[Rad62] Radner, R., 1962. “Team Decision Problems,” Ann. Math. Statist., Vol. 33, pp.857-881.

[RoB17] Rosolia, U., and Borrelli, F., 2017. “Learning Model Predictive Control for It-erative Tasks. A Data-Driven Control Framework,” IEEE Trans. on Automatic Control,Vol. 63, pp. 1883-1896.

[RoB19] Rosolia, U., and Borrelli, F., 2019. “Sample-Based Learning Model PredictiveControl for Linear Uncertain Systems,” 58th Conference on Decision and Control (CDC),pp. 2702-2707.

[Rob52] Robbins, H., 1952. “Some Aspects of the Sequential Design of Experiments,”Bulletin of the American Mathematical Society, Vol. 58, pp. 527-535.

[Ros70] Ross, S. M., 1970. Applied Probability Models with Optimization Applications,Holden-Day, San Francisco, CA.

Page 142:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

References 17

[Ros12] Ross, S. M., 2012. Simulation, 5th Edition, Academic Press, Orlando, Fla.

[Rot79] Rothblum, U. G., 1979. “Iterated Successive Approximation for Sequential De-cision Processes,” in Stochastic Control and Optimization, by J. W. B. van Overhagenand H. C. Tijms (eds), Vrije University, Amsterdam.

[RuK16] Rubinstein, R. Y., and Kroese, D. P., 2016. Simulation and the Monte CarloMethod, 3rd Edition, J. Wiley, N. Y.

[RuN16] Russell, S. J., and Norvig, P., 2016. Artificial Intelligence: A Modern Approach,Pearson Education Limited, Malaysia.

[RuS03] Ruszczynski, A., and Shapiro, A., 2003. “Stochastic Programming Models,” inHandbooks in Operations Research and Management Science, Vol. 10, pp. 1-64.

[SGC02] Savagaonkar, U., Givan, R., and Chong, E. K. P., 2002. “Sampling Techniquesfor Zero-Sum, Discounted Markov Games,” in Proc. 40th Allerton Conference on Com-munication, Control and Computing, Monticello, Ill.

[SGG15] Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B., and Geist, M., 2015.“Approximate Modified Policy Iteration and its Application to the Game of Tetris,” J.of Machine Learning Research, Vol. 16, pp. 1629-1676.

[SHB15] Simroth, A., Holfeld, D., and Brunsch, R., 2015. “Job Shop Production Plan-ning under Uncertainty: A Monte Carlo Rollout Approach,” Proc. of the InternationalScientific and Practical Conference, Vol. 3, pp. 175-179.

[SHM16] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche,G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., and Dieleman,S., 2016. “Mastering the Game of Go with Deep Neural Networks and Tree Search,”Nature, Vol. 529, pp. 484-489.

[SHS17] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A.,Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., and Lillicrap, T., 2017. “MasteringChess and Shogi by Self-Play with a General Reinforcement Learning Algorithm,” arXivpreprint arXiv:1712.01815.

[SJL18] Soltanolkotabi, M., Javanmard, A., and Lee, J. D., 2018. “Theoretical Insightsinto the Optimization Landscape of Over-Parameterized Shallow Neural Networks,”IEEE Trans. on Information Theory, Vol. 65, pp. 742-769.

[SLA12] Snoek, J., Larochelle, H., and Adams, R. P., 2012. “Practical Bayesian Opti-mization of Machine Learning Algorithms,” in Advances in Neural Information Process-ing Systems, pp. 2951-2959.

[SLJ13] Sun, B., Luh, P. B., Jia, Q. S., Jiang, Z., Wang, F., and Song, C., 2013. “Build-ing Energy Management: Integrated Control of Active and Passive Heating, Cooling,Lighting, Shading, and Ventilation Systems,” IEEE Trans. on Automation Science andEngineering, Vol. 10, pp. 588-602.

[SNC18] Sarkale, Y., Nozhati, S., Chong, E. K., Ellingwood, B. R., and Mahmoud, H.,

2018. “Solving Markov Decision Processes for Network-Level Post-Hazard Recovery viaSimulation Optimization and Rollout,” in 2018 IEEE 14th International Conference onAutomation Science and Engineering, pp. 906-912.

[SSS17] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A.,Hubert, T., Baker, L., Lai, M., Bolton, A. and Chen, Y., 2017. “Mastering the Game ofGo Without Human Knowledge,” Nature, Vol. 550, pp. 354-359.

[SSW16] Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N., 2015.“Taking the Human Out of the Loop: A Review of Bayesian Optimization,” Proc. of

Page 143:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

18 References

IEEE, Vol. 104, pp. 148-175.

[SWM89] Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P., 1989. “Design andAnalysis of Computer Experiments,” Statistical Science, pp. 409-423.

[SYL17] Saldi, N., Yuksel, S., and Linder, T., 2017. “Finite Model Approximations forPartially Observed Markov Decision Processes with Discounted Cost,” arXiv preprintarXiv:1710.07009.

[SZL08] Sun, T., Zhao, Q., Lun, P., and Tomastik, R., 2008. “Optimization of JointReplacement Policies for Multipart Systems by a Rollout Framework,” IEEE Trans. onAutomation Science and Engineering, Vol. 5, pp. 609-619.

[Sas02] Sasena, M. J., 2002. Flexibility and Efficiency Enhancements for ConstrainedGlobal Design Optimization with Kriging Approximations, PhD Thesis, Univ. of Michi-gan.

[ScS02] Scholkopf, B., and Smola, A. J., 2002. Learning with Kernels: Support VectorMachines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA.

[Sch13] Scherrer, B., 2013. “Performance Bounds for Lambda Policy Iteration and Appli-cation to the Game of Tetris,” J. of Machine Learning Research, Vol. 14, pp. 1181-1227.

[Sco10] Scott, S. L., 2010. “A Modern Bayesian Look at the Multi-Armed Bandit,”Applied Stochastic Models in Business and Industry, Vol. 26, pp. 639-658.

[Sec00] Secomandi, N., 2000. “Comparing Neuro-Dynamic Programming Algorithms forthe Vehicle Routing Problem with Stochastic Demands,” Computers and OperationsResearch, Vol. 27, pp. 1201-1225.

[Sec01] Secomandi, N., 2001. “A Rollout Policy for the Vehicle Routing Problem withStochastic Demands,” Operations Research, Vol. 49, pp. 796-802.

[Sec03] Secomandi, N., 2003. “Analysis of a Rollout Approach to Sequencing Problemswith Stochastic Routing Applications,” J. of Heuristics, Vol. 9, pp. 321-352.

[ShC04] Shawe-Taylor, J., and Cristianini, N., 2004. Kernel Methods for Pattern Anal-ysis, Cambridge Univ. Press.

[Sha50] Shannon, C., 1950. “Programming a Digital Computer for Playing Chess,” Phil.Mag., Vol. 41, pp. 356-375.

[SiK19] Singh, R., and Kumar, P. R., 2019. “Optimal Decentralized Dynamic Policiesfor Video Streaming over Wireless Channels,” arXiv preprint arXiv:1902.07418.

[StW91] Stewart, B. S., and White, C. C., 1991. “Multiobjective A∗,” J. ACM, Vol. 38,pp. 775-814.

[SuB18] Sutton, R., and Barto, A. G., 2018. Reinforcement Learning, 2nd Edition, MITPress, Cambridge, MA.

[SuY19] Su, L., and Yang, P., 2019. ‘On Learning Over-Parameterized Neural Networks:A Functional Approximation Perspective,” in Advances in Neural Information Process-ing Systems, pp. 2637-2646.

[Sun19] Sun, R., 2019. “Optimization for Deep Learning: Theory and Algorithms,”arXiv preprint arXiv:1912.08957.

[Sze10] Szepesvari, C., 2010. Algorithms for Reinforcement Learning, Morgan and Clay-pool Publishers, San Franscisco, CA.

[TCW19] Tseng, W. J., Chen, J. C., Wu, I. C., and Wei, T. H., 2019. “ComparisonTraining for Computer Chinese Chess,” IEEE Trans. on Games.

Page 144:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

References 19

[TGL13] Tesauro, G., Gondek, D. C., Lenchner, J., Fan, J., and Prager, J. M., 2013.“Analysis of Watson’s Strategies for Playing Jeopardy!,” J. of Artificial IntelligenceResearch, Vol. 47, pp. 205-251.

[TRV16] Tu, S., Roelofs, R., Venkataraman, S., and Recht, B., 2016. “Large Scale KernelLearning Using Block Coordinate Descent,” arXiv preprint arXiv:1602.05310.

[TaL20] Tanzanakis, A., and Lygeros, J., 2020. “Data-Driven Control of Unknown Sys-tems: A Linear Programming Approach,” arXiv preprint arXiv:2003.00779.

[TeG96] Tesauro, G., and Galperin, G. R., 1996. “On-Line Policy Improvement UsingMonte Carlo Search,” NIPS, Denver, CO.

[Tes89a] Tesauro, G. J., 1989. “Neurogammon Wins Computer Olympiad,” Neural Com-putation, Vol. 1, pp. 321-323.

[Tes89b] Tesauro, G. J., 1989. “Connectionist Learning of Expert Preferences by Com-parison Training,” in Advances in Neural Information Processing Systems, pp. 99-106.

[Tes92] Tesauro, G. J., 1992. “Practical Issues in Temporal Difference Learning,” Ma-chine Learning, Vol. 8, pp. 257-277.

[Tes94] Tesauro, G. J., 1994. “TD-Gammon, a Self-Teaching Backgammon Program,Achieves Master-Level Play,” Neural Computation, Vol. 6, pp. 215-219.

[Tes95] Tesauro, G. J., 1995. “Temporal Difference Learning and TD-Gammon,” Com-

munications of the ACM, Vol. 38, pp. 58-68.

[Tes01] Tesauro, G. J., 2001. “Comparison Training of Chess Evaluation Functions,” inMachines that Learn to Play Games, Nova Science Publishers, pp. 117-130.

[Tes02] Tesauro, G. J., 2002. “Programming Backgammon Using Self-Teaching NeuralNets,” Artificial Intelligence, Vol. 134, pp. 181-199.

[ThS09] Thiery, C., and Scherrer, B., 2009. “Improvements on Learning Tetris withCross-Entropy,” International Computer Games Association J., Vol. 32, pp. 23-33.

[TsV96] Tsitsiklis, J. N., and Van Roy, B., 1996. “Feature-Based Methods for Large-ScaleDynamic Programming,” Machine Learning, Vol. 22, pp. 59-94.

[Tse98] Tseng, P., 1998. “Incremental Gradient(-Projection) Method with MomentumTerm and Adaptive Stepsize Rule,” SIAM J. on Optimization, Vol. 8, pp. 506-531.

[TuP03] Tu, F., and Pattipati, K. R., 2003. “Rollout Strategies for Sequential FaultDiagnosis,” IEEE Trans. on Systems, Man and Cybernetics, Part A, pp. 86-99.

[UGM18] Ulmer, M.W., Goodson, J. C., Mattfeld, D. C., and Hennig, M., 2018. “Offline-Online Approximate Dynamic Programming for Dynamic Vehicle Routing with Stochas-tic Requests,” Transportation Science, Vol. 53, pp. 185-202.

[Ulm17] Ulmer, M. W., 2017. Approximate Dynamic Programming for Dynamic VehicleRouting, Springer, Berlin.

[VBC19] Vinyals, O., Babuschkin, I., Czarnecki, W. M., and thirty nine more authors,2019. “Grandmaster Level in StarCraft II Using Multi-Agent Reinforcement Learning,”Nature, Vol. 575, p. 350.

[VPA09] Vrabie, D., Pastravanu, O., Abu-Khalaf, M., and Lewis, F. L., 2009. “Adap-tive Optimal Control for Continuous-Time Linear Systems Based on Policy Iteration,”Automatica, Vol. 45, pp. 477-484.

[VVL13] Vrabie, D., Vamvoudakis, K. G., and Lewis, F. L., 2013. Optimal AdaptiveControl and Differential Games by Reinforcement Learning Principles, The Institution

Page 145:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

20 References

of Engineering and Technology, London.

[Van76] Van Nunen, J. A., 1976. Contracting Markov Decision Processes, MathematicalCentre Report, Amsterdam.

[WCG02] Wu, G., Chong, E. K. P., and Givan, R. L., 2002. “Burst-Level CongestionControl Using Hindsight Optimization,” IEEE Transactions on Aut. Control, Vol. 47,pp. 979-991.

[WCG03] Wu, G., Chong, E. K. P., and Givan, R. L., 2003. “Congestion Control UsingPolicy Rollout,” Proc. 2nd IEEE CDC, Maui, Hawaii, pp. 4825-4830.

[WOB15] Wang, Y., O’Donoghue, B., and Boyd, S., 2015. “Approximate Dynamic Pro-gramming via Iterated Bellman Inequalities,” International J. of Robust and NonlinearControl, Vol. 25, pp. 1472-1496.

[WaB14] Wang, M., and Bertsekas, D. P., 2014. “Incremental Constraint ProjectionMethods for Variational Inequalities,” Mathematical Programming, pp. 1-43.

[WaB16] Wang, M., and Bertsekas, D. P., 2016. “Stochastic First-Order Methods withRandom Constraint Projection,” SIAM Journal on Optimization, Vol. 26, pp. 681-717.

[WaS00] de Waal, P. R., and van Schuppen, J. H., 2000. “A Class of Team Problemswith Discrete Action Spaces: Optimality Conditions Based on Multimodularity,” SIAMJ. on Control and Optimization, Vol. 38, pp. 875-892.

[Wat89] Watkins, C. J. C. H., Learning from Delayed Rewards, Ph.D. Thesis, CambridgeUniv., England.

[WeB99] Weaver, L., and Baxter, J., 1999. “Learning from State Differences: STD(λ),”Tech. Report, Dept. of Computer Science, Australian National University.

[WhS94] White, C. C., and Scherer, W. T., 1994. “Finite-Memory Suboptimal Designfor Partially Observed Markov Decision Processes,” Operations Research, Vol. 42, pp.439-455.

[Whi88] Whittle, P., 1988. “Restless Bandits: Activity Allocation in a Changing World,”J. of Applied Probability, pp. 287-298.

[Whi91] White, C. C., 1991. “A Survey of Solution Techniques for the Partially ObservedMarkov Decision Process,” Annals of Operations Research, Vol. 32, pp. 215-230.

[WiB93] Williams, R. J., and Baird, L. C., 1993. “Analysis of Some Incremental Variantsof Policy Iteration: First Steps Toward Understanding Actor-Critic Learning Systems,”Report NU-CCS-93-11, College of Computer Science, Northeastern University, Boston,MA.

[Wie03] Wiewiora, E., 2003. “Potential-Based Shaping and Q-Value Initialization areEquivalent,” J. of Artificial Intelligence Research, Vol. 19, pp. 205-208.

[Wit68] Witsenhausen, H., 1968. “A Counterexample in Stochastic Optimum Control,”SIAM Journal on Control, Vol. 6, pp. 131-147.

[Wit71a] Witsenhausen, H., 1971. “On Information Structures, Feedback and Causality,”SIAM J. Control, Vol. 9, pp. 149-160.

[Wit71b] Witsenhausen, H., 1971. “Separation of Estimation and Control for DiscreteTime Systems,” Proceedings of the IEEE, Vol. 59, pp. 1557-1566.

[YDR04] Yan, X., Diaconis, P., Rusmevichientong, P., and Van Roy, B., 2004. “Solitaire:Man Versus Machine,” Advances in Neural Information Processing Systems, Vol. 17, pp.1553-1560.

Page 146:  · 2021. 1. 17. · Reinforcement Learning and Optimal Control and Rollout, Policy Iteration, and Distributed Reinforcement Learning by Dimitri P. Bertsekas Extended Chapter 1 Exact

References 21

[YYM20] Yu, L., Yang, H., Miao, L., and Zhang, C., 2019. “Rollout Algorithms forResource Allocation in Humanitarian Logistics,” IISE Transactions, Vol. 51, pp. 887-909.

[Yar17] Yarotsky, D., 2017. “Error Bounds for Approximations with Deep ReLU Net-works,” Neural Networks, Vol. 94, pp. 103-114.

[YuB08] Yu, H., and Bertsekas, D. P., 2008. “On Near-Optimality of the Set of Finite-State Controllers for Average Cost POMDP,” Math. of OR, Vol. 33, pp. 1-11.

[YuB09] Yu, H., and Bertsekas, D. P., 2009. “Basis Function Adaptation Methods forCost Approximation in MDP,” Proceedings of 2009 IEEE Symposium on ApproximateDynamic Programming and Reinforcement Learning (ADPRL 2009), Nashville, Tenn.

[YuB13] Yu, H., and Bertsekas, D. P., 2013. “Q-Learning and Policy Iteration Algorithmsfor Stochastic Shortest Path Problems,” Annals of Operations Research, Vol. 208, pp.95-132.

[YuB15] Yu, H., and Bertsekas, D. P., 2015. “A Mixed Value and Policy Iteration Methodfor Stochastic Control with Universally Measurable Policies,” Math. of OR, Vol. 40, pp.926-968.

[YuK20] Yue, X., and Kontar, R. A., 2020. “Lookahead Bayesian Optimization viaRollout: Guarantees and Sequential Rolling Horizons,” arXiv preprint arXiv:1911.01004.

[ZBH16] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O., 2016. “Un-derstanding Deep Learning Requires Rethinking Generalization,” arXiv preprint arXiv:1611.03530.

[ZOT18] Zhang, S., Ohlmann, J. W., and Thomas, B. W., 2018. “Dynamic Orienteeringon a Network of Queues,” Transportation Science, Vol. 52, pp. 691-706.

[ZSG20] Zoppoli, R., Sanguineti, M., Gnecco, G., and Parisini, T., 2020. Neural Ap-proximations for Optimal Control and Decision, Springer.

[ZuS81] Zuker, M., and Stiegler, P., 1981. “Optimal Computer Folding of Larger RNASequences Using Thermodynamics and Auxiliary Information,” Nucleic Acids Res., Vol.9, pp. 133-148.