an optimal approximate dynamic programming ... - castle labs€¦ · to capture the economic...

50
An Optimal Approximate Dynamic Programming Algorithm for the Energy Dispatch Problem with Grid- Level Storage Juliana M. Nascimento Warren B. Powell Department of Operations Research and Financial Engineering Princeton University September 7, 2009

Upload: others

Post on 24-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

An Optimal Approximate Dynamic ProgrammingAlgorithm for the Energy Dispatch Problem with Grid-

Level Storage

Juliana M. NascimentoWarren B. Powell

Department of Operations Research and Financial EngineeringPrinceton University

September 7, 2009

Page 2: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Abstract

We consider the problem of modeling hourly dispatch and energy allocation decisions in thepresence of grid-level storage. The model makes it possible to capture hourly, daily andseasonal variations in wind, solar and demand, while also modeling the presence of hydro-electric storage to smooth energy production over both daily and seasonal cycles. The problemis formulated as a stochastic, dynamic program, where we approximate the value of storedenergy. We provide a rigorous convergence proof for an approximate dynamic programmingalgorithm, which can capture the presence of both the amount of energy held in storage as wellas other exogenous variables. Our algorithm exploits the natural concavity of the problem toavoid any need for explicit exploration policies.

Page 3: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

1 Introduction

There is a long history of modeling energy dispatch using deterministic optimization models

to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-

gramming has been used for energy policy modeling in models such as NEMS (NEMS (2003)),

MARKAL (Fishbone & Abilock (1981)), MESSAGE III (Messner & Strubegger (1995)) and

Meta*Net (Lamont (1997)). These models typically work in very aggregate time steps such as

a year or more, which require that average values be used for intermittent energy sources such

as wind and solar. However, it is well known (Lamont (2008)) that intermittent energy sources

such as wind and solar look far more attractive when we use average production over periods

as short as a few hours. Averaging reduces the peaks, where energy is often lost because we

are unable to use brief bursts, and raises the valleys, which determines the amount of spinning

reserve that must be maintained to supply demand in the event that these sources drop to

their minimum. To accurately model the economics of intermittent energy, it is necessary to

model energy supply and demand in time increments no greater than an hour.

Some authors recognize the importance of capturing hourly wind variations, but still use

deterministic linear programming models which are allowed to make storage and allocation

decisions knowing the entire future directory (Castronuovo & Lopes (2004)). Garcia-Gonzalez

et al. (2008) and Brown et al. (2008) use the idea of stochastic wind scenarios, but allow

decisions to “see” the wind in the future within a particular scenario (violating what are

known as nonanticipativity conditions). These models will also overstate the value of energy

from intermittent sources by being able to make adjustments to other sources of energy in

anticipation of future fluctuations in wind and solar. For this reason, it is important to capture

not only the variability of wind and solar, but also the uncertainty. Furthermore, uncertainties

come in a number of sources such as demand, prices and rainfall, among others.

We address the problem of allocating energy resources over both the electric power grid

(generally referred to as the electric power dispatch problem) as well as other forms of energy

allocation (conversion of biomass to liquid fuels, conversion of petroleum to gasoline or elec-

tricity, and the use of natural gas in home heating or electric power generation). Determining

1

Page 4: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

how much energy can be converted from each source to serve each type of demand can be

modeled as a series of single-period linear programs linked by a single, scalar storage variable,

as would be the case with grid-level storage.

Let Rt be the scalar quantity of stored energy on hand, and let Wt be a vector-valued

stochastic process describing exogenously varying parameters such as available energy from

wind and solar, demand of different types (with hourly, daily and weekly patterns), energy

prices and rainfall. We assume that Wt is Markovian to be able to model processes such as

wind, prices and demand where the future (say, the price pt+1 at time t+ 1), depends on the

current price pt plus an exogenous change pt+1, allowing us to write pt+1 = pt + pt+1. If wt

is the wind at time t, pt is a price (or vector of prices), and Dt is the demand, we would let

Wt = (wt, pt, Dt), where the future of each stochastic process depends on the current value.

The state of our system, then, is given by St = (Rt,Wt).

In our problem, xt is a vector-valued control giving the allocation of different sources of

energy (oil, coal, wind, solar, nuclear, etc.) over different energy pathways (conversion to

electricity, transmission over the electric power grid, conversion of biomass to liquid fuel) and

to satisfy different types of demands. xt also includes the amount of energy stored in an

aggregate, grid-level storage device such as a reservoir for hydroelectric power.

An optimal policy is described by Bellman’s equation

Vt(St) = maxxt∈Xt

(C(St, xt) + γE Vt+1(St+1)|St) (1)

where St+1 = f(St, xt,Wt+1) and C(St, xt) is a contribution function that is linear in xt. We

use the convention that any variable indexed by t is Ft-measurable, where Ft is the sigma-

algebra generated by W1,W2, . . . ,Wt.

We are going to assume that Wt is a set of discrete random variables, which includes

exogenous energy supply (such as wind and solar), demand, prices, rainfall and any other

uncertain parameters. When the supplies and demands are discrete (in particular, that it can

be written as a scaling factor times an integer), it is possible to show that Vt(St) = Vt(Rt,Wt)

can be written as a piecewise linear function in Rt. We treat xt as a continuous variable,

2

Page 5: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

but we will show that the optimal policy involves solving a linear program which returns an

optimal xt that is also defined on a discrete set of outcomes (the extreme point solutions of

the linear program).

In this case, we can ensure that the solution xt also takes on discrete values, but we solve

the optimization problem as a continuous problem using a linear programming solver. Thus,

we can view xt and Rt as continuous, while ensuring that they always take on discrete values.

Although St may only have only five or ten dimensions, this is often enough to render

(1) computationally intractable. There is an extensive literature in approximate dynamic

programming which assumes either discrete action spaces, or low dimensional, continuous

controls. For our problem, xt can easily have hundreds or thousands of dimensions, depending

on the level of aggregation. In addition, we are unable to compute the expectation ana-

lytically, introducing the additional complication of maximizing an expectation that is itself

computationally intractable. The problem of vector-valued state, outcome and action spaces

is known as the three curses of dimensionality (Powell (2007)).

We propose a Monte Carlo-based algorithm based on the principles of approximate dy-

namic programming. We assume that C(St, xt) is linear in xt given St. Now define Ft(Rt)

by

Ft(Rt) = maxxt∈Xt

C(St, xt)

subject to a flow conservation constraint involving the stored energy Rt. It is well known (see,

e.g. Vanderbei (1997)) that Ft(Rt) is piecewise linear concave in Rt. Backward induction

can be used to then show that the value function Vt(St) = Vt(Rt,Wt) is piecewise linear

and concave in the resource dimension Rt. We exploit this concavity to show almost sure

convergence of an ADP algorithm that uses pure exploitation. This is important for higher

dimensional problems where exploration steps, which may be useful for proving convergence,

can be extremely slow in practice (imagine inserting an exploration step when there are

millions of states). This algorithmic strategy has proven to be very successful for industrial

applications (see, for example, Godfrey & Powell (2002), Topaloglu & Powell (2006)) and

most recently in an energy model that is the motivating application for this paper (Powell et

3

Page 6: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

al. (2009)), but none of these prior papers provided any formal convergence results.

Our work is motivated by a particular energy application, but the problem class is more

general. For example, the theory would apply to problems in logistics involving allocation

of a single commodity from multiple suppliers to multiple consumers, with a single inven-

tory linking time periods. We were recently introduced to such an application in Europe

involving the distribution of cement provided by multiple suppliers to numerous construction

projects. There are many other applications involving the distribution of a single product.

This technique has also been applied in transportation to problems with multiple resource

classes (Topaloglu & Powell (2006)), but without any formal convergence results.

Our algorithmic strategy is limited to problems with linear contribution functions subject

to constraints which can be discretized. For this problem class, Ft(Rt) can be described as

a piecewise linear, concave function. We exploit this property to show that we will visit,

infinitely often, the slopes near the optimal solution, allowing us to prove convergence in our

estimates of these slopes. Concavity allows us to avoid visiting other points infinitely often.

Fortunately, there is an extremely wide class of problems that fit these conditions.

Our proof technique combines the convergence proof for a monotone mapping in Bertsekas

& Tsitsiklis (1996) with the proof of the SPAR algorithm in Powell et al. (2004). Current

proofs of convergence for approximate dynamic programming algorithms such as Q-learning

(Tsitsiklis (1994), Jaakkola et al. (1994)) and optimistic policy iteration (Tsitsiklis (2002))

require that we visit states (and possibly actions) infinitely often. A convergence proof for

a Real Time Dynamic Programming (Barto et al. (1995)) algorithm that considers a pure

exploitation scheme is provided in Bertsekas & Tsitsiklis (1996) [Prop. 5.3 and 5.4], but it

assumes that expected values can be computed and the initial approximations are optimistic,

which produces a type of forced exploration. We make no such assumptions, but it is important

to emphasize that our result depends on the concavity of the optimal value functions.

Others have taken advantage of concavity (convexity for minimization) in the stochastic

control literature. Hernandez-Lerma & Runggladier (1994) presents a number of theoretical

results for general stochastic control problems for the case with i.i.d. noise (an assumption we

do not make), but does not present a specific algorithm that could be applied to our problem

4

Page 7: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

class. There is an extensive literature on stochastic control which assumes linear, additive

noise (see, for example, Bertsekas (2005) and the references cited there). For our problem,

noise arises in costs and, in particular, the constraints.

Other approaches to deal with our problem class would be different flavors of Benders de-

composition (Van Slyke & Wets (1969), Higle & Sen (1991), Chen & Powell (1999)) and sample

average approximation (Shapiro (2003))(SAA). However, techniques based on stochastic pro-

gramming such as Benders require that we create scenario trees to capture the trajectory of

information. We are interested in modeling our problem over a full year to capture seasonal

variations, producing a problem with 8,760 time periods. On the other hand, SAA relies on

generating random samples outside of the optimization problems and then solving the cor-

responding deterministic problems using an appropriate optimization algorithm. Numerical

experiments with the SAA approach applied to problems where an integer solution is required

can be found in Ahmed & Shapiro (2002).

We begin in section 2 with a brief summary of the energy policy model SMART. Section

3 gives the optimality equations and describes some properties of the value function. Section

4 describes the algorithm in detail. Section 5 gives the convergence proof, which is the heart

of the paper. Section 6 describes the use of this algorithm in SMART, and shows that the

approximate dynamic programming algorithm almost perfectly matches the solution provided

by an optimal linear programming model for a deterministic problem. Section 7 concludes the

paper.

2 An energy modeling application

Our problem is based on the SMART model (Powell et al. (2009)), which describes a stochas-

tic, multiscale model which models multiple energy supplies and demands, and an energy

conversion network that captures both electric power transmission as well as activities such as

biomass conversion. This model includes hydroelectric storage which is able to store energy

(in the form of water) over long periods of time. This makes it possible to study both the

storage of wind energy (which can be used to generate electricity to pump water back into the

5

Page 8: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

reservoir) along with seasonal rainfall. In this section, we provide only a sketch of this model,

since we only need the structure of the problem and specific mathematical properties such as

concavity.

We assume that the exogenous information process Wt is Markov, and that it includes

both the demand for energy as well as any other exogenous information. The demand is

vector-valued representing different types of energy demand, which we represent using Dt =

(Dt1, . . . , DtBd), where Bd is the number of different sources of demand. The demand does

not need to be fully satisfied, but no backlogging is allowed (our energy model assumes that

energy can be imported at a high price). Wt also includes wind and solar energy production,

energy commodity prices, and rainfall. We assume that all elements of Wt ∈ Wt, including

demands, have finite support, which means that they have been discretized at an appropriate

level.

We denote by Rt the scalar storage level right before a decision is taken. To simplify the

presentation of the algorithm, we discretize the storage, and let the amount in storage be

given by ηRt where η is an appropriately chosen scalar parameter, while Rt is a nonnegative

integer. If the units of all the forms of energy are chosen carefully, we can let η = 1. We

model the problem as if Rt is continuous, but argue below that it will always take on integer

values.

We define St = (Wt, Rt) to be the pre-decision state (the information just before we make

a decision), where we use St and (Wt, Rt) interchangeably. The decision xt = (xdt , xst , x

trt ) has

three components. The first component, xdt , is a vector that represents how much to sell to

satisfy the demands Dt. The second component, xst , is also a vector that represents how much

to buy from Bst different energy sources (coal, nuclear, natural gas, hydroelectric, ethanol,

biomass, and so on). Finally, xtrt is a scalar that denotes the amount of energy that should

be transferred (discarded), in case the storage level is too high, even after demand has been

satisfied.

6

Page 9: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

The feasible set, which is dependent on the current state (Wt, Rt), is given by

Xt(Wt, Rt) =xdt ∈ IRB

d

, xst ∈ IRBs

, xtrt ∈ IR :

Bdt∑

i=1

xdti + xtrt ≤ ηRt,

0 ≤ xdti ≤ Dti, i = 1, . . . , Bd,

0 ≤ xsti ≤Mti, i = 1, . . . , Bs,

0 ≤ xsti ≤ ust(Wt), i = 1, . . . , Bs,

0 ≤ xtrti ≤ utrt (Wt),

where ust(·) and utrt (·) are nonnegative functions which take on discrete values and Mti is a

deterministic bound. The first constraint indicates that backlogging is not allowed, the second

indicates that we do not sell more than is demanded and the third indicates that each supplier

can only offer a limited amount of resources. The last two constraints impose upper bounds

that varies with the information vector Wt.

The contribution generated by the decision is given by

Ct(Wt, Rt, xt) =−Bd

t∑i=1

cuti(Wt)(Dti − xdti)− cot (Wt)

Rt −Bd

t∑i=1

xdti

− ctrt (Wt)xtrt

−Bs

t∑i=1

csti(Wt)xsti +

Bdt∑

i=1

cdti(Wt)xdti,

where cuti(·), cot (·), ctrt (·), csti(·) and cdti(·) are nonnegative scalar functions. The first three terms

represent underage, overage and transfer costs, respectively. The fourth term represents the

buying cost and the last one represents the money that is made satisfying demand.

The decision takes the system to a new storage level, denoted by the scalar Rxt , given by

Rxt = fx(Rt, xt) (2)

= Rt −Bd

t∑i=1

xdti +

Bst∑

i=1

xsti − xtrt . (3)

where fx(Rt, xt) is our transition function that returns the post-decision resource state.

7

Page 10: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

If the problem is deterministic, it can be formulated as a single linear program of the form

maxx

T∑t=0

Ct(Wt, Rt, xt) (4)

subject to xt ∈ Xt(Wt, Rt) and the transition equation 3. In section 6, we compare against

the optimal solution obtained from this formulation (again, assuming Wt is deterministic) as

a benchmark. When the problem is stochastic, we look for a decision function Xπt (St) (or

policy), that solves

maxπ∈Π

ET∑t=0

Ct(Wt, Rt, Xπt (St)). (5)

In this paper, we design the decision function using approximate dynamic programming, where

Xπt (St) is given by

Xπt (St) = max

xt∈Xt

(C(St, xt) + γE

Vt+1(St+1)|St

)where Vt(s) is an approximation of the value function at state s. This strategy introduces two

challenges. The first is the presence of an expectation (which cannot be computed exactly)

within the optimization problem. We can remove the expectation by using the concept of

the post-decision state, Sxt = (Wt, Rxt ), which is the state immediately after we have made a

decision but before any new information has arrived. Since Wt is not affected by a decision,

we only have to worry about Rxt which is the storage level after we have decided how much

to add or subtract. The pre-decision state is then computed using

Rt+1 = Rxt + Rt+1

where Rt+1 might represent exogenous rainfall to our reservoir (Rt+1 would be part of Wt+1).

The post-decision state (see Powell (2007), chapters 4, 11 and 12) allows us to solve problems

of the form

Xπt (St) = max

xt∈Xt

(C(St, xt) + γVt+1(Wt, R

xt ))

(6)

8

Page 11: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

which is a deterministic linear program, where Rxt is given by equation 3.

The second challenge is computing the value function Vt(Sxt ) = Vt(Wt, R

xt ). Vt(Wt, R) can

be represented exactly as a piecewise linear function of R indexed by Wt. Throughout our

presentation, we will write Vt(Wt, R) for R = 0, 1, . . . , BR, by which we mean Rxt = ηR. We

will treat R as if it is continuous, but it is easy to show by induction that V xt (Wt, R

xt ) is

piecewise linear and concave in Rxt . Since energy supplies and demands are all assumed to

be discretized (with the same discretization parameter η), the optimal solution of the linear

program in equation 6 has the property that the optimal solution xt will also be discrete, and

can be represented as integers scaled by η. For this reason, assuming the initial amount in

storage also takes on one of these values (such as zero), then we are assured that Rxt (and Rt)

will always take on a discrete value, even if we treat it as a continuous variable.

For our energy application, concavity of Vt(Wt, R) in R is a significant structural property.

In section 6, we report on comparisons against the optimal solution from a deterministic linear

program. To match this solution, we had to choose η so that BR was on the order of 10,000,

which means that the value function approximation could have as many as 10,000 segments

(and one of these for each of 8,760 time periods). However, our algorithm requires that we

visit only a small number of these (for each value of Wt), accelerating the algorithm by a factor

of 1000. For this reason, it does not affect the computational performance of the algorithm if

we were to discretize the function into a million segments.

Our algorithmic strategy can be extended to more complex problems in an approximate

way. In Powell et al. (2009), energy is stored in a hydroelectric reservoir. If we extended the

model to handle storage using compressed air, flywheels, batteries and compressed hydrogen,

we lose our convergence proof because we would have to approximate the value of this vector

of energy storage resources using separable, piecewise linear approximations. Topaloglu &

Powell (2006) does exactly this for a problem managing a fleet of vehicles of different types,

and shows that the use of separable, piecewise linear value functions provides very good, but

not optimal results.

9

Page 12: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

3 The Optimal Value Functions

We define, recursively, the optimal value functions associated with the storage class. We

denote by V ∗t (Wt, Rt) the optimal value function around the pre-decision state (Wt, Rt) and

by V xt (Wt, R

xt ) the optimal value function around the post-decision state (Wt, R

xt ).

At time t = T , since it is the end of the planning horizon, the value of being in any state

(WT , RxT ) is zero. Hence, V x

T (WT , RxT ) = 0. At time t− 1, for t = T, . . . , 1, the value of being

in any post-decision state (Wt−1, Rxt−1) does not involve the solution of a deterministic opti-

mization problem, it only involves an expectation, since the next pre-decision state (Wt, Rt)

only depends on the information that first becomes available at t. On the other hand, the

value of being in any pre-decision state (Wt, Rt) does not involve expectations, since the next

post-decision state (Wt, Rxt ) is a deterministic function of Wt, it only requires the solution of

a deterministic optimization problem. Therefore,

V xt−1(Wt−1, R

xt−1) = E

[V ∗t (Wt, Rt)|(Wt−1, R

xt−1)], (7)

V ∗t (Wt, Rt) = maxxt∈Xt(Wt,Rt)

(Ct(Wt, Rt, xt) + γV x

t (Wt, Rxt )). (8)

In the remainder of the paper, we only use the value function V xt (Sxt ) defined around the

post-decision state variable, since it allows us to make decisions by solving a deterministic

problem as in (7).

We show that V xt (Wt, R) is concave and piecewise linear in R with BR discrete break-

points. This structural property combined with the optimization/expectation inversion is the

foundation of our algorithmic strategy and its proof of convergence.

For Wt ∈ Wt, let vt(Wt) = (vt(Wt, 1), . . . , vt(Wt, BR)) be a vector representing the slopes

of a function Vt(Wt, R) : [0,∞) → IR that is concave and piecewise linear with breakpoints

R = 1, . . . , BR. It is easy to see that the function F ∗t (vt(Wt),Wt, ·), where

F ∗t (vt(Wt),Wt, Rt) = maxxt,yt

Ct(Wt, Rt, xt) + γBR∑r=1

vt(Wt, r)ytr

subject to xt ∈ Xt(Wt, Rt),BR∑r=1

ytr = fx(Rt, xt), 0 ≤ yt ≤ 1,

10

Page 13: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

is also concave and piecewise linear with breakpoints R = 1, . . . , BR, since the demand vec-

tor Dt and the upper bounds usti(·), utrt (·) and Mti in the constraint set Xt(Wt, Rt) are all

discrete. Moreover, the optimal solution (x∗t , y∗t ) to the linear programming problem that

defines F ∗t (vt(Wt),Wt, Rt) does not depend on Vt(Wt, 0) and is a discrete vector whenever the

argument Rt is discrete. We also have that F ∗t (vt(Wt),Wt, Rt) is bounded for all Wt ∈ Wt.

We use F ∗t to prove the following proposition about the optimal value function.

Proposition 1. For t = 0, . . . , T and information vector Wt ∈ Wt, the optimal value func-

tion V xt (Wt, R) is concave and piecewise linear with breakpoints R = 1, . . . , BR. We denote

its slopes by v∗t (Wt) =(v∗t (Wt, 1), . . . , v∗t (Wt, B

R)), where, for R = 1, . . . , BR and t < T ,

v∗t (Wt, R) is given by

v∗t (Wt, R) = V xt (Wt, R)− V x

t (Wt, R− 1)

= E[F ∗t+1(v∗t+1(Wt+1),Wt+1, R + Rt+1(Wt+1))

− F ∗t+1(v∗t+1(Wt+1),Wt+1, R− 1 + Rt+1(Wt+1))|Wt

]. (9)

Proof: The proof is by backward induction on t. The base case t = T holds as V xT (WT , ·)

is equal to zero for all WT ∈ WT . For t < T the proof is obtained noting that

V xt (Wt, R

xt ) = E

[γV x

t+1(Wt+1, 0) + F ∗t (v∗t+1(Wt+1),Wt+1, Rxt + Rt+1(Wt+1))|Wt

].

Due to the concavity of V xt (Wt, ·), the slope vector v∗t (Wt) is monotone decreasing, that

is, v∗t (Wt, R) ≥ v∗t (Wt, R + 1) (throughout, we use v∗t (Wt) to refer to the slope of the value

function V xt (Wt, R

xt ) defined around the post-decision state variable). Moreover, throughout

the paper, we work with the translated version V ∗t (Wt, ·) of V xt (Wt, ·) given by V ∗t (Wt, R +

y) =R∑r=1

v∗t (Wt, r) + yv∗t (Wt, R + 1), where R is a nonnegative integer and 0 ≤ y ≤ 1, since

the optimal solution (x∗t , y∗t ) associated with F ∗(v∗t+1(Wt+1),Wt+1, R) does not depend on

V xt (Wt, 0).

We next introduce the dynamic programming operator H associated with the storage class.

We define H using the slopes of piecewise linear functions instead of the functions themselves.

11

Page 14: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Let v = vt(Wt) for t = 0, . . . , T, Wt ∈ Wt be a set of slope vectors, where vt(Wt) =

(vt(Wt, 1), . . . , vt(Wt, BR)). The dynamic programming operator H associated with the stor-

age class maps a set of slope vectors v into a new set Hv as follows. For t = 0, . . . , T − 1,

Wt ∈ Wt and R = 1, . . . , BR,

(Hv)t(Wt, R) = E[F ∗t (vt+1(Wt+1),Wt+1, R + Rt+1(Wt+1))

−F ∗t (vt+1(Wt+1),Wt+1, R− 1 + Rt+1(Wt+1))|Wt

]. (10)

It is well known that the optimal value function in a dynamic program is unique. This

is a trivial result for the finite horizon problems that we consider in this paper. Therefore,

the set of slopes v∗ corresponding to the optimal value functions V ∗t (Wt, R) for t = 0, . . . , T ,

Wt ∈ Wt and R = 1, . . . , Bt are unique. Moreover, the dynamic programming operator

H defined by (10) is assumed to satisfy the following conditions. Let v = vt(Wt) for t =

0, . . . , T, Wt ∈ Wt and ˜v = ˜vt(Wt) for t = 0, . . . , T, Wt ∈ Wt be sets of slope vectors

such that vt(Wt) = (vt(Wt, 1), . . . , vt(Wt, Bt)) and ˜vt(Wt) = (˜vt(Wt, 1), . . . , ˜vt(Wt, Bt)) are

monotone decreasing and vt(Wt) ≤ ˜vt(Wt). Then, for Wt ∈ Wt and R = 1, . . . , Bt:

(Hv)t(Wt) is monotone decreasing, (11)

(Hv)t(Wt, R) ≤ (H ˜v)t(Wt, R), (12)

(Hv)t(Wt, R)− ηe ≤ (H(v − ηe))t(W,R) ≤ (H(v + ηe))t(W,R) ≤ (Hv)t(W,R) + ηe,(13)

where η is a positive integer and e is a vector of ones. The conditions in equations (11) and

(13) imply that the mapping H is continuous (see the discussion in Bertsekas & Tsitsiklis

(1996), page 158).

The dynamic programming operator H and the associated conditions (11)-(13) are used

later on to construct deterministic sequences that are provably convergent to the optimal

slopes.

4 The SPAR-Storage Algorithm

We propose a pure exploitation algorithm, namely the SPAR-Storage Algorithm, that provably

learns the optimal decisions to be taken at parts of the state space that can be reached by an

12

Page 15: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Important Region

Asset Dimension

V ∗t (Wt, ·) V nt (Wt, ·)

Approximation - Constructed

Rt1 Rt2 Rt3

v nt (W

t , Rt5 )

v ∗t (W

t , Rt4)

v∗t (Wt, Rt3)

v∗t(W t, R

t2)

v∗ t(Wt,Rt1

)

Optimal - Unknown

vn t(W

t,Rt1

)

vnt(W

t,R t

2)

vnt (Wt, Rt3) v nt (W

t , Rt4)

v ∗t (Wt , R

t5 )

Figure 1: Optimal value function and the constructed approximation

optimal policy, which are determined by the algorithm itself. This is accomplished by learning

the slopes of the optimal value functions at important parts of the state space, through the

construction of value function approximations V nt (Wt, ·) that are concave, piecewise linear

with breakpoints R = 1, . . . , BR. The approximation is represented by its slopes vnt (Wt) =(vnt (Wt, 1), . . . , vnt (Wt, B

R)). Figure 1 illustrates the idea. In order to do so, the algorithm

combines Monte Carlo simulation in a pure exploitation scheme and stochastic approximation

integrated with a projection operation.

Figure 2 describes the SPAR-Storage algorithm. The algorithm requires an initial concave

piecewise linear value function approximation V 0(Wt, ·), represented by its slopes v0t (Wt) =(

v0t (Wt, 1), . . . , v0

t (Wt, BR)), for each information vector Wt ∈ Wt. Therefore the initial slope

vector v0t (Wt) has to be monotone decreasing. For example, it is valid to set all the initial

slopes equal to zero. For completeness, since we know that the optimal value function at the

end of the horizon is equal to zero, we set vnT (WT , R) = 0 for all iterations n, information

vectors WT ∈ WT and asset levels R = 1, . . . , BR. The algorithm also requires an initial asset

level to be used in all iterations. Thus, for all n ≥ 0, Rx,n−1 is set to be a nonnegative integer,

as in STEP 0b.

At the beginning of each iteration n, the algorithm observes a sample realization of the

information sequence W n0 , . . . ,W

nT , as in STEP 1. The sample can be obtained from a sample

generator or actual data. After that, the algorithm goes over time periods t = 0, . . . , T .

13

Page 16: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

STEP 0: Algorithm Initialization:

STEP 0a: Initialize v0t (Wt) for t = 1, . . . , T − 1 and Wt ∈ Wt monotone decreasing.

STEP 0b: Set Rx,n−1 = r, a nonnegative integer, for all n ≥ 0

STEP 0c: Set n = 1.

STEP 1: Sample/Observe the information sequence W n0 , . . . ,W

nT .

Do for t = 0, . . . , T :

STEP 2: Compute the pre-decision asset level: Rnt = Rx,n

t−1 + Rt(Wnt ).

STEP 3: Find the optimal solution xnt of

maxxt∈Xt(Wn

t ,Rnt )Ct(W

nt , R

nt , xt) + γV n−1

t (W nt , f

x(Rnt , xt)) .

STEP 4: Compute the post-decision asset level:

Rx,nt = fx(Rn

t , xnt ).

STEP 5: Update slopes:

If t < T then

STEP 5a: Observe vnt+1(Rx,nt ) and vnt+1(Rx,n

t + 1). See (14).

STEP 5b: For Wt ∈ Wt and R = 1, . . . , BR:

znt (Wt, R) = (1− αnt (Wt, R))vn−1t (Wt, R) + αnt (Wt, R)vnt+1(R).

STEP 5c: Perform the projection operation vnt = ΠC(znt ). See (18).

STEP 6: Increase n by one and go to step 1.

Figure 2: SPAR-Storage Algorithm

First, the pre-decision asset level Rnt is computed, as in STEP 2. Then, the decision xnt ,

which is optimal with respect to the current pre-decision state (W nt , R

nt ) and value function

approximation V n−1t (W n

t , ·) is taken, as stated in STEP 3. We have that V nt (Wt, R + y) =

R∑r=1

vnt (Wt, r) + yvnt (Wt, R+ 1), where R is a nonnegative integer and 0 ≤ y ≤ 1. Next, taking

into account the decision, the algorithm computes the post-decision asset level Rx,nt , as in

STEP 4.

14

Page 17: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Time period t is concluded by updating the slopes of the value function approximation.

Steps 5a-5c describes the procedure and figure 3 illustrates it. Sample slopes relative to the

post-decision states (W nt , R

x,nt ) and (W n

t , Rx,nt + 1) are observed, see STEP 5a and figure 3a.

After that, these samples are used to update the approximation slopes vn−1t (W n

t ), through

a temporary slope vector znt (W nt ). This procedure requires the use of a stepsize rule that is

state dependent, denoted by αnt (Wt, R), and it may lead to a violation of the property that the

slopes are monotonically decreasing, see STEP 5b and figure 3b. Thus, a projection operation

ΠC is performed to restore the property and updated slopes vnt (W nt ) are obtained, see STEP

5c and figure 3c.

After the end of the planning horizon T is reached, the iteration counter is incremented,

as in STEP 6, and a new iteration is started from STEP 1.

We have that xnt is easily computed since it is the first component of the optimal solution

to the linear programming problem in the definition of F ∗t (vn−1t (W n

t ),W nt , R

nt ). Moreover,

given our assumptions and the properties of F ∗t discussed in the previous section, it is clear

that Rnt , xnt and Rx,n

t are all integer valued. We also know that they are bounded. Therefore,

the sequences of decisions, pre and post decision states generated by the algorithm, which are

denoted by xnt n≥0, Snt n≥0 = (W nt , R

nt )n≥0 and Sx,nt = (W n

t , Rx,nt )n≥0, respectively,

have at least one accumulation point. Since these are sequences of random variables, their

accumulation points, denoted by x∗t , S∗t and Sx,∗t , respectively, are also random variables.

The sample slopes used to update the approximation slopes are obtained by replacing the

expectation and the slopes v∗t+1(Wt+1) of the optimal value function in (9) by a sample real-

ization of the information W nt+1 and the current slope approximation vn−1

t+1 (W nt+1), respectively.

Thus, for t = 1 . . . , T , the sample slope is given by

vnt+1(R) = F ∗t (vn−1t+1 (W n

t+1),W nt+1, R + Rt+1(W n

t+1))

− F ∗t (vn−1t+1 (W n

t+1),W nt+1, R− 1 + Rt+1(W n

t+1)). (14)

The update procedure is then divided into two parts. First, a temporary set of slope vectors

znt = znt (Wt, R) : Wt ∈ Wt, R = 1, . . . , BR is produced combining the current approximation

15

Page 18: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

vnt+1

(Rx,n

t)

Exact

xntx

Rnt Rx,n

t

Ft(vn−

1(W

n t),W

n t,R

n t,x

)

vnt+1(Rx,nt + 1)

vn−1t (Wn

t , Rnt )

Wnt

Slopes of V n−1t (W n

t , ·)

3a: Current approximate function, optimal decision and sampled slopes

Concavity violationExact

znt (Wnt , R

nt )

Temporary slopes

3b: Temporary approximate function with violation of concavity

Ft(vn−

1(W

n t),W

n t,R

n t,x

)

Exact

x

vnt (Wnt , R

nt )

Slopes of V nt (W n

t , ·)

3c: Level projection operation: updated approximate function with concavity restored

Figure 3: Update procedure of the approximate slopes

and the sample slopes using the stepsize rule αnt (Wt, R). We have that

αnt (Wt, R) = αnt 1Wt=Wnt (1R=Rx,n

t + 1R=Rx,nt +1),

where αnt is a scalar between 0 and 1 and can depend only on information that became available

up until iteration n and time t. Moreover, on the event that (W ∗t , R

x,∗t ) is an accumulation

16

Page 19: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

point of (W nt , R

x,nt )n≥0, we make the standard assumptions that

∞∑n=1

αnt (W ∗t , R

x,∗t ) =∞ a.s., (15)

∞∑n=1

(αnt (W ∗t , R

x,∗t ))2 ≤ Bα <∞ a.s., (16)

where Bα is a constant. Clearly, the rule αnt = 1NV n(W ∗t ,R

x,∗t )

satisfies all the conditions, where

NV n(W ∗t , R

x,∗t ) is the number of visits to state (W ∗

t , Rx,∗t ) up until iteration n. Furthermore,

for all positive integers N ,

∞∏n=N

(1− αnt (W ∗t , R

x,∗t )) = 0 a.s. (17)

The proof for (17) follows directly from the fact that log(1 + x) ≤ x.

The second part is the projection operation, where the temporary slope vector znt (Wt), that

may not be monotone decreasing, is transformed into another slope vector vnt (Wt) that has this

structural property. The projection operator imposes the desired property by simply forcing

the violating slopes to be equal to the newly updated ones. For Wt ∈ Wt and R = 1, . . . , BR,

the projection is given by

ΠC(znt )(Wt, R) =

znt (W n

t , Rx,nt ), if Wt = W n

t , R < Rx,nt , znt (Wt, R) ≤ znt (W n

t , Rx,nt )

znt (W nt , R

x,nt + 1), if Wt = W n

t , R > Rx,nt + 1, znt (Wt, R) ≥ znt (W n

t , Rx,nt + 1)

znt (Wt, R), otherwise.

(18)

For t = 0, . . . , T , information vector Wt ∈ Wt and asset level R = 1, . . . , BR, the se-

quence of slopes of the value function approximation generated by the algorithm is denoted

by vnt (Wt, R)n≥0. Moreover, as the function F ∗t (vn−1t (W n

t ),W nt , R

nt ) is bounded and the

stepsizes αnt are between 0 and 1, we can easily see that the sample slopes vnt (R), the tempo-

rary slopes znt (Wt, R) and, consequently, the approximated slopes vnt (Wt, R) are all bounded.

Therefore, the slope sequence vnt (Wt, R)n≥0 has at least one accumulation point, as the

projection operation guarantees that the updated vector of slopes are elements of a compact

set. The accumulation points are random variables and are denoted by v∗t (Wt, R), as opposed

to the deterministic optimal slopes v∗t (Wt, R).

17

Page 20: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

The ability of the SPAR-storage algorithm to avoid visiting all possible values of Rt was

significant in our energy application. In our energy model in Powell et al. (2009), Rt was the

energy stored in a hydroelectric reservoir which served as a source of storage for the entire grid.

The algorithm required that we discretize Rt into approximately 10,000 elements. However,

the power of concavity requires that we visit only a small number of these a large number of

times for each value of Wt.

An important practical issue is that we effectively have to visit the entire support of

Wt infinitely often. If Wt contains a vector of as few as five or ten dimensions, this can

be computationally expensive. In a practical application, we would use statistical methods

to produce more robust estimates using smaller numbers of observations. For example, we

could use the hierarchical aggregation strategy suggested by George et al. (2008) to produce

estimates of V (W,R) at different levels of aggregation of W . These estimates can then be

combined using a simple weighting formula. It is beyond the scope of this paper to analyze the

convergence properties of such a strategy, but it is important to realize that these strategies

are available to overcome the discretization of Wt.

5 Convergence Analysis

We start this section presenting the convergence results we want to prove. The major result is

the almost sure convergence of the approximation slopes corresponding to states that are vis-

ited infinitely often. On the event that (W ∗t , R

x,∗t ) is an accumulation point of (W n

t , Rx,nt )n≥0,

we obtain

vnt (W ∗t , R

x,∗t )→ v∗t (W

∗t , R

x,∗t ) and vnt (W ∗

t , Rx,∗t + 1)→ v∗t (W

∗t , R

x,∗t + 1) a.s.

As a byproduct of the previous result, we show that, for t = 0, . . . , T , on the event that

(W ∗t , R

∗t , x∗t ) is an accumulation point of (W n

t , Rnt , x

nt )n≥0,

x∗t = arg maxxt∈Xt(W ∗t ,R

∗t )Ct(W

∗t , R

∗t , xt) + γV ∗t (W ∗

t , fx(R∗t , xt)) a.s. (19)

where V ∗t (W ∗t , ·) is the translated optimal value function.

18

Page 21: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Equation (19) implies that the algorithm has learned almost surely an optimal decision for

all states that can be reached by an optimal policy. This implication can be easily justified

as follows. Pick ω in the sample space. We omit the dependence of the random variables

on ω for the sake of clarity. For t = 0, since Rx,n−1 = r, a given constant, for all iterations of

the algorithm, we have that Rx,∗−1 = r. Moreover, all the elements in W0 are accumulation

points of W n0 n≥0, asW0 has finite support. Thus, (19) tells us that the accumulation points

x∗0 of the sequence xn0n≥0 along the iterations with pre-decision state (W ∗0 , R

x,∗−1 + R0(W ∗

0 ))

are in fact an optimal policy for period 0 when the information is W ∗0 . This implies that

all accumulation points Rx,∗0 = fx(Rx,∗

−1 + R0(W ∗0 ), x0) of Rx,n

0 n≥0 are post-decision resource

levels that can be reached by an optimal policy. By the same token, for t = 1, every element

inW1 is an accumulation point of W n1 n≥0. Hence, (19) tells us that the accumulation points

x∗1 of the sequence xn1 along iterations with (W n1 , R

n1 ) = (W ∗

1 , Rx,∗−0 + R0(W ∗

1 )) are indeed an

optimal policy for period 1 when the information is W ∗1 and the pre-decision resource level is

R∗1 = Rx,∗−0 + R0(W ∗

1 ). As before, the accumulation points Rx,∗1 = fx(R∗1, x1) of R∗,n1 n≥0 are

post-decision resource levels that can be reached by an optimal policy. The same reasoning

can be applied for t = 2, . . . , T .

5.1 Outline of the Convergence Proofs

Our proofs build on the ideas presented in Bertsekas & Tsitsiklis (1996) and in Powell et

al. (2004). The first proves convergence assuming that all states are visited infinitely often.

The authors do not consider a concavity-preserving step, which is the key element that has

allowed us to obtain a convergence proof when a pure exploitation scheme is considered.

Although the framework in Powell et al. (2004) also considers the concavity of the optimal

value functions in the resource dimension, the use of a projection operation to restore concavity

and a pure exploitation routine, their proof is restricted to two-stage problems. The difference

is significant. In Powell et al. (2004), it was possible to assume that Monte Carlo estimates

of the true value function were unbiased, a critical assumption in that paper. In our paper,

estimates of the marginal value of additional resource at time t depends on a value function

approximation at t + 1. Since this is an approximation, the estimates of marginal values at

time t are biased.

19

Page 22: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

The main concept to achieve the convergence of the approximation slopes to the op-

timal ones is to construct deterministic sequences of slopes, namely, Lk(Wt, R)k≥0 and

Ukt (Wt, R)k≥0, that are provably convergent to the slopes of the optimal value functions.

These sequences are based on the dynamic programming operator H, as introduced in (10).

We then use these sequences to prove almost surely that for all k ≥ 0,

Lkt (W∗t , R

x,∗t ) ≤ vnt (W ∗

t , Rx,∗t ) ≤ Uk

t (W ∗t , R

x,∗t ), (20)

Lkt (W∗t , R

x,∗t + 1) ≤ vnt (W ∗

t , Rx,∗t + 1) ≤ Uk

t (W ∗t , R

x,∗t + 1), (21)

on the event that the iteration n is sufficiently large and (W ∗t , R

x,∗t ) is an accumulation point

of (W nt , R

x,nt )n≥0, which implies the convergence of the approximation slopes to the optimal

ones.

Establishing (20) and (21) requires several intermediate steps that need to take into consid-

eration the pure exploitation nature of our algorithm and the concavity preserving operation.

We give all the details in the proof of Lkt (W∗t , R

x,∗t ) ≤ vnt (W ∗

t , Rx,∗t ) and Lkt (W

∗t , R

x,∗t + 1) ≤

vnt (W ∗t , R

x,∗t + 1). The upper bound inequalities are obtained using a symmetrical argument.

First, we define two auxiliary stochastic sequences of slopes, namely, the noise and the

bounding sequences, denoted by snt (Wt, R)n≥0, and lnt (Wt, R)n≥0, respectively. The first

sequence represents the noise introduced by the observation of the sample slopes, which re-

places the observation of true expectations and the optimal slopes. The second one is a

convex combination of the deterministic sequence Lkt (Wt, R) and the transformed sequence

(HLk)t(Wt, R).

We then define the set St to contain the states (W ∗t , R

x,∗t ) and (W ∗

t , Rx,∗t + 1), such that

(W ∗t , R

x,∗t ) is an accumulation point of (W n

t , Rx,nt )n≥0 and the projection operation decreased

or kept the same the corresponding unprojected slopes infinitely often. This is not the set

of all accumulation points, since there are some points where the slope may have increased

infinitely often.

The stochastic sequences snt (Wt, R), and lnt (Wt, R) are used to show that on the event

that the iteration n is big enough and (W ∗t , R

x,∗t ) is an element of the random set St,

vn−1t (W ∗

t , Rx,∗t ) ≥ ln−1

t (W ∗t , R

x,∗t )− sn−1

t (W ∗t , R

x,∗t ) a.s.

20

Page 23: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Then, on (W ∗t , R

x,∗t ) ∈ St, convergence to zero of the noise sequence, the convex com-

bination property of the bounding sequence and the monotone decreasing property of the

approximate slopes, will give us

Lkt (W∗t , R

x,∗t ) ≤ vn−1

t (W ∗t , R

x,∗t ) a.s.

Note that this inequality does not cover all the accumulation points of (W nt , R

x,nt )n≥0, since

they are restricted to states in the set St. Nevertheless, this inequality and some properties of

the projection operation are used to fulfill the requirements of a bounding technical lemma,

which is used repeatedly to obtain the desired lower bound inequalities for all accumulation

points.

In order to prove (19), we note that

Ft(vn−1t (W n

t ),W nt , R

nt , xt) = Ct(W

nt , R

nt , xt) + γV n−1

t (W nt , f

x(Rnt , xt))

is a concave function of Xt and Xt(W nt , R

nt ) is a convex set. Therefore,

0 ∈ ∂Ft(vn−1t (W n

t ),W nt , R

nt , x

nt )−XN

t (W nt , R

nt , x

nt ),

where xnt is the optimal decision of the optimization problem in STEP 3a of the algo-

rithm, ∂Ft(vn−1t (W n

t ),W nt , R

nt , x

nt ) is the subdifferential of Ft(v

n−1t (W n

t ),W nt , R

nt , ·) at xnt and

XNt (W n

t , Rnt , x

nt ) is the normal cone of Xt(W n

t , Rnt ) at xnt . This inclusion and the first conver-

gence result are then combined to prove, as shown in section 5.4, that

0 ∈ ∂Ft(v∗t (W ∗t ),W ∗

t , R∗t , x∗t )−XN

t (W ∗t , R

∗t , x∗t ).

5.2 Technical Elements

In this section, we set the stage to the convergence proofs by defining some technical elements.

We start with the definition of the deterministic sequence Lkt (Wt, R)k≥0. For this, we let Bv

be a deterministic integer that bounds |vnt (R)|, |znt (Wt, R)|, |vnt (Wt, R)| and |v∗t (Wt, R)| for all

Wt ∈ Wt and R = 1, . . . , BR. Then, for t = 0, . . . , T − 1, Wt ∈ Wt and R = 1, . . . , BR, we

have that

L0t (Wt, R) = v∗t (Wt, R)− 2Bv and Lk+1

t (Wt, R) =Lkt (Wt, R) + (HLk)t(Wt, R)

2. (22)

21

Page 24: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

At the end of the planning horizon T , LkT (WT , R) = 0 for all k ≥ 0. The proposition below

introduces the required properties of the deterministic sequence Lk(Wt, R)k≥0. Its proof is

deferred to the appendix.

Proposition 2. For t = 0, . . . , T − 1, information vector Wt ∈ Wt and resource levels R =

1, . . . , BR,

Lkt (Wt, R) is monotone decreasing, (23)

(HLk)t(Wt, R) ≥ Lk+1t (Wt, R) ≥ Lkt (Wt, R), (24)

LkT (Wt, R) < v∗t (Wt, R), and limk→∞

Lkt (Wt, R) = v∗t (Wt, R). (25)

The deterministic sequence Uk(Wt, R)k≥0 is defined in a symmetrical way. It also has

the properties stated in proposition 2, with the reversed inequality signs.

We move on to define the random index N that is used to indicate when an iteration

of the algorithm is large enough for convergence analysis purposes. Let N be the smallest

integer such that all states (actions) visited (taken) by the algorithm after iteration N are

accumulation points of the sequence of states (actions) generated by the algorithm. In fact, N

can be required to satisfy other constraints of the type: if an event did not happen infinitely

often, then it did not happen after N . Since we need N to be finite almost surely, the

additional number of constraints have to be finite.

We introduce the set of iterations, namely Nt(Wt, R), that keeps track of the effects pro-

duced by the projection operation. For Wt ∈ Wt and R = 1, . . . , BR, let Nt(Wt, R) be the set

of iterations in which the unprojected slope corresponding to state (Wt, R), that is, znt (Wt, R)

was too large and had to be decreased by the projection operation. Formally,

Nt(Wt, R) = n ∈ IN : znt (Wt, R) > vnt (Wt, R).

For example, based on figure 3c, n ∈ Nt(W nt , R

nt + 2). A related set is the set of states St. A

state (Wt, R) is an element of St if (Wt, R) is equal to an accumulation point (W ∗t , R

x,∗t ) of

(W nt , R

x,nt )n≥0 or is equal to (W ∗

t , Rx,∗t +1). Its corresponding approximate slope also has to

satisfy the condition znt (Wt, R) ≤ vnt (Wt, R) for all n ≥ N , that is, the projection operation

decreased or kept the same the corresponding unprojected slopes infinitely often.

22

Page 25: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

We close this section dealing with measurability issues. Let (Ω,F ,Prob) be the probability

space under consideration. The sigma-algebra F is defined by F = σ(W nt , x

nt ), n ≥ 1, t =

0, . . . , T. Moreover, for n ≥ 1 and t = 0, . . . , T ,

Fnt = σ((Wm

t′ , xmt′ ), 0 < m < n, t′ = 0, . . . , T

⋃(W n

t′ , xnt′), t′ = 0, . . . , t

).

Clearly, Fnt ⊂ Fnt+1 and FnT ⊂ Fn+10 . Furthermore, given the initial slopes v0

t (Wt) and the

initial resource level r, we have that Rnt , Rx,n

t and αnt are in Fnt , while vnt+1(Rxt ), z

nt (Wt),

vnt (Wt) are in Fnt+1. A pointwise argument is used in all the proofs of almost sure convergence

presented in this paper. Thus, zero-measure events are discarded on an as-needed basis.

5.3 Almost sure convergence of the slopes

We prove that the approximation slopes produced by the SPAR-Storage algorithm converge

almost surely to the slopes of the optimal value functions of the storage class for states that

can be reached by an optimal policy. This result is stated in theorem 1 below. Along with

the proof of the theorem, we present the noise and the bounding stochastic sequences and

introduce three technical lemmas. Their proofs are given in the appendix so that the main

reasoning is not disrupted.

Before we present the theorem establishing the convergence of the value function, we

introduce the following technical condition. Given k ≥ 0 and t = 0, . . . , T − 1, we assume

there exists a positive random variable Nkt such that on n ≥ Nk

t ,

(HLk)t(Wnt , R

nt ) ≤ (Hvn−1)t(W

nt , R

nt ) ≤ (HUk)t(W

nt , R

nt ) a.s., (26)

(HLk)t(Wnt , R

nt + 1) ≤ (Hvn−1)t(W

nt , R

nt + 1) ≤ (HUk)t(W

nt , R

nt + 1) a.s. (27)

Proving this condition is nontrivial. It draws on a proof technique in Bertsekas & Tsitsiklis

(1996), Section 4.3.6, although this technique requires an exploration policy that ensures that

each state is visited infinitely often. Nascimento & Powell (2009) (section 6.1) show how this

proof can be adapted to a similar concave optimization problem that uses pure exploration.

The proof for our problem is essentially the same, and is not duplicated here.

Theorem 1. Assume the stepsize conditions (15)–(16). Also assume (26) and (27). Then,

for all k ≥ 0 and t = 0, . . . , T , on the event that (W ∗t , R

x,∗t ) is an accumulation point of

23

Page 26: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

(W nt , R

x,nt )n≥0, the sequences of slopes vnt (W ∗

t , Rx,∗t )n≥0 and vnt (W ∗

t , Rx,∗t + 1)n≥0 gener-

ated by the SPAR-Storage algorithm for the storage class converge almost surely to the optimal

slopes v∗t (W∗t , R

x,∗t ) and v∗t (W

∗t , R

x,∗t + 1), respectively.

Proof: As discussed in section 5.1, since the deterministic sequences Lkt (Wt, Rxt )k≥0

and Ukt (Wt, R

xt )k≥0 do converge to the optimal slopes, the convergence of the approximation

sequences is obtained by showing that for each k ≥ 0 there exists a nonnegative random

variable N∗,kt such that on the event that n ≥ N∗,kt and (W ∗t , R

x,∗t ) is an accumulation point

of (W nt , R

x,nt )n≥0, we have

Lkt (W∗t , R

x,∗t ) ≤ vnt (W ∗

t , Rx,∗t ) ≤ Uk

t (W ∗t , R

x,∗t ) a.s.

Lkt (W∗t , R

x,∗t + 1) ≤ vnt (W ∗

t , Rx,∗t + 1) ≤ Uk

t (W ∗t , R

x,∗t + 1) a.s.

We concentrate on the inequalities Lkt (W∗t , R

x,∗t ) ≤ vnt (W ∗

t , Rx,∗t ) and Lkt (W

∗t , R

x,∗t + 1) ≤

vnt (W ∗t , R

x,∗t + 1). The upper bounds are obtained using a symmetrical argument.

The proof is by backward induction on t. The base case t = T is trivial as LkT (WT , R) =

vnT (WT , R) = 0 for all WT ∈ WT , R = 1, . . . , BR, k ≥ 0 and iterations n ≥ 0. Thus, we can

pick, for example, N∗,kT = N , where N , as defined in section 5.2, is a random variable that

denotes when an iteration of the algorithm is large enough for convergence analysis purposes.

The backward induction proof is completed when we prove, for t = T − 1, . . . , 0 and k ≥ 0,

that there exists N∗,kt such that on the event that n ≥ N∗,kt and (W ∗t , R

x,∗t ) is an accumulation

point of (W nt , R

x,nt )n≥0,

Lkt (W∗t , R

x,∗t ) ≤ vnt (W ∗

t , Rx,∗t ) and Lkt (W

∗t , R

x,∗t + 1) ≤ vnt (W ∗

t , Rx,∗t + 1) a.s. (28)

Given the induction hypothesis for t+ 1, the proof for time period t is divided into two parts.

In the first part, we prove for all k ≥ 0 that there exists a nonnegative random variable Nkt

such that

Lkt (W∗t , R

x,∗t ) ≤ vn−1

t (W ∗t , R

x,∗t ), a.s. on n ≥ Nk

t , (W ∗t , R

x,∗t ) ∈ St. (29)

Its proof is by induction on k. Note that it only applies to states in the random set St. Then,

again for t, we take on the second part, which takes care of the states not covered by the first

24

Page 27: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

part, proving the existence of a nonnegative random variable N∗,kt such that the lower bound

inequalities are true on n ≥ N∗,kt for all accumulation points of (W nt , R

x,nt )n≥0.

We start the backward induction on t. Pick ω ∈ Ω. We omit the dependence of the random

elements on ω for compactness. Remember that the base case t = T is trivial and we pick

N∗,kT = N . We also pick, for convenience, NkT = N .

Induction Hypothesis: Given t = T −1, . . . , 0, assume, for t+1, and all k ≥ 0 the existence

of integers Nkt+1 and N∗,kt+1 such that, for all n ≥ Nk

t+1, (29) is true, and, for all n ≥ N∗,kt+1,

inequalities in (28) hold true for all accumulation points (W ∗t , R

x,∗t ).

Part 1:

For our fixed time period t, we prove for any k, the existence of an integer Nkt such that

for n ≥ Nkt , inequality(29) is true. The proof is by forward induction on k.

We start with k = 0. For every state (Wt, R), we have that −Bv ≤ v∗t (Wt, R) ≤ Bv,

implying, by definition, that L0t (Wt, R) ≤ −Bv. Therefore, (29) is satisfied for all n ≥ 1,

since we know that vn−1t (Wt, R) is bounded by −Bv and +Bv for all iterations. Thus, N0

t =

max(1, N∗,0t+1

)= N∗,0t+1.

The induction hypothesis on k assumes that there exists Nkt such that, for all n ≥ Nk

t (29)

is true. Note that we can always make Nkt larger than N∗,kt+1, thus we assume that Nk

t ≥ N∗,kt+1.

The next step is the proof for k + 1.

Before we move on, we depart from our pointwise argument in order to define the stochastic

noise sequence and state a lemma describing an important property of this sequence. We start

by defining, for R = 1, . . . , BR, the random variable snt+1(R) = (Hvn−1)t(Wnt , R) − vnt+1(R)

that measures the error incurred by observing a sample slope. Using snt+1(R) we define for

each Wt ∈ Wt the stochastic noise sequence snt (Wt, R)n≥0. We have that snt (Wt, R) = 0 on

n < Nkt and, on n ≥ Nk

t , snt (Wt, R) is equal to

max(0, (1− αnt (Wt, R)) sn−1

t (Wt, R) + αnt (Wt, R)snt+1(Rx,nt 1R≤Rx,n

t + (Rnt + 1)1R>Rn

t )).

25

Page 28: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

The sample slopes are defined in a way such that

E[snt+1(R)|Fnt

]= 0. (30)

This conditional expectation is called the unbiasedness property. This property, together with

the martingale convergence theorem and the boundedness of both the sample slopes and the

approximate slopes are crucial for proving that the noise introduced by the observation of the

sample slopes, which replace the observation of true expectations, go to zero as the number

of iterations of the algorithm goes to infinity, as is stated in the next lemma.

Lemma 5.1. On the event that (W ∗t , R

x,∗t ) is an accumulation point of (W n

t , Rx,nt )n≥0 we

have that

snt (W ∗t , R

x,∗t )n≥0 → 0 and snt (W ∗

t , Rx,∗t + 1)n≥0 → 0 a.s. (31)

Proof: [Proof of lemma 5.1] Given in the appendix.

Returning to our pointwise argument, where we have fixed ω ∈ Ω, we use the conventionthat the minimum of an empty set is +∞. Let

δk = min

(HLk)t(W ∗t , R

x,∗t )− Lkt (W ∗t , R

x,∗t )

4: (W ∗t , R

x,∗t ) ∈ St, (HLk)t(W ∗t , R

x,∗t ) > Lkt (W

∗t , R

x,∗t )

.

If δk < +∞ we define an integer NL ≥ Nkt to be such that

NL−1∏m=Nk

t

(1− αmt (W ∗

t , Rx,∗t ))≤ 1/4 and sn−1

t (W ∗t , R

x,∗t ) ≤ δk, (32)

for all n ≥ NL and states (W ∗t , R

x,∗t ) ∈ St. Such an NL exists because both (17) and (31)

are true. If δk = +∞ then, for all states (W ∗t , R

x,∗t ) ∈ St, (HLk)t(W

∗t , R

x,∗t ) = Lkt (W

∗t , R

x,∗t )

since (24) tells us that (HLk)t(W∗t ) ≥ Lkt (W

∗t ). Thus, Lk+1

t (W ∗t , R

x,∗t ) = Lkt (W

∗t , R

x,∗t ) and we

define the integer NL to be equal to Nkt . We let Nk+1

t = max(NL, N

∗,k+1t+1

)and show that

(29) holds for n ≥ Nk+1t .

We pick a state (W ∗t , R

x,∗t ) ∈ St. If Lk+1

t (W ∗t , R

x,∗t ) = Lkt (W

∗t , R

x,∗t ), then inequality

(29) follows from the induction hypothesis. We therefore concentrate on the case where

Lk+1t (W ∗

t , Rx,∗t ) > Lkt (W

∗t , R

x,∗t ).

26

Page 29: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

First, we depart one more time from the pointwise argument to introduce the stochastic

bounding sequence. We also state a lemma combining this sequence with the stochastic noise

sequence. For Wt ∈ Wt and R = 1, . . . , BR, we have on n < Nkt that lnt (Wt, R) = Lkt (Wt, R)

and, on n ≥ Nkt ,

lnt (Wt, R) = (1− αnt (Wt, R)) ln−1t (Wt, R) + αnt (Wt, R)(HLk)t(Wt, R).

The next lemma states that the noise and the stochastic bounding sequences can be used to

provide a bound for the approximate slopes as follows.

Lemma 5.2. On n ≥ Nkt , (W ∗

t , Rx,∗t ) ∈ St,

vn−1t (W ∗

t , Rx,∗t ) ≥ ln−1

t (W ∗t , R

x,∗t )− sn−1

t (W ∗t , R

x,∗t ) a.s. (33)

Proof: [Proof of lemma 5.2] Given in the appendix.

Back to our fixed ω, a simple inductive argument proves that lnt (Wt, R) is a convex com-

bination of Lkt (Wt, R) and (HLk)t(Wt, R). Therefore we can write,

ln−1t (W ∗

t , Rx,∗t )) = bn−1

t Lkt (W∗t , R

x,∗t ) + (1− bn−1

t )(HLk)t(W∗t , R

x,∗t ),

where bn−1t =

n−1∏m=Nk

t

(1− αmt (W ∗

t , Rx,∗t ))

. For n ≥ Nk+1t ≥ NL, we have bn−1

t ≤ 1/4. Moreover,

Lkt (W∗t , R

x,∗t ) ≤ (HLk)t(W

∗t , R

x,∗t ). Thus, using (22) and the definition of δk, we obtain

ln−1t (W ∗t , R

x,∗t ) ≥ 1

4Lkt (W

∗t , R

x,∗t ) +

34

(HLk)t(W ∗t , Rx,∗t )

=12Lkt (W

∗t , R

x,∗t ) +

12

(HLk)t(W ∗t , Rx,∗t ) +

14

((HLk)t(W ∗t , Rx,∗t )− Lkt (W ∗t , R

x,∗t ))

≥ Lk+1t (W ∗t , R

x,∗t ) + δk. (34)

Combining (33) and (34), we obtain, for all n ≥ Nk+1t ≥ NL,

vn−1t (W ∗

t , Rx,∗t ) ≥ Lk+1

t (W ∗t , R

x,∗t ) + δk − sn−1

t (W ∗t , R

x,∗t )

≥ Lk+1t (W ∗

t , Rx,∗t ) + δk − δk

= Lk+1t (W ∗

t , Rx,∗t ),

where the last inequality follows from (32).

27

Page 30: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Part 2:

We continue to consider ω picked in the beginning of the proof of the theorem. In this

part, we take care of the states (W ∗t , R

x,∗t ) that are accumulation points but are not in St.

In contrast to part 1, the proof technique here is not by forward induction on k. We rely

entirely on the definition of the projection operation and on the elements defined in section

5.2, as this part of the proof is all about states for which the projection operation decreased

the corresponding approximate slopes infinitely often, which might happen when some of the

optimal slopes are equal. Of course this fact is not verifiable in advance, as the optimal slopes

are unknown.

Remember that at iteration n time period t, we observe the sample slopes vnt+1(Rx,nt ) and

vnt+1(Rx,nt + 1) and it is always the case that vnt+1(Rx,n

t ) ≥ vnt+1(Rx,nt + 1), implying that the

resulting temporary slope znt (W nt , R

x,nt ) is bigger than znt (W n

t , Rx,nt + 1). Therefore, according

to our projection operator, the updated slopes vnt (W nt , R

x,nt ) and vnt (W n

t , Rx,nt + 1) are always

equal to znt (W nt , R

x,nt ) and znt (W n

t , Rx,nt +1), respectively. Due to our stepsize rule, as described

in section 4, the slopes corresponding to (W nt , R

x,nt ) and (W n

t , Rx,nt + 1) are the only ones

updated due to a direct observation of sample slopes at iteration n, time period t. All the

other slopes are modified only if a violation of the monotone decreasing property occurs.

Therefore, the slopes corresponding to states with information vector Wt ∈ W different than

W nt , no matter the resource level R = 1, . . . , BR, remain the same at iteration n time period

t, that is, vn−1t (Wt, R) = znt (Wt, R) = vnt (Wt, R). On the other hand, it is always the case that

the temporary slopes corresponding to states with information vector W nt and resource levels

smaller than Rx,nt can only be increased by the projection operation. If necessary, they are

increased to be equal to vnt (W nt , R

x,nt ). Similarly, the temporary slopes corresponding to states

with information vector W nt and resource levels greater than Rx,n

t + 1 can only be decreased

by the projection operation. If necessary, they are decreased to be equal to vnt (W nt , R

x,nt + 1),

see figure 3c.

Keeping the previous discussion in mind, it is easy to see that for each W ∗t ∈ Wt, if RMin

t is

the minimum resource level such that (W ∗t , R

Mint ) is an accumulation point of (W n

t , Rx,nt )n≥0,

then the slope corresponding to (W ∗t , R

Mint ) could only be decreased by the projection opera-

28

Page 31: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

tion a finite number of iterations, as a decreasing requirement could only be originated from a

resource level smaller than RMint . However, no state with information vector W ∗

t and resource

level smaller than RMint is visited by the algorithm after iteration N (as defined in section

5.2), since only accumulation points are visited after N . We thus have that (W ∗t , R

Mint ) is an

element of the set St, showing that St is a proper set.

Hence, for all states (W ∗t , R

x,∗t ) that are accumulation points of (W n

t , Rx,nt )n≥0 and are

not elements of St, there exists another state (W ∗t , R

x,∗t ), where Rx,∗

t is the maximum resource

level smaller than Rx,∗t such that (W ∗

t , Rx,∗t ) ∈ St. We argue that for all resource levels R

between Rx,∗t and Rx,∗

t (inclusive), we have that |Nt(W ∗t , R)| = ∞. Figure 4 illustrates the

situation.

Rx,∗t

Rx,∗t 6∈St R

∈ St

v∗ t(W∗ t,R

)

are accumulation points

RMint

|Nt(W ∗t , Rx,∗t )| =∞

|Nt(W ∗t , Rx,∗t − 1)| =

|Nt(W ∗t , Rx,∗t − 2)| =

Figure 4: Illustration of technical elements related to the projection operation

As introduced in section 5.2, we have thatNt(W ∗t , R) = n ∈ IN : znt (W ∗

t , R) > vnt (W ∗t , R).

By definition, the sets St and Nt(W ∗t , R) share the following relationship. Given that (W ∗

t , R)

is an accumulation point of (W nt , R

x,nt )n≥0, then |Nt(W ∗

t , R)| = ∞ if and only if the state

(W ∗t , R) is not an element of St. Therefore, |Nt(W ∗

t , Rx,∗t )| = ∞, otherwise (W ∗

t , Rx,∗t ) would

be an element of St. If Rx,∗t = Rx,∗

t − 1 we are done. If Rx,∗t < Rx,∗

t − 1, we have to

consider two cases, namely (W ∗t , R

x,∗t − 1) is accumulation point and (W ∗

t , Rx,∗t − 1) is not an

accumulation point. For the first case, we have that |Nt(W ∗t , R

x,∗t − 1)| = ∞ from the fact

that this state is not an element of St. For the second case, since (W ∗t , R

x,∗t − 1) is not an

accumulation point, its corresponding slope is never updated due to a direct observation of

sample slopes for n ≥ N , by the definition of N . Moreover, every time the slope of (W ∗t , R

x,∗t )

29

Page 32: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

is decreased due to a projection (which is coming from the left), the slope of (W ∗t , R

x,∗t − 1)

has to be decreased as well. Therefore, Nt(W ∗t , R

x,∗t )⋂n ≥ N ⊆ Nt(W ∗

t , Rx,∗t − 1)

⋂n ≥

N, implying that |Nt(W ∗t , R

x,∗t − 1)| = ∞. We then apply the same reasoning for states

(W ∗t , R

x,∗t − 2), . . . , (W ∗

t , Rx,∗t − 1), obtaining that the corresponding sets of iterations have an

infinite number of elements. The same reasoning applies to states (W ∗t , R

x,∗t + 1) that are not

in St.

We state a lemma that is the key element for the proof of Part 2, once again going away

from the pointwise argument.

Lemma 5.3. Given an information vector Wt ∈ Wt and a resource level R = 1, . . . , BR − 1,

if for all k ≥ 0, there exists an integer random variable Nk(Wt, R) such that Lkt (Wt, R) ≤

vn−1t (Wt, R) almost surely on n ≥ Nk(Wt, R), Nt(Wt, R+1) =∞ , then for all k ≥ 0, there

exists another integer random variable Nk(Wt, R+1) such that Lkt (Wt, R+1) ≤ vn−1t (Wt, R+1)

almost surely on n ≥ Nk(Wt, R + 1).

Proof: [Proof of lemma 5.3] Given in the appendix.

Using the properties of the projection operator, we return to the proof of Part 2 and to

our fixed ω. Pick k ≥ 0 and a state (W ∗t , R

x,∗t ) that is an accumulation point but is not in

St. The same applies if (W ∗t , R

x,∗t + 1)6∈St. Consider the state (W ∗

t , Rx,∗t ) where Rx,∗

t is the

maximum resource level smaller than Rx,∗t such that (W ∗

t , Rx,∗t ) ∈ St. This state satisfies the

condition of lemma 5.3 with Nk(W ∗t , R

x,∗t ) = Nk

t (from part 1 of the proof). Thus, we can

apply this lemma in order to obtain, for all k ≥ 0, an integer Nk(W ∗t , R

x,∗t + 1) such that

Lkt (W∗t , R

x,∗t + 1) ≤ vn−1

t (W ∗t , R

x,∗t + 1), for all n ≥ Nk(W ∗

t , Rx,∗t + 1).

After that, we use lemma 5.3 again, this time considering state (W ∗t , R

x,∗t + 1). Note that

the first application of lemma 5.3 gave us the integer Nk(W ∗t , R

x,∗t + 1), necessary to fulfill

the conditions of this second usage of the lemma. We repeat the same reasoning, applying

lemma 5.3 successively to the states (W ∗t , R

x,∗t + 2), . . . , (W ∗

t , Rx,∗t − 1). In the end, we obtain,

for each k ≥ 0, an integer Nk(W ∗t , R

x,∗t ), such that Lkt (W

∗t , R

x,∗t ) ≤ vn−1

t (W ∗t , R

x,∗t ), for all

n ≥ Nk(W ∗t , R

x,∗t ). Figure 5 illustrates this process.

Finally, if we pick N∗,kt to be greater than Nkt of part 1 and greater than Nk(W ∗

t , Rx,∗t )

30

Page 33: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

1 * * * *

* *

, ,

for ,

n kt t

k kt

v P R L P R

n N P R N

1 * * * *

* *

Lemma 3 0,1

, 1 , 1

for , 1

n kt t

k

v P R L P R

n N P R

1 * * * *

* *

Lemma 3

, 2 , 2

for , 2

n kt t

k

v P R L P R

n N P R

1 * * * *

* *

Lemma 3

, 3 , 3

for , 3

n kt t

k

v P R L P R

n N P R

* *, 1| |k P R * *, 2| |k P R * *,| |k P R

* *t tR S S

* 1R *

*

2

1

R

R

* * \t tR S S

Figure 5: Successive applications of lemma 5.3.

and Nk(W ∗t , R

x,∗t + 1) for all accumulation points (W ∗

t , Rx,∗t ) that are not in St, then (28) is

true for all accumulation points and n ≥ N∗,kt .

5.4 Optimality of the Decisions

We finish the convergence analysis proving that, with probability one, the algorithm learns

an optimal decision for all states that can be reached by an optimal policy.

Theorem 2. Assume the conditions of Theorem 1 are satisfied. For t = 0, . . . , T , on the

event that (W ∗t , R

∗t , v∗t , x

∗t ) is an accumulation point of the sequence (W n

t , Rnt , v

n−1t , xnt )n≥1

generated by the SPAR-Storage algorithm, x∗t , is almost surely an optimal solution of

maxxt∈Xt(W ∗t ,R

∗t )Ft(v

∗t (W

∗t ),W ∗

t , R∗t , xt), (35)

where Ft(v∗t (W

∗t ),W ∗

t , R∗t , xt) = Ct(W

∗t , R

∗t , xt) + γV ∗t

W ∗t , R

∗t −

Bdt∑

i=1

xdti +

Bst∑

i=1

xsti − xtrt

.

31

Page 34: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Proof: Fix ω ∈ Ω. As before, the dependence on ω is omitted. At each iteration n and

time t of the algorithm, the decision xnt in STEP 3 of the algorithm is an optimal solution to the

optimization problem maxxt∈Xt(Wn

t ,Rnt )Ft(v

n−1t (W n

t ),W nt , R

nt , xt). Since Ft(v

n−1t (W n

t ),W nt , R

nt , ·) is

concave and Xt(W nt , R

nt ) is convex, we have that

0 ∈ ∂Ft(vn−1t (W n

t ),W nt , R

nt , x

nt ) + XN

t (W nt , R

nt , x

nt ),

where ∂Ft(vn−1t (W n

t ),W nt , R

nt , x

nt ) is the subdifferential of Ft(v

n−1t (W n

t ),W nt , R

nt , ·) at xnt and

XNt (W n

t , Rnt , x

nt ) is the normal cone of Xt(W n

t , Rnt ) at xnt .

Then, by passing to the limit, we can conclude that each accumulation point (W ∗t , R

∗t , v∗t , x

∗t )

of the sequence (W nt , R

nt , v

n−1t , xnt )n≥1 satisfies the condition 0 ∈ ∂Ft(v∗t (W ∗

t ),W ∗t , R

∗t , x∗t ) +

XNt (W ∗

t , R∗t , x∗t ).

We now derive an expression for the subdifferential. We have that

∂Ft(v∗t (W

∗t ),W ∗

t , R∗t , xt) = ∇Ct(W ∗

t , R∗t , xt) + γ∂V ∗t (W ∗

t , R∗t + A · xt),

where A′ = (−1, . . . ,−1︸ ︷︷ ︸Bd

, 1, . . . , 1︸ ︷︷ ︸Bs

,−1), that is, A · xt = −∑Bd

ti=1 x

dti +

∑Bst

i=1 xsti − xtrt . From

(Bertsekas et al., 2003, Proposition 4.2.5), for xt ∈ INBd+Bs+1,

∂V ∗t (W ∗t , R

∗t+A·xt) = (A1y, . . . , ABd+Bs+1y)T : y ∈ [v∗t (W

∗t , R

∗t+A·xt+1), v∗t (W

∗t , R

∗t+A·xt)].

Therefore, as x∗t is integer valued,

∂Ft(v∗t (W

∗t ),W ∗

t , R∗t , x∗t ) = ∇Ct(W ∗

t , R∗t , x∗t ) + γAy :

y ∈ [v∗t (W∗t , R

∗t + A · x∗t + 1), v∗t (W

∗t , R

∗t + A · x∗t )].

Since (W ∗t , R

∗t +A·x∗t ) is an accumulation point of (W n

t , Rx,nt )n≥0, it follows from theorem

1 that

v∗t (W∗t , R

∗t +A·x∗t ) = v∗t (W

∗t , R

∗t +A·x∗t ) and v∗t (W

∗t , R

∗t +A·x∗t +1) = v∗t (W

∗t , R

∗t +A·x∗t +1).

Hence, ∂Ft(v∗t (W

∗t ),W ∗

t , R∗t , x∗t ) = ∂Ft(v

∗t (W

∗t ),W ∗

t , R∗t , xt) and

0 ∈ ∂Ft(v∗t (W ∗t ),W ∗

t , R∗t , x∗t ) + XN

t (W ∗t , R

∗t , x∗t ),

which proves that x∗t is the optimal solution of (35).

32

Page 35: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

6 Numerical experiments

We have implemented the SPAR-Storage algorithm in an energy policy model called SMART

(Stochastic Multiscale model for the Analysis of energy Resources, Technology and policy),

described in Powell et al. (2009) (the numerical results reported here are based on this paper).

The problem involved modeling the allocation of 15 types of energy resources where some were

converted to electricity and distributed over the grid, while the others (notably transportation

fuels and home heating needs) which capture the conversion and movement of physical fuels

to meet various demands. The model represents energy supplies and demand for the western

United States at an aggregate level.

The model captures hourly variations in wind, solar and demands, over a one-year horizon.

Each time period involves a fairly small linear program (comparable to the model in section 2)

that is linked over time through the presence of a single, grid-level storage facility in the form

of a reservoir for hydroelectric power. Water inflow to the reservoir, in the form of rainfall

and snow melt, is a highly seasonal process with typically high input in the late winter and

spring. Generating electricity from hydroelectric power is extremely cheap on the margin, but

there are benefits to holding water in the reservoir until it is needed in the summer. There is

a limit on how much water can be stored. In the optimal solution, the tendency is to begin

storing water late in the winter, hitting the maximum just before we begin drawing it down for

the summer months with no rainfall. This means that the algorithm has to be smart enough

to realize, in March, that there is a positive value for storing water that is not needed until

August, thousands of time periods later.

We undertook two sets of experiments to test the performance of the algorithm. The

first considered a deterministic dataset which could be solved optimally as a linear program

over all 8,760 time periods at the same time. This is a good test of the algorithm, since the

ADP algorithm is not allowed to take advantage of the deterministic future. The deterministic

model was run with a single rainfall scenario, and two demand patterns: the first used constant

demand over the entire year, while the second used a sinusoidal curve to test more complex

behaviors for storing and releasing water. The ADP algorithm was run for 200 iterations

33

Page 36: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Optimal from linear program

Reservoir level

Demand

Rainfall

ADP solution

Reservoir level

Demand

Rainfall

(a) Optimal-constant demand (b) ADP-constant demand

Optimal from linear program

Reservoir level

Demand

Rainfall

ADP solution

Reservoir level

Demand

Rainfall

(c) Optimal-sinusoidal demand (d) ADP-sinusoidal demand

Figure 6: Optimal reservoir levels (a) and (c) and reservoir levels from ADP algorithm (b)and (d) for constant and sinusoidal demand (from Powell et al. (2009)).

(requiring approximately 5 seconds per iteration for the entire year), and produced an objective

function within 0.02 percent of the optimal solution produced by the linear programming

model. Despite this close agreement, we were interested in seeing the actual behavior in

terms of the amount held in storage, since small differences in the objective function can hide

significant differences in the decisions.

Figure 6 shows the rainfall, demand and, most important, the amount of water in storage

at each time period as produced by the ADP algorithm, and the optimal solution from the

linear programming model. Clearly there is very close agreement. Note also that in both runs,

the ADP solution was able to replicate the behavior of the optimal solution where water was

held in storage for over 2000 hours.

The ADP algorithm, of course, can also handle uncertainty. This was tested in the form

of stochastic rainfall. Figure 7 shows the reservoir level produced by the ADP algorithm

(as a single, sample path taken from the last iteration). Also shown is the optimal solution

34

Page 37: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 100 200 300 400 500 600 700 800

Res

ervo

ir le

vel

Time period

ADPOptimal for individual

scenarios

Figure 7: Reservoir level produced by ADP (last iteration) for stochastic precipitation, andoptimal reservoir levels produced by LP when optimized over individual scenarios (from Powellet al. (2009)).

produced by the linear programming model when optimized over each sample path for rainfall.

Each of these so-called optimal solutions are allowed to see into the future. Not surprisingly,

the optimal, prescient solutions for each sample path are able to hold off longer from storing

water, whereas the ADP solution, quite reasonably, stores water earlier in the season because

of the possibility that the rainfall might be lower than expected.

The ADP-based algorithm in the SMART model was found to offer some attractive fea-

tures. First, while this paper focuses purely on storage tested over a one-year horizon, in the

SMART model it was used in the context of an energy policy model that extended for 20

years (over 175,000 time periods) with run times of only a few hours on a PC. The same ADP

algorithm described in this paper was also used to model energy investment decisions (how

many megawatts of capacity should we have for ethanol, wind, solar, coal, natural gas, ...), but

in this case, we used separable approximations of a nonseparable value function. For this set-

ting, we can no longer prove convergence, but we did obtain results within 0.24 percent of the

linear programming optimal solution, suggesting that the general algorithmic strategy scales

to vector-valued resource allocation problems. This finding is in line with other algorithmic

research with this class of algorithms (in particular, see Topaloglu & Powell (2006)).

35

Page 38: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

7 Summary

We propose an approximate dynamic programming algorithm using pure exploitation for the

problem of planning energy flows over a year, in hourly increments, in the presence of a single

storage device, taking advantage of our ability to represent the value function as a piecewise

linear function. We are able to prove almost sure convergence of the algorithm using a pure

exploitation strategy. This ability is critical for this application, since we discretized each

value function (for 8,760 time periods) into thousands of breakpoints. Rather than needing

to visit all the possible values of the storage level many times, we only need to visit a small

number of these points (for each value of our information vector Wt) many times.

A key feature of our algorithm is that it is able to handle high-dimensional decisions in each

time period. For the energy application, this means that we can solve the relatively small (but

with hundreds of dimensions) decision problem that determines how much energy is coming

from each energy source and being converted to serve different types of energy demands. We

are able to solve sequences of deterministic linear programs very quickly by formulating the

value function around the post-decision state variable.

Our numerical experiments with the SMART model suggest that the resulting algorithm

is quite fast, and when applied to a deterministic problem, produced solutions that closely

match the optimal solution produced by a linear program. This is a true test of the use of

ADP algorithm, since it requires that it be able to determine when to start storing water in

the water so that we can meet demands late in the summer, thousands of hours later. The

algorithm requires approximately five seconds to do a forward pass over all 8,760 hourly time

periods within a year, and high quality solutions are obtained within 100 or 200 iterations.

Additional research is required to develop a practical algorithm in the presence of uncer-

tainty when the information vector Wt is large, since we have to estimate a piecewise linear

value function for each possible value of Wt. There are a number of statistical techniques that

can be used to address this problem. A possible line of investigation, described in George et

al. (2008), uses hierarchical aggregation, where a piecewise linear value function is estimated

for Wt at different levels of aggregation, and an estimate of the value function is formed using

36

Page 39: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

a weighted some of these different approximations. As the algorithm progresses, the weights

are adjusted so that higher weights are put on more disaggregate levels. This strategy ap-

pears promising, but we need to show a) that this strategy would still produce a convergent

algorithm and b) that it actually works in practice.

Acknowledgements

The authors would like to acknowledge the valuable comments and suggestions of Andrzej

Rusczynski. This research was supported in part by grant AFOSR contract FA9550-08-1-

0195.

References

Ahmed, S. & Shapiro, A. (2002), The sample average approximation method for stochastic

programs with integer recourse, E-print available at http://www.optimization-online.org. 5

Barto, A. G., Bradtke, S. J. & Singh, S. P. (1995), ‘Learning to act using real-time dynamic

programming’, Artificial Intelligence, Special Volume on Computational Research on Inter-

action and Agency 72, 81–138. 4

Bertsekas, D. (2005), Dynamic Programming and Stochastic Control Vol. I, Athena Scientific.

Bertsekas, D. & Tsitsiklis, J. (1996), Neuro-Dynamic Programming, Athena Scientific, Bel-

mont, MA. 4, 12, 19, 23, 42

Bertsekas, D., Nedic, A. & Ozdaglar, E. (2003), Convex Analysis and Optimization, Athena

Scientific, Belmont, Massachusetts. 32

Brown, P. D., Peas-Lopes, J. A. & Matos, M. A. (2008), ‘Optimization of pumped storage

capacity in an isolated power system with large renewable penetration’, IEEE Transactions

on Power Systems 23(2), 523–531. 1

Castronuovo, E. D. & Lopes, J. A. P. (2004), ‘On the Optimization of the Daily Operation of

a Wind-Hydro Power Plant’, IEEE Transactions on Power Systems 19(3), 1599–1606. 1

37

Page 40: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Chen, Z.-L. & Powell, W. B. (1999), ‘A convergent cutting-plane and partial-sampling al-

gorithm for multistage stochastic linear programs with recourse’, Journal of Optimization

Theory and Applications 102(3), 497–524. 5

Fishbone, L. & Abilock, H. (1981), ‘MARKAL - A linear programming model for energy

systems analysis: Technical description of the BNL version’, International Journal of Energy

Research 5, 353–375. 1

Garcia-Gonzalez, J., de la Muela, R. M. R., Santos, L. M. & Gonzalez, A. M. (2008), ‘Stochas-

tic Joint Optimization of Wind Generation and Pumped-Storage Units in an Electricity

Market’, Power Systems, IEEE Transactions on 23(2), 460–468. 1

George, A., Powell, W. B. & Kulkarni, S. (2008), ‘Value function approximation using mul-

tiple aggregation for multiattribute resource management’, J. Machine Learning Research

pp. 2079–2111. 18, 36

Godfrey, G. & Powell, W. B. (2002), ‘An adaptive, dynamic programming algorithm for

stochastic resource allocation problems I: Single period travel times’, Transportation Science

36(1), 21–39. 3

Hernandez-Lerma, O. & Runggladier, W. (1994), ‘Monotone approximations for convex

stochastic control problems’, J. Math. Syst., Estim. and Control 4, 99–140.

Higle, J. & Sen, S. (1991), ‘Stochastic decomposition: An algorithm for two stage linear

programs with recourse’, Mathematics of Operations Research 16(3), 650–669. 5

Jaakkola, T., Jordan, M. I. & Singh, S. P. (1994), Convergence of stochastic iterative dynamic

programming algorithms, in J. D. Cowan, G. Tesauro & J. Alspector, eds, ‘Advances in Neu-

ral Information Processing Systems’, Vol. 6, Morgan Kaufmann Publishers, San Francisco,

pp. 703–710. 4

Lamont, A. (1997), User’s Guide to the META*Net Economic Modeling System Version 2.0,

Technical report, Lawrence Livermore National Laboratory. 1

Lamont, A. D. (2008), ‘Assessing the long-term system value of intermittent electric generation

technologies’, Energy Economics 30(3), 1208–1231. 1

38

Page 41: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Messner, S. & Strubegger, M. (1995), User’s Guide for MESSAGE III, Technical report,

International Institute for Applied Systems Analysis (IIASA). 1

Nascimento, J. M. & Powell, W. B. (2009), ‘An optimal approximate dynamic programming

algorithm for the lagged asset acquisition problem’, Mathematics of Operations Research

34(1), 210–237. 23

NEMS (2003), The National Energy Modeling System: An Overview 2003, Technical report,

Energy Information Administration, Office of Integrated Analysis and Forecasting, U.S.

Department of Energy, Washington, DC 20585. 1

Powell, W. B. (2007), Approximate Dynamic Programming: Solving the curses of dimension-

ality, John Wiley and Sons, New York. 3, 8

Powell, W. B., George, A., Lamont, A. & Stewart, J. (2009), ‘Smart: A stochastic multiscale

model for the analysis of energy resources, technology and policy’, Informs J. on Computing.

3, 5, 9, 18, 33, 34, 35

Powell, W. B., Ruszczynski, A. & Topaloglu, H. (2004), ‘Learning algorithms for separable

approximations of stochastic optimization problems’, Mathematics of Operations Research

29(4), 814–836. 4, 19

Shapiro, A. (2003), Monte carlo sampling methods, in A. Ruszczynski & A. Shapiro, eds,

‘Handbooks in Operations Research and Management Science: Stochastic Programming’,

Vol. 10, Elsevier, Amsterdam, pp. 353–425. 5

Shiryaev, A. (1996), Probability Theory, Vol. 95 of Graduate Texts in Mathematics, Springer-

Verlag, New York. 43

Topaloglu, H. & Powell, W. B. (2006), ‘Dynamic programming approximations for stochas-

tic, time-staged integer multicommodity flow problems’, Informs Journal on Computing

18(1), 31–42. 3, 4, 9, 35

Tsitsiklis, J. (2002), ‘On the convergence of optimistic policy iteration’, Journal of Machine

Learning Research 3, 59–72. 4

39

Page 42: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Tsitsiklis, J. N. (1994), ‘Asynchronous stochastic approximation and Q-learning’, Machine

Learning 16, 185–202. 4

Van Slyke, R. & Wets, R. (1969), ‘L-shaped linear programs with applications to optimal

control and stochastic programming’, SIAM Journal of Applied Mathematics 17(4), 638–

663. 5

Vanderbei, R. (1997), Linear Programming: Foundations and Extensions, Kluwer’s Interna-

tional Series. 3

Appendix

The appendix consists of a brief summary of notation to assist with reading the proofs and

the proofs themselves.

Notation

For each random element, we provide its measurability.

• Filtrations

F = σ(W nt , x

nt ), n ≥ 1, t = 0, . . . , T.

Fnt = σ ((Wmt′ , x

mt′ ), 0 < m < n, t′ = 0, . . . , T ∪ (W n

t′ , xnt′), t′ = 0, . . . , t).

• Post-decision state (W nt , R

x,nt )

Rx,nt ∈ Fnt : resource level after the decision. Integer and bounded by 0 and BR.

W nt ∈ Fnt : Markovian information vector. Independent of the resource level.

Dnt ∈ Fnt : demand vector. Nonnegative and integer valued.

Wt: finite support set of W nt .

• Slopes (monotone decreasing in R and bounded)

v∗t (Wt, R): slope of the optimal value function at (Wt, R).

znt (Wt, R) ∈ Fnt+1: unprojected slope of the value function approximation at (Wt, R).

40

Page 43: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

vnt (Wt, R) ∈ Fnt+1: slope of the value function approximation at (Wt, R).

v∗t (Wt, R) ∈ F : accumulation point of vnt (Wt, R)n≥0.

vnt (R) ∈ Fnt : sample slope at R.

• Stepsizes (bounded by 0 and 1, sum is +∞, sum of the squares is < +∞ )

αnt ∈ Fnt and αnt (Wt, R) = αnt(1Wt=Wn

t ,R=Rx,nt + 1Wt=Wn

t ,R=Rx,nt +1

).

• Finite random variable N . Iteration big enough for convergence analysis.

• Set of iterations and states (due to the projection operation)

Nt(Wt, R) ∈ F : iterations in which the unprojected slope at (Wt, R) was decreased.

St ∈ F : states in which the projection had not decreased the unprojected slopes i.o.

• Dynamic programming operator H.

• Deterministic slope sequence Lkt (Wt, R)k≥0. Converges to v∗t (Wt, R).

• Error variable snt+1(R) ∈ Fnt+1

• Stochastic noise sequence snt (Wt, R)n≥0

• Stochastic bounding sequence lnt (Wt, R)n≥0

Proofs of Lemmas

We start with the proof of proposition 2, then we present the proofs for the lemmas of the

convergence analysis section.

Proof: [Proof of proposition 2] We use the notational convention that Lk is the entire set of

slopes Lk = Lkt (Wt) for t = 0, . . . , T, Wt ∈ Wt, where Lkt (Wt) = (Lkt (Wt, 1), . . . , Lkt (Wt, BR)).

We start by showing (23). Clearly, LkT (WT ) is monotone decreasing (MD) for all k ≥ 0

and WT ∈ WT , since LkT (WT ) is a vector of zeros. Thus, using condition (11), we have

that (HLk)T−1(WT−1) is MD. We keep in mind that, for t = 0, . . . , T − 1, L0t (Wt) is MD as

L0 = v∗−2Bve. By definition, we have that L1T−1(WT−1) is MD. A simple induction argument

shows that LkT−1(WT−1) is MD. Using the same argument for t = T − 2, . . . , 0, we show that

(HLk)t(Wt) and Lkt (Wt) are MD.

41

Page 44: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

We now show (24). Since v∗ is the unique fixed point of H, then L0 = Hv∗ − 2Bve.

Moreover, from condition (13), we have that (Hv∗)t(Wt, R)−2Bv ≤ (H(v∗−2Bve))t(Wt, R) =

(HL0)t(Wt, R) for all Wt ∈ Wt and R = 1, . . . , BRt . Hence, L0 ≤ HL0 and L0 ≤ L1 =

L0+HL0

2≤ HL0. Suppose that (24) holds true for some k > 0. We shall prove for k + 1. The

induction hypothesis and (23) tell us that (12) holds true. Hence, HLk+1 ≥ HLk ≥ Lk+1 and

HLk+1 ≥ Lk+2 ≥ Lk+1.

Finally, we show (25). A simple inductive argument shows that Lk > v∗ for all k ≥ 0.

Thus, as the sequence is monotone and bounded, it is convergent. Let L denote the limit.

It is clear that Lt(Wt) is monotone decreasing. Therefore, conditions (11)-(13) are also true

when applied to L. Hence, as shown in (Bertsekas & Tsitsiklis, 1996, pages 158-159), we have

that

‖HLk −HL‖∞ ≤ ‖Lk − L‖∞.

With this inequality, it is straightforward to see that

limk→∞

HLk = HL.

Therefore, as in the proof of (Bertsekas & Tsitsiklis, 1996, Lemma 3.4)

L =L+HL

2.

It follows that L = v∗, as v∗ is the unique fixed point of H.

Each lemma assumes all the conditions imposed and all the results obtained before its

statement in the proof of Theorem 1. To improve the comprehension of each proof, all the

assumptions are presented beforehand.

Proof: [Proof of lemma 5.1]

Assumptions: Assume stepsize conditions (15)-(16).

Fix ω ∈ Ω. Omitting the dependence on ω, let (W ∗t , R

x,∗t ) be an accumulation point

of (W nt , R

x,nt )n≥0. In order to simplify notation, let snt (W ∗

t , Rx,∗t ) be denoted by s∗,nt and

42

Page 45: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

αnt (W ∗t , R

x,∗t ) be denoted by α∗,nt . Furthermore, let

θnt+1 = snt+1(Rnt 1Rx,∗

t ≤Rx,nt + (Rn

t + 1)1Rx,∗t >Rx,n

t ).

We have, for n ≥ 1,

(s∗,nt )2 ≤[(1− α∗,nt ) s∗,n−1

t + α∗,nt θnt+1

]2

= (s∗,n−1t )2 − 2α∗,nt (s∗,n−1

t )2 + Ant , (36)

where Ant = 2α∗,nt s∗,n−1t θnt+1 + (α∗,nt )2(θnt+1 − s

∗,n−1t )2. We want to show that

∞∑n=1

Ant = 2∞∑n=1

α∗,nt s∗,n−1t θnt+1 +

∞∑n=1

(α∗,nt )2(θnt+1 − s∗,n−1t )2 <∞.

It is trivial to see that both s∗,n−1t and θnt+1 are bounded. Thus, (θnt+1− s

∗,n−1t )2 is bounded

and (16) tells us that

∞∑n=1

(α∗,nt )2(θnt+1 − s∗,n−1t )2 <∞. (37)

Define a new sequence gnt+1n≥0, where g0t+1 = 0 and gnt+1 =

n∑m=1

α∗,mt s∗,m−1t θmt+1. We

can easily check that gnt+1n≥0 is a FnT -martingale bounded in L2. Measurability is obvious.

The martingale equality follows from repeated conditioning and the unbiasedness property.

Finally, the L2-boundedness and consequentially the integrability can be obtained by noticing

that (gnt+1)2 = (gn−1t+1 )2+2gn−1

t+1 α∗,nt s∗,n−1

t θnt+1+(α∗,nt )2(s∗,n−1t θnt+1)2. From the martingale equality

and boundedness of s∗,n−1t and θnt+1, we get

E[(gnt+1)2|Fn−1

T

]≤ (gn−1

t+1 )2 + CE[(α∗,nt )2|Fn−1

T

],

where C is a constant. Hence, taking expectations and repeating the process, we obtain, from

the stepsize assumption (16) and E[(g0t+1)2] = 0,

E[(gnt+1)2

]≤ E

[(gn−1t+1 )2

]+ CE

[(α∗,nt )2

]≤ E

[(g0t+1)2

]+ C

n∑m=1

E[(α∗,mt )2

]<∞.

Therefore, the L2-Bounded Martingale Convergence Theorem (Shiryaev, 1996, page 510) tells

us that

∞∑n=1

α∗,nt s∗,n−1t θnt+1 <∞. (38)

43

Page 46: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Inequalities (37) and (38) show us that∞∑n=1

Ant <∞, and so, it is valid to write

Ant =∞∑m=n

Amt −∞∑

m=n+1

Amt .

Therefore, as −2α∗,nt (s∗,n−1t )2 < 0, inequality (36) can be rewritten as

(s∗,nt )2 +∞∑

m=n+1

Amt ≤ (s∗,n−1t )2 +

∞∑m=n

Amt . (39)

Thus, the sequence (s∗,n−1t )2+

∞∑m=n

Amt n≥1 is decreasing and bounded from below, as∞∑m=1

Amt <

∞. Hence, it is convergent. Moreover, as∞∑m=n

Amt → 0 when n → ∞, we can conclude that

s∗,nt n≥0 converges.

Finally, as inequality (36) holds for all n ≥ 1, it yields

(s∗,nt )2 ≤ (s∗,n−1t )2 − 2α∗,nt (s∗,n−1

t )2 + Ant

≤ (s∗,n−2t )2 − 2α∗,n−1

t (s∗,n−2t )2 + An−1

t − 2α∗,nt (s∗,n−1t )2 + Ant

...

≤ −2n∑

m=1

α∗,mt (s∗,m−1t )2 +

n∑m=1

Amt .

Passing to the limits we obtain:

lim supn→∞

(s∗,nt )2 + 2∞∑m=1

α∗,mt (s∗,m−1t )2 ≤

∞∑m=1

Amt <∞.

This implies, together with the convergence of s∗,nt n≥0, that∞∑m=1

α∗,mt (s∗,m−1t )2 < ∞. On

the other hand, stepsize assumption (15) tells us that∞∑m=1

α∗,mt = ∞. Hence, there must

exist a subsequence of s∗,nt n≥0 that converges to zero. Therefore, as every subsequence of a

convergent sequence converges to its limit, it follows that s∗,nt n≥0 converges to zero.

Proof: [Proof of lemma 5.2]Assumptions: Given t = 0, . . . , T − 1, k ≥ 0 and integer Nkt ,

assume on n ≥ Nkt , that (29), (26) and (27) hold true.

44

Page 47: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Again, fix ω ∈ Ω and omit the dependence on ω. The proof is by induction on n. Let

(W ∗t , R

x,∗t ) be in St. The proof for the base case n = Nk

t is immediate from the fact that

sNk

t −1t (W ∗

t , Rx,∗t ) = 0, l

Nkt −1

t (W ∗t , R

x,∗t ) = Lkt (W

∗t , R

x,∗t ) and by the assumption that (29) holds

true for n ≥ Nkt . Now suppose (33) holds for n > Nk

t ≥ N and we need to prove for n+ 1.

In order to simplify the notation, let αnt (W ∗t , R

x,∗t ) be denoted by αnt and vnt (W ∗

t , Rx,∗t )

be denoted by vnt . We use the same shorthand notation for znt (W ∗t , R

x,∗t ), lnt (W ∗

t , Rx,∗t ) and

snt (W ∗t , R

x,∗t ) as well. Keeping in mind that the set of iterations Nt(W ∗

t , Rx,∗t ) is finite and for

all n ≥ Nkt ≥ N , vnt (W ∗

t , Rx,∗t ) ≥ znt (W ∗

t , Rx,∗t ). We consider three different cases:

Case 1: W ∗t = W n

t and Rx,∗t = Rx,n

t .

In this case, (W ∗t , R

x,∗t ) is the state being visited by the algorithm at iteration n at time

t. Thus,

vnt ≥ znt = (1− αnt )vn−1t + αnt v

nt+1(Rx,n

t )

≥ (1− αnt )(ln−1t − sn−1

t ) + αnt vnt+1(Rx,n

t )

− αnt (Hvn−1)t(Wnt , R

x,nt ) + αnt (Hvn−1)t(W

nt , R

x,nt ) (40)

≥ (1− αnt )(ln−1t − sn−1

t )− αnt snt+1(Rx,nt ) + αnt (HLk)t(P

nt , R

x,nt ) (41)

= lnt − ((1− αnt )sn−1t + αnt s

nt+1(Rx,n

t )) (42)

≥ lnt −(max

(0, (1− αnt )sn−1

t + αnt snt+1(Rx,n

t )))

= lnt − snt . (43)

The first inequality is due to the construction of set St, while (40) is due to the induction

hypothesis. Inequalities (26) and (27) for n ≥ Nkt explains (41). Finally, (42) and (43)

come from the definition of the stochastic sequences lnt and snt , respectively.

Case 2: W ∗t = W n

t and Rx,∗t = Rn

t + 1.

This case is analogous to the previous one, except that we use the sample slope vnt+1(Rx,nt +

1) instead of vnt+1(Rx,nt ). We also consider the (W n

t , Rx,nt + 1) component, instead of

(W nt , R

x,nt ).

Case 3: Else.

Here the state (W ∗t , R

x,∗t ) is not being updated at iteration n at time t due to a direct

45

Page 48: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

observation of sample slopes. Then, αnt = 0 and, hence,

lnt = ln−1t and snt = sn−1

t .

Therefore, from the construction of set St and the induction hypothesis

vnt ≥ znt = vn−1t ≥ ln−1

t − sn−1t = lnt − snt .

Proof: [Proof of lemma 5.3]

Assumptions: Assume stepsize conditions (15)-(16). Moreover, assume for all k ≥ 0

and integer Nkt that inequalities (26) and (27) hold true on n ≥ Nk

t . Finally, assume

there exists an integer random variable Nk(Wt, R) such that vn−1t (Wt, R) ≥ Lkt (Wt, R) on

n ≥ Nk(Wt, R), Nt(Wt, R + 1) =∞.

Pick ω ∈ Ω and omit the dependence of the random elements on w. We start by showing,

for each k ≥ 0, there exists an integer Nk,st such that vn−1

t (Wt, R + 1) ≥ Lkt (Wt, R + 1) −

sn−1t (Wt, R + 1) for all n ≥ Nk,s

t . Then, we show, for all ε > 0, there is an integer Nk,εt such

that vn−1t (Wt, R + 1) ≥ Lkt (Wt, R + 1) − ε for all n ≥ Nk,ε

t . Finally, using these results, we

prove existence of an integer Nk(Wt, R+ 1) such that vn−1t (Wt, R+ 1) ≥ Lkt (Wt, R+ 1) for all

n ≥ Nk(Wt, R + 1).

Pick k ≥ 0. Let Nk,st = min

n ∈ Nt(Wt, R + 1) : n ≥ Nk(Wt, R)

+ 1, where Nk(Wt, R)

is the integer such that vn−1t (Wt, R) ≥ Lkt (Wt, R) for all n ≥ Nk(Wt, R). Given the projection

properties discussed in the proof of Theorem 1, Nt(Wt, R+ 1) is infinite and vNk,s

t −1t (Wt, R) =

vNk,s

t −1t (Wt, R + 1). Therefore, Nk,s

t is well defined. Redefine the noise sequence snt (Wt, R +

1)n≥0 introduced in the proof of Theorem 1 using Nk,st instead of Nk

t . We prove that

vn−1t (Wt, R + 1) ≥ Lkt (Wt, R + 1)− sn−1

t (Wt, R + 1) for all n ≥ Nk,st by induction on n.

For the base case n− 1 = Nk,st − 1, from our choice of Nk,s

t and the monotone decreasing

property of Lkt (Wt), we have that

vn−1t (Wt, R+1) = vn−1

t (Wt, R) ≥ Lkt (Wt, R) ≥ Lkt (Wt, R+1) = Lkt (Wt, R+1)−sn−1t (Wt, R+1).

Now, we suppose vn−1t (Wt, R+ 1) ≥ Lkt (Wt, R+ 1)− sn−1

t (Wt, R+ 1) is true for n− 1 > Nk,st

and prove for n. We have to consider two cases:

46

Page 49: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Case 1: n ∈ Nt(Wt, R + 1)

In this case, a projection operation took place at iteration n. This fact and the monotone

decreasing property of Lkt (Wt) give us

vn−1t (Wt, R+1) = vn−1

t (Wt, R) ≥ Lkt (Wt, R) ≥ Lkt (Wt, R+1) ≥ Lkt (Wt, R+1)−sn−1t (Wt, R+1).

Case 2: n 6∈Nt(Wt, R + 1)

The analysis of this case is analogous to the proof of inequality (33) of lemma 5.2 for

(Wt, R+1) ∈ St. The difference is that we consider Lkt instead of the stochastic bounding

sequence lnt (Wt, R + 1)n≥0 and we note that

(1− αnt (Wt, R+1))Lkt (Wt, R+1)+ αnt (Wt, R+1)(HLk)t(Wt, R+1) ≥ Lkt (Wt, R+1).

Hence, we have proved that for all k ≥ 0 there exists an integer Nk,st such that

vn−1t (Wt, R + 1) ≥ Lkt (Wt, R + 1)− sn−1

t+ (Wt, R + 1) for all n ≥ Nk,st .

We move on to show, for all k ≥ 0 and ε > 0, there is an integer Nk,εt such that

vn−1t (Wt, R + 1) ≥ Lkt (Wt, R + 1) − ε for all n ≥ Nk,ε

t . We have to consider two cases:

(i) (Wt, R + 1) is either equal to an accumulation point (W ∗t , R

x,∗t ) or (W ∗

t , Rx,∗t + 1) and (ii)

(Wt, R+1) is neither equal to an accumulation point (W ∗t , R

x,∗t ) nor (W ∗

t , Rx,∗t +1). For the first

case, lemma 5.1 tells us that snt+(Wt, R+ 1) goes to zero. Then, there exists N ε > 0 such that

snt+(Wt, R + 1) < ε for all n ≥ N ε. Therefore, we just need to choose Nk,εt = max(Nk,s

t , N ε).

For the second case, αnt (Wt, R + 1) = 0 for all n ≥ Nk,st and s

Nk,st −1

t+ (Wt, R + 1) = 0. Thus,

snt+(Wt, R+1) = sNk,s

t −1t+ (Wt, R+1) = 0 for all n ≥ Nk,s

t and we just have to choose Nk,εt = Nk,s

t .

We are ready to conclude the proof. For that matter, we use the result of the previous

paragraph. Pick k > 0. Let ε = v∗t (Wt, R+ 1)−Lkt (Wt, R+ 1) > 0. Since Lkt (Wt, R+ 1)k≥0

increases to v∗t (Wt, R+ 1), there exists k′ > k such that v∗t (Wt, R+ 1)−Lk′t (Wt, R+ 1) < ε/2.

Thus, Lk′t (Wt, R + 1) − Lkt (Wt, R + 1) > ε/2 and the result of the previous paragraph tells

us that there exists Nk′,ε/2t such that vn−1

t (Wt, R + 1) ≥ Lk′t (Wt, R + 1) − ε/2 > Lkt (Wt, R +

1) + ε/2 − ε/2 = Lkt (Wt, R + 1) for all n ≥ Nk′,ε/2t . Therefore, we just need to choose

47

Page 50: An Optimal Approximate Dynamic Programming ... - Castle Labs€¦ · to capture the economic behavior of energy allocation and electric power dispatch. Linear pro-gramming has been

Nk(Wt, R + 1) = Nk′,ε/2t and we have proved that for all k ≥ 0, there exists Nk(Wt, R + 1)

such that vn−1t (Wt, R + 1) ≥ Lkt (Wt, R + 1) for all n ≥ Nk(Wt, R + 1).

48