tutorial: a unified modeling and algorithmic framework for

Tutorial: A Unified Modeling and Algorithmic Framework for Optimization under Uncertainty

Informs Optimization Society Meeting

March 23, 2018

Warren B. Powell

Princeton UniversityDepartment of Operations Research

and Financial Engineering

© 2016 Warren B. Powell, Princeton University

Drug discovery

Designing molecules

» X and Y are sites where we can hang substituents to change the behavior of the molecule. We approximate the performance using a linear belief model:

0

ij ijsites i substituents j

Y X

» How to sequence experiments to learn the best molecule as quickly as possible?

Real-time logisticsUber/Lyft» Provides real-time, on-demand

transportation.» Drivers are encouraged to enter or leave

the system using pricing signals and informational guidance.

Decisions:» How to price to get the right balance of

drivers relative to customers.» Real-time management of drivers.» Policies (rules for managing drivers,

customers, …)

Ride sharing

RidersCars

t t+1 t+2

Ride sharing

Matching buyers with sellersNow we have a logistic curve for each origin-destination pair (i,j)

Number of offers for each (i,j) pair is relatively small.Need to generalize the learning across hundreds to thousands of markets.

0

0( , | )1

aij ij ij

aij ij ij

p aY

p a

eP p ae

Buyer Seller

Offered price

Meeting variability with portfolios of generationwith mixtures of dispatchability

Storage applicationsHow much energy to store in a battery to handle the volatility of wind and spot prices to meet demands?

Outline

Canonical problemsElements of a dynamic modelAn energy storage illustrationSolution strategies and problem classesModeling uncertaintyDesigning policies

Canonical problems

Decision trees

Canonical problems

Stochastic search (derivative based)» Basic problem:

» Stochastic gradient

» Asymptotic convergence:

» Finite time performance

max ( , )x F x W

1 1( , )n n n nn xx x F x W

*lim ( , ) ( , )nn F x W F x W

Manufacturing network (x=design)Unit commitment problem (x=day ahead decisions)Inventory system (x=design, replenishment policy)Battery system (x=choice of material)Patient treatment cost (x=drug, treatments)Trucking company (x=fleet size and mix)

,max ( , ) where is an algorithm (or policy)nF x W

Canonical problems

Ranking and selection (derivative free)» Basic problem:

» We need to design a policy that finds a design given by

» We refer to this objective as maximizing the final reward.

1 ,...,max ( , )Mx x x F x W

( )nX S

,max ,NF x W

,Nx

Canonical problems

Multi-armed bandit problems» We learn the reward from playing each

“arm”

» We need to find a policy for playing machine x that maximizes:

where

We refer to this problem as maximizing cumulative reward.

( )nX S

11

0

max ( ( ), )N

n n

n

F X S W

1 "winnings"State of knowledge

( )

n

n

n n

WSx X S Choose next “arm” to play

New information

What we know about each slot machine

Canonical problems

(Discrete) Markov decision processes» Bellman’s optimality equation

» This is also the same as solving

where the optimal policy has the form

1 1

1 1 1'

( ) min ( , ) ( ) |

min ( , ) ( ' | , ) ( )

t

t

t t a t t t t t

a t t t t t t ts

V S C S a V S S

C S a p S s S a V S

A

A

{ }( )1 1( ) arg min ( , ) ( ) | ,tt x t t t t t tX S C S x V S S xp

+ += +

00

min , ( ) |T

t t tt

E C S X S S

Canonical problems

Optimal stopping I» Model:

• Exogenous process:

• Decision:

• Reward:

» Optimization problem:

where is a “stopping time” (or )

1 If we stop and sell at time ( )

0 Otherwise t

tX

1 2, ,..., Sequence of stock pricesTp p p

Price received if we stop at time tp t

max p X " measurable function"tF

Canonical problems

Optimal stopping II» Model:

• Exogenous process:

• State:

• Policy:

» Optimization problem:

1 ( | )

0 Otherwise t t

t t

p pX S

1 2

1

, ,..., Sequence of stock prices(1 )

T

t t t

p p pp p p

0 0max ( | ) max ( | )

T T

t t t tt t

p X S p X S

1 if we are holding asset, 0 otherwise.( , , )

t

t t t t

RS R p p

=

=

Canonical problems

Linear quadratic regulation (LQR)» A popular optimal control problem in engineering

involves solving:

» where:

» Possible to show that the optimal policy looks like:

where is a complicated function of Q and R.

0 ,...,

0min ( ) ( )

T

TT T

u u t t t tt

x Qx u Ru

1

State at time Control at time (must be measurable)

( , ) ( is random at time )

t

t t

t t t t t

x tu t F

x f x u w w t

*( )t t t tU x K x

tK

Canonical problems

Stochastic programming» A (two-stage) stochastic programming problem

where

» This is the canonical form of stochastic programming, which might also be written over multiple periods:

0 0 0 0 0 1min ( , )x X c x Q x

1 10 1 ( ) ( ) 1 1( , ( )) min ( ) ( )x XQ x c x

0 01

min ( ) ( ) ( )T

t tt

c x p c x

Canonical problems

Stochastic programming» A (two-stage) stochastic programming policy

where

» This is the canonical form of stochastic programming, which might also be written over multiple periods:

1min ( , )t tx X t t t tc x Q x

1 11 ( ) ( ) 1 1( , ( )) min ( ) ( )t tt t x X t tQ x c x

' '' 1

min ( ) ( ) ( )t t

t H

t t t tt ttt t

c x p c x

Canonical problems

A robust optimization problem would be written

» This means finding the best design x for the worst outcome w in an “uncertainty set”

» This has been adapted to multiperiod problems

( )min max ( , )x X w F x w W

0 1 0,..., ( ,..., ) ( ) ' ' '' 0

min max ( )T

T

x x w w t t tt

c w x

( )

,..., ( ,..., ) ( ) ' ' ''

min max ( )t t H t t H

t H

x x w w t t tt t

c w x

Canonical problems

Why do we need a unified framework?

» The classical frameworks and algorithms are fragile.

» Small changes to problems invalidate optimality conditions, or make algorithmic approaches intractable.

» Practitioners need robust approaches that will provide high quality solutions for all problems.

Outline

Canonical problemsElements of a dynamic modelAn energy storage illustrationSolution strategies problem classesModeling uncertaintyDesigning policies

Storage applicationsHow much energy to store in a battery to handle the volatility of wind and spot prices to meet demands?

Modeling

For deterministic problems, we speak the language of mathematical programming» Linear programming:

» For time-staged problems

min x cx

0Ax b

x

1 1

0

t t t t t

t t t

t

A x B x bD x u

x

0 ,...,0

minT

T

x x t tt

c x

Arguably Dantzig’s biggest contribution, more so than the simplex algorithm, was his articulation of optimization problems in a standard format, which has given algorithmic researchers a common language.

Stochastic programming

Markov decision processes

Reinforcement learning

Optimal control

Model predictive

control

Robust optimization

Approximate dynamic

programming

Online computation

Simulation optimization

Stochastic search

Decision

analysis

Stochastic control


DynamicProgramming

andcontrol

Optimal learning

Banditproblems

Modeling

Before we can solve complex problems, we have to know how to think about them.

The biggest challenge when making decisions under uncertainty is modeling.

Min E {cx}Ax = bx > 0

Mathematician

Software

Organize classlibraries, and set up

communications and databases

Modeling

We lack a standard language for modeling sequential, stochastic decision problems.» In the slides that follow, we propose to model problems

along five fundamental dimensions:

• State variables• Decision variables• Exogenous information• Transition function• Objective function

» This framework draws heavily from Markov decision processes and the control theory communities, but it is not the standard form used anywhere.

Modeling dynamic problems

The state variable:

Controls community "Information state"Operations research/MDP/Computer science , , System state, where: Resource state (physical state) Loca

t

t t t t

t

x

S R I BR

tion/status of truck/train/plane Energy in storage Information state Prices Weather Belief state ("state of knowl

t

t

I

B

edge") Belief about traffic delays Belief about the status of equipment

The state variable

Illustrating state variables» A deterministic graph

1

2

3

4

5

6

7

8

9

10

11

12.6

8.4

9.2 3.6

8.1

17.4

15.9

16.5 20.2

13.5

8.912.7

15.9

2.34.5

7.3

9.65.7

?tS ( ) 6t tS N

The state variable

Illustrating state variables» A stochastic graph

1

2

3

4

5

6

7

8

9

10

11

3.6

8.1

13.5

8.9

12.712.6

8.4

9.2

15.9

?tS

The state variable

Illustrating state variables» A stochastic graph

1

2

3

4

5

6

7

8

9

10

11

3.6

8.1

13.5

8.9

12.712.6

8.4

9.2

15.9

?tS , ,, 6, 12.7,8.9,13.5tt t t N j j

S N c

tR

tI

The state variable

Illustrating state variables» A stochastic graph with left turn penalties

1

2

3

4

5

6

7

8

9

10

11

3.6

8.1

13.5

8.9

12.6

8.4

9.2

15.9

?tS , , 1, , 6, 12,7,8.9,13.5 ,3tt t t N j tj

S N c N

tR

tI

12.7( .7)

The state variable

Variant of problem in Puterman (2005):» Find best path from 1 to 11 that minimizes the second

highest arc cost along the path:

» If the traveler is at node 9, what is her state?

1 3

2

4

6

5

7

8

8

1212

14

87

3

11 14

11

8

9

8

10

112

4

510

15

815

( , highest,second highest) (9,15,12)t tS N ?tS

The state variables

What is a state variable?» Bellman’s classic text on dynamic programming (1957)

describes the state variable with:• “… we have a physical system characterized at any stage by a

small set of parameters, the state variables.”

» The most popular book on dynamic programming (Puterman, 2005, p.18) “defines” a state variable with the following sentence:

• “At each decision epoch, the system occupies a state.”

» Wikipedia:• A state variable is one of the set of variables that are used to

describe the mathematical ‘state’ of a dynamical system

The state variable

My definition of a state variable:

» The first depends on a policy. The second depends only on the problem (and includes the constraints).

» Using either definition, all properly modeled problems are Markovian!


Decisions:Markov decision processes/Computer science Discrete actionControl theory Low-dimensional continuous vectorOperations research Usually a discrete or continuous but high-dimensional

t

t

t

a

u

x

vector of decisions.

At this point, we do not specify to make a decision.Instead, we define the function ( ) (or ( ) or ( )), where specifies the type of policy. " " carries informationabout the type of functi

howX s A s U s

.on , and any tunable parameters ff

The decision variables

Styles of decisions» Binary

» Finite

» Continuous scalar

» Continuous vector

» Discrete vector

» Categorical

0,1x X

1,2,...,x X M

,x X a b

1( ,..., ), K kx x x x

1( ,..., ), K kx x x x

1( ,..., ), is a category (e.g. red/green/blue)I ix a a a


Exogenous information:

New information that first became known at time

ˆ ˆ ˆˆ = , , ,

ˆ Equipment failures, delays, new arrivals New drivers being hired to the network

ˆ New customer demands

t

t t t t

t

t

W t

R D p E

R

D

ˆ Changes in pricesˆ Information about the environment (temperature, ...) t

t

p

E

Note: Any variable indexed by t is known at time t. This convention, which is not standard in control theory, dramatically simplifies the modeling of information.

1 2Below, we let represent a sequence of actual observations , ,....

refers to a sample realization of the random variable .t t

W WW W


The transition function

1 1

1 1

1 1

1 1

( , , )ˆ Inventoriesˆ Spot pricesˆ Market demands

Mt t t t

t t t t

t t t

t t t

S S S x W

R R x Rp p p

D D D

Also known as the:“System model”“State transition model”“Plant model”“Plant equation”“State equation”

“Transfer function”“Transformation function”“Law of motion”“Model”“transition function”

For many applications, these equations are unknown. This is known as “model-free” dynamic programming.

1 00

max , ( ), |

T

t t t t tt

E C S X S W S

Modeling stochastic, dynamic problemsThe universal objective function

Given a system model (transition function)

and a stochastic process:

Now we just have to find the best policy.

Decision function (policy)State variable

Contribution function

Finding the best policy

Expectation over allrandom outcomes

1 1, , ( )Mt t t tS S S x W

Initial state variableNew information

0 1 2, , ,..., TS W W W

1 00

max , ( ), |

T

t t t t tt

E C S X S W S

Modeling stochastic, dynamic problemsThe universal objective function

Given a system model (transition function)

and a stochastic process:

Now we just have to find the best policy.

1 1, , ( )Mt t t tS S S x W

0 1 2, , ,..., TS W W W

Modeling Deterministic » Objective function

» Decision variables:

» Constraints: • at time t

• Transition function

Stochastic» Objective function

» Policy

» Constraints at time t

» Transition function

» Exogenous information

1 00

max , ( ), |

T

t t t t tt

E C S X S W S0 ,..., 0min

T

T

t tx x tc x

( )t t t tx X S

1 1t t t tR b B x 1 1, ,Mt t t tS S S x W

0 1 2( , , ,..., )TS W W W

0t t t

t

A x Rx

t

0 ,..., Tx x :X S

Outline


An energy storage problem

Consider a basic energy storage problem:

» We are going to show that with minor variations in the characteristics of this problem, we can make each class of policy work best.


A model of our problem

» State variables

» Decision variables

» Exogenous information


» Objective function


State variables

» We will present the full model, accumulating the information we need in the state variable.

» We will highlight information we need as we proceed. This information will make up our state variable.

E

G

B

L


Decision variables

» Constraints;

E

G

B

L

, , , , ,EL EB GL GB BLt t t t t tx x x x x x


Exogenous information

E

G

B

L

' '

' '

ˆ Change in energy from wind between 1 and

Noise in the price process between 1 and

Forecast of demand provided by vendor at time

Provided exogenously

Differenc

tp

tD

tt t

D Dt tt t tDt

E t t

t t

f D t

f f

e between actual demand and forecast

tW


Transition function

E

G

B

L

1 1

1 0 1 1 2 2 1 1

1 , 1 1

1

ˆ

( )t t t

p T pt t t t t t t t t t t

D Dt t t t

battery batteryt t t

E E E

p p p p p

D f

R R x

1

2

t

t t

t

pp p

p

Learning in stochastic optimization

Updating the demand parameter» Let be the new price and let

» We update our estimate using our recursive least squares equations:

1 11

1t t t t t

t

B pq q eg+ +

+

= -

( )1 1

11

1

( | ) , 1 ( )

1 ( )

t t t t t

Tt t t t t t

t

Tt t t t

F p p

B B B p p B

p B p

e q

g

g

+ +

++

+

= -

= -

= +

0 1 1 2 2( | ) ( )price Tt t t t t t t t t t tF p p p p p


Objective function

E

G

B

L

( , ) GB GLt t t t tC S x p x x

1 00

min ( , ( ), ) | T

t t t t tt

C S X S W S

1 0 1 1 2 2 1

1 , 1 1

1 11

1 ( )

pt t t t t t t t

D Dt t t t

t t t t tt

p p p p

D f

B x


State variables» Cost function

» Decision functionConstraints:


Price of electricitytp

1 2, , , ( , , ) , , ( , )Dt t t t t t t t t tS E L R p p p f B

Outline

Elements of a dynamic modelAn energy storage illustrationSolution strategies and problem classesModeling uncertaintyDesigning policies

Solution strategies and problem classesSpecial structure» There are special cases where we can solve

exactly. But not very many.

Sampled problems (SAA, scenario trees)» If the only problem is that we cannot compute the expectation, we

might solve a sampled approximation

Adaptive learning algorithms» This is what we have to turn to for most problems, and is the focus

of this tutorial.

max ( , )x F x W

ˆmax ( , )x F x W

Solution strategies and problem classes

State independent problems» The problem does not depend on the state of the system.

» The only state variable is what we know (or believe) about the unknown function , called the belief state , so .

State dependent problems» Now the problem may depend on what we know at time t:

» Now the state is

max ( , ) min( , )xF x W p x W cx

( , )F x W

0max ( , , ) min( , )

tx R tC S x W p x W cx


Offline (final reward)» We can iteratively search for the best solution, but only

care about the final answer.» Asymptotic formulation:

» Finite horizon formulation:

Online (cumulative reward)» We have to learn as we go

max ( , )xF x W

,max ( , )NF x W

11

0

max ( ( ), )N

n n

n

F X S W

üïïïïïýïïïïïþ

“ranking and selection”or

“stochastic search”


Learning policies:Approximate dynamic programmingQ-learningSDDP…


“Online” (cumulative reward) dynamic programming is recognized as the “dynamic programming problem,” but the entire literature on solving dynamic programs describes class (4) problems. This appears to be an open problem.

Outline


Modeling uncertainty

Observational uncertaintyPrognostic uncertainty (forecasting)Experimental noise/variabilityTransitional uncertaintyInferential uncertaintyModel uncertaintySystematic exogenous uncertaintyControl/implementation uncertaintyAlgorithmic noiseGoal uncertainty

Modeling uncertainty in the context of stochastic optimization is a relatively untapped area of research.

Outline


Designing policies

We have to start by describing what we mean by a policy.» Definition:

A policy is a mapping from a state to an action. … any mapping.

How do we search over an arbitrary space of policies?

Designing policies

Two fundamental strategies:

1) Policy search – Search over a class of functions for making decisions to optimize some metric.

2) Lookahead approximations – Approximate the impact of a decision now on the future.

0( , )0

max , ( | ) |f f

T

t t tf Ft

E C S X S S

*' ' ' 1

' 1( ) arg max ( , ) max ( , ( )) | | ,

t

T

t t x t t t t t t t tt t

X S C S x C S X S S S x

Designing policiesPolicy search:1a) Policy function approximations (PFAs)

• Lookup tables – “when in this state, take this action”

• Parametric functions– Order-up-to policies: if inventory is less than s, order up to S.– Affine policies -– Neural networks

• Locally/semi/non parametric– Requires optimizing over local regions

1b) Cost function approximations (CFAs)• Optimizing a deterministic model modified to handle uncertainty

(buffer stocks, schedule slack)

( )( | ) arg max ( , | )

t t

CFAt t tx

X S C S x

X

( | )PFAt tx X S

( | ) ( )PFAt t f f t

f Fx X S S

Designing policiesLookahead approximations – Approximate the impact of a decision now on the future:» An optimal policy (based on looking ahead):

2a) Approximating the value of being in a downstream state using machine learning (“value function approximations”)

*1 1

1 1

( ) arg max ( , ) ( ) | ,

( ) arg max ( , ) ( ) | ,

arg max ( , ) ( )

t

t

t

t t x t t t t t t

VFAt t x t t t t t t

x xx t t t t

X S C S x V S S x

X S C S x V S S x

C S x V S

*' ' ' 1

' 1( ) arg max ( , ) max ( , ( )) | | ,

t

T



The ultimate lookahead policy is optimal*

' ' ' 1' 1

( ) arg max ( , ) max ( , ( )) | | ,

t

T



Designing policies

*' ' ' 1

' 1( ) arg max ( , ) max ( , ( )) | | ,

t

T



Designing policies

The ultimate lookahead policy is optimal

Expectations that we cannot compute

Maximization that we cannot compute

Designing policies

The ultimate lookahead policy is optimal

» 2b) Instead, we have to solve an approximation called the lookahead model:

» A lookahead policy works by approximating the lookahead model.

*' ' ' 1

' 1( ) arg max ( , ) max ( , ( )) | | ,

t

T



*' ' ' , 1

' 1( ) arg max ( , ) max ( , ( )) | | ,

t

t H

t t x t t tt tt tt t t t tt t


Designing policiesTypes of lookahead approximations » One-step lookahead – Widely used in pure learning

policies:• Bayes greedy/naïve Bayes• Expected improvement• Value of information (knowledge gradient)

» Multi-step lookahead• Deterministic lookahead, also known as model predictive

control, rolling horizon procedure• Stochastic lookahead:

– Two-stage (widely used in stochastic linear programming)– Multistage

» Monte carlo tree search (MCTS) for discrete action spaces

» Multistage scenario trees (stochastic linear programming) – typically not tractable.

1) Policy function approximations (PFAs)» Lookup tables, rules, parametric/nonparametric functions

2) Cost function approximation (CFAs)»

3) Policies based on value function approximations (VFAs)»

4) Direct lookahead policies (DLAs)» Deterministic lookahead/rolling horizon proc./model predictive control

» Chance constrained programming

» Stochastic lookahead /stochastic prog/Monte Carlo tree search

» “Robust optimization”

Four (meta)classes of policies

( )( | ) arg max ( , | )

t t

CFAt t tx

X S C S x

X

( ) arg max ( , ) ( , )t

VFA x xt t x t t t t t tX S C S x V S S x

' '' 1

( ) arg max ( , ) ( ) ( ( ), ( ))t

TLA St t tt tt tt tt

t tX S C S x p C S x

xtt , xt ,t1,..., xt ,tT

,' '( ),..., ' 1

( ) arg max min ( , ) ( ( ), ( ))ttt t t H

TLA ROt t tt tt tt ttw Wx x t t

X S C S x C S w x w

[ ( )] 1t tP A x f W

Polic

y se

arch

,' ',..., ' 1

( ) arg max ( , ) ( , )

tt t t H

TLA Dt t tt tt tt ttx x t t

X S C S x C S x

Look

ahea

dap

prox

imat

ions









( )( | ) arg max ( , | )

t t

CFAt t tx

X S C S x

X

,' ',..., ' 1

( ) arg max ( , ) ( , )

tt t t H


X S C S x C S x

( ) arg max ( , ) ( , )t


' '' 1

( ) arg max ( , ) ( ) ( ( ), ( ))t




,' '( ),..., ' 1



X S C S x C S w x w

[ ( )] 1t tP A x f W

Func

tion

appr

ox.









( )( | ) arg max ( , | )

t t

CFAt t tx

X S C S x

X

,' ',..., ' 1

( ) arg max ( , ) ( , )

tt t t H


X S C S x C S x

( ) arg max ( , ) ( , )t


' '' 1

( ) arg max ( , ) ( ) ( ( ), ( ))t




,' '( ),..., ' 1



X S C S x C S w x w

[ ( )] 1t tP A x f W

Imbe

dded

opt

imiz

atio

n

Learning problems

Functions we have to learn:» Approximating the objective

.» Designing a policy .» A value function approximation

.» Designing a cost function approximation:

• The objective function ̅ , | .• The constraints |

» Approximating the transition function

Approximation strategies

Approximation strategies» Lookup tables

• Independent beliefs • Correlated beliefs

» Linear parametric models• Linear models • Sparse-linear• Tree regression

» Nonlinear parametric models• Logistic regression• Neural networks

» Nonparametric models• Gaussian process regression• Kernel regression• Support vector machines• Deep neural networks

Designing policies

Finding the best policy» We have to first articulate our classes of policies

» So minimizing over means:

» We then have to pick an objective such as

or

10 0

max , ( | ) ( | ),T T

t t t tt t

C S X S F X S W

, , ,

Parameters that characterize each family.f

f PFAs CFAs VFAs DLAs

, ff

max , ( , )T T TC S X F X W

Outline

The four classes of policies» Policy function approximations (PFAs)» Cost function approximations (CFAs)» Value function approximations (VFAs)» Direct lookahead policies (DLAs)» A hybrid lookahead/CFA

Policy function approximations

Battery arbitrage – When to charge, when to discharge, given volatile LMPs

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71

Grid operators require that batteries bid charge and discharge prices, an hour in advance.

We have to search for the best values for the policy parameters

DischargeCharge

Charge Dischargeand .



Our policy function might be the parametric model (this is nonlinear in the parameters):

charge

charge discharge

charge

1 if ( | ) 0 if

1 if

t

t t

t

pX S p

p

Energy in storage:

Price of electricity:


Finding the best policy» We need to maximize

» We cannot compute the expectation, so we run simulations:

DischargeCharge

0

max ( ) , ( | )T

tt t t

tF C S X S

Outline


Cost function approximations

Lookup table» We can organize potential catalysts into groups» Scientists using domain knowledge can estimate

correlations in experiments between similar catalysts.

92


Correlated beliefs: Testing one material teaches us about other materials

1 2 3 4 4 5


Designing a policy» Examples of policies for making decisions:

• Interval estimation:

• Upper confidence bounding

• Thompson sampling:

• Knowledge gradient (expected value of information):

( | ) arg max Std. dev. of IE n IE n IE n n nx x x x xX S

ˆ ˆ( ) arg max ( , )TS n n n n nx x x x xX S N

log( | ) arg max No. of times is tested.UCB n UCB n UCB nx x xn

x

nX S N xN

, 1( ) max ( , ( )) max ( , )KG n n ny yx E F y B x F y B

94


Picking means we are evaluating each choice at the mean.

1 2 3 4 4 5

95


Picking means we are evaluating each choice at the 95th percentile.

1 2 3 4 4 5

Cost function approximationsOptimizing the policy» We optimize to maximize:

where

Notes:» This can handle any belief model,

including correlated beliefs, nonlinear belief models.

» All we require is that we be able to simulate a policy.

( | ) arg max ( , )n IE n IE n IE n n n nx x x x xx X S S

,max ( ) ,IEIE NF F x W


Inventory management

» How much product should I order to anticipate future demands?

» Need to accommodate different sources of uncertainty.

• Market behavior• Transit times• Supplier uncertainty• Product quality


Imagine that we want to purchase parts from different suppliers. Let be the amount of product we purchase at time t from supplier p to meet forecasted demand . We would solve

» This assumes our demand forecast is accurate.

tD

tpx

tD

( ) arg maxtt t x p tp

p PX S c x

subject to

0

tp tp P

tp p

tp

x D

x u

x

t

Imagine that we want to purchase parts from different suppliers. Let be the amount of product we purchase at time t from supplier p to meet forecasted demand . We would solve

» This is a “parametric cost function approximation”

( )

Reserve

buffer

( | ) arg max

subject to

tt t p tpx

p P

tp tp P

tp p

tp

X S c x

x D

x u

x

( )t


tD

tpx

Reserve

Buffer stock

A general way of creating CFAs:» Define our policy:

subject to

» We tune by optimizing:


( ) arg max ( , | )t x t tX C S x

0

min ( ) ( , ( ))T

t tt

F C S X

( )Ax b

Parametricallymodified costs

Parametricallymodified constraints

Outline


Locational marginal prices on the gridLMPs – Locational marginal prices

$977/MW !!!

Grid level storage Optimizing battery storage

Slide 103

Grid level storage Optimizing battery storage

Slide 104

Value function approximations

Slide 105

Value function approximationsMonday

Time :05 :10 :15 :20

Slide 106


Time :05 :10 :15 :20

Slide 107


Time :05 :10 :15 :20

Slide 108

Exploiting concavity

Derivatives are used to estimate a piecewise linear approximation

( )t tV R

tR

Slide 109

Approximate value iteration

Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using

an approximate value function:

to obtain . Step 3: Update the value function approximation

Step 4: Obtain Monte Carlo sample of andcompute the next pre-decision state:

Step 5: Return to step 1.

, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n

t t n t t n tV S V S v

1 ,ˆ min ( , ) ( ( , ) )n n n M x nt x t t t t t tv C S x V S S x

ntS

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Deterministicoptimization

Recursivestatistics

“on policy learning”

nx

Approximate value iteration

Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using

an approximate value function:

to obtain . Step 3: Update the value function approximation

Step 4: Obtain Monte Carlo sample of andcompute the next pre-decision state:

Step 5: Return to step 1.

, 1 ,1 1 1 1 1 1 ˆ( ) (1 ) ( )n x n n x n n

t t n t t n tV S V S v

1 ,ˆ min ( , ) ( ( , ) )n n n M x nt x t t t t t tv C S x V S S x

ntS

( )ntW

1 1( , , ( ))n M n n nt t t tS S S x W

Simulation

Deterministicoptimization

Recursivestatistics

nx

1200000

1300000

1400000

1500000

1600000

1700000

1800000

1900000

0 100 200 300 400 500 600 700 800 900 1000

Obj

ectiv

e fu

nctio

n

Iterations

Approximate dynamic programming

… a typical performance graph.

The value of grid level storage

Without storage

The value of grid level storage

With storage

“I think you give a too rosy a picture of ADP….”Andy Barto, in comments on a paper (2009)

“Is the RL glass half full, or half empty?” Rich Sutton, NIPS workshop, (2014)

Approximate value functions can work very well, but you need structure to guide the learning process. ADP needs benchmarks and careful tuning.

Outline


Lookahead policies

Planning your next chess move:

» You put your finger on the piece while you think about moves into the future. This is a lookahead policy, illustrated for a problem with discrete actions.

Lookahead policies

Decision trees:

Lookahead policies

Modeling lookahead policies» Lookahead policies solve a lookahead model, which is an

approximation of the future.» It is important to understand the difference between the:

• Base model – this is the model we are trying to solve by finding the best policy. This is usually some form of simulator.

• The lookahead model, which is our approximation of the future to help us make better decisions now.

» The base model is typically a simulator, or it might be the real world.

Lookahead policies

Lookahead models use five classes of approximations:» Horizon truncation – Replacing a longer horizon problem

with a shorter horizon» Stage aggregation – Replacing multistage problems with

two-stage approximation.» Outcome aggregation/sampling – Simplifying the

exogenous information process» Discretization – Of time, states and decisions» Dimensionality reduction – We may ignore some variables

(such as forecasts) in the lookahead model that we capture in the base model (these become latent variables in the lookahead model).

Lookahead policies

Lookahead policies are the trickiest to model:» We create “tilde variables” for the lookahead model:

» All variables are indexed by t (when the lookaheadmodel is being generated) and t’ (the time within the lookahead model).

, '

, '

, , 1 ,

Approximated state variable (e.g coarse discretization)Decision we plan on implementing at time ' when we are

planning at time , ' , 1,...,

, ,...,

t t

t t

t t t t t t t H

Sx t

t t t t t H

x x x x

, '

, '

, '

Approximation of information processForecast of costs at time ' made at time

Forecast of right hand sides for time ' made at time

t t

t t

t t

Wc t t

b t t

Lookahead policies

We can use this notation to create a policy based on our lookahead model:

» Simplest lookahead is deterministic.

*' ' ' , 1

' 1( ) arg max ( , ) max ( , ( )) | | ,

t H

t t t t tt tt tt t t t tt t


Restricted/simplified set of policies

Simplified/discretized set of state variables

Simplified/discretized set of decision variables

Sampled set of realizations (or deterministic);Aggregated staging of decisions and information

Limited horizon

Lookahead policies

Deterministic lookahead

Stochastic lookahead (with two-stage approximation)

' '' 1

( ) arg max ( , ) ( , )T

LA Dt t tt tt tt tt

t t

X S C S x C S x


' '' 1

( ) arg max ( , ) ( ) ( ( ), ( ))t




Scenario trees

Lookahead policies

Lookahead policies peek into the future» Optimize over deterministic lookahead model

The real process

. . . .

The

look

ahea

dm

odel

t 1t 2t 3t

Lookahead policies

Stochastic lookahead» Here, we approximate the information model by using a

Monte Carlo sample to create a scenario tree: 1am 2am 3am 4am 5am …..

Change in wind speed



Lookahead policies

We can then simulate this lookahead policy over time:

. . . .

t 1t 2t 3t

The

look

ahea

dm

odel

The base model

. . . .

t 1t 2t 3t

Lookahead policies

We can then simulate this lookahead policy over time:

The

look

ahea

dm

odel

The base model

Learning damaged networks

XX

X

134callcall

call

call

134

call

call

Which way should the truck go? X

X

X

call

Decision Outcome Decision Outcome Decision

Lookahead policies

Monte Carlo tree search:

C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis and S. Colton, “A survey of Monte Carlo tree search methods,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1, pp. 1–49, March 2012.

Lookahead policies

Notes:

» Solving stochastic lookahead policies can be hard!

» … but this is still just a lookahead policy which is a class of rolling horizon heuristic.

» Even if solving the lookahead model is hard, an optimal solution of a lookahead model (even a stochastic one) is (with rare exceptions) not an optimal policy.

Outline


Parametric cost function approximation

An energy storage problem:

Parametric cost function approximationForecasts evolve over time as new information arrives:

Actual

Rolling forecasts, updated each hour.

Forecast made at midnight:


Benchmark policy – Deterministic lookahead

' ' ' '

' ' '

' ' '

max' ' '

' ' 'arg

' 'arg

' '

wd rd gd Dtt tt tt tt

gd gr Gtt tt tt

rd rgtt tt tt

wr grtt tt ttwr wd Ett tt ttwr gr ch ett ttrd rg disch ett tt

x x x f

x x f

x x R

x x R R

x x f

x x

x x

b

g

g

+ + £

+ £

+ £

+ £ -

+ £

+ £

+ £


Parametric cost function approximations» Replace the constraint

with:» Lookup table modified forecasts (one adjustment term for

each time in the future):

» Exponential function for adjustments (just two parameters)

» Constant adjustment (one parameter)

't t

' ' ' 'wr wd Ett tt t t ttx x f

'wrttx '

wdttx

2 ( ' )

' ' 1 '

t twr wd Ett tt ttx x e f

' ' 'wr wd Ett tt ttx x f


Optimizing the CFA:» Let be a simulation of our policy given by

» We then compute the gradient with respect to

» The parameter is found using a classical stochastic gradient algorithm:

We tested several stepsize formulas and found that ADAGRAD worked best:

( , ) ( , )F F

( , )F

0

( , ) ( ), ( ( ) | )T

t t tt

F C S X S

1 1( , )n n n nnF

( )21

' 0

( , )t

n t x t ttt

G F x WGh

ae +

=

= = +

å


Optimizing the CFA:» We compute the gradient by applying the chain rule

where the interaction from one time period to the next is captured using

» Assuming there are no integer variables, these equations are quite easy to compute.

0fs = 10fs =

20fs = 30fs =Lookup table

Constant parameterExponential function


Improvement over deterministic benchmark:

Lookup tableExponential

Constant


The parametric CFA represents a fundamental rethinking of the modeling of stochastic programming problems:» From thinking of the lookahead model as the objective

function:

» To acknowledging that the lookahead model is a policy for solving the base model…

…. which is a simulator where we do not have to make any of the standard approximations required in stochastic programming.

0 01

max ( ) ( ) ( )T

t tt

c x p c x

1 00

max , ( | ), |T

t t t t tt

E C S X S W S


Consider a basic energy storage problem:

» We are going to show that with minor variations in the characteristics of this problem, we can make each class of policy work best.


We can create distinct flavors of this problem:» Problem class 1 – Best for PFAs

• Highly stochastic (heavy tailed) electricity prices• Stationary data

» Problem class 2 – Best for CFAs• Stochastic prices and wind (but not heavy tailed)• Stationary data

» Problem class 3 - Best for VFAs• Stochastic wind and prices (but not too random)• Time varying loads, but inaccurate wind forecasts

» Problem class 4 – Best for deterministic lookaheads• Relatively low noise problem with accurate forecasts

» Problem class 5 – A hybrid policy worked best here• Stochastic prices and wind, nonstationary data, noisy forecasts.

An energy storage problemThe policies» The PFA:

• Charge battery when price is below p1• Discharge when price is above p2

» The CFA• Optimize over a horizon H; maintain upper and lower bounds (u, l)

for every time period except the first (note that this is a hybrid with a lookahead).

» The VFA• Piecewise linear, concave value function in terms of energy, indexed

by time.» The lookahead (deterministic)

• Optimize over a horizon H (only tunable parameter) using forecasts of demand, prices and wind energy

» The lookahead CFA• Use a lookahead policy (deterministic), but with a tunable parameter

that improves robustness.


Each policy is best on certain problems» Results are percent of posterior optimal solution

» … any policy might be best depending on the data.Joint research with Prof. Stephan Meisel, University of Muenster, Germany.


Markov decision processes

Reinforcement learning

Optimal control

Model predictive

control

Robust optimization

Approximate dynamic

programming

Online computation


Stochastic search

Decision

analysis

Stochastic control


DynamicProgramming

andcontrol

Optimal learning

Banditproblems

© Warren Powell 2017

Multi-armed bandits/optimal learning

Reinforcementlearning/ADP

Stochastic search

Simulationoptimization


Optimal controlMarkov decisionprocesses

Lookahead policies (DLAs)

Stochastic gradients

© Warren Powell 2017

Ranking andselection

Policy search(PFAs)

Cost functionapprox. (CFAs)

Value function approx. (VFAs)

Theory

Applications

Computation

Modeling

Thank you!http://www.castlelab.princeton.edu/

A tutorial on this topic is available at the top of

http://www.castlelab.princeton.edu/jungle/

tutorial: a unified modeling and algorithmic framework for

Documents