some new directions in dynamic programming with cost ...dimitrib/gen_bellman_eqs_adprl.pdf ·...

Three Interrelated Research Directions Aggregation and Seminorm Projected Equations Simulation-Based Solution

Some New Directions inDynamic Programming with Cost Function Approximation

Dimitri P. Bertsekasjoint work with

Huizhen Yu

Department of Electrical Engineering and Computer ScienceLaboratory for Information and Decision Systems

Massachusetts Institute of Technology

IEEE Symposium on ADPRLDecember 2014

Outline

1 Three Interrelated Research DirectionsSeminorm Projections (Unifying Projected Equation and AggregationApproaches)Generalized Bellman Equations (Multistep with State-DependentWeights)Free Form Sampling (A Flexible Alternative to Single Long TrajectorySimulation)

2 Aggregation and Seminorm Projected Equations

3 Simulation-Based SolutionIterative and Matrix Inversion MethodsFree-Form Sampling

Bellman Equations and their Fixed Points

Bellman equation for a policy µ of an n-state α-discounted MDP

J = TµJ

(TµJ)(i) def=

n∑j=1

pij(µ(i)

)(g(i, µ(i), j) + αJ(j)

), i = 1, . . . , n

pij (µ(i)): transition probs, g(i, µ(i), j): cost per stage for µ

Bellman equation for the optimal cost function of an n-state MDP

J = TJ

(TJ)(i) def= min

u∈U(i)

n∑j=1

pij (u)(g(i, u, j) + αJ(j)

), i = 1, . . . , n

pij (u): transition probs, g(i, u, j): cost per stage for a control u

Subspace Approximation J ≈ Φr (Using a Matrix of Basis Functions Φ)

Methods with subspace approximation

Projected equation (Galerkin) approach Φr = ΠTµ(Φr) (Π is projectionwith respect to some weighted Euclidean norm)

Aggregation approach Φr = ΦDTµ(Φr) (Φ and D are matrices whoserows are probability distributions)

Bellman error method (Φr = ΠT̂µ(Φr), for a modified mapping T̂µ thathas the same fixed points as Tµ)

First direction of research aims to connect all these

All of these can be written as Φr = ΠTµ(Φr), where Π is a seminormweighted Euclidean projection

Another Direction of Research: Generalized Bellman Equations

Ordinary Bellman equation for a policy µ of an n-state MDP

J = TµJ

Generalized Bellman equation

J = T (w)µ J

where w is a matrix of weights wi`:

(T (w)µ J)(i) def

=∞∑`=1

wi`(T `µJ)(i), wi` ≥ 0,∞∑`=1

wi` = 1 (for each i = 1, . . . , n)

Both can be solved for Jµ, the cost vector of policy µ.

Two differences of generalized vs ordinary Bellman equations

Multistep mappings (an old idea, e.g., TD(λ))

State dependent weights (a new idea)

Special Cases

Classical TD(λ) mapping, λ ∈ [0, 1)

T (λ)J = (1− λ)∞∑`=1

λ`−1T `J, wi` = (1− λ)λ`−1

A generalization: State-dependent λi ∈ [0, 1)

(T (w)J)(i) = (1− λi )∞∑`=1

λ`−1i (T `J)(i), wi` = (1− λi )λ

`−1i

Why state dependent weights?

They may allow exploitation of prior knowledge for better approximation(emphasize important states)

They may facilitate simulation (for special cases such as aggregation)

A Third Direction for Research: Flexible/Free-Form Simulation

Classical TD Sampling

T (λ)J = (1− λ)∞∑`=1

λ`−1T `J

Simulate one single infinitely long trajectory, and move the starting stateto generate multiple (infinitely long) trajectories

This is well-matched to the structure of TD

Does not work well in the aggregation context, where there are bothregular and aggregate transitions (powers T `J involve ` regulartransitions but no aggregate transitions)

TD sampling matches well with regular transitions but not with aggregatetransitions

Free-form sampling

Generates many short trajectories (length ` < − > term T `J)

Arbitrary restart distribution

Connects well with state-dependent weights (and allows restarting at anaggregate state in the case of aggregation)

References

D. P. Bertsekas, “λ-Policy Iteration: A Review and a NewImplementation," in Reinforcement Learning and Approximate DynamicProgramming for Feedback Control , by F. Lewis and D. Liu (eds.), IEEEPress, Computational Intelligence Series, 2012 (simulation with shorttrajectories and restart, as a means to control exploration).

H. Yu and D. P. Bertsekas, “Weighted Bellman Equations and theirApplications in Approximate Dynamic Programming," ReportLIDS-P-2876, MIT, 2012 (weighted Bellman equations and seminormprojections).

D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II, 4thEdition: Approximate Dynamic Programming, Athena Scientific,Belmont, MA, 2012 (a general reference where all the ideas arementioned with limited analysis).

Generalized Bellman Eqs with Seminorm Projection: Φr = ΠT (w)(Φr)

Φ is an n × s matrix of features, defining subspace S = {Φr | r ∈ <s},r ∈ <s is a vector of weights.Π is projection onto S with respect to a weighted Euclidean seminorm‖J‖2

ξ =∑n

i=1 ξi(J(i)

)2, where ξ = (ξ1, . . . , ξn), with ξi≥ 0.Bias-variance tradeoff applies to both norm and seminorm cases.

Example: TD(λ) T (λ)J = (1− λ)∞∑`=1

λ`−1T `J, λ ∈ [0, 1)

Subspace S = {!r | r ! "s}

Simulation error"Jµ

Solution of projected equation

Simulation error

!r = "T (!)(!r)

Aggregation Framework

pij(u)

dxi !jy

OriginalSystem States

Aggregate States

DisaggregationProbabilities

AggregationProbabilities

Matrix D Matrix !

Introduce s aggregate states, aggregation and disaggregation probsA composite system with both regular and aggregate statesTwo single step Bellman equations

r = DT (Φr), Φr = ΦDT (Φr)

r is the cost vector of the aggregate states, Φr the cost vector of theregular statesNatural multistep versions for bias-variance tradeoff:

Φr = ΦDT (λ)(Φr) or Φr = ΦDT (w)(Φr)

Two Common Types of Aggregation

Hard aggregation: The aggregate states are disjoint subsets Sx of stateswith ∪x Sx = {1, . . . , n}, and dxi > 0 only if i ∈ Sx , φix = 1 if i ∈ Sx .

!"""""""""""#

1 0 0 01 0 0 00 1 0 01 0 0 01 0 0 00 1 0 00 0 1 00 0 1 00 0 0 1

$%%%%%%%%%%%&

Aggregation with discretization grid of representative states: Eachaggregate state is a single original system state x ∈ {1, . . . , n}, anddxx = 1.

x j1 j2 j3 y1 y2 y3

λ |β| (1 − λ)|β| l(1 − λ)β| λβ O A B C |1 − λβ|Asynchronous Initial state x Initial state f(x, u,w) TimeVk: k-stages optimal cost vector with terminal cost function J

Vk+1: (k + 1)-stages optimal cost vector with terminal cost functionJ

Direct Method: Projection of cost vector Jµ ΠJµ n t pnn(u) pin(u)pni(u) pjn(u) pnj(u)

Indirect Method: Solving a projected form of Bellman’s equation

Projection on S. Solution of projected equation Φr = ΠT(λ)µ (Φr)

Tµ(Φr) Φr = ΠT(λ)µ (Φr)

ΠJµ n t pnn(u) pin(u) pni(u) pjn(u) pnj(u)

J TJ ΠTJ J̄ T J̄ ΠT J̄

Value Iterate T (Φrk) = g + αPΦrk Projection on S Φrk Φrk+1

Solution of J̃µ = ΠTµ(J̃µ) λ = 0 λ = 1 0 < λ < 1

Route to Queue 2hλ(n) λ∗ λµ λ hµ,λ(n) = (λµ − λ)Nµ(n)n − 1 −(n − 1) Cost = 1 Cost = 2 u = 2 Cost = -10 µ∗(i + 1) µ µ p

1 0 νj(u), pjk(u) νk(u), pki(u) J∗(p) µ1 µ2

Simulation error Solution of J̃µ = WTµ(J̃µ) Bias ΠJµ Slope J̃µ =Φrµ

Transition diagram and costs under policy {µ�, µ�, . . .} M q(µ)

c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��

Cost = 0 Cost = −1

νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

x j1 j2 j3 y1 y2 y3

c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��

νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

x j1 j2 j3 y1 y2 y3

c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��

νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

x j1 j2 j3 y1 y2 y3

c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��

νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

x j1 j2 j3 y1 y2 y3

c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��

νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

x j1 j2 j3 y1 y2 y3

c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��

νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

J(2) = g(2, u1) + αp21(u1)J(1) + αp22(u1)J(2)

x j1 j2 j3 y1 y2 y3 Original State Space

Representative/Aggregate State

c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��

νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

Representative/Aggregate States

c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��

νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

x = i pij1(u) j1 j2 j3 y1 y2 y3 Original State Space

φj1y1 φj1y2 φj1y3 j1 j2 j3 y1 y2 y3 Original State Space

1 0 0 01 0 0 00 1 0 01 0 0 01 0 0 00 1 0 00 0 1 00 0 1 00 0 0 1

1 2 3 4 5 6 7 8 9 x1 x2 x3 x4

1 0 0 01 0 0 00 1 0 01 0 0 01 0 0 00 1 0 00 0 1 00 0 1 00 0 0 1

1 2 3 4 5 6 7 8 9 x1 x2 x3 x4

1 0 0 01 0 0 00 1 0 01 0 0 01 0 0 00 1 0 00 0 1 00 0 1 00 0 0 1

1 2 3 4 5 6 7 8 9 x1 x2 x3 x4

Critical state 50Small costSx1 Sx2 Sx3 pij j Sx = {i}

!jx1 !jx2 !jx3

Aggregate States/Subsets0 1 2 49 iLength "t

Start state itEnd state jt

Cost Ct(rk) = "t Transition costs + Term. cost (!rk)jt

...Frequencies #i

Conditional frequencies wi!

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Subspace spanned by basis functionsSolution of multistep projected equationLP CONVEX NLP

Simplex

Gradient/Newton

Duality

Subgradient Cutting plane Interior point Subgradient

Polyhedral approximation

LPs are solved by simplex method

NLPs are solved by gradient/Newton methods.

Convex programs are special cases of NLPs.

Modern view: Post 1990s

LPs are often best solved by nonsimplex/convex methods.

Convex problems are often solved by the same methods as LPs.

Nondi#erentiability and piecewise linearity are common features.

Critical state 50Small costSx1 Sx2 Sx3 pij1 pij2 pij3 j Sx = {i}

!jx1 !jx2 !jx3

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

!jx1 !jx2 !jx3

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

!jx1 !jx2 !jx3

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

A Generalization: Aggregation with Representative Features

Representative/Aggregate State

c + Ez

�J∗

�pf0(z)

pf0(z) + (1 − p)f1(z)

��

νi(u)pij(u)ν

νj(u)pjk(u)ν

νk(u)pki(u)ν

J(2) = g(2, u2) + αp21(u2)J(1) + αp22(u2)J(2)

Critical state 50Small costAggregate States/Subsets0 1 2 49Length !t

Cost Ct(rk) = !t Transition costs + Term. cost (!rk)jt

...Frequencies "i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Primal Problem Description

Dual Problem Description

Critical state 50Small costSx1 Sx2 Sx3 pij j!jx1 !jx2 !jx3

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Critical state 50Small costSx1 Sx2 Sx3 pij j

!jx1 !jx2 !jx3

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

!jx1 !jx2 !jx3

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

!jx1 !jx2 !jx3

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

The aggregate states are disjoint subsets Sx of “similar" states

Common case: Sx is a group of states with “similar features"

Hard aggregation is a special case: ∪x Sx = {1, . . . , n}Aggregation with representative states is a special case: Sx consists ofjust one state

Connection with Seminorm Projection

Consider the aggregation equations

r = DT (w)(Φr), (low-dimensional) Φr = ΦDT (w)(Φr), (high-dimensional)

Compare them with projected equation case Φr = ΠT (w)(Φr)

Assume that the approximation is piecewise constant with interpolation:constant within the aggregate states, interpolated for the other states, i.e., thedisaggregation and aggregation probs satisfy

φix = 1 ∀ i ∈ Sx , dxi > 0 iff i ∈ Sx

Then ΦD is a seminorm projection with

ξi = dxi/s, ∀ i ∈ Sx

This is true for the preceding aggregation schemes. Moreover, the multistepequation Φr = ΦDT (w)(Φr) is a sup-norm contraction if T is.

Sampling for Aggregation

The classic form of TD sampling does not work for multistep aggregation.

Reason: In aggregation we need to simulate multistep cost samplesinvolving both regular and aggregate states. This cannot be easily donewith classical TD sampling.

So we introduce a more general (free-form) sampling.

Generate many short trajectories.

In aggregation, the start and end states of each trajectory must be anaggregate state.

A side benefit: A lot of flexibility for “exploration".

An Example: Projected Value Iteration for Equation Φr = ΠT (w)(Φr)

Exact form of projected value iteration

Φrk+1 = ΠT (w)(Φrk )

rk+1 = arg minr

n∑i=1

(φ(i)′r −

∞∑`=1

wi`(T `(Φrk )

, (φ(i)′: i th row of Φ)

We view the expression minimized as an expected value that can besimulated with Markov chain trajectories:

ξi will be the “frequency" of i as start state of the trajectories

wi` will be the “frequency" of trajectory length ` when i is the start state

Simulation-Based Implementation of Projected Value Iteration

Length !t

Cost Ct(!rk) = Transition costs + Term. cost (!rk)jt

Frequencies "i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Vertical Distances

Crossing Point Di#erentials

Values f(x) Crossing points f!(y)

Length !t

Frequencies "i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Vertical Distances

Length !t

Frequencies "i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Vertical Distances

Length !t

Cost Ct(!rk) = !t Transition costs + Term. cost (!rk)jt

Frequencies "i

Conditional frequencies wi!...

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Vertical Distances

Length !t

Frequencies "i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Vertical Distances

Length !t

Cost Ct(!rk) = !t Transition costs + Term. cost (!rk)jt ...Frequencies "i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Vertical Distances

!jx1 !jx2 !jx3

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

!jx1 !jx2 !jx3

Cost Ct(rk) = "t Transition costs + Term. cost #!t(!rk)jt

...Frequencies $i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Approximation using trajectories t = 1, . . . ,m

rk+1 = arg minr

m∑t=1

(φ(it )′r − Ct (rk )

)2 (it : start state, Ct (rk ): sample cost)

Since freq. of start state i → ξi , freq. of start-state/length (i, `)→ ξiwi`

Opt. condition for simulation-based least squares

converges to

Opt. condition for exact least squares

Matrix Inversion Method (Extension of LSTD(λ))

Length !t

Frequencies "i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Vertical Distances

Length !t

Frequencies "i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Vertical Distances

Length !t

Frequencies "i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Vertical Distances

Length !t

Frequencies "i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Vertical Distances

Length !t

Frequencies "i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Vertical Distances

Length !t

Cost Ct(!rk) = !t Transition costs + Term. cost (!rk)jt ...Frequencies "i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Vertical Distances

!jx1 !jx2 !jx3

...Frequencies #i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

!jx1 !jx2 !jx3

Cost Ct(rk) = "t Transition costs + Term. cost #!t(!rk)jt

...Frequencies $i

"T (!x!) = !x! T (!x!)

!x = "T (")(!x) y! "y!

Simplex

Gradient/Newton

Duality

Find r̂ such that

r̂ = arg minr

m∑t=1

(φ(it )′r − Ct (r̂)

This is a linear system of equations (the equivalent optimality condition).

Concluding Remarks

Extension of cost function approximation methodology in DP via threeinterlocking ideas:

Seminorm projections.Generalized weighted Bellman equations.Free-form simulation.

The approximation framework is general enough to include bothmultistep projected equations and aggregation (and other methods).Some of the highlights:

Connection between projected equations and aggregation equations.Multistep aggregation methods of the TD(λ) type.Use of a variety of sampling methods.Flexible treatment of the bias-variance tradeoff.

The methodology extends to the much broader field of Galerkinapproximation for solving general linear equations.

Thank you!

some new directions in dynamic programming with cost ...dimitrib/gen_bellman_eqs_adprl.pdf ·...

Documents

dynamic programming and optimal control 3rd...

dynamic programming and stochastic...

approximate dynamic programming based on value...

dynamic programming (ch. 15) dynamic programming

dynamic storytimes - dynamic children's programming

based on lectures given at the...

lecture 3: planning by dynamic programming · lecture 3:...

dynamic programming - virginia...

dynamic programming - saylor academy€¦ · dynamic...

dynamic programming - wordpress.com€¦ · dynamic...

dynamic programming - androbenchcsl.skku.edu › uploads ›...

advanced dynamic programming in...

feature-based aggregation and deep reinforcement...

distributed asynchronous computation of fixed...

based on lectures given at the massachusetts inst. of...

approximate dynamic programming for large-scale resource...

2 dynamic programming -...

weighted sup-norm contractions in dynamic programming: a...

approximate dynamic programming based on value and...

1 dynamic programming 2012/11/20. p.2 dynamic programming...