data-driven economic nmpc using reinforcement learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · the...

132
Data-driven Economic NMPC using Reinforcement Learning Mario Zanon, Alberto Bemporad IMT Lucca

Upload: others

Post on 05-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Data-driven Economic NMPC usingReinforcement Learning

Mario Zanon, Alberto Bemporad

IMT Lucca

Page 2: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 2 / 27

Reinforcement Learningmodel-free

optimal for the actual system

≈ sample-based stochastic optimal control

learning can be slow, expensive, unsafe

no stability guarantee (commonly based on DNN)

(Economic) Model Predictive Controloptimal for the nominal model

constraint satisfaction

can represent complex control policies

stability and recursive feasibility guarantees

Combine MPC and RLsimple MPC formulations as proxy for complicated ones

recover optimality, safety and stability for the true system

Page 3: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 2 / 27

Reinforcement Learningmodel-free

optimal for the actual system

≈ sample-based stochastic optimal control

learning can be slow, expensive, unsafe

no stability guarantee (commonly based on DNN)

(Economic) Model Predictive Controloptimal for the nominal model

constraint satisfaction

can represent complex control policies

stability and recursive feasibility guarantees

Combine MPC and RLsimple MPC formulations as proxy for complicated ones

recover optimality, safety and stability for the true system

Page 4: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 2 / 27

Reinforcement Learningmodel-free

optimal for the actual system

≈ sample-based stochastic optimal control

learning can be slow, expensive, unsafe

no stability guarantee (commonly based on DNN)

(Economic) Model Predictive Controloptimal for the nominal model

constraint satisfaction

can represent complex control policies

stability and recursive feasibility guarantees

Combine MPC and RLsimple MPC formulations as proxy for complicated ones

recover optimality, safety and stability for the true system

Page 5: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 3 / 27

The Basics

Assumption: the system is a Markov Decision Process (MDP)

state, action s, a

stochastic transition dynamics P[s+|s, a] ⇔ s+ = f (s, a,w)

scalar reward / stage cost L(s, a)

discount factor 0 < γ ≤ 1

SystemP [s+|s, a]

Controllera = πθ(s)

as

RL

θ

L(s, a)

Goal:

learn the optimal policy

using no prior knowledge, observe

reward only

Page 6: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 3 / 27

The Basics

Assumption: the system is a Markov Decision Process (MDP)

state, action s, a

stochastic transition dynamics P[s+|s, a] ⇔ s+ = f (s, a,w)

scalar reward / stage cost L(s, a)

discount factor 0 < γ ≤ 1

SystemP [s+|s, a]

Controllera = πθ(s)

as

RL

θ

L(s, a)

Goal:

learn the optimal policy

using no prior knowledge, observe

reward only

Page 7: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 3 / 27

The Basics

Assumption: the system is a Markov Decision Process (MDP)

state, action s, a

stochastic transition dynamics P[s+|s, a] ⇔ s+ = f (s, a,w)

scalar reward / stage cost L(s, a)

discount factor 0 < γ ≤ 1

SystemP [s+|s, a]

Controllera = πθ(s)

as

RL

θ

L(s, a)

Goal:

learn the optimal policy

using no prior knowledge, observe

reward only

Page 8: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 3 / 27

The Basics

Assumption: the system is a Markov Decision Process (MDP)

state, action s, a

stochastic transition dynamics P[s+|s, a] ⇔ s+ = f (s, a,w)

scalar reward / stage cost L(s, a)

discount factor 0 < γ ≤ 1

SystemP [s+|s, a]

Controllera = πθ(s)

as

RL

θ

L(s, a)

Goal:

learn the optimal policy

using no prior knowledge, observe

reward only

Page 9: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 3 / 27

The Basics

Assumption: the system is a Markov Decision Process (MDP)

state, action s, a

stochastic transition dynamics P[s+|s, a] ⇔ s+ = f (s, a,w)

scalar reward / stage cost L(s, a)

discount factor 0 < γ ≤ 1

SystemP [s+|s, a]

Controllera = πθ(s)

asRL

θ

L(s, a)

Goal:

learn the optimal policy

using no prior knowledge, observe

reward only

Page 10: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 3 / 27

The Basics

Assumption: the system is a Markov Decision Process (MDP)

state, action s, a

stochastic transition dynamics P[s+|s, a] ⇔ s+ = f (s, a,w)

scalar reward / stage cost L(s, a)

discount factor 0 < γ ≤ 1

SystemP [s+|s, a]

Controllera = πθ(s)

asRL

θ

L(s, a)

Goal:

learn the optimal policy

using no prior knowledge, observe

reward only

Page 11: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 4 / 27

Main Concepts

Optimal policy:

π?(s) = arg mina

L(s, a) + γE[V?(s+) | s, a]

Optimal value function:

V?(s) = L(s, π?(s)) + γE[V?(s+) | s, π?(s)]

If we know V?, we can compute π?

but only if we know the model

Optimal action-value function:

Q?(s, a) = L(s, a) + γE[V?(s+) | s, a]

= L(s, a) + γE[

mina+

Q?(s+, a+)

∣∣∣∣ s, a]Optimal policy:

π?(s) = mina

Q?(s, a)

If we know Q?, we know π?(if we know how to minimise Q?)

LQR example:

P solves the Riccati equation

K? = (R + B>PB)−1(S> + B>PA)

π?(s) = −K?s

V?(s) = s>Ps + V0

Q?(s, a) =

[sa

]>M

[sa

]+ V0

M=

[Q + A>PA S + A>PBS> + B>PA R + B>PB

]K? = M−1

aa Mas

If we learn M directly, we do notneed a model!

How can we evaluate V? and Q??

Page 12: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 4 / 27

Main Concepts

Optimal policy:

π?(s) = arg mina

L(s, a) + γE[V?(s+) | s, a]

Optimal value function:

V?(s) = L(s, π?(s)) + γE[V?(s+) | s, π?(s)]

If we know V?, we can compute π?

but only if we know the model

Optimal action-value function:

Q?(s, a) = L(s, a) + γE[V?(s+) | s, a]

= L(s, a) + γE[

mina+

Q?(s+, a+)

∣∣∣∣ s, a]Optimal policy:

π?(s) = mina

Q?(s, a)

If we know Q?, we know π?(if we know how to minimise Q?)

LQR example:

P solves the Riccati equation

K? = (R + B>PB)−1(S> + B>PA)

π?(s) = −K?s

V?(s) = s>Ps + V0

Q?(s, a) =

[sa

]>M

[sa

]+ V0

M=

[Q + A>PA S + A>PBS> + B>PA R + B>PB

]K? = M−1

aa Mas

If we learn M directly, we do notneed a model!

How can we evaluate V? and Q??

Page 13: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 4 / 27

Main Concepts

Optimal policy:

π?(s) = arg mina

L(s, a) + γE[V?(s+) | s, a]

Optimal value function:

V?(s) = L(s, π?(s)) + γE[V?(s+) | s, π?(s)]

If we know V?, we can compute π?

but only if we know the model

Optimal action-value function:

Q?(s, a) = L(s, a) + γE[V?(s+) | s, a]

= L(s, a) + γE[

mina+

Q?(s+, a+)

∣∣∣∣ s, a]Optimal policy:

π?(s) = mina

Q?(s, a)

If we know Q?, we know π?(if we know how to minimise Q?)

LQR example:

P solves the Riccati equation

K? = (R + B>PB)−1(S> + B>PA)

π?(s) = −K?s

V?(s) = s>Ps + V0

Q?(s, a) =

[sa

]>M

[sa

]+ V0

M=

[Q + A>PA S + A>PBS> + B>PA R + B>PB

]K? = M−1

aa Mas

If we learn M directly, we do notneed a model!

How can we evaluate V? and Q??

Page 14: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 4 / 27

Main Concepts

Optimal policy:

π?(s) = arg mina

L(s, a) + γE[V?(s+) | s, a]

Optimal value function:

V?(s) = L(s, π?(s)) + γE[V?(s+) | s, π?(s)]

If we know V?, we can compute π?but only if we know the model

Optimal action-value function:

Q?(s, a) = L(s, a) + γE[V?(s+) | s, a]

= L(s, a) + γE[

mina+

Q?(s+, a+)

∣∣∣∣ s, a]Optimal policy:

π?(s) = mina

Q?(s, a)

If we know Q?, we know π?(if we know how to minimise Q?)

LQR example:

P solves the Riccati equation

K? = (R + B>PB)−1(S> + B>PA)

π?(s) = −K?s

V?(s) = s>Ps + V0

Q?(s, a) =

[sa

]>M

[sa

]+ V0

M=

[Q + A>PA S + A>PBS> + B>PA R + B>PB

]K? = M−1

aa Mas

If we learn M directly, we do notneed a model!

How can we evaluate V? and Q??

Page 15: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 4 / 27

Main Concepts

Optimal policy:

π?(s) = arg mina

L(s, a) + γE[V?(s+) | s, a]

Optimal value function:

V?(s) = L(s, π?(s)) + γE[V?(s+) | s, π?(s)]

If we know V?, we can compute π?but only if we know the model

Optimal action-value function:

Q?(s, a) = L(s, a) + γE[V?(s+) | s, a]

= L(s, a) + γE[

mina+

Q?(s+, a+)

∣∣∣∣ s, a]

Optimal policy:π?(s) = min

aQ?(s, a)

If we know Q?, we know π?(if we know how to minimise Q?)

LQR example:

P solves the Riccati equation

K? = (R + B>PB)−1(S> + B>PA)

π?(s) = −K?s

V?(s) = s>Ps + V0

Q?(s, a) =

[sa

]>M

[sa

]+ V0

M=

[Q + A>PA S + A>PBS> + B>PA R + B>PB

]K? = M−1

aa Mas

If we learn M directly, we do notneed a model!

How can we evaluate V? and Q??

Page 16: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 4 / 27

Main Concepts

Optimal policy:

π?(s) = arg mina

L(s, a) + γE[V?(s+) | s, a]

Optimal value function:

V?(s) = L(s, π?(s)) + γE[V?(s+) | s, π?(s)]

If we know V?, we can compute π?but only if we know the model

Optimal action-value function:

Q?(s, a) = L(s, a) + γE[V?(s+) | s, a]

= L(s, a) + γE[

mina+

Q?(s+, a+)

∣∣∣∣ s, a]Optimal policy:

π?(s) = mina

Q?(s, a)

If we know Q?, we know π?(if we know how to minimise Q?)

LQR example:

P solves the Riccati equation

K? = (R + B>PB)−1(S> + B>PA)

π?(s) = −K?s

V?(s) = s>Ps + V0

Q?(s, a) =

[sa

]>M

[sa

]+ V0

M=

[Q + A>PA S + A>PBS> + B>PA R + B>PB

]K? = M−1

aa Mas

If we learn M directly, we do notneed a model!

How can we evaluate V? and Q??

Page 17: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 4 / 27

Main Concepts

Optimal policy:

π?(s) = arg mina

L(s, a) + γE[V?(s+) | s, a]

Optimal value function:

V?(s) = L(s, π?(s)) + γE[V?(s+) | s, π?(s)]

If we know V?, we can compute π?but only if we know the model

Optimal action-value function:

Q?(s, a) = L(s, a) + γE[V?(s+) | s, a]

= L(s, a) + γE[

mina+

Q?(s+, a+)

∣∣∣∣ s, a]Optimal policy:

π?(s) = mina

Q?(s, a)

If we know Q?, we know π?(if we know how to minimise Q?)

LQR example:

P solves the Riccati equation

K? = (R + B>PB)−1(S> + B>PA)

π?(s) = −K?s

V?(s) = s>Ps + V0

Q?(s, a) =

[sa

]>M

[sa

]+ V0

M=

[Q + A>PA S + A>PBS> + B>PA R + B>PB

]K? = M−1

aa Mas

If we learn M directly, we do notneed a model!

How can we evaluate V? and Q??

Page 18: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 4 / 27

Main Concepts

Optimal policy:

π?(s) = arg mina

L(s, a) + γE[V?(s+) | s, a]

Optimal value function:

V?(s) = L(s, π?(s)) + γE[V?(s+) | s, π?(s)]

If we know V?, we can compute π?but only if we know the model

Optimal action-value function:

Q?(s, a) = L(s, a) + γE[V?(s+) | s, a]

= L(s, a) + γE[

mina+

Q?(s+, a+)

∣∣∣∣ s, a]Optimal policy:

π?(s) = mina

Q?(s, a)

If we know Q?, we know π?(if we know how to minimise Q?)

LQR example:

P solves the Riccati equation

K? = (R + B>PB)−1(S> + B>PA)

π?(s) = −K?s

V?(s) = s>Ps + V0

Q?(s, a) =

[sa

]>M

[sa

]+ V0

M=

[Q + A>PA S + A>PBS> + B>PA R + B>PB

]K? = M−1

aa Mas

If we learn M directly, we do notneed a model!

How can we evaluate V? and Q??

Page 19: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 4 / 27

Main Concepts

Optimal policy:

π?(s) = arg mina

L(s, a) + γE[V?(s+) | s, a]

Optimal value function:

V?(s) = L(s, π?(s)) + γE[V?(s+) | s, π?(s)]

If we know V?, we can compute π?but only if we know the model

Optimal action-value function:

Q?(s, a) = L(s, a) + γE[V?(s+) | s, a]

= L(s, a) + γE[

mina+

Q?(s+, a+)

∣∣∣∣ s, a]Optimal policy:

π?(s) = mina

Q?(s, a)

If we know Q?, we know π?(if we know how to minimise Q?)

LQR example:

P solves the Riccati equation

K? = (R + B>PB)−1(S> + B>PA)

π?(s) = −K?s

V?(s) = s>Ps + V0

Q?(s, a) =

[sa

]>M

[sa

]+ V0

M=

[Q + A>PA S + A>PBS> + B>PA R + B>PB

]K? = M−1

aa Mas

If we learn M directly, we do notneed a model!

How can we evaluate V? and Q??

Page 20: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 4 / 27

Main Concepts

Optimal policy:

π?(s) = arg mina

L(s, a) + γE[V?(s+) | s, a]

Optimal value function:

V?(s) = L(s, π?(s)) + γE[V?(s+) | s, π?(s)]

If we know V?, we can compute π?but only if we know the model

Optimal action-value function:

Q?(s, a) = L(s, a) + γE[V?(s+) | s, a]

= L(s, a) + γE[

mina+

Q?(s+, a+)

∣∣∣∣ s, a]Optimal policy:

π?(s) = mina

Q?(s, a)

If we know Q?, we know π?(if we know how to minimise Q?)

LQR example:

P solves the Riccati equation

K? = (R + B>PB)−1(S> + B>PA)

π?(s) = −K?s

V?(s) = s>Ps + V0

Q?(s, a) =

[sa

]>M

[sa

]+ V0

M=

[Q + A>PA S + A>PBS> + B>PA R + B>PB

]K? = M−1

aa Mas

If we learn M directly, we do notneed a model!

How can we evaluate V? and Q??

Page 21: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 4 / 27

Main Concepts

Optimal policy:

π?(s) = arg mina

L(s, a) + γE[V?(s+) | s, a]

Optimal value function:

V?(s) = L(s, π?(s)) + γE[V?(s+) | s, π?(s)]

If we know V?, we can compute π?but only if we know the model

Optimal action-value function:

Q?(s, a) = L(s, a) + γE[V?(s+) | s, a]

= L(s, a) + γE[

mina+

Q?(s+, a+)

∣∣∣∣ s, a]Optimal policy:

π?(s) = mina

Q?(s, a)

If we know Q?, we know π?(if we know how to minimise Q?)

LQR example:

P solves the Riccati equation

K? = (R + B>PB)−1(S> + B>PA)

π?(s) = −K?s

V?(s) = s>Ps + V0

Q?(s, a) =

[sa

]>M

[sa

]+ V0

M=

[Q + A>PA S + A>PBS> + B>PA R + B>PB

]K? = M−1

aa Mas

If we learn M directly, we do notneed a model!

How can we evaluate V? and Q??

Page 22: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 5 / 27

How does it work?

Policy Evaluation

Monte CarloTemporal Difference

Policy Optimization

Greedy policy updatesε-greedyExploration vs Exploitation

Abstract / generalize

Curse of dimensionalityFunction approximation

Q-learning

Policy search

Page 23: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 5 / 27

How does it work?

Policy Evaluation

Monte CarloTemporal Difference

Policy Optimization

Greedy policy updatesε-greedyExploration vs Exploitation

Abstract / generalize

Curse of dimensionalityFunction approximation

Q-learning

Policy search

Page 24: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 5 / 27

How does it work?

Policy Evaluation

Monte CarloTemporal Difference

Policy Optimization

Greedy policy updatesε-greedyExploration vs Exploitation

Abstract / generalize

Curse of dimensionalityFunction approximation

Q-learning

Policy search

Page 25: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 5 / 27

How does it work?

Policy Evaluation

Monte CarloTemporal Difference

Policy Optimization

Greedy policy updatesε-greedyExploration vs Exploitation

Abstract / generalize

Curse of dimensionalityFunction approximation

Q-learning

Policy search

Page 26: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 5 / 27

How does it work?

Policy Evaluation

Monte CarloTemporal Difference

Policy Optimization

Greedy policy updatesε-greedyExploration vs Exploitation

Abstract / generalize

Curse of dimensionalityFunction approximation

Q-learning

Policy search

Page 27: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 5 / 27

How does it work?

Policy Evaluation

Monte CarloTemporal Difference

Policy Optimization

Greedy policy updatesε-greedyExploration vs Exploitation

Abstract / generalize

Curse of dimensionalityFunction approximation

Q-learning

Policy search

Page 28: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 6 / 27

Policy Evaluation - Monte Carlo

Consider the discrete case first: only finitely many states s and actions a

Given policy π, what are Vπ,Qπ?

Vπ(s):

pick random s, increase counter N(s)

compute cost-to-go

Ci (s) =∞∑k=0

γkL(s, π(s))

empirical expectation:

V (s) ≈N(s)∑i=1

Ci (s, a)

N(s)

Recursive formulation:

V (s)← V (s) +1

N(s)

(∑Ci − V (s)

)Alternative:

V (s)← V (s) + α(∑

Ci − V (s))

Q(s, a):

N(s, a)

Ci (s, a) =

L(s, a) +∞∑k=1

γkL(s, π(s))

empirical expectation:

Q(s, a) ≈N(s,a)∑i=1

Ci (s, a)

N(s, a)

Page 29: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 6 / 27

Policy Evaluation - Monte Carlo

Consider the discrete case first: only finitely many states s and actions aGiven policy π, what are Vπ,Qπ?

Vπ(s):

pick random s, increase counter N(s)

compute cost-to-go

Ci (s) =∞∑k=0

γkL(s, π(s))

empirical expectation:

V (s) ≈N(s)∑i=1

Ci (s, a)

N(s)

Recursive formulation:

V (s)← V (s) +1

N(s)

(∑Ci − V (s)

)Alternative:

V (s)← V (s) + α(∑

Ci − V (s))

Q(s, a):

N(s, a)

Ci (s, a) =

L(s, a) +∞∑k=1

γkL(s, π(s))

empirical expectation:

Q(s, a) ≈N(s,a)∑i=1

Ci (s, a)

N(s, a)

Page 30: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 6 / 27

Policy Evaluation - Monte Carlo

Consider the discrete case first: only finitely many states s and actions aGiven policy π, what are Vπ,Qπ?

Vπ(s):

pick random s, increase counter N(s)

compute cost-to-go

Ci (s) =∞∑k=0

γkL(s, π(s))

empirical expectation:

V (s) ≈N(s)∑i=1

Ci (s, a)

N(s)

Recursive formulation:

V (s)← V (s) +1

N(s)

(∑Ci − V (s)

)Alternative:

V (s)← V (s) + α(∑

Ci − V (s))

Q(s, a):

N(s, a)

Ci (s, a) =

L(s, a) +∞∑k=1

γkL(s, π(s))

empirical expectation:

Q(s, a) ≈N(s,a)∑i=1

Ci (s, a)

N(s, a)

Page 31: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 6 / 27

Policy Evaluation - Monte Carlo

Consider the discrete case first: only finitely many states s and actions aGiven policy π, what are Vπ,Qπ?

Vπ(s):

pick random s, increase counter N(s)

compute cost-to-go

Ci (s) =∞∑k=0

γkL(s, π(s))

empirical expectation:

V (s) ≈N(s)∑i=1

Ci (s, a)

N(s)

Recursive formulation:

V (s)← V (s) +1

N(s)

(∑Ci − V (s)

)Alternative:

V (s)← V (s) + α(∑

Ci − V (s))

Q(s, a):

N(s, a)

Ci (s, a) =

L(s, a) +∞∑k=1

γkL(s, π(s))

empirical expectation:

Q(s, a) ≈N(s,a)∑i=1

Ci (s, a)

N(s, a)

Page 32: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 6 / 27

Policy Evaluation - Monte Carlo

Consider the discrete case first: only finitely many states s and actions aGiven policy π, what are Vπ,Qπ?

Vπ(s):

pick random s, increase counter N(s)

compute cost-to-go

Ci (s) =∞∑k=0

γkL(s, π(s))

empirical expectation:

V (s) ≈N(s)∑i=1

Ci (s, a)

N(s)

Recursive formulation:

V (s)← V (s) +1

N(s)

(∑Ci − V (s)

)Alternative:

V (s)← V (s) + α(∑

Ci − V (s))

Q(s, a):

N(s, a)

Ci (s, a) =

L(s, a) +∞∑k=1

γkL(s, π(s))

empirical expectation:

Q(s, a) ≈N(s,a)∑i=1

Ci (s, a)

N(s, a)

Page 33: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 6 / 27

Policy Evaluation - Monte Carlo

Consider the discrete case first: only finitely many states s and actions aGiven policy π, what are Vπ,Qπ?

Vπ(s):

pick random s, increase counter N(s)

compute cost-to-go

Ci (s) =∞∑k=0

γkL(s, π(s))

empirical expectation:

V (s) ≈N(s)∑i=1

Ci (s, a)

N(s)

Recursive formulation:

V (s)← V (s) +1

N(s)

(∑Ci − V (s)

)Alternative:

V (s)← V (s) + α(∑

Ci − V (s))

Q(s, a):

N(s, a)

Ci (s, a) =

L(s, a) +∞∑k=1

γkL(s, π(s))

empirical expectation:

Q(s, a) ≈N(s,a)∑i=1

Ci (s, a)

N(s, a)

Page 34: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 6 / 27

Policy Evaluation - Monte Carlo

Consider the discrete case first: only finitely many states s and actions aGiven policy π, what are Vπ,Qπ?

Vπ(s):

pick random s, increase counter N(s)

compute cost-to-go

Ci (s) =∞∑k=0

γkL(s, π(s))

empirical expectation:

V (s) ≈N(s)∑i=1

Ci (s, a)

N(s)

Recursive formulation:

V (s)← V (s) +1

N(s)

(∑Ci − V (s)

)Alternative:

V (s)← V (s) + α(∑

Ci − V (s))

Q(s, a):

N(s, a)

Ci (s, a) =

L(s, a) +∞∑k=1

γkL(s, π(s))

empirical expectation:

Q(s, a) ≈N(s,a)∑i=1

Ci (s, a)

N(s, a)

Page 35: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 6 / 27

Policy Evaluation - Monte Carlo

Consider the discrete case first: only finitely many states s and actions aGiven policy π, what are Vπ,Qπ?

Vπ(s):

pick random s, increase counter N(s)

compute cost-to-go

Ci (s) =∞∑k=0

γkL(s, π(s))

empirical expectation:

V (s) ≈N(s)∑i=1

Ci (s, a)

N(s)

Recursive formulation:

V (s)← V (s) +1

N(s)

(∑Ci − V (s)

)

Alternative:

V (s)← V (s) + α(∑

Ci − V (s))

Q(s, a):

N(s, a)

Ci (s, a) =

L(s, a) +∞∑k=1

γkL(s, π(s))

empirical expectation:

Q(s, a) ≈N(s,a)∑i=1

Ci (s, a)

N(s, a)

Page 36: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 6 / 27

Policy Evaluation - Monte Carlo

Consider the discrete case first: only finitely many states s and actions aGiven policy π, what are Vπ,Qπ?

Vπ(s):

pick random s, increase counter N(s)

compute cost-to-go

Ci (s) =∞∑k=0

γkL(s, π(s))

empirical expectation:

V (s) ≈N(s)∑i=1

Ci (s, a)

N(s)

Recursive formulation:

V (s)← V (s) +1

N(s)

(∑Ci − V (s)

)Alternative:

V (s)← V (s) + α(∑

Ci − V (s))

Q(s, a):

N(s, a)

Ci (s, a) =

L(s, a) +∞∑k=1

γkL(s, π(s))

empirical expectation:

Q(s, a) ≈N(s,a)∑i=1

Ci (s, a)

N(s, a)

Page 37: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 7 / 27

Policy Evaluation - Temporal Difference

Remember: Vπ(s) = L(s, π(s)) + γE[Vπ(s+) | s, π(s)]

Idea of TD(0):

V (s)← V (s) + α(L(s, π(s)) + γV (s+)− V (s)

)

TD-target: L(s, π(s)) + γV (s+) is a proxy for infinite-horizon cost c

TD-error δ = L(s, π(s)) + γV (s+)− V (s)

Sample-based dynamic programming

learn before the episode ends

very efficient in Markov environments

We can do the same for the action-value function:

Q(s, a)← Q(s, a) + α(L(s, a) + γQ(s+, π(s+))− Q(s, a)

)

We can evaluate Vπ and Qπ, but how can we optimize them?

Page 38: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 7 / 27

Policy Evaluation - Temporal Difference

Remember: Vπ(s) = L(s, π(s)) + γE[Vπ(s+) | s, π(s)]

Idea of TD(0):

V (s)← V (s) + α(L(s, π(s)) + γV (s+)− V (s)

)

TD-target: L(s, π(s)) + γV (s+) is a proxy for infinite-horizon cost c

TD-error δ = L(s, π(s)) + γV (s+)− V (s)

Sample-based dynamic programming

learn before the episode ends

very efficient in Markov environments

We can do the same for the action-value function:

Q(s, a)← Q(s, a) + α(L(s, a) + γQ(s+, π(s+))− Q(s, a)

)

We can evaluate Vπ and Qπ, but how can we optimize them?

Page 39: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 7 / 27

Policy Evaluation - Temporal Difference

Remember: Vπ(s) = L(s, π(s)) + γE[Vπ(s+) | s, π(s)]

Idea of TD(0):

V (s)← V (s) + α(L(s, π(s)) + γV (s+)− V (s)

)TD-target: L(s, π(s)) + γV (s+) is a proxy for infinite-horizon cost c

TD-error δ = L(s, π(s)) + γV (s+)− V (s)

Sample-based dynamic programming

learn before the episode ends

very efficient in Markov environments

We can do the same for the action-value function:

Q(s, a)← Q(s, a) + α(L(s, a) + γQ(s+, π(s+))− Q(s, a)

)

We can evaluate Vπ and Qπ, but how can we optimize them?

Page 40: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 7 / 27

Policy Evaluation - Temporal Difference

Remember: Vπ(s) = L(s, π(s)) + γE[Vπ(s+) | s, π(s)]

Idea of TD(0):

V (s)← V (s) + α(L(s, π(s)) + γV (s+)− V (s)

)TD-target: L(s, π(s)) + γV (s+) is a proxy for infinite-horizon cost c

TD-error δ = L(s, π(s)) + γV (s+)− V (s)

Sample-based dynamic programming

learn before the episode ends

very efficient in Markov environments

We can do the same for the action-value function:

Q(s, a)← Q(s, a) + α(L(s, a) + γQ(s+, π(s+))− Q(s, a)

)

We can evaluate Vπ and Qπ, but how can we optimize them?

Page 41: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 7 / 27

Policy Evaluation - Temporal Difference

Remember: Vπ(s) = L(s, π(s)) + γE[Vπ(s+) | s, π(s)]

Idea of TD(0):

V (s)← V (s) + α(L(s, π(s)) + γV (s+)− V (s)

)TD-target: L(s, π(s)) + γV (s+) is a proxy for infinite-horizon cost c

TD-error δ = L(s, π(s)) + γV (s+)− V (s)

Sample-based dynamic programming

learn before the episode ends

very efficient in Markov environments

We can do the same for the action-value function:

Q(s, a)← Q(s, a) + α(L(s, a) + γQ(s+, π(s+))− Q(s, a)

)

We can evaluate Vπ and Qπ, but how can we optimize them?

Page 42: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 7 / 27

Policy Evaluation - Temporal Difference

Remember: Vπ(s) = L(s, π(s)) + γE[Vπ(s+) | s, π(s)]

Idea of TD(0):

V (s)← V (s) + α(L(s, π(s)) + γV (s+)− V (s)

)TD-target: L(s, π(s)) + γV (s+) is a proxy for infinite-horizon cost c

TD-error δ = L(s, π(s)) + γV (s+)− V (s)

Sample-based dynamic programming

learn before the episode ends

very efficient in Markov environments

We can do the same for the action-value function:

Q(s, a)← Q(s, a) + α(L(s, a) + γQ(s+, π(s+))− Q(s, a)

)

We can evaluate Vπ and Qπ, but how can we optimize them?

Page 43: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 7 / 27

Policy Evaluation - Temporal Difference

Remember: Vπ(s) = L(s, π(s)) + γE[Vπ(s+) | s, π(s)]

Idea of TD(0):

V (s)← V (s) + α(L(s, π(s)) + γV (s+)− V (s)

)TD-target: L(s, π(s)) + γV (s+) is a proxy for infinite-horizon cost c

TD-error δ = L(s, π(s)) + γV (s+)− V (s)

Sample-based dynamic programming

learn before the episode ends

very efficient in Markov environments

We can do the same for the action-value function:

Q(s, a)← Q(s, a) + α(L(s, a) + γQ(s+, π(s+))− Q(s, a)

)

We can evaluate Vπ and Qπ, but how can we optimize them?

Page 44: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 7 / 27

Policy Evaluation - Temporal Difference

Remember: Vπ(s) = L(s, π(s)) + γE[Vπ(s+) | s, π(s)]

Idea of TD(0):

V (s)← V (s) + α(L(s, π(s)) + γV (s+)− V (s)

)TD-target: L(s, π(s)) + γV (s+) is a proxy for infinite-horizon cost c

TD-error δ = L(s, π(s)) + γV (s+)− V (s)

Sample-based dynamic programming

learn before the episode ends

very efficient in Markov environments

We can do the same for the action-value function:

Q(s, a)← Q(s, a) + α(L(s, a) + γQ(s+, π(s+))− Q(s, a)

)

We can evaluate Vπ and Qπ, but how can we optimize them?

Page 45: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 7 / 27

Policy Evaluation - Temporal Difference

Remember: Vπ(s) = L(s, π(s)) + γE[Vπ(s+) | s, π(s)]

Idea of TD(0):

V (s)← V (s) + α(L(s, π(s)) + γV (s+)− V (s)

)TD-target: L(s, π(s)) + γV (s+) is a proxy for infinite-horizon cost c

TD-error δ = L(s, π(s)) + γV (s+)− V (s)

Sample-based dynamic programming

learn before the episode ends

very efficient in Markov environments

We can do the same for the action-value function:

Q(s, a)← Q(s, a) + α(L(s, a) + γQ(s+, π(s+))− Q(s, a)

)

We can evaluate Vπ and Qπ, but how can we optimize them?

Page 46: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 8 / 27

Learning

Greedy policy improvement

model-based: π′(s) = arg mina L(s, a) + γE [V (s+)|s, a]

model-free: π′(s) = arg mina Q(s, a)

Problem:

keep acting on-policy, i.e., a = π(s)

how to ensure enough exploration?

Simplest idea: ε-greedy:

π(s) =

{arg maxa Q(s, a) with p = 1− εaU{1,na} with p = ε

Theorem

For any ε-greedy policy π, the ε-greedy policy π′ is an improvement, i.e.,Vπ′(s) ≤ Vπ(s)

In order to get optimality we need to be GLIEGreedy in the limit with infinite exploration, e.g., ε-greedy with ε→ 0

Page 47: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 8 / 27

Learning

Greedy policy improvement

model-based: π′(s) = arg mina L(s, a) + γE [V (s+)|s, a]

model-free: π′(s) = arg mina Q(s, a)

Problem:

keep acting on-policy, i.e., a = π(s)

how to ensure enough exploration?

Simplest idea: ε-greedy:

π(s) =

{arg maxa Q(s, a) with p = 1− εaU{1,na} with p = ε

Theorem

For any ε-greedy policy π, the ε-greedy policy π′ is an improvement, i.e.,Vπ′(s) ≤ Vπ(s)

In order to get optimality we need to be GLIEGreedy in the limit with infinite exploration, e.g., ε-greedy with ε→ 0

Page 48: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 8 / 27

Learning

Greedy policy improvement

model-based: π′(s) = arg mina L(s, a) + γE [V (s+)|s, a]

model-free: π′(s) = arg mina Q(s, a)

Problem:

keep acting on-policy, i.e., a = π(s)

how to ensure enough exploration?

Simplest idea: ε-greedy:

π(s) =

{arg maxa Q(s, a) with p = 1− εaU{1,na} with p = ε

Theorem

For any ε-greedy policy π, the ε-greedy policy π′ is an improvement, i.e.,Vπ′(s) ≤ Vπ(s)

In order to get optimality we need to be GLIEGreedy in the limit with infinite exploration, e.g., ε-greedy with ε→ 0

Page 49: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 8 / 27

Learning

Greedy policy improvement

model-based: π′(s) = arg mina L(s, a) + γE [V (s+)|s, a]

model-free: π′(s) = arg mina Q(s, a)

Problem:

keep acting on-policy, i.e., a = π(s)

how to ensure enough exploration?

Simplest idea: ε-greedy:

π(s) =

{arg maxa Q(s, a) with p = 1− εaU{1,na} with p = ε

Theorem

For any ε-greedy policy π, the ε-greedy policy π′ is an improvement, i.e.,Vπ′(s) ≤ Vπ(s)

In order to get optimality we need to be GLIEGreedy in the limit with infinite exploration, e.g., ε-greedy with ε→ 0

Page 50: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 8 / 27

Learning

Greedy policy improvement

model-based: π′(s) = arg mina L(s, a) + γE [V (s+)|s, a]

model-free: π′(s) = arg mina Q(s, a)

Problem:

keep acting on-policy, i.e., a = π(s)

how to ensure enough exploration?

Simplest idea: ε-greedy:

π(s) =

{arg maxa Q(s, a) with p = 1− εaU{1,na} with p = ε

Theorem

For any ε-greedy policy π, the ε-greedy policy π′ is an improvement, i.e.,Vπ′(s) ≤ Vπ(s)

In order to get optimality we need to be GLIEGreedy in the limit with infinite exploration, e.g., ε-greedy with ε→ 0

Page 51: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 9 / 27

Q-Learning (basic version)

Update the action-value function as follows:

δ ← L(s, a) + γmina+

Q(s+, a+)− Q(s, a)

Q(s, a)← Q(s, a) + αδ

Curse of Dimensionality

in general: too many state-action pairs

need to generalize / extrapolate

Function Approximation

Features φ(s, a) and weights θ yield Qθ(s, a) = θ>φ(s, a). Weights update

δ ← L(s, a) + γmina+

Qθ(s+, a+)− Qθ(s, a)

θ ← θ + αδ∇θQθ(s, a)

this is a linear function approximation

deep neural networks are nonlinear

Page 52: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 9 / 27

Q-Learning (basic version)

Update the action-value function as follows:

δ ← L(s, a) + γmina+

Q(s+, a+)− Q(s, a)

Q(s, a)← Q(s, a) + αδ

Curse of Dimensionality

in general: too many state-action pairs

need to generalize / extrapolate

Function Approximation

Features φ(s, a) and weights θ yield Qθ(s, a) = θ>φ(s, a). Weights update

δ ← L(s, a) + γmina+

Qθ(s+, a+)− Qθ(s, a)

θ ← θ + αδ∇θQθ(s, a)

this is a linear function approximation

deep neural networks are nonlinear

Page 53: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 9 / 27

Q-Learning (basic version)

Update the action-value function as follows:

δ ← L(s, a) + γmina+

Q(s+, a+)− Q(s, a)

Q(s, a)← Q(s, a) + αδ

Curse of Dimensionality

in general: too many state-action pairs

need to generalize / extrapolate

Function Approximation

Features φ(s, a) and weights θ yield Qθ(s, a) = θ>φ(s, a). Weights update

δ ← L(s, a) + γmina+

Qθ(s+, a+)− Qθ(s, a)

θ ← θ + αδ∇θQθ(s, a)

this is a linear function approximation

deep neural networks are nonlinear

Page 54: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 9 / 27

Q-Learning (basic version)

Update the action-value function as follows:

δ ← L(s, a) + γmina+

Q(s+, a+)− Q(s, a)

Q(s, a)← Q(s, a) + αδ

Curse of Dimensionality

in general: too many state-action pairs

need to generalize / extrapolate

Function Approximation

Features φ(s, a) and weights θ yield Qθ(s, a) = θ>φ(s, a). Weights update

δ ← L(s, a) + γmina+

Qθ(s+, a+)− Qθ(s, a)

θ ← θ + αδ∇θQθ(s, a)

this is a linear function approximation

deep neural networks are nonlinear

Page 55: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 10 / 27

Policy Search

Q-learning: fits E (Qθ − Q?)2

no guarantee that πθ is close to π?

what we want is minθ J(πθ) := E∑∞

k=0 γkL(sk , πθ(sk))

Policy Search parametrizes πθ and directly minimizes J(πθ):

θ ← θ + α∇θJ

model-based: construct a model f and simulate forward in time:

J ≈ 1

B

B∑i=0

N∑k=0

γkL(s(i), πθ

(s(i)))

, s(i)+ = f

(s(i), πθ

(s(i)),w)

actor-critic (model-free): A(s, a) = Q(s, a)− V (s)

Deterministic policy : ∇θJ = ∇θπθ∇aAπθ ,

Stochastic policy : ∇θJ = ∇θ log πθAπθ ,

use, e.g., Q-learning for A, or Vif you are careful: convergence to local min of J(πθ)

Page 56: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 11 / 27

Main issues

Can we guarantee anything?

SafetyStabilityOptimality

Learning is potentially

DangerousExpensiveSlow

Prior knowledge is valuable

Why learning from scratch?

Can we use RL to improve existing controllers?

Retain stability and safetyImprove performance

Page 57: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 11 / 27

Main issues

Can we guarantee anything?

SafetyStabilityOptimality

Learning is potentially

DangerousExpensiveSlow

Prior knowledge is valuable

Why learning from scratch?

Can we use RL to improve existing controllers?

Retain stability and safetyImprove performance

Page 58: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

Reinforcement Learning 11 / 27

Main issues

Can we guarantee anything?

SafetyStabilityOptimality

Learning is potentially

DangerousExpensiveSlow

Prior knowledge is valuable

Why learning from scratch?

Can we use RL to improve existing controllers?

Retain stability and safetyImprove performance

Page 59: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 12 / 27

Optimal Control - Model Predictive Control (MPC)

use a model to predict the future

constraint enforcement

performance and stability guarantees

minx,u

V f(xN) +N−1∑k=0

L(x , u)

s.t. x0 = s,

xk+1 = f (xk , uk),

h(xk , uk) ≤ 0,

xN ∈ Xf .

Optimality hinges on

quality of the model (how descriptive)

system identification (estimate the correct model parameters)

but then, if the model is not “perfect”

can we recover optimality through learning?

use MPC as function approximator

Page 60: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 12 / 27

Optimal Control - Model Predictive Control (MPC)

use a model to predict the future

constraint enforcement

performance and stability guarantees

minx,u

V f(xN) +N−1∑k=0

L(x , u)

s.t. x0 = s,

xk+1 = f (xk , uk),

h(xk , uk) ≤ 0,

xN ∈ Xf .

Optimality hinges on

quality of the model (how descriptive)

system identification (estimate the correct model parameters)

but then, if the model is not “perfect”

can we recover optimality through learning?

use MPC as function approximator

Page 61: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 12 / 27

Optimal Control - Model Predictive Control (MPC)

use a model to predict the future

constraint enforcement

performance and stability guarantees

minx,u

V f(xN) +N−1∑k=0

L(x , u)

s.t. x0 = s,

xk+1 = f (xk , uk),

h(xk , uk) ≤ 0,

xN ∈ Xf .

Optimality hinges on

quality of the model (how descriptive)

system identification (estimate the correct model parameters)

but then, if the model is not “perfect”

can we recover optimality through learning?

use MPC as function approximator

Page 62: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 12 / 27

Optimal Control - Model Predictive Control (MPC)

use a model to predict the future

constraint enforcement

performance and stability guarantees

minx,u

V f(xN) +N−1∑k=0

L(x , u)

s.t. x0 = s,

xk+1 = f (xk , uk),

h(xk , uk) ≤ 0,

xN ∈ Xf .

Optimality hinges on

quality of the model (how descriptive)

system identification (estimate the correct model parameters)

but then, if the model is not “perfect”

can we recover optimality through learning?

use MPC as function approximator

Page 63: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 12 / 27

Optimal Control - Model Predictive Control (MPC)

use a model to predict the future

constraint enforcement

performance and stability guarantees

minx,u

V f(xN) +N−1∑k=0

L(x , u)

s.t. x0 = s,

xk+1 = f (xk , uk),

h(xk , uk) ≤ 0,

xN ∈ Xf .

Optimality hinges on

quality of the model (how descriptive)

system identification (estimate the correct model parameters)

but then, if the model is not “perfect”

can we recover optimality through learning?

use MPC as function approximator

Page 64: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 12 / 27

Optimal Control - Model Predictive Control (MPC)

use a model to predict the future

constraint enforcement

performance and stability guarantees

Vθ(s) := minx,u

V fθ(xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Optimality hinges on

quality of the model (how descriptive)

system identification (estimate the correct model parameters)

but then, if the model is not “perfect”

can we recover optimality through learning?

use MPC as function approximator

Page 65: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 12 / 27

Optimal Control - Model Predictive Control (MPC)

use a model to predict the future

constraint enforcement

performance and stability guarantees

πθ(s) := arg minx,u

V fθ(xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Optimality hinges on

quality of the model (how descriptive)

system identification (estimate the correct model parameters)

but then, if the model is not “perfect”

can we recover optimality through learning?

use MPC as function approximator

Page 66: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 12 / 27

Optimal Control - Model Predictive Control (MPC)

use a model to predict the future

constraint enforcement

performance and stability guarantees

Qθ(s, a) := minx,u

V fθ(xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s, u0 = a,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Optimality hinges on

quality of the model (how descriptive)

system identification (estimate the correct model parameters)

but then, if the model is not “perfect”

can we recover optimality through learning?

use MPC as function approximator

Page 67: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 13 / 27

Learn the true action-value function with MPC

inaccurate MPC model P[s+|s, a] 6= P[s+|s, a]

Theorem [Gros, Zanon TAC2020]

Assume that the MPC parametrization (through θ) is rich enough.Then, the exact V?, Q?, π? are recovered.

SysId and RL are “orthogonal”

RL cannot learn the true model

minx,u

V fθ (xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Algorithmic framework:[Zanon, Gros, Bemporad ECC2019]

Enforce Lθ,Vfθ � 0

Can use condensed MPC formulation

Can use globalization techniques

Page 68: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 13 / 27

Learn the true action-value function with MPC

inaccurate MPC model P[s+|s, a] 6= P[s+|s, a]

Theorem [Gros, Zanon TAC2020]

Assume that the MPC parametrization (through θ) is rich enough.Then, the exact V?, Q?, π? are recovered.

SysId and RL are “orthogonal”

RL cannot learn the true model

minx,u

V fθ (xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Algorithmic framework:[Zanon, Gros, Bemporad ECC2019]

Enforce Lθ,Vfθ � 0

Can use condensed MPC formulation

Can use globalization techniques

Page 69: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 13 / 27

Learn the true action-value function with MPC

inaccurate MPC model P[s+|s, a] 6= P[s+|s, a]

Theorem [Gros, Zanon TAC2020]

Assume that the MPC parametrization (through θ) is rich enough.Then, the exact V?, Q?, π? are recovered.

SysId and RL are “orthogonal”

RL cannot learn the true model

minx,u

V fθ (xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Algorithmic framework:[Zanon, Gros, Bemporad ECC2019]

Enforce Lθ,Vfθ � 0

Can use condensed MPC formulation

Can use globalization techniques

Page 70: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 13 / 27

Learn the true action-value function with MPC

inaccurate MPC model P[s+|s, a] 6= P[s+|s, a]

Theorem [Gros, Zanon TAC2020]

Assume that the MPC parametrization (through θ) is rich enough.Then, the exact V?, Q?, π? are recovered.

SysId and RL are “orthogonal”

RL cannot learn the true model

minx,u

V fθ (xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Algorithmic framework:[Zanon, Gros, Bemporad ECC2019]

Enforce Lθ,Vfθ � 0

Can use condensed MPC formulation

Can use globalization techniques

Page 71: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 14 / 27

Economic MPC and RL

If Lθ(s, a) � 0, then Vθ(s) � 0 is a Lyapunov function

In RL we can have L(s, a) � 0⇒ V?(s) � 0

ENMPC theory:

if we rotate the cost we can have V?(s) = λ(s) + Vθ(s)

e.g., if 0 ≺ Lθ(s, a) = L(s, a)− λ(s) + λ(f (s, a))

Therefore, use

Vθ(s) = minx,u

λθ(s) + V fθ (xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Enforce stability with Lθ(s, a) � 0 and learn also λθ(s)

Page 72: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 14 / 27

Economic MPC and RL

If Lθ(s, a) � 0, then Vθ(s) � 0 is a Lyapunov function

In RL we can have L(s, a) � 0⇒ V?(s) � 0

ENMPC theory:

if we rotate the cost we can have V?(s) = λ(s) + Vθ(s)

e.g., if 0 ≺ Lθ(s, a) = L(s, a)− λ(s) + λ(f (s, a))

Therefore, use

Vθ(s) = minx,u

λθ(s) + V fθ (xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Enforce stability with Lθ(s, a) � 0 and learn also λθ(s)

Page 73: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 14 / 27

Economic MPC and RL

If Lθ(s, a) � 0, then Vθ(s) � 0 is a Lyapunov function

In RL we can have L(s, a) � 0⇒ V?(s) � 0

ENMPC theory:

if we rotate the cost we can have V?(s) = λ(s) + Vθ(s)

e.g., if 0 ≺ Lθ(s, a) = L(s, a)− λ(s) + λ(f (s, a))

Therefore, use

Vθ(s) = minx,u

λθ(s) + V fθ (xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Enforce stability with Lθ(s, a) � 0 and learn also λθ(s)

Page 74: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 14 / 27

Economic MPC and RL

If Lθ(s, a) � 0, then Vθ(s) � 0 is a Lyapunov function

In RL we can have L(s, a) � 0⇒ V?(s) � 0

ENMPC theory:

if we rotate the cost we can have V?(s) = λ(s) + Vθ(s)

e.g., if 0 ≺ Lθ(s, a) = L(s, a)− λ(s) + λ(f (s, a))

Therefore, use

Vθ(s) = minx,u

λθ(s) + V fθ (xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Enforce stability with Lθ(s, a) � 0 and learn also λθ(s)

Page 75: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 15 / 27

Evaporation Process

Model and cost[X2

P2

]= f

([X2

P2

],

[P100

F200

]), `(x , u) = something complicated.

Bounds

X2 ≥ 25 %, 40 kPa ≤ P2 ≤ 80 kPa,

P100 ≤ 400 kPa, F200 ≤ 400kg/min.

Nominal optimal steady state[X2

P2

]=

[25 %

49.743 kPa

],

[P100

F200

]=

[191.713 kPa

215.888 kg/min

].

Nominal Economic MPC gain:

large in the nominal case

about 1.5 % in the stochastic case: cannot even guarantee to have any

Page 76: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 15 / 27

Evaporation Process

Model and cost[X2

P2

]= f

([X2

P2

],

[P100

F200

]), `(x , u) = something complicated.

Bounds

X2 ≥ 25 %, 40 kPa ≤ P2 ≤ 80 kPa,

P100 ≤ 400 kPa, F200 ≤ 400kg/min.

Nominal optimal steady state[X2

P2

]=

[25 %

49.743 kPa

],

[P100

F200

]=

[191.713 kPa

215.888 kg/min

].

Nominal Economic MPC gain:

large in the nominal case

about 1.5 % in the stochastic case: cannot even guarantee to have any

Page 77: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 15 / 27

Evaporation Process

Model and cost[X2

P2

]= f

([X2

P2

],

[P100

F200

]), `(x , u) = something complicated.

Bounds

X2 ≥ 25 %, 40 kPa ≤ P2 ≤ 80 kPa,

P100 ≤ 400 kPa, F200 ≤ 400kg/min.

Nominal optimal steady state[X2

P2

]=

[25 %

49.743 kPa

],

[P100

F200

]=

[191.713 kPa

215.888 kg/min

].

Nominal Economic MPC gain:

large in the nominal case

about 1.5 % in the stochastic case: cannot even guarantee to have any

Page 78: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 15 / 27

Evaporation Process

Model and cost[X2

P2

]= f

([X2

P2

],

[P100

F200

]), `(x , u) = something complicated.

Bounds

X2 ≥ 25 %, 40 kPa ≤ P2 ≤ 80 kPa,

P100 ≤ 400 kPa, F200 ≤ 400kg/min.

Nominal optimal steady state[X2

P2

]=

[25 %

49.743 kPa

],

[P100

F200

]=

[191.713 kPa

215.888 kg/min

].

Nominal Economic MPC gain:

large in the nominal case

about 1.5 % in the stochastic case: cannot even guarantee to have any

Page 79: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 15 / 27

Evaporation Process

Model and cost[X2

P2

]= f

([X2

P2

],

[P100

F200

]), `(x , u) = something complicated.

Bounds

X2 ≥ 25 %, 40 kPa ≤ P2 ≤ 80 kPa,

P100 ≤ 400 kPa, F200 ≤ 400kg/min.

Nominal optimal steady state[X2

P2

]=

[25 %

49.743 kPa

],

[P100

F200

]=

[191.713 kPa

215.888 kg/min

].

Nominal Economic MPC gain:

large in the nominal case

about 1.5 % in the stochastic case: cannot even guarantee to have any

Page 80: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 16 / 27

Evaporation Process

Reinforcement learning based on

minx,u,σ

λ(x0)︷ ︸︸ ︷x>0 Hλx0 + h>λ x0 + cλ +γN

( V f (xN )︷ ︸︸ ︷x>N HV f xN + h>V f xN + cV f

)+

N−1∑k=0

γk([xk

uk

]>H`

[xkuk

]+ h>`

[xkuk

]+ c` + σ>k H>σ σk + h>σ σk︸ ︷︷ ︸

`(xk ,uk ,σk )

)s.t. x0 = s,

xk+1 = f (xk , uk) + cf ,

ul ≤ uk ≤ uu,

xl − σlk ≤ xk ≤ xu + σu

k .

Parameters to learn: θ = {Hλ, hλ, cλ,HV f , hV f , cV f ,H`, h`, c`, cf , xl, xu}

Initial guess: θ = {0, 0, 0,HV f , 0, 0,H`, 0, 0, 0, xl, xu}

HV f = I , H` = I , xl =

[25

100

], xu =

[10080

]

Page 81: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 16 / 27

Evaporation Process

Reinforcement learning based on

minx,u,σ

λ(x0)︷ ︸︸ ︷x>0 Hλx0 + h>λ x0 + cλ +γN

( V f (xN )︷ ︸︸ ︷x>N HV f xN + h>V f xN + cV f

)+

N−1∑k=0

γk([xk

uk

]>H`

[xkuk

]+ h>`

[xkuk

]+ c` + σ>k H>σ σk + h>σ σk︸ ︷︷ ︸

`(xk ,uk ,σk )

)s.t. x0 = s,

xk+1 = f (xk , uk) + cf ,

ul ≤ uk ≤ uu,

xl − σlk ≤ xk ≤ xu + σu

k .

Parameters to learn: θ = {Hλ, hλ, cλ,HV f , hV f , cV f ,H`, h`, c`, cf , xl, xu}

Initial guess: θ = {0, 0, 0,HV f , 0, 0,H`, 0, 0, 0, xl, xu}

HV f = I , H` = I , xl =

[25

100

], xu =

[10080

]

Page 82: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 16 / 27

Evaporation Process

Reinforcement learning based on

minx,u,σ

λ(x0)︷ ︸︸ ︷x>0 Hλx0 + h>λ x0 + cλ +γN

( V f (xN )︷ ︸︸ ︷x>N HV f xN + h>V f xN + cV f

)+

N−1∑k=0

γk([xk

uk

]>H`

[xkuk

]+ h>`

[xkuk

]+ c` + σ>k H>σ σk + h>σ σk︸ ︷︷ ︸

`(xk ,uk ,σk )

)s.t. x0 = s,

xk+1 = f (xk , uk) + cf ,

ul ≤ uk ≤ uu,

xl − σlk ≤ xk ≤ xu + σu

k .

Parameters to learn: θ = {Hλ, hλ, cλ,HV f , hV f , cV f ,H`, h`, c`, cf , xl, xu}

Initial guess: θ = {0, 0, 0,HV f , 0, 0,H`, 0, 0, 0, xl, xu}

HV f = I , H` = I , xl =

[25

100

], xu =

[10080

]

Page 83: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 17 / 27

Evaporation Process

Page 84: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 17 / 27

Evaporation Process

0 200 400 600 800 100024

26

28

30

0 200 400 600 800 100022

24

26

28

0 200 400 600 800 1000-10

-5

0

5

14% gain

Page 85: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 18 / 27

Enforcing Safety [Gros, Zanon, Bemporad (IFAC2020,rev)]

Ensure that π(s) ∈ S:

1 Penalize constraint violation

violations rare but cannot be excludedok when safety not at stake

2 Project policy onto feasible set:

π⊥θ (x) = arg minu‖u − πθ‖ s.t. u ∈ S

cannot explore outside S ⇒ constrained RL problemQ-learning

projection must be done carefullyDPG: ∇θπ⊥θ ∇uAπ⊥ = ∇θπθM∇uAπ⊥

SPG: 1. draw a sample; 2. project

dirac-like structure on the boundaryuse IP method for projection: ≈ Dirac, but continuous∇θJ(π⊥θ ) = ∇θ log πθ∇uAπ⊥ evaluate score function gradienton unprojected sample

3 Safety by construction

Page 86: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 18 / 27

Enforcing Safety [Gros, Zanon, Bemporad (IFAC2020,rev)]

Ensure that π(s) ∈ S:

1 Penalize constraint violation

violations rare but cannot be excludedok when safety not at stake

2 Project policy onto feasible set:

π⊥θ (x) = arg minu‖u − πθ‖ s.t. u ∈ S

cannot explore outside S ⇒ constrained RL problemQ-learning

projection must be done carefullyDPG: ∇θπ⊥θ ∇uAπ⊥ = ∇θπθM∇uAπ⊥

SPG: 1. draw a sample; 2. project

dirac-like structure on the boundaryuse IP method for projection: ≈ Dirac, but continuous∇θJ(π⊥θ ) = ∇θ log πθ∇uAπ⊥ evaluate score function gradienton unprojected sample

3 Safety by construction

Page 87: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 18 / 27

Enforcing Safety [Gros, Zanon, Bemporad (IFAC2020,rev)]

Ensure that π(s) ∈ S:

1 Penalize constraint violation

violations rare but cannot be excludedok when safety not at stake

2 Project policy onto feasible set:

π⊥θ (x) = arg minu‖u − πθ‖ s.t. u ∈ S

cannot explore outside S ⇒ constrained RL problemQ-learning

projection must be done carefullyDPG: ∇θπ⊥θ ∇uAπ⊥ = ∇θπθM∇uAπ⊥

SPG: 1. draw a sample; 2. project

dirac-like structure on the boundaryuse IP method for projection: ≈ Dirac, but continuous∇θJ(π⊥θ ) = ∇θ log πθ∇uAπ⊥ evaluate score function gradienton unprojected sample

3 Safety by construction

Page 88: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 18 / 27

Enforcing Safety [Gros, Zanon, Bemporad (IFAC2020,rev)]

Ensure that π(s) ∈ S:

1 Penalize constraint violation

violations rare but cannot be excludedok when safety not at stake

2 Project policy onto feasible set:

π⊥θ (x) = arg minu‖u − πθ‖ s.t. u ∈ S

cannot explore outside S ⇒ constrained RL problem

Q-learningprojection must be done carefully

DPG: ∇θπ⊥θ ∇uAπ⊥ = ∇θπθM∇uAπ⊥

SPG: 1. draw a sample; 2. project

dirac-like structure on the boundaryuse IP method for projection: ≈ Dirac, but continuous∇θJ(π⊥θ ) = ∇θ log πθ∇uAπ⊥ evaluate score function gradienton unprojected sample

3 Safety by construction

Page 89: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 18 / 27

Enforcing Safety [Gros, Zanon, Bemporad (IFAC2020,rev)]

Ensure that π(s) ∈ S:

1 Penalize constraint violation

violations rare but cannot be excludedok when safety not at stake

2 Project policy onto feasible set:

π⊥θ (x) = arg minu‖u − πθ‖ s.t. u ∈ S

cannot explore outside S ⇒ constrained RL problemQ-learning

projection must be done carefully

DPG: ∇θπ⊥θ ∇uAπ⊥ = ∇θπθM∇uAπ⊥

SPG: 1. draw a sample; 2. project

dirac-like structure on the boundaryuse IP method for projection: ≈ Dirac, but continuous∇θJ(π⊥θ ) = ∇θ log πθ∇uAπ⊥ evaluate score function gradienton unprojected sample

3 Safety by construction

Page 90: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 18 / 27

Enforcing Safety [Gros, Zanon, Bemporad (IFAC2020,rev)]

Ensure that π(s) ∈ S:

1 Penalize constraint violation

violations rare but cannot be excludedok when safety not at stake

2 Project policy onto feasible set:

π⊥θ (x) = arg minu‖u − πθ‖ s.t. u ∈ S

cannot explore outside S ⇒ constrained RL problemQ-learning

projection must be done carefullyDPG: ∇θπ⊥θ ∇uAπ⊥ = ∇θπθM∇uAπ⊥

SPG: 1. draw a sample; 2. project

dirac-like structure on the boundaryuse IP method for projection: ≈ Dirac, but continuous∇θJ(π⊥θ ) = ∇θ log πθ∇uAπ⊥ evaluate score function gradienton unprojected sample

3 Safety by construction

Page 91: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 18 / 27

Enforcing Safety [Gros, Zanon, Bemporad (IFAC2020,rev)]

Ensure that π(s) ∈ S:

1 Penalize constraint violation

violations rare but cannot be excludedok when safety not at stake

2 Project policy onto feasible set:

π⊥θ (x) = arg minu‖u − πθ‖ s.t. u ∈ S

cannot explore outside S ⇒ constrained RL problemQ-learning

projection must be done carefullyDPG: ∇θπ⊥θ ∇uAπ⊥ = ∇θπθM∇uAπ⊥

SPG: 1. draw a sample; 2. projectdirac-like structure on the boundary

use IP method for projection: ≈ Dirac, but continuous∇θJ(π⊥θ ) = ∇θ log πθ∇uAπ⊥ evaluate score function gradienton unprojected sample

3 Safety by construction

Page 92: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 18 / 27

Enforcing Safety [Gros, Zanon, Bemporad (IFAC2020,rev)]

Ensure that π(s) ∈ S:

1 Penalize constraint violation

violations rare but cannot be excludedok when safety not at stake

2 Project policy onto feasible set:

π⊥θ (x) = arg minu‖u − πθ‖ s.t. u ∈ S

cannot explore outside S ⇒ constrained RL problemQ-learning

projection must be done carefullyDPG: ∇θπ⊥θ ∇uAπ⊥ = ∇θπθM∇uAπ⊥

SPG: 1. draw a sample; 2. projectdirac-like structure on the boundaryuse IP method for projection: ≈ Dirac, but continuous

∇θJ(π⊥θ ) = ∇θ log πθ∇uAπ⊥ evaluate score function gradienton unprojected sample

3 Safety by construction

Page 93: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 18 / 27

Enforcing Safety [Gros, Zanon, Bemporad (IFAC2020,rev)]

Ensure that π(s) ∈ S:

1 Penalize constraint violation

violations rare but cannot be excludedok when safety not at stake

2 Project policy onto feasible set:

π⊥θ (x) = arg minu‖u − πθ‖ s.t. u ∈ S

cannot explore outside S ⇒ constrained RL problemQ-learning

projection must be done carefullyDPG: ∇θπ⊥θ ∇uAπ⊥ = ∇θπθM∇uAπ⊥

SPG: 1. draw a sample; 2. projectdirac-like structure on the boundaryuse IP method for projection: ≈ Dirac, but continuous∇θJ(π⊥θ ) = ∇θ log πθ∇uAπ⊥ evaluate score function gradienton unprojected sample

3 Safety by construction

Page 94: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 18 / 27

Enforcing Safety [Gros, Zanon, Bemporad (IFAC2020,rev)]

Ensure that π(s) ∈ S:

1 Penalize constraint violation

violations rare but cannot be excludedok when safety not at stake

2 Project policy onto feasible set:

π⊥θ (x) = arg minu‖u − πθ‖ s.t. u ∈ S

cannot explore outside S ⇒ constrained RL problemQ-learning

projection must be done carefullyDPG: ∇θπ⊥θ ∇uAπ⊥ = ∇θπθM∇uAπ⊥

SPG: 1. draw a sample; 2. projectdirac-like structure on the boundaryuse IP method for projection: ≈ Dirac, but continuous∇θJ(π⊥θ ) = ∇θ log πθ∇uAπ⊥ evaluate score function gradienton unprojected sample

3 Safety by construction

Page 95: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 19 / 27

Safe RL [Zanon, Gros (TAC,rev.)], [Gros, Zanon (TAC,rev.)]

Robust MPC as function approximator

Model used for data compression

Q-learning: no adaptation required

Actor-critic:best possible performanceconstraints pose technical difficulties

How to explore safely?

minx,u

d>u0 + V fθ (xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Gradient d perturbs the MPC solution.

Page 96: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 19 / 27

Safe RL [Zanon, Gros (TAC,rev.)], [Gros, Zanon (TAC,rev.)]

Robust MPC as function approximator

Model used for data compression

Q-learning: no adaptation required

Actor-critic:best possible performanceconstraints pose technical difficulties

How to explore safely?

minx,u

d>u0 + V fθ (xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Gradient d perturbs the MPC solution.

Page 97: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 19 / 27

Safe RL [Zanon, Gros (TAC,rev.)], [Gros, Zanon (TAC,rev.)]

Robust MPC as function approximator

Model used for data compression

Q-learning: no adaptation required

Actor-critic:best possible performanceconstraints pose technical difficulties

How to explore safely?

minx,u

d>u0 + V fθ (xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Gradient d perturbs the MPC solution.

Page 98: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 19 / 27

Safe RL [Zanon, Gros (TAC,rev.)], [Gros, Zanon (TAC,rev.)]

Robust MPC as function approximator

Model used for data compression

Q-learning: no adaptation required

Actor-critic:best possible performanceconstraints pose technical difficulties

How to explore safely?

minx,u

d>u0 + V fθ (xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Gradient d perturbs the MPC solution.

Page 99: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 19 / 27

Safe RL [Zanon, Gros (TAC,rev.)], [Gros, Zanon (TAC,rev.)]

Robust MPC as function approximator

Model used for data compression

Q-learning: no adaptation required

Actor-critic:best possible performanceconstraints pose technical difficulties

How to explore safely?

minx,u

d>u0 + V fθ (xN) +

N−1∑k=0

Lθ(x , u)

s.t. x0 = s,

xk+1 = fθ(xk , uk),

hθ(xk , uk) ≤ 0,

xN ∈ Xfθ.

Gradient d perturbs the MPC solution.

Page 100: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 20 / 27

Safe Q-learning [Zanon, Gros (TAC,rev.)]

Evaporation processNominal optimal steady state[

X2

P2

]=

[25 %

49.743 kPa

],

[P100

F200

]=

[191.713 kPa

215.888 kg/min

].

Satisfy X2 ≥ 25 robustly

Adjust uncertainty set representation

Adjust the RMPC cost

Adjust feedback K

Page 101: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 20 / 27

Safe Q-learning [Zanon, Gros (TAC,rev.)]

Evaporation processNominal optimal steady state[

X2

P2

]=

[25 %

49.743 kPa

],

[P100

F200

]=

[191.713 kPa

215.888 kg/min

].

Satisfy X2 ≥ 25 robustly

Adjust uncertainty set representation

Adjust the RMPC cost

Adjust feedback K

Page 102: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 20 / 27

Safe Q-learning [Zanon, Gros (TAC,rev.)]

Evaporation processNominal optimal steady state[

X2

P2

]=

[25 %

49.743 kPa

],

[P100

F200

]=

[191.713 kPa

215.888 kg/min

].

Satisfy X2 ≥ 25 robustly

Adjust uncertainty set representation

Adjust the RMPC cost

Adjust feedback K

Page 103: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 20 / 27

Safe Q-learning [Zanon, Gros (TAC,rev.)]

Evaporation processNominal optimal steady state[

X2

P2

]=

[25 %

49.743 kPa

],

[P100

F200

]=

[191.713 kPa

215.888 kg/min

].

Satisfy X2 ≥ 25 robustly

Adjust uncertainty set representation

Adjust the RMPC cost

Adjust feedback K

Page 104: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 20 / 27

Safe Q-learning [Zanon, Gros (TAC,rev.)]

Evaporation processNominal optimal steady state[

X2

P2

]=

[25 %

49.743 kPa

],

[P100

F200

]=

[191.713 kPa

215.888 kg/min

].

Satisfy X2 ≥ 25 robustly

Adjust uncertainty set representation

Adjust the RMPC cost

Adjust feedback K

Page 105: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 20 / 27

Safe Q-learning [Zanon, Gros (TAC,rev.)]

Evaporation processNominal optimal steady state[

X2

P2

]=

[25 %

49.743 kPa

],

[P100

F200

]=

[191.713 kPa

215.888 kg/min

].

Satisfy X2 ≥ 25 robustly

Adjust uncertainty set representation

Adjust the RMPC cost

Adjust feedback K

Page 106: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 21 / 27

Safe Actor-Critic RL [Gros, Zanon (TAC,rev.)]

Safe exploration distorts the distribution

Distribution of d , a

Deterministic PG

∇θJ = ∇θπθ∇a Aπθ︸︷︷︸Advantage Function

Bias in ∇aAπθ :

E[a− πθ(s)] 6= 0

Cov[a− πθ(s)]→ rank deficient

Stochastic PG

∇θJ = ∇θ log πθ δVπθ︸︷︷︸TD Error

sampling ∇θ log πθ too expensive

if action infeasible: resample

gradient perturbation is the solution

∇J more expensive than DPG

deterministic: bias in advantage function gradient estimation

stochastic: πθ and its score function ∇θ log πθ is expensive if you don’tuse the right techniques because of the diracs stochastic you don’t needany assumption. Deterministic you have some mild ones but they arerealistic though technical

Page 107: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 21 / 27

Safe Actor-Critic RL [Gros, Zanon (TAC,rev.)]

Safe exploration distorts the distribution

Distribution of d , a

Deterministic PG

∇θJ = ∇θπθ∇a Aπθ︸︷︷︸Advantage Function

Bias in ∇aAπθ :

E[a− πθ(s)] 6= 0

Cov[a− πθ(s)]→ rank deficient

Stochastic PG

∇θJ = ∇θ log πθ δVπθ︸︷︷︸TD Error

sampling ∇θ log πθ too expensive

if action infeasible: resample

gradient perturbation is the solution

∇J more expensive than DPG

deterministic: bias in advantage function gradient estimation

stochastic: πθ and its score function ∇θ log πθ is expensive if you don’tuse the right techniques because of the diracs stochastic you don’t needany assumption. Deterministic you have some mild ones but they arerealistic though technical

Page 108: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 21 / 27

Safe Actor-Critic RL [Gros, Zanon (TAC,rev.)]

Safe exploration distorts the distribution

Distribution of d , a

Deterministic PG

∇θJ = ∇θπθ∇a Aπθ︸︷︷︸Advantage Function

Bias in ∇aAπθ :

E[a− πθ(s)] 6= 0

Cov[a− πθ(s)]→ rank deficient

Stochastic PG

∇θJ = ∇θ log πθ δVπθ︸︷︷︸TD Error

sampling ∇θ log πθ too expensive

if action infeasible: resample

gradient perturbation is the solution

∇J more expensive than DPG

deterministic: bias in advantage function gradient estimation

stochastic: πθ and its score function ∇θ log πθ is expensive if you don’tuse the right techniques because of the diracs stochastic you don’t needany assumption. Deterministic you have some mild ones but they arerealistic though technical

Page 109: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 21 / 27

Safe Actor-Critic RL [Gros, Zanon (TAC,rev.)]

Safe exploration distorts the distribution

Distribution of d , a

Deterministic PG

∇θJ = ∇θπθ∇a Aπθ︸︷︷︸Advantage Function

Bias in ∇aAπθ :

E[a− πθ(s)] 6= 0

Cov[a− πθ(s)]→ rank deficient

Stochastic PG

∇θJ = ∇θ log πθ δVπθ︸︷︷︸TD Error

sampling ∇θ log πθ too expensive

if action infeasible: resample

gradient perturbation is the solution

∇J more expensive than DPG

deterministic: bias in advantage function gradient estimation

stochastic: πθ and its score function ∇θ log πθ is expensive if you don’tuse the right techniques because of the diracs stochastic you don’t needany assumption. Deterministic you have some mild ones but they arerealistic though technical

Page 110: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 21 / 27

Safe Actor-Critic RL [Gros, Zanon (TAC,rev.)]

Safe exploration distorts the distribution

Distribution of d , a

Deterministic PG

∇θJ = ∇θπθ∇a Aπθ︸︷︷︸Advantage Function

Bias in ∇aAπθ :

E[a− πθ(s)] 6= 0

Cov[a− πθ(s)]→ rank deficient

Stochastic PG

∇θJ = ∇θ log πθ δVπθ︸︷︷︸TD Error

sampling ∇θ log πθ too expensive

if action infeasible: resample

gradient perturbation is the solution

∇J more expensive than DPG

deterministic: bias in advantage function gradient estimation

stochastic: πθ and its score function ∇θ log πθ is expensive if you don’tuse the right techniques because of the diracs stochastic you don’t needany assumption. Deterministic you have some mild ones but they arerealistic though technical

Page 111: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 22 / 27

MPC Sensitivities [Gros, Zanon (TAC2020)], [Zanon, Gros (TAC,rev.)], [Gros, Zanon (TAC,rev.)]

θ ← θ + α{δ∇θQθ(s, a), ∇θπθ∇aAπθ , ∇θπθδπθ

}Q-learning DPG SPG

Differentiate MPC [Gros, Zanon TAC2020], [Zanon, Gros, Bemporad ECC2019]

Need to compute ∇θQθ(s, a), ∇θπθ(s), ∇aAπθ

Result from parametric optimization:

∇θQθ(s, a) = ∇θLθ, Lθ = Lagrangian of MPC

M∇θπθ(s) = ∂rθ∂θ

, M, r = KKT matrix and residual

∇aAπθ = ν, ν = multiplier of u0 = a

Derivatives are cheap!

∇θLθ much cheaper than MPC

M already factorized inside MPC

ν is for free

Safe RL:

∇ constraint tightening

Actor-critic

Additional computationsDPG cheaper than SPG

Page 112: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 22 / 27

MPC Sensitivities [Gros, Zanon (TAC2020)], [Zanon, Gros (TAC,rev.)], [Gros, Zanon (TAC,rev.)]

θ ← θ + α{δ∇θQθ(s, a), ∇θπθ∇aAπθ , ∇θπθδπθ

}Q-learning DPG SPG

Differentiate MPC [Gros, Zanon TAC2020], [Zanon, Gros, Bemporad ECC2019]

Need to compute ∇θQθ(s, a), ∇θπθ(s), ∇aAπθ

Result from parametric optimization:

∇θQθ(s, a) = ∇θLθ, Lθ = Lagrangian of MPC

M∇θπθ(s) = ∂rθ∂θ

, M, r = KKT matrix and residual

∇aAπθ = ν, ν = multiplier of u0 = a

Derivatives are cheap!

∇θLθ much cheaper than MPC

M already factorized inside MPC

ν is for free

Safe RL:

∇ constraint tightening

Actor-critic

Additional computationsDPG cheaper than SPG

Page 113: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 22 / 27

MPC Sensitivities [Gros, Zanon (TAC2020)], [Zanon, Gros (TAC,rev.)], [Gros, Zanon (TAC,rev.)]

θ ← θ + α{δ∇θQθ(s, a), ∇θπθ∇aAπθ , ∇θπθδπθ

}Q-learning DPG SPG

Differentiate MPC [Gros, Zanon TAC2020], [Zanon, Gros, Bemporad ECC2019]

Need to compute ∇θQθ(s, a), ∇θπθ(s), ∇aAπθ

Result from parametric optimization:

∇θQθ(s, a) = ∇θLθ, Lθ = Lagrangian of MPC

M∇θπθ(s) = ∂rθ∂θ

, M, r = KKT matrix and residual

∇aAπθ = ν, ν = multiplier of u0 = a

Derivatives are cheap!

∇θLθ much cheaper than MPC

M already factorized inside MPC

ν is for free

Safe RL:

∇ constraint tightening

Actor-critic

Additional computationsDPG cheaper than SPG

Page 114: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 22 / 27

MPC Sensitivities [Gros, Zanon (TAC2020)], [Zanon, Gros (TAC,rev.)], [Gros, Zanon (TAC,rev.)]

θ ← θ + α{δ∇θQθ(s, a), ∇θπθ∇aAπθ , ∇θπθδπθ

}Q-learning DPG SPG

Differentiate MPC [Gros, Zanon TAC2020], [Zanon, Gros, Bemporad ECC2019]

Need to compute ∇θQθ(s, a), ∇θπθ(s), ∇aAπθ

Result from parametric optimization:

∇θQθ(s, a) = ∇θLθ, Lθ = Lagrangian of MPC

M∇θπθ(s) = ∂rθ∂θ

, M, r = KKT matrix and residual

∇aAπθ = ν, ν = multiplier of u0 = a

Derivatives are cheap!

∇θLθ much cheaper than MPC

M already factorized inside MPC

ν is for free

Safe RL:

∇ constraint tightening

Actor-critic

Additional computationsDPG cheaper than SPG

Page 115: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 22 / 27

MPC Sensitivities [Gros, Zanon (TAC2020)], [Zanon, Gros (TAC,rev.)], [Gros, Zanon (TAC,rev.)]

θ ← θ + α{δ∇θQθ(s, a), ∇θπθ∇aAπθ , ∇θπθδπθ

}Q-learning DPG SPG

Differentiate MPC [Gros, Zanon TAC2020], [Zanon, Gros, Bemporad ECC2019]

Need to compute ∇θQθ(s, a), ∇θπθ(s), ∇aAπθ

Result from parametric optimization:

∇θQθ(s, a) = ∇θLθ, Lθ = Lagrangian of MPC

M∇θπθ(s) = ∂rθ∂θ

, M, r = KKT matrix and residual

∇aAπθ = ν, ν = multiplier of u0 = a

Derivatives are cheap!

∇θLθ much cheaper than MPC

M already factorized inside MPC

ν is for free

Safe RL:

∇ constraint tightening

Actor-critic

Additional computationsDPG cheaper than SPG

Page 116: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 23 / 27

Real-Time NMPC and RL [Zanon, Kungurtsev, Gros (IFAC2020,rev)]

Real-time feasibility:

NMPC can be computationally heavy

use RTI

Sensitivities:

formulae only hold at convergence

compute sensitivities of RTI QP

justified in a patfollowing framework

Evaporation process:

Similar results as fully converged NMPC

Gain ≈ 15− 20% over standard RTI

Page 117: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 23 / 27

Real-Time NMPC and RL [Zanon, Kungurtsev, Gros (IFAC2020,rev)]

Real-time feasibility:

NMPC can be computationally heavy

use RTI

Sensitivities:

formulae only hold at convergence

compute sensitivities of RTI QP

justified in a patfollowing framework

Evaporation process:

Similar results as fully converged NMPC

Gain ≈ 15− 20% over standard RTI

Page 118: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 23 / 27

Real-Time NMPC and RL [Zanon, Kungurtsev, Gros (IFAC2020,rev)]

Real-time feasibility:

NMPC can be computationally heavy

use RTI

Sensitivities:

formulae only hold at convergence

compute sensitivities of RTI QP

justified in a patfollowing framework

Evaporation process:

Similar results as fully converged NMPC

Gain ≈ 15− 20% over standard RTI

Page 119: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 23 / 27

Real-Time NMPC and RL [Zanon, Kungurtsev, Gros (IFAC2020,rev)]

Real-time feasibility:

NMPC can be computationally heavy

use RTI

Sensitivities:

formulae only hold at convergence

compute sensitivities of RTI QP

justified in a patfollowing framework

Evaporation process:

Similar results as fully converged NMPC

Gain ≈ 15− 20% over standard RTI

Page 120: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 23 / 27

Real-Time NMPC and RL [Zanon, Kungurtsev, Gros (IFAC2020,rev)]

Real-time feasibility:

NMPC can be computationally heavy

use RTI

Sensitivities:

formulae only hold at convergence

compute sensitivities of RTI QP

justified in a patfollowing framework

Evaporation process:

Similar results as fully converged NMPC

Gain ≈ 15− 20% over standard RTI

Page 121: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 23 / 27

Real-Time NMPC and RL [Zanon, Kungurtsev, Gros (IFAC2020,rev)]

Real-time feasibility:

NMPC can be computationally heavy

use RTI

Sensitivities:

formulae only hold at convergence

compute sensitivities of RTI QP

justified in a patfollowing framework

Evaporation process:

Similar results as fully converged NMPC

Gain ≈ 15− 20% over standard RTI

Page 122: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 23 / 27

Real-Time NMPC and RL [Zanon, Kungurtsev, Gros (IFAC2020,rev)]

Real-time feasibility:

NMPC can be computationally heavy

use RTI

Sensitivities:

formulae only hold at convergence

compute sensitivities of RTI QP

justified in a patfollowing framework

Evaporation process:

Similar results as fully converged NMPC

Gain ≈ 15− 20% over standard RTI

Page 123: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 23 / 27

Real-Time NMPC and RL [Zanon, Kungurtsev, Gros (IFAC2020,rev)]

Real-time feasibility:

NMPC can be computationally heavy

use RTI

Sensitivities:

formulae only hold at convergence

compute sensitivities of RTI QP

justified in a patfollowing framework

Evaporation process:

Similar results as fully converged NMPC

Gain ≈ 15− 20% over standard RTI

Page 124: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 24 / 27

Mixed-Integer Problems [Gros, Zanon IFAC2020,rev.]

Can we handle mixed-integer problems?

RL was born for integer problems

Apply, e.g., SPG: expensive

What about a combination of SPG and DPG?

separate continuous and integer parts

continuous: DPG

integer: SPG

Real system:

xk+1 = xk + uk ik + wk , wk ∼ U [0, 0.05]

Page 125: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 24 / 27

Mixed-Integer Problems [Gros, Zanon IFAC2020,rev.]

Can we handle mixed-integer problems?

RL was born for integer problems

Apply, e.g., SPG: expensive

What about a combination of SPG and DPG?

separate continuous and integer parts

continuous: DPG

integer: SPG

Real system:

xk+1 = xk + uk ik + wk , wk ∼ U [0, 0.05]

Page 126: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 24 / 27

Mixed-Integer Problems [Gros, Zanon IFAC2020,rev.]

Can we handle mixed-integer problems?

RL was born for integer problems

Apply, e.g., SPG: expensive

What about a combination of SPG and DPG?

separate continuous and integer parts

continuous: DPG

integer: SPG

Real system:

xk+1 = xk + uk ik + wk , wk ∼ U [0, 0.05]

Page 127: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 24 / 27

Mixed-Integer Problems [Gros, Zanon IFAC2020,rev.]

Can we handle mixed-integer problems?

RL was born for integer problems

Apply, e.g., SPG: expensive

What about a combination of SPG and DPG?

separate continuous and integer parts

continuous: DPG

integer: SPG

Real system:

xk+1 = xk + uk ik + wk , wk ∼ U [0, 0.05]

Page 128: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 25 / 27

Conclusions

The goal:

simplify the computational aspects

provide safety and stability guarantees

achieve true optimality

self-tuning optimal controllers

We are not quite there yet! Challenges:

exploration vs exploitation (identifiability and persistentexcitation)

data noise and cost

can we combine SYSID and RL effectively?

Page 129: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 25 / 27

Conclusions

The goal:

simplify the computational aspects

provide safety and stability guarantees

achieve true optimality

self-tuning optimal controllers

We are not quite there yet!

Challenges:

exploration vs exploitation (identifiability and persistentexcitation)

data noise and cost

can we combine SYSID and RL effectively?

Page 130: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 25 / 27

Conclusions

The goal:

simplify the computational aspects

provide safety and stability guarantees

achieve true optimality

self-tuning optimal controllers

We are not quite there yet! Challenges:

exploration vs exploitation (identifiability and persistentexcitation)

data noise and cost

can we combine SYSID and RL effectively?

Page 131: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

MPC-based RL 26 / 27

Our contribution

1 Gros, S. and Zanon, M. Data-Driven Economic NMPC usingReinforcement Learning. IEEE Transactions on Automatic Control,2020, in press.

2 Zanon, M., Gros, S., and Bemporad, A. Practical ReinforcementLearning of Stabilizing Economic MPC. European Control Conference2019

3 Zanon, M. and Gros, S. Safe Reinforcement Learning Using RobustMPC. Transaction on Automatic Control, (under review).

4 Gros, S. and Zanon, M. Safe Reinforcement Learning Based on RobustMPC and Policy Gradient Methods IEEE Transactions on AutomaticControl, (under review).

5 Gros, S., Zanon, M, and Bemporad, A. Safe Reinforcement Learningvia Projection on a Safe Set: How to Achieve Optimality? IFACWorld Congress, 2020 (submitted)

6 Gros, S. and Zanon, M. Reinforcement Learning for Mixed-IntegerProblems Based on MPC. IFAC World Congress, 2020 (submitted)

Page 132: Data-driven Economic NMPC using Reinforcement Learningdysco.imtlucca.it/mpc-cdc19/pdf/4.pdf · The Basics Assumption: the system is a Markov Decision Process (MDP) state, action s,

27 / 27

Thank you for your attention!