chance constrained policy optimization for process control

1

Chance Constrained Policy Optimization forProcess Control and Optimization

Panagiotis Petsagkourakis, Ilya Orson Sandoval, Eric Bradford, Federico Galvanin, Dongda Zhang and EhecatlAntonio del Rio-Chanona

Abstract—Chemical process optimization and control are af-fected by 1) plant-model mismatch, 2) process disturbances,and 3) constraints for safe operation. Reinforcement learningby policy optimization would be a natural way to solve thisdue to its ability to address stochasticity, plant-model mismatch,and directly account for the effect of future uncertainty and itsfeedback in a proper closed-loop manner; all without the needof an inner optimization loop. One of the main reasons whyreinforcement learning has not been considered for industrialprocesses (or almost any engineering application) is that it lacksa framework to deal with safety critical constraints. Presentalgorithms for policy optimization use difficult-to-tune penaltyparameters, fail to reliably satisfy state constraints or presentguarantees only in expectation. We propose a chance constrainedpolicy optimization (CCPO) algorithm which guarantees thesatisfaction of joint chance constraints with a high probability- which is crucial for safety critical tasks. This is achieved bythe introduction of constraint tightening (backoffs), which arecomputed simultaneously with the feedback policy. Backoffs areadjusted with Bayesian optimization using the empirical cumu-lative distribution function of the probabilistic constraints, andare therefore self-tuned. This results in a general methodologythat can be imbued into present policy optimization algorithmsto enable them to satisfy joint chance constraints with highprobability. We present case studies that analyze the performanceof the proposed approach.

I. INTRODUCTION

The optimization of chemical processes presents distinctivechallenges to the stochastic systems community given thatthey suffer from three conditions: 1) there is no preciseknown model for most industrial scale processes (plant-modelmismatch), leading to inaccurate predictions and convergence tosuboptimal solutions, 2) the process is affected by disturbances(i.e. it is stochastic), and 3) state constraints must be satisfieddue to operational and safety concerns, therefore constraintviolation can be detrimental or even dangerous. In this workwe use constrained policy search, a reinforcement learning(RL) technique, to address the above challenges.

RL is a machine learning technique that computes a policywhich learns to perform a task by interacting with the(stochastic) environment. RL has been shown to be a powerful

P. Petsagkourakis and F. Galvanin are with the Centre for Process SystemsEngineering (CPSE), University College London, Torrington Place, London,United Kingdom

I. O. Sandoval and Ehecatl Antonio del Rio-Chanona are with the Centrefor Process Systems Engineering (CPSE), Imperial College London, UK

E. Bradford is with the Department of Engineering Cybernetics, NorwegianUniversity of Science and Technology, Trondheim, Norway

D. Zhang is with the Centre for Process Integration, Department of ChemicalEngineering and Analytical Science,The University of Manchester, UK andCentre for Process Systems Engineering (CPSE), Imperial College London,UK

control approach, and one of the few control techniquesable to handle nonlinear stochastic optimal control problems[1, 2]. RL in the approximate dynamic programming (ADP)sense has been studied for chemical process control in [3]. Amodel-based strategy and a model-free strategy for control ofnonlinear processes were proposed in [4]. ADP strategies wereused to address fed-batch reactor optimization, in [5] mixed-integer decision problems were addressed with applicationsto scheduling. In [6] RL was combined with distributedoptimization techniques, to solve an input-constrained optimalcontrol, among other works (e.g. [7], [8]). All these approachesrely on action-value methods, which approximate the solution ofthe Hamilton–Jacobi–Bellman equation (HJBE), and have beenshown to be reliable for some problem instances. RL also sharessimilar features with multiparametric model predictive control(or explicit MPC) [9, 10], where an explicit representation ofthe controller can be found, such that it satisfies the optimalityconditions. However, explicit MPC usually requires someapproximations or assumptions on the models.

Policy gradient RL methods [11] have been proposed todirectly optimize the control policy. Unlike action-value RLmethods where convergence to local optima is not guaranteed,policy gradient methods can guarantee convergence to localoptimality even in high dimensional continuous state andaction spaces. However, the inclusion of constraints in policygradient methods is not straightforward. Existing methodscannot guarantee strict feasibility of the policies even when ini-tialized with feasible initial policies [12]. The main approachesto incorporate constraints make use of trust-region, fixedpenalties [13, 14], and cross entropy [12]. Furthermore, if onlineoptimization is to be avoided, addressing the constraints by theuse of penalties is a natural choice. Various approaches havebeen proposed in this direction, however, current approacheseasily lose optimality or feasibility [13] and guaranteefeasibility only in expectation. Following this thread of thought,a Lyapunov-based approach is implemented in [15], wherea Lyapunov function is constructed and the unconstrainedpolicy is projected to a safety layer allowing the satisfactionof constraints in expectation. In [16] an upper bound on theexpected constraint violation is provided for the projected-basedpolicy optimization, where the initial unconstrained policy isprojected back to the constraint set. In [17] an interior-pointinspired method that is widely used in control [18] is proposed,where constraints are incorporated into the reward using alogarithm barrier function, allowing the satisfaction of theconstraints in expectation.

The above methods all guarantee constraint satisfaction inexpectation, which is inadequate for safety critical engineering

arX

iv:2

008.

0003

0v2

[ee

ss.S

Y]

17

Dec

202

0

2

applications (very loosely speaking, this means violating 50%of the time). Furthermore, although penalties are a naturalway to avoid an online optimization loop (one of the mainadvantages of policy gradients), tuning these penalties is notalways straightforward, and usually rely on heuristics whichsignificantly affect the performance of policy gradient algo-rithms [19]. This suggests constrained policy gradient methodswould benefit from self-tuning parameters. As mentioned earlier,it is well known that policy gradient methods present manyadvantages, however, their application domain will remainlimited until they can handle constraint satisfaction reliably.This is the main challenge addressed in this work. Our proposedmethod - chance constrained policy optimization (CCPO) -guarantees the satisfaction of joint chance constraints for theoptimal policy. This allows for the satisfaction of constraintswith a high probability, rather than only in expectation. Toachieve the satisfaction of joint chance constraints withoutthe need of an online optimization, we use tightening ofconstraints, which are appended to the objective. We treatthe problem of finding the least tightening that satisfies theprobabilistic constraints as a black-box optimization problem,and can therefore solve it efficiently by existing methods. Uponconvergence the proposed algorithm finds an optimal policywhich guarantees the constraint satisfaction to the desiredtolerance.

The structure of this paper is as follows. The problemstatement is outlined in section 2, the details of the proposedmethod for probabilistic satisfaction in RL is presented insection 3. A case study is presented in section 4, where theframework is applied to a dynamic bioreactor system, and inthe last section, conclusions are outlined.

II. PROBLEM STATEMENT

In this work, the dynamic system is assumed to be given bya probability distribution, following a Markov process,

xt+1 ∼ p(xt+1|xt,ut), x0 ∼ p(x0), (1)

where p(xi) is the probability density function of xi, withx ∈ Rnx representing the states, u ∈ Rnu the control inputsand t discrete time. This behaviour is observed in systems whenstochastic disturbances are present and/or other uncertaintiesaffect the physical system, like parametric uncertainties. Adiscrete-time system with disturbances and parametric uncer-tainties can be written as:

xt+1 = f(xt,ut,p,wt) (2)

where w ∈ Rnw is a vector of disturbances and p ∈ Rnp areuncertain parameters. This can be represented by (1). In thiswork we seek to maximize an objective function in expectationby using an optimal stochastic policy subject to probabilisticconstraints despite the uncertainty of the system. This problem

can be written as a stochastic optimal control problem (SOCP):

P(π(·)) :=

maxπ(·)

E (J(x0, . . . , xT ,u0, . . . ,uT ))

s.t.x0 ∼ p(x0)

xt+1 ∼ p(xt+1|xt,ut)ut ∼ p(ut|xt) = π(xt)ut ∈ U

P(

T⋂i=0

{xi ∈ Xi}) ≥ 1− α

∀t ∈ {0, ..., T − 1}

(3)

where J is the objective function, U denotes the set of hardconstraints for the control inputs, Xi represents constraints forstates that must be satisfied with a probability 1−α. Specifically,

Xt = {xt ∈ Rnx |gj,t(xt) ≤ 0, j = 1, . . . , ng}, (4)

with gj,t being the jth constraint to be satisfied at time instantt and the joint chance (also called probabilistic) constraints(P(⋂Tt=0{xt ∈ Xi}) ≥ 1−α) are satisfied for the full trajectory

over all t ∈ {0, . . . , T}. The probability density function ofut given state xt is p(ut|xt), and π(xt) is the state feedbackpolicy. Additionally, π(·) is the stochastic policy. Unfortunately,this SOCP is generally intractable and approximations mustbe sought, hence the use of RL [1, 20, 21].

In RL a policy πθ(·) parametrized by the parameters θis constructed. This policy maximizes the expectation of theobjective function J(·). In the finite horizon discrete-time case,this objective function can be defined as

J =

T∑t=0

GtRt(ut, xt), (5)

where G ∈ [0, 1] is the discount factor and Rt a givenreward at the time instance t for the values of ut, xt.

Notice that we seek for the policy to satisfy joint chanceconstraints with some high probability. Previous approacheshave address the satisfaction of constraints in expectation; thismeans that constraints are violated (roughly speaking) half ofthe time.

III. CONSTRAINED POLICY OPTIMIZATION FOR CHANCECONSTRAINTS

In RL agents take actions to maximize some expected rewardgiven a performance metric. In process control, these agentsbecome the controller, which use a feedback policy π(·), tooptimize the expected value of an economic metric of theprocess (J). The physical system (or the model) at eachsampling time produces a value for the reward R which reflectsthe performance of the policy. The RL algorithm determines thefeedback policy that produces the greatest reward in expectation,which is referred to as policy optimization. However, there is nonatural way to handle constraints, and a constraint satisfied inexpectation may still have a very high probability of not beingsatisfied. To satisfy the constraints with some high probability

3

and not only in expectation, we tighten the constraint withbackoffs [22, 23] bj,t as:

Xt = {xt ∈ Rnx |gj,t(xt) + bj,t ≤ 0, j = 1, . . . , ng}, (6)

where variables bj,t ≥ 0 represent the backoffs which tightenthe original feasible set Xt defined in (4). Backoffs restrict theperceived feasible space by the controller, and allow guaranteeson the satisfaction of chance constraints.

We denote τττ , as the joint random variable of states, controlsand rewards for a trajectory with a time horizon T :

τττ = (x0,u0, R0, ..., xT−1,uT−1, RT−1, xT , RT ). (7)

We also assume that the policy can be parametrized by a finitenumber of parameters θ, and denote this parametrized policyas:

ut ∼ p(ut|xt) = πθ(xt, Dt), ut ∈ U, (8)

where ut ∈ U is inherently satisfied by the construction of thepolicy, e.g. the policy passes through a bounded and differentialsquashing function [24], and Dt is a window of past inputsand states that are used by the policy, for example, if theparametrized policy is a recurrent neural network, then Dt

corresponds to a number of past states and controls.Recently a methodology was proposed that satisfies the

expected value of the constraints [12, 14], however this is notadequate for safety critical constraints in chemical processes.Instead, to account for constraint violations, we incorporateprobabilistic constraints. The problem can then be reformulatedas:

πθ∗ = arg maxπθ(·)

Eτττ (J(τττ))

s.t.

ut ∼ p(ut|xt) = πθ(xt, Dt), ∀t ∈ 0, ..., T − 1

ut ∈ Uτττ ∼ p(τττ |θ)

Pτττ

(T⋂i=1

{xi ∈ Xi}

)≥ 1− α

(9)

where p(τττ |θ) represents the probability of the trajectory τττgiven the parametrization θ of the feedback policy, and πθ∗ isthe optimal policy.

In order to solve (9) we propose to:1) Parameterize the feedback policy by a multilayer neural

network that computes the mean and variance of thecontrol actions, which are consequently sampled as anormal distribution resulting in a stochastic policy.

2) The probabilistic constraint in (9) is substituted by atightened constraint set Xt to guarantee closed-loopprobabilistic constraint satisfaction.

3) The tightened constraints gj,t(xt)+bj,t ≤ 0, j = 1, . . . , ngare incorporated into the objective function and can beincorporated in previous approached [13, 25]. This avoidsthe need to explicitly solve an optimization problem atrun-time.

4) The policy optimization is performed by a policy gradientframework [11].

5) The value of the backoffs should be the smallest valuethat guarantees constraint satisfaction with a probabilityof at least 1 − α. This is formulated as a black-boxoptimization problem, which can be solved efficientlyby existing methods [26, 27, 28].

Notice that the value of the backoffs imply a trade-off: largevalues guarantee constraint satisfaction, but they make theproblem over-conservative and mitigate performance, whilesmaller values produce solutions with high rewards, but maynot guarantee the constraint satisfaction to the desired accuracy(1−α). Also notice that if backoffs are too large, the problemmight become infeasible.

In the next subsections we introduce the components thatwere outlined above.

A. Policy parametrization

For the policy parametrization we use recurrent neural net-works, (RNNs) [29, 30]. Let N be the length of the window ofpast states and controls to be used by the parametrized feedbackpolicy, i.e. Dt =

[xTt−1,uTt−1, xTt−2,uTt−2, . . . ,uTt−N−1

]T. Then

the stochastic policy can be defined as:

πθ(xt, Dt) = N (ut|µµµut ,ΣΣΣ

ut ), [µµµu

t ,ΣΣΣut ] = sθ(xt, Dt) (10)

where sθ is the multilayer RNN parametrized by θ, and µµµut and

ΣΣΣut are the mean and covariance of the normal distribution from

which the control ut is drawn. Deep structures are employedto enhance the performance of the learning process [31, 32].

B. Probabilistic constraints

In this section, we present a strategy to handle (joint)chance constraints by policy gradient methods. There is noclosed form solution for general probabilistic constraints, andtherefore we compute their empirical cumulative distributionfunction (ECDF) instead. To guarantee with high probability(and not only in expectation) the satisfaction of constraints, weintroduce constraint tightening by using backoffs bj,t (see Eq.(6)). Therefore, we conduct closed-loop Monte Carlo (MC)simulations to compute the backoffs. We pose the problemof attaining the smallest backoffs (least restrictive) that stillsatisfy our constraints with a probability of at least 1−α as anexpensive black-box optimization problem, which is solved byBayesian optimization. In this way we compute a policy thatsatisfies our constraints with a probability of at least 1 − α,and we can be certain of this with a confidence of at least1− ε. The rest of this subsection details the aforementionedmethodology.

For convenience we define a single-variate random variableC(·) representing the satisfaction of the joint chance constraint:

C(X) = max(j,t)∈{1,...,ng}×{1,...,T}

gj,t(xt) (11)

F (c) = P (C(X) ≤ c) (12)

then

F (c := 0) = P (C(X) ≤ 0) = P

(T⋂t=0

{xt ∈ Xt}

), (13)

4

where X = [x1, . . . ,xT ]T, and F is the cumulative distributionfunction (CDF). There is no analytical expression for generalprobabilistic constraints, and we therefore approximate them bya non-parametric approximation, i.e. their empirical cumulativedistribution function (ECDF). We can approximate the valueof F (0) by its sample approximation on S Monte Carlo (MC)simulations of X:

F (0) ≈ FS(0) =1

S

S∑s=1

1 (C(Xs) ≤ 0) , (14)

where Xs = [xs1, . . . ,xsT ]T is the sth MC sample of the state

trajectory. Notice that FS(0) is a random variable. Define1{C(X) ≤ 0} as the indicator function for a single trajectoryto satisfy all constraints:

1{C(X) ≤ 0 } =

{1, C(X) ≤ 0

0, otherwise.

Notice that FS(0) follows a Binomial distribution, as theindicator function is a Bernoulli random variable. Hence,FS(0) ∼ 1

SBin(S, F (0)). Later on we will use a realization

from this random variable (i.e. FS) to approximate the proba-bility for a trajectory to satisfy all constraints. The confidencebound for the ECDF can then be computed from the BinomialCDF via the Clopper-Pearson interval [22, 33, 34].

Notice that the quality of the approximation in (14) stronglydepends on the number of samples S used and it is thereforedesirable to quantify the uncertainty of the sample approxi-mation itself. This problem has been studied to a great extentin the statistics literature [33, 34], leading to the followinglemma.

Lemma 1 ([34]). Given the random variable FS(0) (the ECDF- see (14)) based on S i.i.d. samples, the true value of the CDF,F (0), has the following lower bound Flb with a confidencelevel of 1− ε:

P(F (0) ≥ Flb) ≥ 1− ε,Flb = 1− betainv(ε, S + 1− S FS(0), S FS(0)),

(15)

with betainv(·, ·, ·) being the inverse of the beta CDF withparameters {S + 1− S FS(0)} and {S FS(0)}.

Lemma 1 states that it is possible to compute a probabilisticlower bound, with an arbitrarily confidence of at least 1− ε,for the ECDF of the chance constraints. The following lemmafollows directly using the above result.

Lemma 2. Given a realization FS of the ECDF randomvariable FS(0) based on S independent samples, we cancompute a corresponding realization of the random variableFlb which is denoted Flb. If the value of Flb ≥ 1 − α, then,with a confidence level of 1− ε, our original chance constraintin Eq. (13) holds.

The implication of Lemma 2 is that we can guarantee thesatisfaction of joint chance constraints with a user-definedprobability of at least 1− α and a user-defined confidence ofat least 1− ε. Values of Flb that are higher than 1− α lead tomore conservative solutions, and consequently worse values for

the objective function. On the other hand, when Flb is lowerthan 1 − α, then the objective attains better values, but theconstraints are not satisfied to the required degree. Therefore,the best trade-off is realized if Flb is as close as possible to1− α.

Hence, we aim to compute tightened constraints employingbackoffs bj,t such that Flb − (1 − α) is close to 0 when thepolicy optimization has terminated. Lemma 2 allows us to finda solution to the original problem having satisfied all jointchance constraints with a confidence at least 1− ε.

To compute the tightened constraint set as denoted in Eq.(6), we first compute an initial set of backoffs (b0j,t), as it hasbeen proposed in [35], where

E (gj,t(xt)) + b0j,t = 0 gives P (gj,t(xt) ≤ 0) ≥ 1− δ ∀j, t,(16)

with δ being a tuning parameter for the initial backoffs b0j,t.Additionally, the expected value of gj,t is approximated witha sample average approximation (SAA):

gj,t =1

S

S∑s=1

gj,t(xst ). (17)

The initial backoff values are computed to probabilisticallysatisfy each constraint:

P (gj,t(xt) ≤ 0) ≥ 1− δ, (18a)

b0j,t = F−1gj,t(1− δ)− gj,t(xt), ∀j, t (18b)

with gj,t = [gj,t(x1t ), . . . , gj,t(x

St )]T and F−1

gj,t(1 − δ) is the1− δ quantile of gj,t (18a).

We wish the backoffs b (with elements bj,t ∀ j, t ∈{1, . . . , ng} × {1, . . . , T}) to be large enough to ensure theprobabilistic satisfaction of constraints. However, if the backoffsare too large it may result in a conservative solution, thereforea worse performance, or even infeasibility of the problem. Toobtain the least conservative solution that still guarantees theprobabilistic constraint satisfaction specified by 1−α we solvea root-finding problem using Lemma 1 to find a vector γγγ thatparametrizes bj,t := γj b

0j,t ∀j, t such that

Flb(γγγ)− (1− α) = 0 (19)

Notice, that now Flb is a function of γγγ = [γ1, ..., γng ]. In factthe Eq. 19 is an ideal scenario, where the lower bound couldbe forced to be 1− α. Here we try to minimize the distancebetween Flb and 1−α. With the above procedure we computea deterministic surrogate for the constraints that allows us tosatisfy the joint chance constraints in the original optimizationproblem (9). In the subsequent section we explain how thisconstraint surrogate is incorporated into the reinforcementlearning framework.

IV. CCPO: CHANCE CONSTRAINED POLICY OPTIMIZATION

A. Policy gradient for fixed backoffs

In this work we reformulate chance constraints such that theirsatisfaction is guaranteed with high probability, and they canbe incorporated into the objective of the policy gradient method[11]. We therefore avoid the need for a numerical optimizationevery time the agent/controller outputs an action/control input.

5

Loosely speaking policy gradient methods aim to updatethe parametrized policy using the gradient of the reward.Without loss of generality, we outline the implementation forthe REINFORCE [36] algorithm for ease of presentation, butthis can be generalized to any policy gradient or actor-criticmethod [20, 37].

Previous works have embedded Lagrangian methods as adap-tive penalty coefficients to enforce satisfaction of constraints[14], however, an adaptive scheme such as gradient ascent orits variants on inequality multipliers are difficult to justify intheory (see for example [38] chapter 12 or [39] chapter 5),and in practice tend to have numerical issue when an arbitrarynumber of constraints are enforced and different constraintsare active in different instances [38].

In this work, we instead propose a p-norm :

J(τττ , b) = J(τττ)− κT∑t=1

|| [gt(xt) + bt]−||pp, (20)

where [gj,t(xt) + bj,t]−

= max{gj,t, (xt) + bj,t , 0}, || · ||p isa p-norm of the vector gt =

[g1,t, . . . , gng,t

]. We advocate

for the use of l1 (p = 1) and l2 (p = 2) norms due to thearbitrary number of constraints that may be considered andtheir numerical stability [38].

The following theorem can be stated for the satisfaction ofthe constraints given S Monte Carlo trajectories.

Theorem 3. Consider the chance constrained stochasticoptimal problem (9) and let πθ∗ be the trained policy thatmaximizes the expected reward (see [13, 14] or (20)) andsatisfies (19) using backoffs b. Then the joint chance constraints(3) will be satisfied with a confidence level of 1− ε.

Proof. Assume that a policy πθ∗ is evaluated on the chanceconstraints (4) and a realization of the ECDF FS is computedwhich leads to a realization Flb of the random variable Flb.Then, given Lemma 2, if Flb ≥ 1− α policy πθ∗ will satisfy(13) with confidence 1− ε.

Remark 1. Further a posteriori analysis can be performed inthe system following the work of [40]. Given the number oftrajectories S and a confidence level 1−εS , then the probabilityof constraint violation can be found to be bounded with aconfidence 1− εS . More details can be in [40]

Remark 2. The feasibility of the constraints will guaranteedindependently from the selection of κ (see Theorem 3).

Remark 3. The parameter κ can be updated using standardtechniques from constrained optimization [38].

We now use the policy gradient theorem [11] to obtain anexplicit gradient estimate of the reward with respect to theparameters of our policy:

∇θEτττ(J)≈ 1

S

S∑s=1

[J(τττs,b)∇θ

T−1∑t=0

log (πθ(xst , Dst ))

](21)

where τττs = (xs0,us0, Rs0, ..., xsT−1,usT−1, RsT−1, xsT , RsT ) de-

notes the realization of the sth trajectory with Rs being thereward for sample s and

Dst =

[(xst−1)T , (ust−1)T , . . . , (ust−N−1)T

]T(22)

denotes the past states and controls used by the policy forsample s. The variance of the gradient estimate can be reducedwith the aid of an action-independent baseline βS , which doesnot introduce a bias [41]. A simple but effective baseline is theexpectation of reward under the current policy, approximatedby the mean of the sampled paths:

βS =1

S

S∑s=1

J(τττs,b), (23)

which leads to:

∇θEτττ(J)≈ 1

S

S∑s=1

[(J(τττs,b)− βS)∇θ

T−1∑t=0


].

(24)Using the above gradient of the expected reward with respectto the parameters the policy can now be iterative adjusted, in asteepest ascent framework or any of its variants (e.g. Adam):

θk+1 := θk+`kS

S∑s=1

[(J(τττs,b)− βs)∇θ

T−1∑t=0


],

(25)where `k is the adaptive learning rate. The algorithm that trainsthe policy network for a fixed backoff value b is outlined inAlgorithm 1.

Algorithm 1 Policy gradient for fixed backoff

Input: Given optimization problem in Eq. (9) and modifiedobjective in Eq. (20), initialize policy πθ with parametersθ := θ0, initial learning rate `0, learning rate update ruleL(`k), number of samples for the gradient approximationS, number of epochs K, tolerance tol, and backoffs b.for k = 1 to K do

1. Collect τττs and J(τττs,b) for samples s = 1, ..., S.2. Update policy πθ: θk+1 := θk+

+ `kS

∑Ss=1

[(J(τττs,b)− βs)∇θ

∑T−1t=0 log (πθ(xst , Ds

t ))].

3. Update learning rate `k+1 := L(`k).4. if |Jk+1 − Jk| ≤ tol then exit, where J refers to thesample average of J(τττs,b).

end forOutput: Optimal policy πθ∗ , with θ∗ := θk+1.

B. Backoff iterations

Solving Eq. (19) is complicated because there is no closedform solution, or even an explicit expression. To solve the root-finding problem in Eq. (19) efficiently, we formulate it as a least-squares expensive black box optimization problem [27] andsolve it via Bayesian optimization with the objective function

F(γγγ) :=(Flb(γγγ)− (1− α)

)2

. (26)

6

Algorithm 1 yields an optimal stochastic policy for fixed valuesof the backoffs. We propose Algorithm 2 to iteratively adjustthe backoffs to guarantee probabilistic constraint satisfaction.

A description of the steps conducted in Algorithm 2 ispresented here:

Step (1): The policy is trained with b = 0 using Algorithm 1.Step (2): The initial estimates of the backoff are computed

by Monte-Carlo closed-loop simulations.Step (3):(3i) An initial set of backoffs parameter values ΓΓΓ =

[γγγ1, . . . , γγγNΓ ] are used to construct a set of optimal policiesπθ∗(·|γγγ1). . .πθ∗(·|γγγNΓ), recall that γγγ parametrizes bj,t :=γj b

0j,t ∀j, t.

(3ii) For each optimal policy πθ∗(·|γγγi), i ∈ {1, ..., NΓ},the lower bound on the ECDF, Flb(γγγi), i ∈ {1, ..., NΓ}, iscomputed by Monte Carlo (a cross-validation of sorts), alongwith the squared residual in Eq. 26. This yields a set ofbackoff parameter values ΓΓΓ = [γγγ1, . . . , γγγNΓ ] that correspond toresiduals F = [F(γγγ1), . . . ,F(γγγNΓ)] for Eq. (26). This is theinitial sample set used to map backoff values to residual valuesvia a Gaussian process regression, and subsequently solved viaa Bayesian optimization framework to enforce Eq. (19).

Step (4): For each mth iteration:(4i) A Gaussian process which maps the backoff values to

the squared residuals in Eq. (26) is constructed. We compute theobjective function value as in Eq. (26) to find F(γγγNΓ+m−1) :=(Flb(γγγ

NΓ+m−1)− (1− α))2

.(4ii) We conduct a Bayesian optimization step, via lower

confidence bound minimization [42] of the GP. This allows usto obtain the next value for γγγNΓ+m.

(4iii) We compute a new policy π(·|γγγNΓ+m) using Algorithm1.

(4iv) We compute the new backoff b with the new valueof the backoff parameter γγγNΓ+m. We also compute the newvalue for the residuals (objective) F . Data matrices ΓΓΓ and Fare updated. The algorithm goes back to (i) where a new GPis constructed incorporating the new values of γγγNΓ+m and Fand the algorithm proceeds until some tolerance is achieved(e.g. F(γγγNΓ+m) ≤ tol).

It should be noted that every time a new backoff is computedthe policy is re-optimized. This may look inefficient at firstglance, however the convergence is achieved fast, as at everyiteration, the previous policy is used as an initial guess for thenext iteration, making the first iteration the most expensive one.Additionally, because the problem is treated as an expensiveblack-box optimization problem the number of iterations fromthis outer loop is in general quite small (in our examples nomore than 13 iterations). (see section IV-C). By the end ofAlgorithm 2, a probabilistically constrained policy will havebeen constructed.

C. Policy initialization

Reinforcement learning methods (particularly policy gradi-ent) are computationally expensive; mainly because initiallythe agent (or controller in our case) explores the controlaction space randomly. In the case of process optimization andcontrol, it is possible to use a preliminary controller, along with

Algorithm 2 Backoff-Based Policy Optimization

Input: Initialize policy parameter θ := θ0, initial learningrate `0, learning rate update rule L(`k), define initial datamatrix ΓΓΓ := [γγγ1, . . . , γγγNΓ ] with NΓ values for γγγ, 0 < δ < 1,0 < α < 1, tolerance tol0, maximum number of backoffiterations M and S number of Monte Carlo samples tocompute Flb, and number of epochs K.1. Perform policy optimization with b = 0 using Algo-rithm 1, to obtain nominal policy πθ∗ .2. Estimate initial backoffs using S samples generated byMonte-Carlo simulations from the state trajectories of thenominal policy:b0j,t := F−1

gj,t(1− δ)− gj,t(xt), ∀j, t3. (i) Use the initially proposed set of backoff param-eter values ΓΓΓ = [γγγ1, . . . , γγγNΓ ] (Notice that the back-offs are bj,t := γj b

0j,t ∀j, t) to obtain optimal policies

πθ∗(·|γγγ1). . .πθ∗(·|γγγNΓ)(ii) Evaluate the policies πθ∗(·|γγγ1). . .πθ∗(·|γγγNΓ)and compute their corresponding lower bounds[Flb(γγγ

1), . . . , Flb(γγγNΓ)] by Monte-Carlo using Eq. 19, and

compute their residuals F = [F(γγγ1), . . . ,F(γγγNΓ)] usingEq. 26.4. Perform Bayesian Optimization:for m = 1 to . . . do

i) Construct a mapping from the backoffs parametervalues ΓΓΓ = [γγγ1, . . . , γγγNΓ+m−1] to their residuals F =[F(γγγ1), . . . ,F(γγγNΓ+m−1)] by using a GP regression.ii) Perform Bayesian optimization over the GP to minimize

F(γγγNΓ+m) :=(Flb(γγγ

NΓ+m)− (1− α))2

.iii) Perform policy optimization with bj,t :=γNΓ+mj b0j,t ∀j, t using Algorithm 1, to obtain policyπθ∗(·|γγγNΓ+m).iv) Update data matrices ΓΓΓ := [ΓΓΓ, γγγNΓ+m] and F :=[F,F(γγγNΓ+m)] by Monte-Carlo using Eq. 19.if F(γγγNΓ+m) ≤ tol0 then

exitend if

end forOutput: policy π∗θ := πNΓ+m

θ∗ .

supervised learning to hot-start the policy, and significantlyspeed-up convergence. The initial parameterization for thepolicy (before Step (1)) is trained in a supervised learningfashion where the states are the inputs and the control actionsare the outputs. This has been further discussed in [43, 44, 45]

V. CASE STUDIES

The case studies in this paper focuses on the photo-production of phycocyanin synthesized by cyanobacteriumArthrospira platensis. The two case studies are separatedregarding the type of uncertainty: The case study 1 considersparametric uncertainty and model is considered to be known.The second case study considers that only data is available,where additive disturbance and measurement noise are present(no knowledge of the system’s equations). Additionally, in thetwo case studies different penalizations of the constraints are

7

implemented: Case study 1 uses Eq.(20) with arbitrary κ = 0.1and p = 1, and Case study 2 uses Eq.(20) with arbitrary κ = 1.and p = 2.

Phycocyanin is a high-value bioproduct and its biologi-cal function is to enhance the photosynthetic efficiency ofcyanobacteria and red algae. It has applications as a naturalcolorant to replace other toxic synthetic pigments in bothfood and cosmetic production. Additionally, the pharmaceuticalindustry considers it beneficial because of its unique antioxidant,neuroprotective, and anti-inflammatory properties.

The dynamic system consists of the following system ofODEs describing the evolution of the concentration (c) ofbiomass (x), nitrate (N ), and product (q). The dynamic modelis based on Monod kinetics, which describes microorganismgrowth in nutrient sufficient cultures, where intracellularnutrient concentration is kept constant because of the rapidreplenishment. We assume a fixed volume fed-batch. Themanipulated variables as in the previous examples are thelight intensity (u1 = I) and inflow rate (u2 = FN ). The massbalance equations are

dcxdt

= umI

I + ks + I2/ki

cxcNcN +KN

− udcX (27)

dcNdt

= −YN/XumI

I + ks + I2/ki

cxcNcN +KN

+ FN (28)

dcqdt

=km I

I + ksq + I2/kiqcx −

kdcqCN +KNq

(29)

The parameter values are adopted from [22].

A. Case Study 1

Uncertainty is assumed for the initial concen-tration, where

[cx(0) cN (0)

]∼ N (

[1. 150.

],

diag(1 × 10−3, 22.5)) and cq(0) = 0. Additionally,10% of parametric uncertainty for the system is

assumed:ks

(µmol/m2/s)∼ N (178.9, 17.89),

ki(mg/L)

∼

N (447.1, 44.71),kN

(µmol/m2/s)∼ N (393.1, 39.31). This

type of uncertainty is common in engineering settings, asthe parameters are obtained using experimental data, andthey are subject to their respective confidence regions afterthey are estimated using regression techniques. The objectivefunction (reward) in this work is to maximize the product’sconcentration (cq) at the end of the batch. The objective isadditionally penalized by the change of the control actionsu(t) = [I, FN ]

T . As a result the reward is:

Rt = −||∆ut||r, RT = cq(T )

r = diag(3.125× 10−8, 3.125× 10−6)(30)

where t ∈ {0, T −1} and ∆ut = ut−ut−1. The constraints inthis work for each time step are cN ≤ 800 and cq ≤ 0.011cX .This constraints have been normalized as:

g1,t =cN800− 1 ≤ 0, g2,t =

cq0.011cX

− 1 ≤ 0 (31)

and the joint chance constraint is meant to be satisfied withprobability 99% (α = 0.01) and confidence level is 99% (ε =0.01). The constraints are added as a penalty with κ = 1 using

(20) and p = 1. The control actions are constrained to be in theinterval 0 ≤ FN ≤ 40 and 120 ≤ I ≤ 400, these constraintsare considered to be hard. The control policy RNN is designedto contain 4 hidden layers, each of which comprises 20 neuronswith a leaky rectified linear unit (ReLU) as activation function.A unified policy network with diagonal variance is utilizedsuch that the control actions share memory and the previousstates are used from the RNN (together the current measuredstates). The computational cost for each control action onlineis insignificant since it only requires the evaluation of thecorresponding RNN. First the algorithm computes the policy forthe backoffs to be zero (b = 0), then the backoffs are updatedaccording to the Algorithm 2. The parameters for the trainingsare: M = 200, S = 1000, K = 200, tol = tol0 = 10−4,Nγ = 5 and the two previous states and controls are used fromthe policy. Additionally, the Gaussian process has zero meanas prior, squared-exponential (SE) kernel as the covariancefunction, and the inputs-outputs are normalized using its meanand variance respectively. After the completion of the trainingthe backoffs have been computed to satisfy (19). In a rathersmall number of iterations the backoff values managed to forcethe Flb to 0.99.

Now, the actual closed-loop constraint satisfaction can bedepicted in Fig. 1(a) and Fig. 2(a), where the shaded areasare the 98% and 2% percentiles. Notice, that even though thefigures for both methods look similar the constraint satisfactionis significantly different, this can be observed in Fig. 1(b) andFig. 2(b), where the region that the violation of constraintsoccurs has been zoomed in. This result is also quantitativelydepicted in Table I. The column labelled ‘actual’ is theprobability of constraint violation that corresponds to thefraction of the 1000 Monte-Carlo trajectories that satisfiedboth of the constraints (see Eq. (14)). The column labelled‘desired’ is the goal for the probability for constraint satisfaction.It is clear that when backoffs are not applied, almost 50% ofthe constraints are violated. On the other hand when backoffsare applied then all the constraints are satisfied, which is anexpected result as the goal was the lower bound of ECDF (Flb)to be 0.99, which means that the actual probability is equal orhigher.

Fig. 1: Case study 1: Constraints when backoffs are applied(blue) and when there are set to be zero (yellow) for g1,t (a)and zoomed in the region that the violation of constraints occur(b). The shaded areas are the 98% and 2% percentiles.

8

Fig. 2: Case study 1: Constraints when backoffs are applied(blue) and when there are set to be zero (yellow) for g2,t (a),zoomed region where the violation of constraints occurs (b).The shaded areas are the 98% and 2% percentiles.

The backoff values for each update are shown in Fig. 3 (a,b), where the red-dashed represents the converged final value.It should be noted that the final value for the product cq (30) is0.163 and 0.167 when backoffs are applied and when they arenot. This difference is to be expected given that in the nominalcase (b = 0), the policy allows the violation of constraintswhich lead to an increase in the concentration of the product.

Fig. 3: Case study 2: The lines are plotted over number ofepochs for b1,t (a) and ,b2,t (b) which are faded out towardsearlier iterations.

B. Case Study 2

The second case study considers the same system but thistime no equation is assumed to be available. In this case,a Gaussian process[46] is used to model the available data.Different kernels can have different effects in the performance,in this case the Matern32[46] kernel is employed. The data wasgenerated using the system in Eqs. (27-29), and an additivenormally distributed disturbance (w) and measurement noise(v) with zero mean and ΣΣΣw = diag(4× 10−4, 0.1, 1× 10−8),ΣΣΣv = diag(4 × 10−5, 0.01, 1 × 10−9). The same uncertaintyfor the initial conditions is assumed.

Bayesian frameworks have been used in reinforcementlearning [47, 48] as they can model both the epistemicand aleatoric uncertainty (in Gaussian processes, usuallyhomoscedastic aleatoric uncertainty is considered). One of thefundamental steps here is the propagation of the uncertainty foreach step. The nature of Gaussian processes has been exploitedin [22, 49, 50], where the generated samples from the GPare conditioned in the posterior (without noise). Then, theGP predicts a mean µµµx and variance ΣΣΣx and the state x issampled from a normal distribution: x ∼ N (µµµx,ΣΣΣx). In thissetting 8 episodes were randomly generated using a Sobol

sequences [51] for the control variables of all 12 time intervalsand the initial conditions to train the Gaussian process.

After the training the actual closed-loop constraint satisfac-tion can be validated. Fig. 4(a) and Fig. 5(a) illustrate the pathconstraints for each time interval, where the shaded areas arethe 98% and 2% percentiles. The results are compared withthe case in absence of backoffs (b = 0). It should be noticedthat in the absence of backoffs the mean of the constraints’samples barely satisfies the bounds. The difference is clearand more apparent in Fig. 5(a). This can further be noticedin Fig. 4(b) and Fig. 5(b), where focus has been put on theregion of constraint violation. Table I shows the probabilityof constraint violation that corresponds to the fraction of the1000 Monte-Carlo trajectories that satisfied the constraints (seeEq. (14)) compared to the ‘desired’ probability for constraintsatisfaction. The absence of constraint tightening results in only24% of constraint satisfaction compare to our proposed methodwhere 97% are satisfied. Notice that the desired probabilitydesigned in terms of the lower bound of the ECDF (Flb) is setto be 0.95.

Fig. 4: Case study 2: Constraints when backoffs are applied(blue) and when they are set to be zero (yellow) for g1,t (a),zoomed region where the violation of constraints occurs (b).The shaded areas are the 98% and 2% percentiles.

Fig. 5: Case study 2: Constraints when backoffs are applied(blue) and when they are set to zero (yellow) for g2,t (a),zoomed region where the violation of constraints occurs (b).The shaded areas are the 98% and 2% percentiles.

The backoff values for each update are shown in Fig. 6 (a,b), where the red-dashed represents the converged final value.It should be noted that the final value for the product cq (30)is 0.153 and 0.171 when backoffs are applied and when theyare not.

9

Fig. 6: Case study 2: The lines are plotted over number ofepochs for b1,t (a) and ,b2,t (b) which are faded out towardsearlier iterations.

Case Study Desired ActualCase Study 1: Parametric Nominal 0.99 0.51∗Case Study 1: Parametric Proposed 0.99 1.00Case Study 2: GP Nominal 0.95 0.24∗Case Study 2: GP Proposed 0.95 0.97

TABLE I: Comparison of closed-loop constraint satisfaction.The symbol ∗ corresponds to the cases that the constraintsatisfaction is less than the desired.

The algorithm is implemented in Pytorch [52] version0.4.1. Adam [53] is employed to compute the network’sparameter values using a step size of 10−2 with the rest ofhyperparameters at their default values.

VI. CONCLUSIONS

In this paper we address the problem of finding a policythat can satisfy constraints with high probability. The proposedalgorithm - chance constrained policy optimization (CCPO) -uses constraint tightening by applying backoffs to the originalfeasible set. Backoffs restrict the perceived feasible spaceby the controller, and allow guarantees on the satisfactionof chance constraints. We find the smallest backoffs (leastconservative) that still guarantee the desired probability ofsatisfaction by stating the root-finding problem as a black-boxoptimization problem. This allows the algorithm to constructa policy that can guarantee the satisfaction of joint chanceconstraints with a user-defined probability of at least 1 − αand a user-defined confidence of at least 1− ε. Furthermore,the proposed methodology can be combined with other penaltyor Lagrangian approaches for constrained policy search. Beingable to solve constraint policy optimization problems with highprobability satisfaction has been one of the main bottlenecksfor the wider use of reinforcement learning in engineeringapplications. This work aims to take a step towards applyingRL to the real world, where constraints on policies are necessaryfor the sake of safety and product quality.

ACKNOWLEDGMENT

This project has received funding from the EPSRC projects(EP/R032807/1) and (EP/P016650/1).

REFERENCES

[1] D. P. Bertsekas, Dynamic Programming and OptimalControl, 2nd ed. Athena Scientific, 2000.

[2] S. Spielberg, A. Tulsyan, N. P. Lawrence, P. D. Loewen,and R. Bhushan Gopaluni, “Toward self-driving processes:A deep reinforcement learning approach to control,”AIChE Journal, vol. 65, no. 10, p. e16689, 2019.

[3] J. M. Lee and J. H. Lee, “Approximate dynamicprogramming-based approaches for input–output data-driven control of nonlinear processes,” Automatica,vol. 41, no. 7, pp. 1281–1288, 2005.

[4] C. Peroni, N. Kaisare, and J. Lee, “Optimal control of afed-batch bioreactor using simulation-based approximatedynamic programming,” IEEE Transactions on ControlSystems Technology, vol. 13, no. 5, pp. 786–790, 2005.

[5] J. H. Lee and J. M. Lee, “Approximate dynamic program-ming based approach to process control and scheduling,”Computers & Chemical Engineering, vol. 30, no. 10-12,pp. 1603–1618, 2006.

[6] W. Tang and P. Daoutidis, “Distributed adaptive dynamicprogramming for data-driven optimal control,” Systems& Control Letters, vol. 120, pp. 36–43, 2018.

[7] D. Chaffart and L. A. Ricardez-Sandoval, “Optimizationand control of a thin film growth process: A hybrid firstprinciples/artificial neural network based multiscale mod-elling approach,” Computers & Chemical Engineering,vol. 119, pp. 465–479, 2018.

[8] H. Shah and M. Gopal, “Model-Free Predictive Control ofNonlinear Processes Based on Reinforcement Learning,”IFAC-PapersOnLine, vol. 49, no. 1, pp. 89–94, 2016.

[9] A. Bemporad and M. Morari, “Robust model predictivecontrol: A survey,” Robustness in identification andcontrol, vol. 245, pp. 207–226, 1999.

[10] V. M. Charitopoulos and V. Dua, “Explicit modelpredictive control of hybrid systems and multiparametricmixed integer polynomial programming,” AIChE Journal,vol. 62, no. 9, pp. 3441–3460, 2016. [Online].Available: https://aiche.onlinelibrary.wiley.com/doi/abs/10.1002/aic.15396

[11] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour,“Policy gradient methods for reinforcement learning withfunction approximation,” in Proceedings of the 12th In-ternational Conference on Neural Information ProcessingSystems, ser. NIPS’99. Cambridge, MA, USA: MITPress, 1999, pp. 1057–1063.

[12] M. Wen, “Constrained Cross-Entropy Method for SafeReinforcement Learning,” Neural Information ProcessingSystems (NIPS), no. Nips, 2018.

[13] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Con-strained Policy Optimization,” arXiv preprint 1705.10528,2017.

[14] C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward Con-strained Policy Optimization,” arXiv preprint 1805.11074,no. 2016, pp. 1–15, 2018.

[15] Y. Chow, O. Nachum, A. Faust, E. Duenez-Guzman,and M. Ghavamzadeh, “Lyapunov-based Safe PolicyOptimization for Continuous Control,” 2019. [Online].Available: http://arxiv.org/abs/1901.10031

[16] T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge,“Projection-based constrained policy optimization,” inInternational Conference on Learning Representations,

https://aiche.onlinelibrary.wiley.com/doi/abs/10.1002/aic.15396

https://aiche.onlinelibrary.wiley.com/doi/abs/10.1002/aic.15396

http://arxiv.org/abs/1901.10031

10

2020. [Online]. Available: https://openreview.net/forum?id=rke3TJrtPS

[17] Y. Liu, J. Ding, and X. Liu, “IPO: Interior-pointPolicy Optimization under Constraints,” 2019. [Online].Available: http://arxiv.org/abs/1910.09615

[18] P. Petsagkourakis, W. P. Heath, J. Carrasco, andC. Theodoropoulos, “Robust stability of barrier-basedmodel predictive control,” IEEE Transactions onAutomatic Control, 2020. [Online]. Available: doi:10.1109/TAC.2020.3010770.

[19] L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, F. Janoos,L. Rudolph, and A. Madry, “Implementation matters indeep rl: A case study on ppo and trpo,” in InternationalConference on Learning Representations, 2020. [Online].Available: https://openreview.net/forum?id=r1etN1rtPB

[20] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, andO. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017.

[21] P. Petsagkourakis, I. Sandoval, E. Bradford, D. Zhang,and E. del Rio-Chanona, “Reinforcement learning forbatch bioprocess optimization,” Computers & ChemicalEngineering, vol. 133, p. 106649, 2020.

[22] E. Bradford, L. Imsland, D. Zhang, and E. A. delRio Chanona, “Stochastic data-driven model predictivecontrol using gaussian processes,” Computers & ChemicalEngineering, vol. 139, p. 106844, 2020.

[23] M. Rafiei and L. A. Ricardez-Sandoval, “StochasticBack-Off Approach for Integration of Design andControl Under Uncertainty,” Industrial & EngineeringChemistry Research, vol. 57, no. 12, pp. 4351–4365, mar2018. [Online]. Available: https://doi.org/10.1021/acs.iecr.7b03935

[24] M. P. Deisenroth, D. Fox, and C. E. Rasmussen, “GaussianProcesses for Data-Efficient Learning in Robotics andControl,” IEEE Transactions on pattern analysis andmachine intelligence, vol. 37, no. 2, 2015.

[25] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone,“Risk-constrained reinforcement learning with percentilerisk criteria,” Journal of Machine Learning Research,vol. 18, pp. 1–51, 2018.

[26] D. Zhan and H. Xing, “Expected improvementfor expensive optimization: a review,” Journal ofGlobal Optimization, 2020. [Online]. Available: https://doi.org/10.1007/s10898-020-00923-x

[27] D. R. Jones, M. Schonlau, and W. J. Welch, “EfficientGlobal Optimization of Expensive Black-Box Functions,”Journal of Global Optimization, vol. 13, no. 4, pp. 455–492, 1998.

[28] C. Cartis, L. Roberts, and O. Sheridan-Methven, “Es-caping local minima with derivative-free methods: anumerical investigation,” 2018.

[29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams,“Learning representations by back-propagating errors,”Nature, vol. 323, p. 533, 1986.

[30] K. Greff, R. K. Srivastava, J. Koutnık, B. R. Steunebrink,and J. Schmidhuber, “Lstm: A search space odyssey,”IEEE Transactions on Neural Networks and LearningSystems, vol. 28, no. 10, pp. 2222–2232, Oct 2017.

[31] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,I. Antonoglou, D. Wierstra, and M. Riedmiller, “PlayingAtari with Deep Reinforcement Learning,” arXiv preprint1312.5602, pp. 1–9, 2013.

[32] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Ve-ness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K.Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik,I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg,and D. Hassabis, “Human-level control through deepreinforcement learning,” Nature, vol. 518, p. 529, 2015.

[33] C. J. Clopper and E. S. Pearson, “The use of confidenceor fiducial limits illustrated in the case of the binomial,”Biometrika, vol. 26, no. 4, pp. 404–413, 1934.

[34] L. D. Brown, T. T. Cai, and A. DasGupta, “Intervalestimation for a binomial proportion,” Statistical science,pp. 101–117, 2001.

[35] J. A. Paulson and A. Mesbah, “Nonlinear Model Predic-tive Control with Explicit Backoffs for Stochastic Systemsunder Arbitrary Uncertainty,” IFAC-PapersOnLine, vol. 51,no. 20, pp. 523–534, 2018.

[36] R. J. Williams, “Simple statistical gradient-following algo-rithms for connectionist reinforcement learning,” Machinelearning, vol. 8, no. 3-4, pp. 229–256, 1992.

[37] S. Kakade, “A natural policy gradient,” Advances inNeural Information Processing Systems, 2002.

[38] J. Nocedal and S. J. Wright, Numerical Optimization,2nd ed. New York, NY, USA: Springer, 2006.

[39] S. J. Wright, Primal-Dual Interior-Point Methods.Society for Industrial and Applied Mathematics, 1997.[Online]. Available: https://epubs.siam.org/doi/abs/10.1137/1.9781611971453

[40] M. C. Campi, S. Garatti, and F. A. Ramponi, “AGeneral Scenario Theory for Nonconvex Optimizationand Decision Making,” IEEE Transactions on AutomaticControl, vol. 63, no. 12, pp. 4067–4078, 2018.

[41] R. Sutton and A. Barto, Reinforcement Learning: AnIntroduction Second Edition. MIT Press, 2018.

[42] P. I. Frazier, “A tutorial on bayesian optimization,” arXivpreprint arXiv:1807.02811, 2018.

[43] P. Petsagkourakis, I. O. Sandoval, E. Bradford, D. Zhang,and E. A. del Rıo Chanona, “Constrained reinforcementlearning for dynamic optimization under uncertainty,”2020.

[44] J. A. Paulson and A. Mesbah, “Approximate closed-looprobust model predictive control with guaranteed stabilityand constraint satisfaction,” IEEE Control Systems Letters,vol. 4, no. 3, pp. 719–724, 2020.

[45] B. Karg and S. Lucia, “Efficient representation andapproximation of model predictive control laws via deeplearning,” IEEE Transactions on Cybernetics, pp. 1–13,2020.

[46] C. Rasmussen and C. Williams, Gaussian Processesfor Machine Learning, ser. Adaptive Computation andMachine Learning. Cambridge, MA, USA: MIT Press,2006.

[47] K. Chua, R. Calandra, R. Mcallister, and S. Levine, “DeepReinforcement Learning in a Handful of Trials usingProbabilistic Dynamics Models.”

https://openreview.net/forum?id=rke3TJrtPS

https://openreview.net/forum?id=rke3TJrtPS


doi: 10.1109/TAC.2020.3010770.

doi: 10.1109/TAC.2020.3010770.

https://openreview.net/forum?id=r1etN1rtPB

https://doi.org/10.1021/acs.iecr.7b03935

https://doi.org/10.1021/acs.iecr.7b03935

https://doi.org/10.1007/s10898-020-00923-x

https://doi.org/10.1007/s10898-020-00923-x

https://epubs.siam.org/doi/abs/10.1137/1.9781611971453

https://epubs.siam.org/doi/abs/10.1137/1.9781611971453

11

[48] M. Janner, J. Fu, M. Zhang, and S. Levine, “When toTrust Your Model: Model-Based Policy Optimization.”

[49] J. Umlauft, T. Beckers, and S. Hirche, “Scenario-basedOptimal Control for Gaussian Process State Space Mod-els,” 2018 European Control Conference, ECC 2018, pp.1386–1392, 2018.

[50] L. Hewing, E. Arcari, L. P. Frohlich, and M. N.Zeilinger, “On Simulation and Trajectory Prediction withGaussian Process Dynamics,” 2019. [Online]. Available:http://arxiv.org/abs/1912.10900

[51] I. Sobol, “Global sensitivity indices for nonlinear mathe-matical models and their monte carlo estimates,” Mathe-matics and Computers in Simulation, vol. 55, no. 1, pp.271 – 280, 2001, the Second IMACS Seminar on MonteCarlo Methods.

[52] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison,A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai,and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances inNeural Information Processing Systems 32, H. Wallach,H. Larochelle, A. Beygelzimer, F. d. Alche-Buc, E. Fox,and R. Garnett, Eds., 2019, pp. 8024–8035.

[53] D. P. Kingma and J. Ba, “Adam: A Method for StochasticOptimization,” 2014.

[54] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, andN. de Freitas, “Taking the human out of the loop: A reviewof bayesian optimization,” Proceedings of the IEEE, vol.104, no. 1, pp. 148–175, 2016.

[55] R. L. Iman, J. C. Helton, and J. E. Campbell, “Anapproach to sensitivity analysis of computer models: Parti—introduction, input variable selection and preliminaryvariable assessment,” Journal of Quality Technology,vol. 13, no. 3, pp. 174–183, 1981.


12

APPENDIX ABAYESIAN OPTIMIZATION

Bayesian optimization is a popular approach for black-boxoptimization [27]. A review of Bayesian optimization can befound in [54].

In this paper Bayesian optimization is exploited to obtain

the minimizer γγγ∗ of F(γγγ) =(Flb − (1− α)

)2

:

γγγ∗ ∈ arg minγγγ

F(γγγ) (32)

The function F(γγγ) to be minimized can only be observedthrough an unbiased noisy observation based on MC estimatesof Flb:

y = F(x) + ω (33)

where ω is assumed to be Gaussian white noise. The noiseω is assumed to be unknown and estimated within the GPframework. Commonly Gaussian processes (GP) are employedas nonparametric models. To start the algorithm we firstrequire some data to build the initial GP, i.e. the functionF(·) is queried at NΓ points. In general the input data-pointsΓΓΓ = [γγγ1, . . . , γγγNΓ

] are selected based on a space-filling design,such as a Latin hybercube design [55]. From this we obtainthe corresponding responses Flb = [F(γγγ1), . . . ,F(γγγNΓ)]. AGP model can then be trained from the input-output data. Inparticular the hyperparameters of the GP were trained usingmaximum likelihood estimation. The GP model can then beutilized to obtain the Gaussian distribution of F(γγγ) at anarbitrary query point γγγ [46]:

F(γγγ)|ΓΓΓ, Flb ∼ N(µGP (γγγ), σ2

GP (γγγ))

(34)

where µGP (γγγ) and σ2GP (γγγ) are the mean and variance pre-

diction of the GP respectively. In this setting the approachsequentially selects a location γγγ at which to query F(·)and observe y. After a selected number of iterations M thealgorithm returns a best-estimate of γγγ∗. To accomplish thisthe GP of F(·) is iteratively updated from the available dataof F(·). One could simply sample at the minimum of themean function µGP (γγγ). However, sampling at a point withhigher uncertainty could yield a lower minimum, i.e. thereis a trade-off between sampling at points with low values ofthe mean function µGP (γγγ) and high values of the variancefunction σ2

GP (γγγ). The selection of the query points is givenby so-called acquisition function, for which we used the lowerconfidence bound:

γγγm = arg minγγγ

µGP (γγγ)− 3σGP (γγγ) (35)

where γγγm denotes the query point at iteration m. Note thatthe query point at each iteration are chosen at points thatare predicted to be low by the mean function, but couldalso potentially yield lower values according to the variancefunction.

APPENDIX BNOMENCLATURE

TABLE II: Nomenclature

Symbol DescriptionNx Size of variable xtxt state at time tut manipulated variables at time twt External Disturbances at time tp(a|b) probablity of event a given bN (·, ·) Normal distribution defined by mean and covariance∼ distributed according to; example: x ∼ N (µ, σ2)D Data availableE ExpectationP ProbabilityX Feasible space for states variablesU Feasible space for manipulated variables∩ Intersection of setsG discount factorRt Reward at time tgj , t Constraint j at time tπθ(·) Policy parametrized by θbj , t Backoff of constraint j at time tτττ Joint random variable of states, controls and reward1− α The probability of constraint satisfaction1− epsilon Confidence levelF (c) Cumulative distribution function (CDF) at c1(·) Indicator functionBin Binomal distributionFS Cumulative distribution function (ECDF) for F with S samplesFS Realization of FSbetainv Inverse comulative distribution of Beta probability distributionFlb Lower bound of ECDF FS\hatF lb Realization of F lbpmb\gamma Parametrization of backoffs|| · ||p lp norm∇θ Parial derivatives with respect to θ

chance constrained policy optimization for process control

Documents