green simulation assisted reinforcement ...wxie/wsc2020_greensimulationpolicy...green simulation...

Proceedings of the 2020 Winter Simulation ConferenceK.-H. Bae, B. Feng, S. Kim, S. Lazarova-Molnar, Z. Zheng, T. Roeder, and R. Thiesing, eds.

GREEN SIMULATION ASSISTED REINFORCEMENT LEARNING WITH MODEL RISK FORBIOMANUFACTURING LEARNING AND CONTROL

Hua Zheng, Wei Xie

Department of Mechanical and Industrial EngineeringNortheastern University

Boston, MA 02115, USA

M. Ben Feng

Department of Statistics and Actuarial ScienceUniversity of Waterloo

Waterloo, Ontario, CANADA

ABSTRACT

Biopharmaceutical manufacturing faces critical challenges, including complexity, high variability, lengthylead time, and limited historical data and knowledge of the underlying system stochastic process. Toaddress these challenges, we propose a green simulation assisted model-based reinforcement learning tosupport process online learning and guide dynamic decision making. Basically, the process model riskis quantified by the posterior distribution. At any given policy, we predict the expected system responsewith prediction risk accounting for both inherent stochastic uncertainty and model risk. Then, we proposegreen simulation assisted reinforcement learning and derive the mixture proposal distribution of decisionprocess and likelihood ratio based metamodel for the policy gradient, which can selectively reuse processtrajectory outputs collected from previous experiments to increase the simulation data-efficiency, improvethe policy gradient estimation accuracy, and speed up the search for the optimal policy. Our numericalstudy indicates that the proposed approach demonstrates the promising performance.

1 INTRODUCTION

To address critical needs in biomanufacturing automation, in this paper, we introduce a green simulationassisted Bayesian reinforcement learning to support bioprocess online learning and guide dynamic deci-sion making. The biomanufacturing industry is growing rapidly and becoming one of the key drivers ofpersonalized medicine. However, biopharmaceutical production faces critical challenges, including com-plexity, high variability, long lead time, and very limited process data. Biotherapeutics are manufacturedin living cells whose biological processes are complex and have highly variable outputs (e.g., productcritical quality attributes (CQAs)) whose values are determined by many factors (e.g., raw materials, media,critical process parameters (CPPs)). As new biotherapeutics (e.g., cell and gene therapies) become moreand more “personalized,” biomanufacturing requires more advanced manufacturing protocols. In addition,the analytical testing time required by biopharmaceuticals of complex molecular structure is lengthy, andthe process observations are relatively limited.

Driven by these challenges, we consider the model-based reinforcement learning (MBRL) or MarkovDecision Process (MDP) to fully leverage the existing bioprocess domain knowledge, utilize the limitedprocess data, support online learning, and guide dynamic decision making. At each time step t, the systemis in state st , and the decision maker takes the action at by following a policy at = πt(at |st). At the nexttime step (t +1), the system evolves to new state st+1 by following the state transition probabilistic modelP(st+1|st ,at ;ωωω), and then we collect a reward rt(at ,st). Thus, the statistical properties and dynamic evolutionof stochastic control process depend on decision policy πt and state transition model P(st+1|st ,at ;ωωω). Inthe biomanufacturing, the prior knowledge of state transition model is constructed based on the existingbiological/physical/chemical mechanisms and dynamics. The unknown model parameters ωωω (e.g., cellgrowth, protein production, and substrate consumption rates in cell culture; nucleation rate and heat transfer

Zheng, Xie, and Feng

coefficients in freeze drying) will be online learned and updated as the arrivals of new process data. Theoptimal policy depends on the current knowledge of process model parameters.

In this paper, we propose a green simulation assisted Bayesian reinforcement learning (GS-RL) toguide dynamic decision making. Given any policy, we predict the expected system response with predictionrisk accounting for both process inherent stochastic uncertainty and model estimation uncertainty, callmodel risk. The model risk is quantified by the posterior distribution and it can efficiently leverage theexisting bioprocess domain knowledge through the selection of prior and support the online learning. Thus,the proposed Bayesian reinforcement learning can provide the robust dynamic decision guidance, whichcan be applicable for cases with various amount of process historical data. In addition, motivated by thestudies on green simulation (i.e., Feng and Staum (2017) and Dong et al. (2018)), we propose the stochasticcontrol process likelihood ratio-based metamodel to improve the policy gradient estimation, which can fullyleverage the historical trajectories generated with various state transition models and policies. Therefore,the proposed green simulation assisted Bayesian reinforcement learning can: (1) incorporate the existingprocess domain knowledge; (2) facilitate the interprertable online learning; (3) guide complex bioprocessdynamic decision making; and (4) provide the reliable, flexible, robust, and coherent decision guidance.

For the model-free inforcement learning, Mnih et al. (2015) introduce the experience replay (ER)to reuse the past experience, increase the data efficiency, and decrease the data correlation. It randomlysamples and reuses the past trajectories. Built on ER, Schaul et al. (2016) further propose the prioritizedexperience replay (PER), which prioritizes the historical trajectories based on temporal-difference error.

The main contribution of our study is to propose a green simulation assisted Bayesian reinforcementlearning (GS-RL). Even though both GS-RL and PER are motivated by “experience replay” and reuse thehistorical data, there is the fundamental difference between GS-RL and PER. In our approach, the posteriordistribution of state transition model can provide the risk- and science-based knowledge of underlyingbioprocess dynamic mechanisms, and facilitate the online learning. Then, the likelihood ratio of stochasticdecision process is used to construct the metamodel of policy gradient in the complex decision processspace, accounting for the selection and impact of both policy and state transition model. It allows us toreuse the trajectories from previous experiments, and the weight assigned to each trajectory depends onits importance measured by the spatial-temporal distance of decision processes. In addition, a mixtureprocess proposal distribution used in the likelihood ratio can improve the estimation accuracy and stabilityof policy gradient and speed up the search for the optimal policy. Since the model risk is automaticallyupdated during the learning, our approach can dynamically adjust the importance weights on the previoustrajectories, which makes GS-RL flexible, efficient, and automatically deal with non-stationary bioprocess.

The organization of the paper is as follow. In Section 2, we provide the problem description. To facilitatethe biomanufacturing process online learning and automation, we focus on the model-based reinforcementlearning with the posterior distribution quantifying the model risk. Then, in Section 3, we propose the greensimulation assisted policy gradient, which can fully leverage the process trajectories obtained from previousexperiments and speed up the search for the optimal policy. After that, a biomanfuacturing example isused to study the performance of proposed approach and compare it with the state-of-art policy gradientapproaches in Section 4. We conclude this paper in Section 5.

2 PROBLEM DESCRIPTION AND MODEL BASED REINFORCEMENT LEARNING

To facilitate the biomanufacturing automation, we consider the reinforcement learning for finite horizonproblem. In Section 2.1, we suppose the underlying model of production process is known and review themodel-based reinforcement learning. Since the process model is typically unknown and estimated by verylimited process data in the biomanufacturing, in Section 2.2, the posterior distribution is used to quantifythe model risk and the posterior predictive distribution, accounting for stochastic uncertainty and modelrisk, is used to generate the trajectories characterizing the overall prediction risk. Thus, in this paper, wefocus on the model-based reinforcement learning with model risk so that we can efficiently leverage theexisting process knowledge, support online learning, and guide process dynamic decision making.


2.1 Model-Based Reinforcement Learning for Dynamic Decision Making

We formalize model-based reinforcement learning or Markov decision process (MDP) over finite horizon Has (S ,A ,P,r,s1,H), where S is a set of states, s1 is the starting state, A is the set of actions. The processproceeds in discrete time step t = 1,2, ...,H. In each t-th time step, the agent observes the current statest ∈S , takes an action at ∈A , and observes a feedback in form of a reward signal rt+1 ∈ R. Moreover,let πθθθ : S →A denote a policy specified by parameter vector θθθ ∈Rd . The policy is a function of currentstate, at = πθθθ (st), whose output is action for deterministic policy or its selection probabilities for randompolicy. For non-stationary finite horizon MDP, we can write πππθθθ = (π1

θθθ, . . . ,πH

θθθ).

Let P(st+1|st ,at ;ωωωc) represent the state transition model characterizing the probability of transitioningto a particular state st+1 from state st . Suppose the underlying process model can be characterized byparameters ωωωc. Let Dπππθθθ

Pωωωc (τττ) denote the probability distribution of the trajectory

τττ = τττ [1:H−1] ≡ (s1,a1,s2,a2, . . . ,sH−1,aH−1,sH)

of state-action sequence over transition probabilities parameterized by transition model P(st+1|st ,at ;ωωωc)starting from state s1 and following policy πππθθθ . The bioprocess trajectory length H can be scenario-dependent. For example, it can depend on the CQAs of raw materials and working cells. We write thedistribution of decision process trajectory as

Dπππθθθ

Pωωωc (τττ)≡ p(s1;ωωωc)H−1

∏t=1

πtθθθ(at |st)p(st+1|st ,at ;ωωωc). (1)

Let R(τττ) denote the expected total reward for the trajectory (sample path) τττ starting from s1, i.e.,R(τττ)≡∑

H−1t=1 γ t−1rt(st ,at), where γ is the discount factor and the reward rt(st ,at) occurring in the t-th time

step depends on the state st and action at . Therefore, given the process model specified by ωωωc, we areinterested in finding the optimal policy maximizing the expected total reward,

πππ?θθθ (·|ωωω

c) = argmaxπππθθθ

µc(πππθθθ )≡ argmax

πππθθθ

Eτττ∼D

πππθθθ

Pωωωc (τττ)

[H−1

∑t=1

γt−1rt

∣∣∣∣∣πππθθθ ,s1

]. (2)

2.2 Model Risk Quantification and Bayesian Reinforcement Learning

However, the underlying process model is typically unknown and estimated by the limited historical real-world data. Here, we focus on Bayesian reinforcement learning (RL) with model risk quantified by theposterior distribution. We consider the growing-batch RL setting (Laroche and Tachet des Combes 2019).The process consists in successive periods: In each p-th period, a batch of data is collected with a fixedpolicy from distributed complex bioprocess, it is used to update the knowledge of bioprocess state transitionmodel, and then the policy will be updated for the next period. At any p-th period, given all real-worldhistorical data collected so far, denoted by Dp, we construct the posterior distribution of state transitionmodel quantifying the model risk, p(ωωω|Dp) ∝ p(Dp|ωωω)p(ωωω), where the prior p(ωωω) quantifies the existingknowledge on bioprocess dynamic mechanisms. Since the posterior of previous time period can be theprior for the next update, the posterior will be updated as new process data are collected. There are variousadvantages of using the posterior distribution quantifying the model risk, including: (1) it can incorporatethe existing domain knowledge on bioprocess dynamic mechanisms; (2) it is valid even when the historicalprocess data are very limited, which often happens in the biomanufacturing; and (3) it facilitates onlinelearning and bioprocess knowledge automatic update.

At any p-th period, to provide the reliable guidance on the dynamic decision making, we need to considerboth process inherent stochastic uncertainty and model risk. Let µ(πππθθθ ) denote the total expected rewardaccounting for both sources of uncertainty: µ(πππθθθ )≡ Eωωω∼p(ωωω|Dp)

[E

τττ∼Dπππ

θθθPωωω

(τττ)

[∑

H−1t=1 γ t−1rt |πππθθθ ,s1,ωωω

]], with

the inner conditional expectation, µ(πππθθθ ;ωωω)≡ Eτττ∼D

πππθθθ

Pωωω(τττ)

[∑

H−1t=1 γ t−1rt |πππθθθ ,s1,ωωω

]accounting for stochastic

uncertainty and the outer expectation accounting for model risk. Therefore, given the partial information


of bioprocess characterized by p(ωωω|Dp), we are interested in finding the optimal policy,

πππ?θθθ (· |p(ωωω|Dp)) = argmax

πππθθθ

µ(πππθθθ )≡ argmaxπππθθθ

Eωωω∼p(ωωω|Dp)

[E

τττ∼Dπππ

θθθPωωω

(τττ)

[H−1

∑t=1

γt−1rt

∣∣∣∣∣πππθθθ ,s1,ωωω

]]. (3)

3 GREEN SIMULATION ASSISTED REINFORCEMENT LEARNING WITH MODEL RISK

In this section, we present the green simulation assisted Bayesian reinforcement learning, which canefficiently leverage the information from historical process trajectory data and accelerate the search for theoptimal policy. In Section 3.1, at each p-th period and given real-world data Dp, we derive the policy gradientsolving the stochastic optimization problem (3) and develop the likelihood ratio based green simulation toimprove the gradient estimation. Motivated by the metamodel study in Dong, Feng, and Nelson (2018), adecision process mixture proposal distribution and the likelihood ratio based metamodel for policy gradientare derived, which can reuse the process trajectories generated from previous experiments to improve thegradient estimation stability and speed up the search for the optimal policy. In Section 3.2, we provide thealgorithm for proposed online green simulation assisted policy gradient with model risk.

3.1 Green Simulation Assisted Policy Gradient

At each p-th period and given real-world data Dp, we develop the green simulation based likelihood ratio toefficiently use the existing process data and facilitate the policy gradient search. Conditional on the posteriordistribution p(ωωω|Dp), the objective of reinforcement learning is to maximize the expected performanceπππ?

θθθ(· |p(ωωω|Dp)) = argmaxπππθθθ

µ(πππθθθ ). Based on eq. (3), we can rewrite the objective function,

µ(πππθθθ ) = Eωωω∼p(ωωω|Dp)

[E

τττ∼Dπππ

θθθPωωω

(τττ)

[H−1

∑t=1

γt−1rt


]]

=∫ ∫ pωωω(s1)∏

H−1t=1 πθθθ (at |st)pωωω(st+1|st ,at)

pωωω(s1)∏H−1t=1 π

θθθ(at |st)pωωω(st+1|st ,at)

pωωω(s1)H−1

∏t=1

πθθθ(at |st)pωωω(st+1|st ,at)

H−1

∑t=1

γt−1rt p(ωωω|Dp)dτττdωωω

= Eωωω∼p(ωωω|Dp)

[E

τττ∼Dπππ

θθθPωωω

(τττ)

[pωωω(s1)∏

H−1t=1 πθθθ (at |st)pωωω(st+1|st ,at)

pωωω(s1)∏H−1t=1 π

θθθ(at |st)pωωω(st+1|st ,at)

H−1

∑t=1

γt−1rt |πππθθθ ,s1,ωωω

]].

= Eωωω∼p(ωωω|Dp)

[E

τττ∼Dπππ

θθθPωωω

(τττ)

[Dπππθθθ

Pωωω(τττ)

Dπππθθθ

Pωωω(τττ)

H−1

∑t=1


]]. (4)

The likelihood ratio Dπππθθθ

Pωωω(τττ)/Dπππ

θθθ

Pωωω(τττ) in eq. (4) can adjust the existing trajectories generated by policy πππ

θθθ

and transition model p(st+1|st ,at ;ωωω) to predict the mean response at the new policy µ(πππθθθ ).Let k denote the accumulated number of iterations for the optimal search occurring in the previous p

periods. For notation simplification, suppose there is a fixed number of iterations in each period (say K).At k-th iteration, we only generate one posterior sample ωωωk ∼ p(ωωω|Dp) to estimate the outer expectationin eq. (4). For the candidate policy πππθθθ k , the likelihood ratio based green simulation is used to estimate themean response µ(πππθθθ k). It can reuse the process trajectories obtained from previous simulation experimentsgenerated by using the policies and state transition models (πππθθθ i ,ωωω i) with i = 1,2, . . . ,k. They are obtainedin previous p periods with different posterior distributions, i.e., p(ωωω|D`) with `= 1,2, . . . , p. Then, sinceeach proposal distribution is based on a single decision process distribution D

πππθθθ iPωωωi

(τττ) specified by (πππθθθ i ,ωωω i),we create the green simulation individual likelihood ratio (ILR) estimator of µ(πππθθθ k),

µILRk,n ≡

1k

k

∑i=1

1ni

ni

∑j=1

DπππθθθkPωωωk

(τττ(i, j))

Dπππθθθ iPωωωi

(τττ(i, j))

Hi j−1

∑t=1

γt−1rt(a

(i, j)t ,s(i, j)t )

, τττ(i, j) i.i.d∼ Dπππθθθ iPωωωi


where τττ(i, j) is the j-th sample path generated by using (πππθθθ i ,ωωω i) and n = (n1,n2, . . . ,nk) is the combinationof replications allocated at each (πππθθθ i ,ωωω i) for i = 1,2, . . . ,k. Since the process trajectory length is scenario-dependent, we replace the horizon H with Hi j to indicate its trajectory dependence.

This expected total reward estimator µ ILRk,n can be used in the policy gradient to search for the optimal

policy. Under some regularity conditions, we provide the derivation for the policy gradient estimator.

∇θθθ µ(πππθθθ ;ωωω) = ∇θθθ Eτττ∼Dπππ

θθθPωωω

[H−1

∑t=1


]=∫

∇θθθ Dπππθθθ

Pωωω(τττ)

[H−1

∑t=1

γt−1rt(st ,at)

]dτττ

=∫

Dπππθθθ

Pωωω(τττ)∇θθθ log(Dπππθθθ

Pωωω(τττ))

[H−1

∑t=1

γt−1rt(st ,at)

]dτττ

=∫

Dπππθθθ

Pωωω(τττ)

H−1

∑t=1

[∇θθθ log(πθθθ (at |st))+∇θθθ log(p(st+1|st ,at))]

[H−1

∑t=1

γt−1rt(st ,at)

]dτττ

=∫

Dπππθθθ

Pωωω(τττ)

H−1

∑t=1

[∇θθθ log(πθθθ (at |st))]

[H−1

∑t=1

γt−1rt(st ,at)

]dτττ

= Eτττ∼D

πππθθθ

Pωωω

[H−1

∑t=1

∇θθθ log(πθθθ (at |st))

[t−1

∑t ′=1

γt ′−1rt ′(st ′ ,at ′)+

H−1

∑t ′=t

γt ′−1rt ′(st ′ ,at ′)

]∣∣∣∣∣πππθθθ ,s1,ωωω

]

=H−1

∑t=1

Eτττ [1:t−1]

[Eτττ [t:H−1]

[∇θθθ log(πθθθ (at |st))

t−1

∑t ′=1

γt ′−1rt ′(st ′ ,at ′)

∣∣∣∣∣τττ [1:t−1]


]

+Eτττ∼D

πππθθθ

Pωωω

[H−1

∑t=1

∇θθθ log(πθθθ (at |st))H−1

∑t ′=t

γt ′−1rt ′(st ′ ,at ′)


]

= Eτττ∼D

πππθθθ

Pωωω

[H−1

∑t=1


∑t ′=t

γt ′−1rt ′(st ′ ,at ′)


](5)

= Eτττ∼D

πππθθθ

Pωωω

[Dπππθθθ

Pωωω(τττ)

Dπππθθθ

Pωωω(τττ)

H−1

∑t=1


∑t ′=t

γt ′−1r′t(st ′ ,at ′)


](6)

where eq. (6) holds due to similar derivation as eq. (4). Eq. (5) holds because

Eτττ [1:t−1]

[Eτττ [t:H−1]

[∇θθθ log(πθθθ (at |st))

t−1

∑t ′=1

γt ′−1rt ′(st ′ ,at ′)

∣∣∣∣∣τττ [1:t−1]


]

= Eτττ [1:t−1]

[t−1

∑t ′=1

γt ′−1rt ′(st ′ ,at ′)Eτττ [t:H−1]

[∇θθθ log(πθθθ (at |st))|τττ [1:t−1]


](7)

whereEτττ [t:H−1]

[∇θθθ log(πθθθ (at |st))|τττ [1:t−1]

]=

H−1

∏t ′=t+1

∫πθθθ (at ′ |st ′)p(st ′+1|st ′ ,at ′)dat ′dst ′+1

∫πθθθ (at |st)p(st+1|st ,at)∇θθθ log(p(st+1|st ,at)πθθθ (at |st))datdst+1

=∫

p(st+1,at |st)∇θ logp(st+1,at |st)datdst+1, since p(st+1,at |st) = πθθθ (at |st)p(st+1|st ,at)

= ∇θθθ

∫p(st+1,at |st)datdst+1 = ∇θθθ 1 = 0.

By plugging in eq.(6), the policy gradient becomes,


∇θθθ µ(πππθθθ ) = ∇θθθ Eωωω

[E

τττ∼Dπππ

θθθPωωω

(τττ)

[H−1

∑t=1

γt−1rt


]]= ∇θθθ Eωωω [µ(πππθθθ ;ωωω)] = Eωωω [∇θθθ µ(πππθθθ ;ωωω)]

= Eωωω

[E

τττ∼Dπππ

θθθPωωω

[H−1

∑t=1

∇θθθ log(πθθθ (at |st))Dπππθθθ

Pωωω(τττ)

Dπππθθθ

Pωωω(τττ)

H−1

∑t ′=t

γt ′−1r′t(st ′ ,at ′)

]]. (8)

Then, we obtain the individual likelihood ratio based policy gradient estimator,

∇θθθ µILRk,n =

1k

k

∑i=1

1ni

ni

∑j=1

H−1

∑t=1

∇θθθ log(πθθθ k(a(i, j)t |s(i, j)t ))

DπππθkθkθkPωωωk

(τττ(i, j))


(τττ(i, j))

H−1

∑t ′=t

γt ′−1r′t(a

(i, j)t ′ ,s(i, j)t ′ )

. (9)

The importance weight or likelihood ratio DπππθθθkPωωωk

(τττ)/Dπππθθθ iPωωωi

(τττ) is larger for the trajectories τττ that aremore likely to be generated by the policy πππθθθ k and transition probabilities Pωωωk . During the model learningprocess, the current policy candidate πππθθθ k can be quite different from the policy πππθθθ i for i = 1,2, . . . ,k−1that generated the existing trajectories. Although this importance weight is unbiased, its variance couldgrow exponentially as the horizon H increases, which restricts their applications.

Since the likelihood ratio with single proposal distribution can lead to high estimation variance, inspiredby the BLR-M metamodel proposed in Dong et al. (2018), we develop the bioprocess Mixture proposaldistribution and Likelihood Ratio based policy gradient estimation (MLR), which allows us to selectivelyreuse the previous experiment trajectories and reduce the gradient estimation variance. Specifically, at thek-th iteration of search for optimal policy, we generate a posterior sample of process model parameters,ωωωk ∼ p(ωωω|Dp). During the optimal policy search, if there are new process data coming, the posteriorwill automatically update. The policy candidate πππθθθ k and transition probability model P(st+1|st ,at ;ωωωk)

uniquely define the trajectory distribution DπππθθθkPωωωk

(τττ). Based on the historical trajectories generated during

the previous p periods, we create a mixture proposal distribution ∑ki=1 αk

i Dπππθθθ iPωωωi

(τττ), and then use it to constructthe likelihood ratio,

fk(τττ|θθθ ,ωωω)≡D

πππθθθkPωωωk

(τττ)

∑ki=1 αk

i Dπππθθθ iPωωωi

(τττ)(10)

where θθθ = (θθθ 1, . . . ,θθθ k), ωωω = (ωωω1, . . . ,ωωωk), αki =

ni

∑ki=1 ni

, and ni is the number of trajectories generated during

the previous i-th iteration with (θθθ i,ωωω i) for i = 1, . . . ,k. By replacing the likelihood ratio DπππθθθkPωωωk

(τττ)/Dπππθθθ iPωωωi

(τττ)

in eq. (9) with fk(τττ|θθθ ,ωωω), the green simulation based policy gradient estimator becomes,

∇θθθ µMLRk,n =

1k

k

∑i=1

1ni

ni

∑j=1

[Hi j−1

∑t=1

∇θθθ log(πθθθ k(a(i, j)t |s(i, j)t )) fk(τττ

(i, j)|θθθ ,ωωω)Hi j−1

∑t ′=t

γt ′−1rt ′(a

(i, j)t ′ ,s(i, j)t ′ )

](11)

where τττ(i, j) i.i.d∼ Dπππθθθ iPωωωi

(τττ) with j = 1,2, . . . ,ni represent the trajectories generated in the previous i-th iteration.

Notice that the mixture proposal distribution based likelihoood ratio fk(τττ|θθθ ,ωωω) is bounded by 1/αki . In

this way, the mixture likelihood ratio puts higher weight on the existing trajectories that are more likelyto be generated by D

πππθθθkPωωωk

(τττ) in the k-th iteration without assigning extremely large weights on the others.Since the parameterization plays an important role in the optimal policy gradient approach, we briefly

discuss several possible policy functions. The policy function in reinforcement learning can be eitherstochastic or deterministic; see Silver et al. (2014) and Sutton and Barto (2018). The policy for discrete actions

could be defined as the softmax function, πθθθ (a|s)= eθθθT φ(s,a)

∑a′∈A eθθθT φ(s,a′) ,where φ(s,a)∈Rd is feature vector of state-

action pair (s,a). The gradient of the policy function is ∇θθθ log(πθθθ (a|s)) = φ(s,a)−∑a′∈A φ(s,a′)πθθθ (s,a′).For continuous action spaces, we can apply Gaussian policy; say for example πθθθ (a|s) = N (θθθ T φ(s),σ2)


for some constant σ , where φ(s) is feature representation of s. The gradient of the policy function is∇θθθ log(πθθθ (a|s)) = ∇θθθ

−(a−θθθ T φ(s))2

2σ2 = θθθ T φ(s)−aσ2 φ(s). In general, as long as the predictive models have a

gradient descent learning algorithm, they can be applied in our approach, such as deep neural network,generalized linear regression, SVM, etc. In the empirical study, we considered a two-layer MLP model asour policy function.

3.2 Optimal Policy Search Algorithm

Algorithm 1 provides the procedure for the green simulation assisted policy gradient approach to supportonline learning and guide dynamic decision making.

Algorithm 1: Online Green Simulation Assisted Policy Gradient Policy with Model RiskInput: the number of periods P for real-world dynamic data collection; the number of iterations K

for optimal policy search in each period; differentiable policy πθθθ (a|s), ∀a ∈A ,s ∈S ,θθθ ∈ Rd ;and initial real-world data D1. Initialize the set of sample trajectories E1, the set of transitionmodel parameters ΩΩΩ1, and the set of policy parameters ΘΘΘ1 to be empty set.

for p = 1,2, . . . ,P (at each new real-world data collection point) dofor k = (p−1)K +1,(p−1)K +2, . . . , pK do

1. Generate posterior samples ωωωk ∼ p(ωωω|Dp) and build the transition model with newparameter ωωωk, i.e., p(st+1|st ,at ,ωωωk) for t = 1,2, . . . ,H−1 ;

2. Generate nk trajectories by using the current policy πθθθ k and model parameter ωωωk;for j = 1,2, . . . ,nk do

(a) Generate j-th episode τττ(k, j) = (s(k, j)1 ,a(k, j)1 ,s(k, j)2 ,a(k, j)2 , . . . ,s(k, j)H−1,a(k, j)H−1,s

(k, j)H ) of

state-action sequence starting from initial state s(k, j)1 ∼ p(s1|ωωωk), interacting withtransition model s(k, j)t+1 ∼ p(st+1|s(k, j)t ,a(k, j)t ;ωωωk) and following policy

a(k, j)t ∼ πθθθ k(at |s(k, j)t ) for stochastic policy or a(k, j)t = πθθθ k(s(k, j)t ) for deterministic

policy;end3. Reuse the trajectories generated in current and all previous iterations to improve thegradient estimation;

for i = 1,2, . . . ,k and j = 1,2, . . . ,nk doConstruct the mixture proposal distribution based likelihood ratio, fk(τττ

(i, j)|θθθ ,ωωω), byusing eq. (10).

end

4. Calculate the gradient ∇θθθ µMLRk,n based on eq. (11) and update the policy:

θθθ k+1← θθθ k +ηk · ∇µMLRk,n ;

5. Record new generated trajectories Ek+1 = Ek∪τττ(k, j)| j = 1,2, . . . ,nk, transition modelparameters ΩΩΩk+1 =ΩΩΩk∪ωωωk and policy parameters ΘΘΘk+1 =ΘΘΘk∪θθθ k;

end6. Collect new process real-world data Lp by following the estimated optimal policyπ?

θθθ k(a|s) from Step (4). Then, update the historical data set Dp+1 = Dp∪Lp and the

posterior distribution p(ωωω|Dp+1).end

At any p-th period, given the real-world data Dp collected so far, the model risk is quantified by theposterior distribution p(ωωω|Dp), and then we apply the green simulation assisted policy gradient to searchfor the optimal policy in Steps (1)–(4). Specifically, in each k-th iteration, we first generate the posterior


sample for state transition probability model in Step (1), ωωωk ∼ p(ωωω|Dp), and then generate nk trajectoriesby using the current policy πθθθ k and model parameter ωωωk in Step (2). Then, in Steps (3) and (4), we reuseall historical trajectories and apply the green simulation-based policy gradient to speed up the search forthe optimal policy. After that, as new real-world data coming, we update the posterior of transition modelin Step (6), and then repeat the above procedure. In the empirical study, we use a fixed learning rateηk = 0.01. Notice that the proposed mixture likelihood ratio based policy gradient can be easily extendedto broader reinforcement learning settings, such as online, offline, and model-free cases.

4 EMPIRICAL STUDY

In this section, we study the performance of MLR using a biomanufacturing example. The upstreamsimulation model was built based on a first-principle model proposed by Jahic et al. (2002) and thedownstream chromatography purification process follows Martagan et al. (2018). The empirical studyresults show that MLR outperforms the state-of-the-art policy search and baseline model-based stochasticgradient algorithms without BLR-M metamodel.

4.1 A Biomanufacturing Example

In this paper, we consider the batch-based biomanfucturing and use the stochastic simulation model builtbased on our previous study (Wang et al. 2019) to characterize the dynamic evolution of biomanufacturingprocess. A reinforcement learning model with continuous state and discrete action space is then constructedto search for the optimal decisions on chromatography pooling window, which was studied by Martaganet al. (2018). Instead of assuming that each chromatography step removes the uniformly distributed randomproportion of protein and impurity (Martagan et al. 2018), we let the random removal fraction followingBeta distribution with more realistic and flexible shape.

This biomanufacturing process consists of: (1) upstream fermentation where cells produce the targetprotein; and (2) downstream purification to remove the impurities through multiple chromatography steps.The primary output of fermentation is a mixture including the target protein and significant amount ofunwanted impurity derived from the host cells or fermentation medium. After fermentation, each batch needsto be purified using chromatography to meet the specified quality requirements, i.e., purity concentrationreaching to certain threshold level pd . Since the chromatography typically contributes the main cost fordownstream purification, in this paper, we focus on optimizing the integrated protein purification decisionsrelated to chromatography operations or pooling window selection. To guide the downstream purificationdynamic decision making, we formulate the reinforcement learning for biomanufacturing process as follows.Decision Epoch: Following Martagan et al. (2018), we consider three-step chromatography. Duringchromatography we observe measurements and make decisions at each decision epoch T = t : 1,2,3.State Space: The state st at any decision time t is denoted by the protein-impurity-step tuple st , (pt , it , t)on the finite space P× I×T , where P≡ [0, P] and I≡ [0, I]. The state space P is bounded by a predefinedconstant threshold P due to limitation in cell viability, growth rate and antibody production rate, etc. Thestate space I is bounded by a predefined constant threshold I following FDA process quality standards.Action Space: Let at denote the selection of pooling window given the state ssst = (pt , it , t) at the timet ∈T following a policy πθθθ (ssst). To simplify the problem, we consider 10 candidate pooling windows perchromatography step here.Reward: At the end of downstream process, we record the reward,

r(pt , it , t = 3) =

−c f , if rt < rd ,

r(pd), if rt ≥ rd , pt ≥ pd ,

r(pt)− cl(pd− pt), if rt ≥ rd , pt ≤ pd .

We set the failure cost c f = $48, the protein shortage cost cl = $6 per milligram (mg), the product price $5per mg, r(pt) = $5× pt , the amount of purity percentage requirement pd = 8 mg, and the purity requirement


rd ≥ 85%. The operational cost for each chromatography column is $8 for t ∈ 1,2,3 and r(pt , it , t) =−$8for t ∈ 1,2.Initial State: The random protein and impurity inputs for downstream chromatography are generated withthe cell culture first-principle model, which is based on the differential equations with random noise. Here,we consider a fed batch bioreactor dynamic model proposed by Jahic et al. (2002),

dXdt

= (−FV+µ)X ,

dSdt

=FV(Si−S)−qsX , P = ν1X and I = ν2X (12)

where ν1 ∼N (0.11,0.012) and ν2 ∼N (0.11,0.012) denote the constant specific mAb protein productionand impurity rates, X denotes the biomass concentration from dry weight (gL−1), V = 1000 is mediumvolume (L), Si ∼N (780,40) denotes inlet substrate concentration (gL−1), S is substrate concentration(gL−1), qs,max = 0.57 is specific maximum rate of substrate consumption (gg−1h−1), qs = qs,max

SS+0.1 is the

specific rate of substrate consumption (gg−1h−1), µ = (qs−qm) ·Yem is the specific growth rate (h−1) andYem = 0.3 is biomass yield coefficient exclusive maintenance and qm = 0.013 is maintenance coefficient(gg−1h−1). The initial biomass and substrate is set to be (0gL−1,40gL−1). We set the total time ofproduction fermentation to be 50 days, and obtain p(u) mg of target protein and i(u) mg of impurity byapplying the PDEs in (12). After the harvest, we further add the noise, following the normal distributionN (0,52), to account for the overall impact from other factors introduced during the cell production process.Then, the protein p1 and impurity i1 inputs for downstream purification become p1 ∼N (p(u),52) andi1 ∼N (i(u),52). Therefore, in the empirical study, this PDE first-principle model based simulation is usedto generate the random initial state or input s1 = (p1, i1,1) for downstream chromatography purification.State Transitions: In each step of chromatography, the random proportions of protein and impurity willbe removed, which depend on the selection of pooling window at . In specific, given a pooling window,each chromatography step removes random proportions of protein and impurity,

it+1 = (Ψt |at)it and pt+1 = (Ht |at)pt ,

where the fraction Ψt |at ∼ Beta(ψ lt |at ,ψ

ut |at) and Ht |at ∼ Beta(η l

t |at ,ηut |at) for all at ∈A and t ∈T . We

use the posterior distribution for model parameters ψ lt |at ,ψu

t |at η lt |at and ηu

t |at to quantify the model risk.Here we use a uniform prior Unif(0,300) for all parameters and generate the posterior samples based onMCMC using “PyMC3”.Policy: We use a 2-layer perceptron (MLP) of D = 16 dimensional first layer and 10 dimensional outputlayer with softmax activation function to parameterize our policy; see Section 11 in Hastie, Tibshirani,and Friedman (2001) for more discussion. For 10 pooling window outputs, there are 10 units T` with` = 1, . . . ,10 at the second stage, with the `-th unit modeling the probability of selecting action a` with` = 1, . . . ,10. There are 10 pooling window candidate actions a`, ` = 1,2, . . . ,10, each being coded as0-1 variable. The derived feature Zd depends on the linear combination of the input states s, and then theoutput T` is modeled as a function of linear combinations of the Zd ,

Zd = Sigmoid(w0d +wTd s),d = 1, . . . ,D,

T` = β0`+βββT` Z, `= 1, . . . ,10

Prob(a`|s) ≡ MLP`(s) = g`(T), `= 1, . . . ,10 (13)

where Z = (Z1,Z2, . . . ,ZD), T = (T1, . . . ,T10), w = (w0d ,wTd ),βββ = (β0`,βββ

T` ). We can obtain the policy

parameters θθθ = (w,βββ ). The activation function is set to be sigmoid function, i.e., Sigmoid(x) = 11+e−x . The

output function g`(T ) allows a final transformation of the vector of outputs T, which is set to be softmaxfunctioon g`(T) = eT`

∑10`=1 eT`

.


4.2 Study the Performance of Green Simulation Assisted Policy Gradient

In this section, we compare the performance of proposed green simulation assisted policy gradient withRL (MLR), individual likelihood ratio based policy gradient (ILR) and classical policy gradient (PG).

• Likelihood ratio based policy gradient with mixture proposal distribution (MLR): To reduce thecomputation complexity, instead of reusing all previous iterations, we introduce a rolling windowparameter kr to control how many historical trajectories we use,

∇θθθ µMLRk,n =

1kr

k

∑i=k−kr+1

1ni

ni

∑j=1

[Hi j−1

∑t=1


(i, j)|θθθ ,ωωω)Hi j−1

∑t ′=t

γt ′−1rt ′(a

(i, j)t ′ ,s(i, j)t ′ )

].

In the empirical study, we use the most recent kr = 10 iterations.• Likelihood ratio based policy gradient with true transition model known (TLR),

∇θθθ µT LRk,n =

1kr

k

∑i=k−kr+1

1ni

ni

∑j=1

[Hi j−1

∑t=1


(i, j)|θθθ ,ωωωc)Hi j−1

∑t ′=t

γt ′−1rt ′(a

(i, j)t ′ ,s(i, j)t ′ )

]

=1kr

k

∑i=k−kr+1

1ni

ni

∑j=1

[Hi j−1

∑t=1

∇θθθ log(πθθθ k(a(i, j)t |s(i, j)t ))

∏H−1t=1 πθθθ k(at |st)

∑ki=1 ∏

H−1t=1 πθθθ i(at |st)

Hi j−1

∑t ′=t

γt ′−1rt ′(a

(i, j)t ′ ,s(i, j)t ′ )

],

where the last step holds becausepωωωc (s1)∏

H−1t=1 πθθθk

(at |st)pωωωc (st+1|st ,at)

∑ki=1 pωωωc (s1)∏

H−1t=1 πθθθ i (at |st)pωωωc (st+1|st ,at)

=∏

H−1t=1 πθθθk

(at |st)

∑ki=1 ∏

H−1t=1 πθθθ i (at |st)

.

• Individual likelihood ratio based policy gradient (ILR): It is obtained based on Equation (9),

∇θθθ µILRk,n =

1k

k

∑i=1

1ni

ni

∑j=1

H−1

∑t=1

∇θθθ log(πθθθ k(at |st))D

πππθkθkθkPωωωk

(τττ(i, j))


(τττ(i, j))

H−1

∑t ′=t

γt ′−1r′t(a

(i, j)t ′ ,s(i, j)t ′ )

.• Empirical policy gradient (PG): It uses the point estimator of state transition model parameter as

the true one,

∇θθθ µPG

=1ni

ni

∑j=1

[H−1

∑t=1


∑t ′=t

γt ′−1r′t(a

(i, j)t ′ ,s(i, j)t ′ )

].

Notice that in MLR, ILR, PG approaches, the underlying state transition model is unknown and estimatedby finite real-world data. In TLR, we assume the model is known.

Here we set the amount of real-world data m = 20 for chromatography operation. Fig. 1 shows theconvergence performance of MLR, TLR, ILR, and PG. The results are based on M = 5 macro replications. Thex-axis represents the iteration index k, and the vertical dash line indicates the time when the new real-worldprocess data are collected. Let rh(k) denote the average reward of the policy obtained from the k-th iteration inthe h-th macro replications, which is estimated by running rtest = 200 trajectories with the true state transitionmodel. The y-axis reports r(k) = 1

M ∑Mh=1 rh(k). We also plot the 95% confidence band for each approach,

[r(k)−1.96×SE(r(k)), r(k)+1.96×SE(r(k))], where SE(r(k)) = 1√M(M−1)

√∑

Mh=1(rh(k)− r(k))2. Fig. 1

shows that MLR (red line) converges faster than PG and ILR. To better compare the performance ofcandidate algorithms, we apply the common random numbers (CRNs) for each macro replication.

From Fig 1, we can see the algorithms have already converged after 400 iterations. We compare theperformance of policies obtained from MLR, PG and ILR based on the results from the last 100 iterations.

We record the sample mean µa =1

100 ∑500k=401 r(k) and standard error SE = 1

10

√199 ∑

500k=401(r(k)−µa)2 in

Table 1. The results show that MLR tends to have better performance than both PG and ILR approaches.When ni = 25 based on M = 5 macro replications, the average runtime for MLR is 53.0 mins (12.6

mins for updating posterior distribution and 40.4 mins for policy search). The average runtime for PG is34.3 mins (12.9 mins for updating posterior distribution and 21.4 mins for policy search). The averageruntime for ILR is 50.1 mins (12.2 mins for updating posterior distribution and 37.9 mins for policy search).


(a) ni = 50 (b) ni = 25

(c) ni = 10 (d) ni = 5

Figure 1: Convergence results of MLR, TLR, ILR and PG.

Table 1: Average reward estimated based on last 100 iterations for MLR, TLR, ILR and PG.ni = 50 ni = 25 ni = 10 ni = 5

Mean SE Mean SE Mean SE Mean SEMLR 2.23 0.10 3.25 0.09 3.07 0.09 2.92 0.11TLR 2.75 0.10 3.14 0.09 3.08 0.09 2.83 0.11PG 1.80 0.09 3.04 0.10 3.10 0.10 2.53 0.11ILR 1.83 0.10 2.36 0.10 3.01 0.10 2.39 0.13

5 CONCLUSIONSWe propose a green simulation assisted policy gradient algorithm. It can reduce the policy gradientestimation variance through selectively reusing the experiment data and automatically allocating moreweight to those historical trajectories that are more likely generated by the stochastic decision processof interest. In addition, since we quantify the state transition probabilistic model risk with the posteriordistribution, our model-based reinforcement learning can simultanesouly support online learning and guidedynamic decision making. Thus, the proposed approach is robust to model risk, and it can be applicableto various cases with different amounts of real-world data and process dynamic knowledge. In this paper,the empirical study of biomanufacturing example is used to illustrate that our approach can perform betterthan the state-of-art reinforcement learning and policy gradient approaches.


REFERENCES

Dong, J., M. B. Feng, and B. L. Nelson. 2018, Dec. “Unbiased Metamodeling via Likelihood Ratios”. In2018 Winter Simulation Conference (WSC), 1778–1789.

Feng, M., and J. Staum. 2017, October. “Green Simulation: Reusing the Output of Repeated Experiments”.ACM Transactions on Modeling and Computer Simulation (TOMACS) 27(4):23:1–23:28.

Hastie, T., R. Tibshirani, and J. Friedman. 2001. The Elements of Statistical Learning. Springer Series inStatistics. New York, NY, USA: Springer New York Inc.

Jahic, M., J. Rotticci-Mulder, M. Martinelle, K. Hult, and S.-O. Enfors. 2002. “Modeling of growthand energy metabolism of Pichia pastoris producing a fusion protein”. Bioprocess and BiosystemsEngineering 24(6):385–393.

Laroche, R., and R. Tachet des Combes. 2019, July. “Multi-batch Reinforcement Learning”. In The 4thMultidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM).

Martagan, T., A. Krishnamurthy, P. A. Leland, and C. T. Maravelias. 2018, January. “Performance Guaranteesand Optimal Purification Decisions for Engineered Proteins”. Operations Research 66(1):18–41.

Mnih, V., K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,D. Wierstra, S. Legg, and D. Hassabis. 2015, February. “Human-level control through deep reinforcementlearning”. Nature 518(7540):529–533.

Schaul, T., J. Quan, I. Antonoglou, and D. Silver. 2016. “Prioritized Experience Replay”.CoRR abs/1511.05952.

Silver, D., G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. 2014, June. “Deterministic PolicyGradient Algorithms”. In Proceedings of the 31st International Conference on Machine Learning,ICML. Beijing, China.

Sutton, R. S., and A. G. Barto. 2018. Reinforcement Learning: An Introduction. Cambridge, MA, USA:A Bradford Book.

Wang, B., W. Xie, T. Martagan, A. Akcay, and C. G. Corlu. 2019. “Stochastic Simulation Model Developmentfor Biopharmaceutical Production Process Risk Analysis and Stability Control”. In Proceedings of the2019 Winter Simulation Conference: IEEE, Inc.

AUTHOR BIOGRAPHIES

HUA ZHENG is Ph.D. student of the Department of Mechanical and Industrial Engineering (MIE)at Northeastern University. His research interests includes machine learning, data analytics, computersimulation and stochastic optimization. His email address is [email protected] XIE is an assistant professor in MIE at Northeastern University. She received her M.S. and Ph.D. inIndustrial Engineering and Management Sciences (IEMS) at Northwestern University. Her research interestsinclude interpretable Artificial Intelligence (AI), computer simulation, data analytics, stochastic optimization,and blockchain development for cyber-physical system risk management, learning, and automation. Heremail address is [email protected]. Her website is http://www1.coe.neu.edu/∼wxie/BEN MINGBIN FENG is an assistant professor in actuarial science at the University of Waterloo. Heearned his Ph.D. in IEMS at Northwestern University. His research interests include stochastic simulationdesign and analysis, optimization via simulation, nonlinear optimization, and financial and actuarial appli-cations of simulation and optimization methodologies. His e-mail address is [email protected]. Hiswebsite is http://www.math.uwaterloo.ca/∼mbfeng/.

mailto://[email protected]


http://www1.coe.neu.edu/~wxie/


http://www.math.uwaterloo.ca/~mbfeng/

green simulation assisted reinforcement ...wxie/wsc2020_greensimulationpolicy...green simulation...

Documents