new approaches for enhancing simulation metamodeling · new approaches for enhancing simulation...

New Approaches for Enhancing Simulation Metamodeling

Xiaowei Zhang

HKUST

Joint work with Haihui Shen (CityU), L. Jeff Hong (CityU), and Liang Ding (HKUST)

Outline

1. Introduction

2. Stylized-model Enhanced Stochastic Kriging

3. Regularized Stochastic Kriging

4. Markovian Stochastic Kriging

5. Conclusions

Introduction

I Simulation is a popular tool for studying complex stochastic systems• e.g., supply chains, call centers, manufacturing systems

I We can optimize system performance by varying input parameters ordesign points of the simulation model

I However, simulation is often computationally expensive• e.g., large-scale queueing networks

Metamodeling

I “Model of the simulation model”

I Run simulation at a small number of design points

I Use the simulation outputs at the selected design points to predict thesimulation outputs at others without doing simulation

Sim.%Model

Metamodel

Real%System

I x = (x1, . . . , xd)ᵀ: vector of decision variables of a real system

• arrival rate to a critical care facility, service rates at different careunits, and the routing probabilities among the units

I η(x): mean performance of the simulation model

A Popular Metamodel: Stochastic Kriging (SK)

I Kriging was originally used in geostatistics (Matheron, 1963) and laterin the design and analysis of computer experiments (Sacks et al., 1989)

I SK was proposed by Ankenman et al. (2010)

I SK metamodel: represent η(x) by

η(x) = f(x)ᵀβββ +M(x)

• f(x): vector of known functions• βββ: vector of unknown parameters• M(·): zero-mean Gaussian random field

I Simulation output:

Yj(x) = f(x)ᵀβββ +M(x) + εj(x)

• ε1(x), ε2(x), . . . are the simulation errors

SK (cont’d)

I Slow: simulate η(x) at (xi : i = 1, . . . , k)• Yj(xi): simulation output on replication j at location xi

Y (xi) :=1

ni

ni∑j=1

Yj(xi)

I Fast: predict η(x0) with (Y (xi) : i = 1, . . . , k) for any x0

η(x0) = f(x0)ᵀβββ + ΣΣΣM (x0, ·)ᵀ(ΣΣΣM + ΣΣΣε)−1(Y − Fβββ)

Parameter Estimation via Maximum Likelihood

I Covariance function of M : Cov[M(x,x′)] = τ2R(x− x′;θθθ)• Typical example: R(x− x′;θθθ) = exp[−θ

∑i(xi − x

′i)

2]

I Log-likelihood

`(βββ, τ2, θθθ) = − ln |KKK(τ2, θθθ)| − (Y − Fβββ)ᵀKKK(τ2, θθθ)−1(Y − Fβββ),

where K(τ2, θθθ) = ΣΣΣM (τ2, θθθ) + ΣΣΣε

Potential Issues

1. Specification the “trend term” f(x)• Stylized-model enhanced SK : if a rough analytical approximation isavailable• Regularized SK : otherwise, apply statistical learning methods toautomatically select from a large collection of candidates

2. Computation of the inverse covariance matrix KKK−1

• Complexity is O(k3)• Prone to numerical instability: KKK becomes close to being singular ifthere are two design points that “close” to each other• Markovian SK : model K−1 directly and introduce sparsity byimposing Markovian structure

Literature on Enhancing Metamodels

I Incorporating gradient information• Morris et al. (1993), Mitchell et al. (1994) in DACE• Chen et al. (2013), Qu and Fu (2014) in stochastic simulation

I Leveraging another coarser but faster simulation model• Kennedy and O’Hagan (2000), Forrester et al. (2007)

I Let f(x) = p(x)/(1− x)n with p(x) being a polynomial• Cheng and Kleijnen (1999), Yang et al. (2007)• hard to generalize if x is multidimensional

Ordinary Stochastic Kriging (OSK)

I Despite its general form, in applications f(x) is mostly taken as aconstant, i.e., f(x)ᵀβββ ≡ β0

• Assuming no prior knowledge about η(x)

I Hard to capture highly nonlinear response surfaces

0.0 0.2 0.4 0.6 0.8 1.0Arrival rate x

0

2

4

6

8

10Ex

pect

ed q

ueue

leng

thTrue surfaceMean simulation outputFitted surface of OSK

±1.96 MSE interval

Stylized-model Enhanced Stochastic Kriging (SESK)

I A bulk of simulation models in practice are queueing networks

I Assuming no prior knowledge seems overly simplified given the longhistory of queueing theory

I SESK: add a stylized model with a closed-form solution to the trendterm, f(x) = (1, q(x))• e.g., q(x) is the mean queue length of the Jackson network

I We do not expect much quantitative accuracy from q(x) but merely arough prediction of the qualitative behavior of η(x).

Example: M/G/1 Queue

I True surface η(x) = 1.5x2/(1− x)

I q(1)(x) = x2/(1− x), q(2)(x) = 3x9, q(3)(x) = 10(x− 0.52)2.

0.0 0.2 0.4 0.6 0.8 1.0Arrival rate x

0

2

4

6

8

10

Expe

cted

que

ue le

ngth

True surfaceMean simulation outputFitted surface of OSK

±1.96 MSE interval

0.0 0.2 0.4 0.6 0.8 1.0Arrival rate x

0

2

4

6

8

10

Expe

cted

que

ue le

ngth

True surfaceMean simulation outputFitted surface of SESK

±1.96 MSE intervalStylized model (i) - M/M/1

0.0 0.2 0.4 0.6 0.8 1.0Arrival rate x

0

2

4

6

8

10

Expe

cted

que

ue le

ngth


±1.96 MSE intervalStylized model (ii) - rougher

0.0 0.2 0.4 0.6 0.8 1.0Arrival rate x

0

2

4

6

8

10

Expe

cted

que

ue le

ngth


±1.96 MSE intervalStylized model (iii) - irrelevant

Example: Patient Flow in a Hospital

I An open queueing network with 9 servers (medical units)

I Each server has a finite capacity and patients may be blocked

I Stylized model: treat each server as an isolated M/M/s/c queue• Adjust arrival rate and service rate via a system of heuristicequations

8 12 16 20 24 28x4

26

28

30

32

34

36

38

Mea

n so

jour

n tim

e (h

r.)

True responseMean simulation outputFitted surface of OSK

±1.96 MSE interval

8 12 16 20 24 28x4

26

28

30

32

34

36

38

Mea

n so

jour

n tim

e (h

r.)

True responseMean simulation outputFitted surface of SESK

±1.96 MSE intervalStylized model

Example: Dock Allocation at an Air Cargo Terminal

I Four multi-server queues with time-varying arrivals: Mt/G/s queues

I Find the optimal scheme for allocating servers to the four queues tominimize the mean waiting time

0 4 8 12 16 20 2424 one-hour intervals

0

10

20

30

40

50

60

70

80

Hour

ly a

rriva

l rat

es

Cargo 1Cargo 2

Cargo 3Cargo 4

Cargo Type Service Time Distribution (min.) Number of Docks

1 WEIB(21.8, 1.3) x1

2 7 + WEIB(67.6, 1.5) x2

3 7 + GAMM(25.7, 0.9) x3

4 7 + GAMM(9.4, 3.0) x4

Dock Allocation (cont’d)

I Consider two distinct stylized models• Stationary approximation for each queue: M/M/s• Fluid approximation of Mt/M/s

62 64 66 68 70 72 74x2

0

20

40

60

80

Mea

n wa

iting

tim

e (m

in.)

True surfaceMean simulation outputFitted surface of OSK

±1.96 MSE interval

62 64 66 68 70 72 74x2

0

20

40

60

80

Mea

n wa

iting

tim

e (m

in.)


±1.96 MSE intervalStylized model 1

62 64 66 68 70 72 74x2

0

20

40

60

80

Mea

n wa

iting

tim

e (m

in.)


±1.96 MSE intervalStylized model 2

62 64 66 68 70 72 74x2

0

20

40

60

80

Mea

n wa

iting

tim

e (m

in.)

True surfaceMean simulation outputFitted surface of SESK with both stylized models

±1.96 MSE interval

Dock Allocation (cont’d)

I There are 111 servers in total

I Decision variable{x ∈ N4

+ :∑i xi ≤ 111, x1 ≥ 5, x2 ≥ 61, x3 ≥ 5, x4 ≥ 21}

• 8,855 possible values in total

I Apply the Gaussian process-based search (GPS) algorithm (Sun et al.,2014) for optimization• The original version uses OSK• Replace it with SESK

0 100 200 300 400Total observations

30

35

40

45

50

55

Estim

ated

opt

imal

val

ue

GPSSEGPS

Regularized Stochastic Kriging (RSK)

I What if an analytical approximation is not easy to find or implement?

I f(x) is analogous to basis functions in nonparametric regression• Nontrivial to select (form, number of terms, etc.) manually

I Treat it as a feature selection problem in statistical learning

I Use the regularization technique to automatically select proper basisfunctions from a large collection• Penalize the magnitude of βββ properly in its estimation• L1 penalty drives the estimated coefficients of the insignificantfunctions to zero• Different from LASSO regression because of the correlated noise

Penalized Maximum Likelihood Estimation

I Log-likelihood of SK:

`(βββ, τ2, θθθ) = − ln |KKK(τ2, θθθ)| − (Y − Fβββ)ᵀKKK(τ2, θθθ)−1(Y − Fβββ),

where K(τ2, θθθ) = ΣΣΣM (τ2, θθθ) + ΣΣΣε

I Penalized log-likelihood of RSK:

˜(βββ, τ2, θθθ) = `(βββ, τ2, θθθ)− p(βββ),

where p(·) is a penalty function• L1 penalty: p(βββ) = λ‖βββ‖1• Elastic net penalty: p(βββ) = λ1‖βββ‖1 + λ2‖βββ‖2

I Use the block-coordinate descent method for numerical optimization• Alternately maximize over one of βββ, τ2, θθθ by fixing the other two

I Nontrivial to prove the “oracle” property• Sparsity: estimated coefficients of the insignificant basis functionsbecome zero asymptotically• Asymptotic optimality: estimated coefficients follow a multivariatenormal distribution asymptotically

Example: M/M/1 Queue

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12truthdataOSKRSK

0 0.2 0.4 0.6 0.8 1-2

0

2

4

6

8

10

12truthdataOSKRSK

I True surface η(x) = x/(1− x)

I Improvement relative to OSK is significant but not as much as SESK• SESK is better if a good stylized model is available• RSK is more widely applicable

Numerical Issues

I MLE requires repeated computation of K(τ2, θθθ)−1

I Computational complexity is O(k3)• k becomes large easily if x is multidimensional

I A more serious numerical issue is K becomes near-singular easily

I θθθ is hard to estimate (Li and Sudjianto, 2005)• θθθ controls the correlation: Corr[M(x),M(x′)] = exp[−θ

∑i(xi − x

′i)

2]• Log-likelihood function is “flat” near the optimum of θθθ

I Solution: model Q = K−1 directly• Other approaches: tampering, low-rank approximation, etc.

Markovian Structure and Sparsity

I Crucial property: Qi,j = 0 if xi ⊥ xj conditional on the others• {M(xi) : i = 1, . . . , k} forms a Markov chain

I M(xi) and M(xj) are independent unless they are “neighbors”

I “Neighborhood” is defined by a user-specified graph, so Q can be madesparse• Accelerate the related matrix computation dramatically• Solve the near-singularity issue

Salemi, Nelson, and Staum

x

(a)

x

(b)

xq1 q1

q2

q2

(c)

Figure 2: (a) and (b) are two possible neighborhood structures for a lattice point x in a two-dimensionalinteger lattice; (c) shows parameter values assigned to each edge in (a) using the function p.

2.1 Parameterization

In our DOvS problem with integer-ordered decision variables, the feasible region X is a finite subset ofthe d-dimensional integer lattice Zd . Thus, the straightforward construction of the graph G starts withdefining the nodes of the graph to be the lattice points in the feasible region. To finish the constructionof G , we must specify when two nodes will be connected with an edge, which we do by specifyingthe neighborhood structure for any given node. Examples of two-dimensional neighborhood structuresare given by Figure 2a and Figure 2b. The neighborhood represented in Figure 2a, given by the set ofnodes {x0 2 X : ||x�x0||2 = 1}, has an upper bound of 2d neighbors in d dimensions. The neighborhoodrepresented in Figure 2b, given by the set of nodes {x0 2 X : ||x�x0||• = 1}, has an upper bound of 2d+1

neighbors in d dimensions. Two nodes are connected by an edge if they are neighbors as determined bythe neighborhood structure.

Since we are particularly focused on DOvS problems with large feasible regions, we will use theneighborhood structure represented by Figure 2a. Although we assume a specific neighborhood structure forefficient computations, our method can be used with any neighborhood structure. From the neighborhoodstructure represented by Figure 2a, the neighbors of a feasible point x are the set of feasible points{x0 2 X : ||x�x0||2 = 1}. This neighborhood structure leads to at most 2d|X| edges in the set E , and thusthe fraction of non-zero entries in the precision matrix Q is bounded from above by 2d/|X|. Although wehave defined the graph G , and thus the non-zero pattern of the precision matrix, we have not yet specifiedthe non-zero entries of the precision matrix. The non-zero entries of Q are determined by the functionp(x,x0;q), where q = (q0,q1,q2, . . . ,qd)

> is a vector of parameters, i.e., the i jth entry of Q is given byQi j , p(xi,x j;q). We use the function

p(x,x0;q) =

8<:

q0 if x = x0�q0q j if |x�x0| = e j

0 otherwise(3)

for x,x0 2 Zd , where e j is the jth standard basis vector. Since the conditional precisions in Equation(1) must be positive, it follows that q0 must be positive. We also want neighbors to have non-negativepartial correlations, given by Equation (2), so we restrict the values of q1,q2, . . . ,qd to be non-negative.Furthermore, we need qi 1 for i = 1,2, . . . ,d, since the partial correlations must be less than one. Lastly,we need Q to be positive-definite. We will use these restrictions later in Section 3.2 when we calculate themaximum likelihood estimates (MLEs) of the parameters. With this parameterization and these restrictions,Q is a non-singular M-matrix so its inverse is nonnegative, i.e., [Q�1]i j � 0 for all i and j. In other words,there are no negative correlations among nodes in the GMRF, which is a property that we want. Thefunction p, illustrated in Figure 2c, leads to a nonstationary GMRF; the variances at feasible points with

3812

I If x is discrete, such M(x) is a Gaussian Markov random field

Markovian Stochastic Kriging (MSK)

I With x being continuous, we assume M(x) is a Gaussian free field

I The domain of x must be specified

I G(x,y) := Cov[M(x),M(y)] is the solution to a PDE (heat equationwith Dirichlet boundary)

• E.g., if the domain is [0, L], then G(x, y) can computed analyticallyand Q is a tridiagonal matrix

Q =

b −a 0 · · · 0−a b −a · · · 00 −a b · · · 0...

......

. . ....

0 0 0 · · · b

,

• Log-likelihood becomes

`(βββ, a, b) = − ln |Q(a, b)−1| − (Y − Fβββ)ᵀQ(a, b)(Y − Fβββ)

Markovian Stochastic Kriging (MSK)

I With x being continuous, we assume M(x) is a Gaussian free field

I The domain of x must be specified

I G(x,y) := Cov[M(x),M(y)] is the solution to a PDE (heat equationwith Dirichlet boundary)• E.g., if the domain is [0, L], then G(x, y) can computed analyticallyand Q is a tridiagonal matrix

Q =

b −a 0 · · · 0−a b −a · · · 00 −a b · · · 0...

......

. . ....

0 0 0 · · · b

,

• Log-likelihood becomes

`(βββ, a, b) = − ln |Q(a, b)−1| − (Y − Fβββ)ᵀQ(a, b)(Y − Fβββ)

Example: M/M/1 Queue

0 20 40 60 80-1

0

1

2

3

4

5truthdataOSKMSK

0 20 40 60 80 100-2

0

2

4

6

8

10truthdataOSKMSK

I True surface η(x) = x/(100− x)

I Set f(x) ≡ 1 for MSK

I OSK does not perform well because the correlation parameter θ is hardto estimate (close to 0)

Conclusions

I Discussed several issues of the SK metamodel• Hard to specify the trend term f(x)• High computational complexity• Numerical instability

I Proposed three approached for enhancing SK• They address different issues but can be used in a combined way

I SESK performs very well if a good stylized model is available

I RSK provides moderate enhancement but its applicability is higherthan SESK

I MSK is very promising but the shape of the domain must be chosencarefully so that the PDE can be solved analytically

References I

B. Ankenman, B. L. Nelson, and J. Staum. Stochastic kriging forsimulation metamodeling. Oper. Res., 58(2):371–382, 2010.

X. Chen, B. Ankenman, and B. L. Nelson. Enhancing stochastic krigingmetamodels with gradient estimators. Oper. Res., 61(2):512–528, 2013.

R. C. Cheng and J. P. Kleijnen. Improved design of queueing simulationexperiments with highly heteroscedastic responses. Oper. Res., 47(5):762–777, 1999.

A. I. Forrester, A. Sobester, and A. J. Keane. Multi-fidelity optimizationvia surrogate modelling. In Proc. R. Soc. A, volume 463, pages3251–3269, 2007.

M. C. Kennedy and A. O’Hagan. Predicting the output from a complexcomputer code when fast approximations are available. Biometrika, 87(1):1–13, 2000.

R. Li and A. Sudjianto. Analysis of computer experiments using penalizedlikelihood in Gaussian kriging models. Technometrics, 47(2):111–120,2005.

G. Matheron. Principles of geostatistics. Econ. Geol., 58(8):1246–1266,1963.

References II

T. Mitchell, M. Morris, and D. Ylvisaker. Asymptotically optimumexperimental designs for prediction of deterministic functions givenderivative information. J. Stat. Plann. Infer., 41(3):377–389, 1994.

M. D. Morris, T. J. Mitchell, and D. Ylvisaker. Bayesian design andanalysis of computer experiments: Use of derivatives in surfaceprediction. Technometrics, 35(3):243–255, 1993.

H. Qu and M. C. Fu. Gradient extrapolated stochastic kriging. ACMTrans. Model. Comput. Simul., 24(4):23:1–23:25, 2014.

J. Sacks, S. B. Schiller, and W. J. Welch. Designs for computerexperiments. Technometrics, 31(1):41–47, 1989.

L. Sun, L. J. Hong, and Z. Hu. Balancing exploitation and exploration indiscrete optimization via simulation through a Gaussian process-basedsearch. Oper. Res., 62(6):1416–1438, 2014.

F. Yang, B. Ankenman, and B. L. Nelson. Efficient generation of cycletime-throughput curves through simulation and metamodeling. NavalRes. Logist., 54(1):78–93, 2007.

new approaches for enhancing simulation metamodeling · new approaches for enhancing simulation...

Documents