a scalable approach to gradient-enhanced ... - xiaowei zhang · a scalable approach to...
TRANSCRIPT
A Scalable Approach to Gradient-Enhanced
Stochastic Kriging
Haojun Huodagger Xiaowei Zhanglowast and Zeyu ZhengDagger
dagger Hong Kong University of Science and Technology IEDAlowast City University of Hong Kong MSDagger UC Berkeley IEOR
Table of Contents
1 Stochastic Kriging and Big n Problem
2 Markovian Covariance Functions
3 Scalable Gradient Extrapolated Stochastic Kriging
4 Conclusions
1
Stochastic Kriging and Big n
Problem
Metamodeling
SimModel
RealSystem
Meta-model
bull Simulation models are often computationally expensive
bull Metamodel fast approximation of simulation model
bull Run simulation at a small number of design points
bull Predict responses based on the simulation outputs
2
Stochastic Kriging
bull Also called Gaussian process (GP) regression
bull Unknown surface is modeled as a Gaussian process
Z(x) = β +M(x) x isin X sube Rd
bull M(x) is characterized by covariance function k(x y)bull Leverage spatial correlation for prediction
3
Partial Literature
bull Quantification of input uncertainty
bull Barton Nelson and Xie (2014)
bull Xie Nelson and Barton (2014)
bull Simulationblack-boxBayesian optimization
bull Huang et al (2006)
bull Sun Hong and Hu (2014)
bull Scott Frazier and Powell (2011)
bull Shahriari et al (2016)
4
The Big n Problem
bull Response surface is observed at x1 xn with noise
z(xi ) = β +M(xi ) + ε(xi )
bull Best linear unbiased predictor of Z(x0)
983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull Maximum likelihood estimation
maxβθθθ
983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]
⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052
bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time
bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular
bull Especially for the popular Gaussian covariance function
bull Usually run into trouble when n gt 100 which can easily happen
when d ge 3
5
Enhancing SK with Gradient Information
bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1
j (xi ) gdj (xi ))
⊺
g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d
where G r (xi ) is the true r -th partial derivative
bull Predict Z(x0) using both response estimates and gradient estimates
bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)
simple using gradients indirectly
bull Chen Ankenman and Nelson (2013) stochastic kriging with
gradient estimators (SKG) sophisticated using gradients directly
6
GESK (Qu and Fu 2014)
bull Use gradient estimates to create ldquopseudordquo response estimates
zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi
where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation
bull Predict Z(x0) using the augmented data
(z(x1) z(xn) z(x1) z(xn))
bull The size of the covariance matrix now becomes 2n times 2n
bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n
bull Similar problem for SKG
7
Approximation Schemes
bull Well developed in spatial statistics and machine learning
bull Banerjee et al (2015)
bull Rasmussen and Williams (2006)
bull Reduced-rank approximations emphasize long-range dependences
bull Sparse approximations emphasize short-range dependences
optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search
2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie
kethxTHORN frac14 1
eth2THORNd
Ze$ iWTxsethWTHORNdW (38)
Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following
kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i
(39)
As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that
kethxx0THORN
m
Xm
ifrac14 1
e$ iWethiTHORNTxeiWethiTHORN
Tx0 (40)
where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN
As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data
3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface
Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off
The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the
Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95
credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are
shown as black crosses The SSGP model used a basis of 80 Fourier features
Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization
Vol 104 No 1 January 2016 | Proceedings of the IEEE 159
Figure 1 Posterior means and variances Source Shahriari et al (2016)
8
Approximation-free
8
Markovian Covariance Functions
Gaussian Markov Random Field (GMRF)
bull M is multivariate normal with sparsity specified on ΣΣΣminus1M
bull A discrete model using graph to describe Markovian structure
bull Given all its neighbors node i is conditionally independent of its
non-neighbors
bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1
M (i j) ∕= 0 lArrrArr i and j are neighbors
0 1 2 3 4
bull The sparsity can reduce necessary computation to O(n2)
9
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Table of Contents
1 Stochastic Kriging and Big n Problem
2 Markovian Covariance Functions
3 Scalable Gradient Extrapolated Stochastic Kriging
4 Conclusions
1
Stochastic Kriging and Big n
Problem
Metamodeling
SimModel
RealSystem
Meta-model
bull Simulation models are often computationally expensive
bull Metamodel fast approximation of simulation model
bull Run simulation at a small number of design points
bull Predict responses based on the simulation outputs
2
Stochastic Kriging
bull Also called Gaussian process (GP) regression
bull Unknown surface is modeled as a Gaussian process
Z(x) = β +M(x) x isin X sube Rd
bull M(x) is characterized by covariance function k(x y)bull Leverage spatial correlation for prediction
3
Partial Literature
bull Quantification of input uncertainty
bull Barton Nelson and Xie (2014)
bull Xie Nelson and Barton (2014)
bull Simulationblack-boxBayesian optimization
bull Huang et al (2006)
bull Sun Hong and Hu (2014)
bull Scott Frazier and Powell (2011)
bull Shahriari et al (2016)
4
The Big n Problem
bull Response surface is observed at x1 xn with noise
z(xi ) = β +M(xi ) + ε(xi )
bull Best linear unbiased predictor of Z(x0)
983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull Maximum likelihood estimation
maxβθθθ
983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]
⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052
bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time
bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular
bull Especially for the popular Gaussian covariance function
bull Usually run into trouble when n gt 100 which can easily happen
when d ge 3
5
Enhancing SK with Gradient Information
bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1
j (xi ) gdj (xi ))
⊺
g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d
where G r (xi ) is the true r -th partial derivative
bull Predict Z(x0) using both response estimates and gradient estimates
bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)
simple using gradients indirectly
bull Chen Ankenman and Nelson (2013) stochastic kriging with
gradient estimators (SKG) sophisticated using gradients directly
6
GESK (Qu and Fu 2014)
bull Use gradient estimates to create ldquopseudordquo response estimates
zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi
where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation
bull Predict Z(x0) using the augmented data
(z(x1) z(xn) z(x1) z(xn))
bull The size of the covariance matrix now becomes 2n times 2n
bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n
bull Similar problem for SKG
7
Approximation Schemes
bull Well developed in spatial statistics and machine learning
bull Banerjee et al (2015)
bull Rasmussen and Williams (2006)
bull Reduced-rank approximations emphasize long-range dependences
bull Sparse approximations emphasize short-range dependences
optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search
2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie
kethxTHORN frac14 1
eth2THORNd
Ze$ iWTxsethWTHORNdW (38)
Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following
kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i
(39)
As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that
kethxx0THORN
m
Xm
ifrac14 1
e$ iWethiTHORNTxeiWethiTHORN
Tx0 (40)
where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN
As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data
3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface
Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off
The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the
Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95
credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are
shown as black crosses The SSGP model used a basis of 80 Fourier features
Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization
Vol 104 No 1 January 2016 | Proceedings of the IEEE 159
Figure 1 Posterior means and variances Source Shahriari et al (2016)
8
Approximation-free
8
Markovian Covariance Functions
Gaussian Markov Random Field (GMRF)
bull M is multivariate normal with sparsity specified on ΣΣΣminus1M
bull A discrete model using graph to describe Markovian structure
bull Given all its neighbors node i is conditionally independent of its
non-neighbors
bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1
M (i j) ∕= 0 lArrrArr i and j are neighbors
0 1 2 3 4
bull The sparsity can reduce necessary computation to O(n2)
9
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Stochastic Kriging and Big n
Problem
Metamodeling
SimModel
RealSystem
Meta-model
bull Simulation models are often computationally expensive
bull Metamodel fast approximation of simulation model
bull Run simulation at a small number of design points
bull Predict responses based on the simulation outputs
2
Stochastic Kriging
bull Also called Gaussian process (GP) regression
bull Unknown surface is modeled as a Gaussian process
Z(x) = β +M(x) x isin X sube Rd
bull M(x) is characterized by covariance function k(x y)bull Leverage spatial correlation for prediction
3
Partial Literature
bull Quantification of input uncertainty
bull Barton Nelson and Xie (2014)
bull Xie Nelson and Barton (2014)
bull Simulationblack-boxBayesian optimization
bull Huang et al (2006)
bull Sun Hong and Hu (2014)
bull Scott Frazier and Powell (2011)
bull Shahriari et al (2016)
4
The Big n Problem
bull Response surface is observed at x1 xn with noise
z(xi ) = β +M(xi ) + ε(xi )
bull Best linear unbiased predictor of Z(x0)
983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull Maximum likelihood estimation
maxβθθθ
983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]
⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052
bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time
bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular
bull Especially for the popular Gaussian covariance function
bull Usually run into trouble when n gt 100 which can easily happen
when d ge 3
5
Enhancing SK with Gradient Information
bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1
j (xi ) gdj (xi ))
⊺
g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d
where G r (xi ) is the true r -th partial derivative
bull Predict Z(x0) using both response estimates and gradient estimates
bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)
simple using gradients indirectly
bull Chen Ankenman and Nelson (2013) stochastic kriging with
gradient estimators (SKG) sophisticated using gradients directly
6
GESK (Qu and Fu 2014)
bull Use gradient estimates to create ldquopseudordquo response estimates
zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi
where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation
bull Predict Z(x0) using the augmented data
(z(x1) z(xn) z(x1) z(xn))
bull The size of the covariance matrix now becomes 2n times 2n
bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n
bull Similar problem for SKG
7
Approximation Schemes
bull Well developed in spatial statistics and machine learning
bull Banerjee et al (2015)
bull Rasmussen and Williams (2006)
bull Reduced-rank approximations emphasize long-range dependences
bull Sparse approximations emphasize short-range dependences
optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search
2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie
kethxTHORN frac14 1
eth2THORNd
Ze$ iWTxsethWTHORNdW (38)
Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following
kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i
(39)
As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that
kethxx0THORN
m
Xm
ifrac14 1
e$ iWethiTHORNTxeiWethiTHORN
Tx0 (40)
where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN
As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data
3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface
Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off
The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the
Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95
credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are
shown as black crosses The SSGP model used a basis of 80 Fourier features
Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization
Vol 104 No 1 January 2016 | Proceedings of the IEEE 159
Figure 1 Posterior means and variances Source Shahriari et al (2016)
8
Approximation-free
8
Markovian Covariance Functions
Gaussian Markov Random Field (GMRF)
bull M is multivariate normal with sparsity specified on ΣΣΣminus1M
bull A discrete model using graph to describe Markovian structure
bull Given all its neighbors node i is conditionally independent of its
non-neighbors
bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1
M (i j) ∕= 0 lArrrArr i and j are neighbors
0 1 2 3 4
bull The sparsity can reduce necessary computation to O(n2)
9
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Metamodeling
SimModel
RealSystem
Meta-model
bull Simulation models are often computationally expensive
bull Metamodel fast approximation of simulation model
bull Run simulation at a small number of design points
bull Predict responses based on the simulation outputs
2
Stochastic Kriging
bull Also called Gaussian process (GP) regression
bull Unknown surface is modeled as a Gaussian process
Z(x) = β +M(x) x isin X sube Rd
bull M(x) is characterized by covariance function k(x y)bull Leverage spatial correlation for prediction
3
Partial Literature
bull Quantification of input uncertainty
bull Barton Nelson and Xie (2014)
bull Xie Nelson and Barton (2014)
bull Simulationblack-boxBayesian optimization
bull Huang et al (2006)
bull Sun Hong and Hu (2014)
bull Scott Frazier and Powell (2011)
bull Shahriari et al (2016)
4
The Big n Problem
bull Response surface is observed at x1 xn with noise
z(xi ) = β +M(xi ) + ε(xi )
bull Best linear unbiased predictor of Z(x0)
983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull Maximum likelihood estimation
maxβθθθ
983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]
⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052
bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time
bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular
bull Especially for the popular Gaussian covariance function
bull Usually run into trouble when n gt 100 which can easily happen
when d ge 3
5
Enhancing SK with Gradient Information
bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1
j (xi ) gdj (xi ))
⊺
g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d
where G r (xi ) is the true r -th partial derivative
bull Predict Z(x0) using both response estimates and gradient estimates
bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)
simple using gradients indirectly
bull Chen Ankenman and Nelson (2013) stochastic kriging with
gradient estimators (SKG) sophisticated using gradients directly
6
GESK (Qu and Fu 2014)
bull Use gradient estimates to create ldquopseudordquo response estimates
zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi
where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation
bull Predict Z(x0) using the augmented data
(z(x1) z(xn) z(x1) z(xn))
bull The size of the covariance matrix now becomes 2n times 2n
bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n
bull Similar problem for SKG
7
Approximation Schemes
bull Well developed in spatial statistics and machine learning
bull Banerjee et al (2015)
bull Rasmussen and Williams (2006)
bull Reduced-rank approximations emphasize long-range dependences
bull Sparse approximations emphasize short-range dependences
optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search
2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie
kethxTHORN frac14 1
eth2THORNd
Ze$ iWTxsethWTHORNdW (38)
Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following
kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i
(39)
As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that
kethxx0THORN
m
Xm
ifrac14 1
e$ iWethiTHORNTxeiWethiTHORN
Tx0 (40)
where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN
As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data
3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface
Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off
The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the
Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95
credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are
shown as black crosses The SSGP model used a basis of 80 Fourier features
Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization
Vol 104 No 1 January 2016 | Proceedings of the IEEE 159
Figure 1 Posterior means and variances Source Shahriari et al (2016)
8
Approximation-free
8
Markovian Covariance Functions
Gaussian Markov Random Field (GMRF)
bull M is multivariate normal with sparsity specified on ΣΣΣminus1M
bull A discrete model using graph to describe Markovian structure
bull Given all its neighbors node i is conditionally independent of its
non-neighbors
bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1
M (i j) ∕= 0 lArrrArr i and j are neighbors
0 1 2 3 4
bull The sparsity can reduce necessary computation to O(n2)
9
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Stochastic Kriging
bull Also called Gaussian process (GP) regression
bull Unknown surface is modeled as a Gaussian process
Z(x) = β +M(x) x isin X sube Rd
bull M(x) is characterized by covariance function k(x y)bull Leverage spatial correlation for prediction
3
Partial Literature
bull Quantification of input uncertainty
bull Barton Nelson and Xie (2014)
bull Xie Nelson and Barton (2014)
bull Simulationblack-boxBayesian optimization
bull Huang et al (2006)
bull Sun Hong and Hu (2014)
bull Scott Frazier and Powell (2011)
bull Shahriari et al (2016)
4
The Big n Problem
bull Response surface is observed at x1 xn with noise
z(xi ) = β +M(xi ) + ε(xi )
bull Best linear unbiased predictor of Z(x0)
983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull Maximum likelihood estimation
maxβθθθ
983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]
⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052
bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time
bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular
bull Especially for the popular Gaussian covariance function
bull Usually run into trouble when n gt 100 which can easily happen
when d ge 3
5
Enhancing SK with Gradient Information
bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1
j (xi ) gdj (xi ))
⊺
g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d
where G r (xi ) is the true r -th partial derivative
bull Predict Z(x0) using both response estimates and gradient estimates
bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)
simple using gradients indirectly
bull Chen Ankenman and Nelson (2013) stochastic kriging with
gradient estimators (SKG) sophisticated using gradients directly
6
GESK (Qu and Fu 2014)
bull Use gradient estimates to create ldquopseudordquo response estimates
zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi
where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation
bull Predict Z(x0) using the augmented data
(z(x1) z(xn) z(x1) z(xn))
bull The size of the covariance matrix now becomes 2n times 2n
bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n
bull Similar problem for SKG
7
Approximation Schemes
bull Well developed in spatial statistics and machine learning
bull Banerjee et al (2015)
bull Rasmussen and Williams (2006)
bull Reduced-rank approximations emphasize long-range dependences
bull Sparse approximations emphasize short-range dependences
optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search
2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie
kethxTHORN frac14 1
eth2THORNd
Ze$ iWTxsethWTHORNdW (38)
Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following
kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i
(39)
As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that
kethxx0THORN
m
Xm
ifrac14 1
e$ iWethiTHORNTxeiWethiTHORN
Tx0 (40)
where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN
As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data
3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface
Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off
The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the
Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95
credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are
shown as black crosses The SSGP model used a basis of 80 Fourier features
Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization
Vol 104 No 1 January 2016 | Proceedings of the IEEE 159
Figure 1 Posterior means and variances Source Shahriari et al (2016)
8
Approximation-free
8
Markovian Covariance Functions
Gaussian Markov Random Field (GMRF)
bull M is multivariate normal with sparsity specified on ΣΣΣminus1M
bull A discrete model using graph to describe Markovian structure
bull Given all its neighbors node i is conditionally independent of its
non-neighbors
bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1
M (i j) ∕= 0 lArrrArr i and j are neighbors
0 1 2 3 4
bull The sparsity can reduce necessary computation to O(n2)
9
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Partial Literature
bull Quantification of input uncertainty
bull Barton Nelson and Xie (2014)
bull Xie Nelson and Barton (2014)
bull Simulationblack-boxBayesian optimization
bull Huang et al (2006)
bull Sun Hong and Hu (2014)
bull Scott Frazier and Powell (2011)
bull Shahriari et al (2016)
4
The Big n Problem
bull Response surface is observed at x1 xn with noise
z(xi ) = β +M(xi ) + ε(xi )
bull Best linear unbiased predictor of Z(x0)
983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull Maximum likelihood estimation
maxβθθθ
983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]
⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052
bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time
bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular
bull Especially for the popular Gaussian covariance function
bull Usually run into trouble when n gt 100 which can easily happen
when d ge 3
5
Enhancing SK with Gradient Information
bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1
j (xi ) gdj (xi ))
⊺
g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d
where G r (xi ) is the true r -th partial derivative
bull Predict Z(x0) using both response estimates and gradient estimates
bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)
simple using gradients indirectly
bull Chen Ankenman and Nelson (2013) stochastic kriging with
gradient estimators (SKG) sophisticated using gradients directly
6
GESK (Qu and Fu 2014)
bull Use gradient estimates to create ldquopseudordquo response estimates
zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi
where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation
bull Predict Z(x0) using the augmented data
(z(x1) z(xn) z(x1) z(xn))
bull The size of the covariance matrix now becomes 2n times 2n
bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n
bull Similar problem for SKG
7
Approximation Schemes
bull Well developed in spatial statistics and machine learning
bull Banerjee et al (2015)
bull Rasmussen and Williams (2006)
bull Reduced-rank approximations emphasize long-range dependences
bull Sparse approximations emphasize short-range dependences
optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search
2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie
kethxTHORN frac14 1
eth2THORNd
Ze$ iWTxsethWTHORNdW (38)
Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following
kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i
(39)
As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that
kethxx0THORN
m
Xm
ifrac14 1
e$ iWethiTHORNTxeiWethiTHORN
Tx0 (40)
where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN
As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data
3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface
Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off
The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the
Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95
credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are
shown as black crosses The SSGP model used a basis of 80 Fourier features
Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization
Vol 104 No 1 January 2016 | Proceedings of the IEEE 159
Figure 1 Posterior means and variances Source Shahriari et al (2016)
8
Approximation-free
8
Markovian Covariance Functions
Gaussian Markov Random Field (GMRF)
bull M is multivariate normal with sparsity specified on ΣΣΣminus1M
bull A discrete model using graph to describe Markovian structure
bull Given all its neighbors node i is conditionally independent of its
non-neighbors
bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1
M (i j) ∕= 0 lArrrArr i and j are neighbors
0 1 2 3 4
bull The sparsity can reduce necessary computation to O(n2)
9
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
The Big n Problem
bull Response surface is observed at x1 xn with noise
z(xi ) = β +M(xi ) + ε(xi )
bull Best linear unbiased predictor of Z(x0)
983141Z(x0) = β +ΣΣΣM(x0 middot)[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull Maximum likelihood estimation
maxβθθθ
983051minus log[det(ΣΣΣM +ΣΣΣε)]minus [z minus β1n]
⊺[ΣΣΣM +ΣΣΣε][z minus β1n]983052
bull Slow [ΣΣΣM +ΣΣΣε] isin Rntimesn and inverting it takes O(n3) time
bull Numerically unstable [ΣΣΣM +ΣΣΣε] is often nearly singular
bull Especially for the popular Gaussian covariance function
bull Usually run into trouble when n gt 100 which can easily happen
when d ge 3
5
Enhancing SK with Gradient Information
bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1
j (xi ) gdj (xi ))
⊺
g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d
where G r (xi ) is the true r -th partial derivative
bull Predict Z(x0) using both response estimates and gradient estimates
bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)
simple using gradients indirectly
bull Chen Ankenman and Nelson (2013) stochastic kriging with
gradient estimators (SKG) sophisticated using gradients directly
6
GESK (Qu and Fu 2014)
bull Use gradient estimates to create ldquopseudordquo response estimates
zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi
where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation
bull Predict Z(x0) using the augmented data
(z(x1) z(xn) z(x1) z(xn))
bull The size of the covariance matrix now becomes 2n times 2n
bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n
bull Similar problem for SKG
7
Approximation Schemes
bull Well developed in spatial statistics and machine learning
bull Banerjee et al (2015)
bull Rasmussen and Williams (2006)
bull Reduced-rank approximations emphasize long-range dependences
bull Sparse approximations emphasize short-range dependences
optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search
2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie
kethxTHORN frac14 1
eth2THORNd
Ze$ iWTxsethWTHORNdW (38)
Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following
kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i
(39)
As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that
kethxx0THORN
m
Xm
ifrac14 1
e$ iWethiTHORNTxeiWethiTHORN
Tx0 (40)
where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN
As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data
3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface
Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off
The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the
Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95
credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are
shown as black crosses The SSGP model used a basis of 80 Fourier features
Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization
Vol 104 No 1 January 2016 | Proceedings of the IEEE 159
Figure 1 Posterior means and variances Source Shahriari et al (2016)
8
Approximation-free
8
Markovian Covariance Functions
Gaussian Markov Random Field (GMRF)
bull M is multivariate normal with sparsity specified on ΣΣΣminus1M
bull A discrete model using graph to describe Markovian structure
bull Given all its neighbors node i is conditionally independent of its
non-neighbors
bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1
M (i j) ∕= 0 lArrrArr i and j are neighbors
0 1 2 3 4
bull The sparsity can reduce necessary computation to O(n2)
9
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Enhancing SK with Gradient Information
bull j-th run of the simulation model at xi producesbull response estimate zj(xi )bull gradient estimate gj(xi ) = (g 1
j (xi ) gdj (xi ))
⊺
g rj (xi ) = G r (xi ) + δrj (xi ) r = 1 d
where G r (xi ) is the true r -th partial derivative
bull Predict Z(x0) using both response estimates and gradient estimates
bull Qu and Fu (2014) gradient extrapolated stochastic kriging (GESK)
simple using gradients indirectly
bull Chen Ankenman and Nelson (2013) stochastic kriging with
gradient estimators (SKG) sophisticated using gradients directly
6
GESK (Qu and Fu 2014)
bull Use gradient estimates to create ldquopseudordquo response estimates
zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi
where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation
bull Predict Z(x0) using the augmented data
(z(x1) z(xn) z(x1) z(xn))
bull The size of the covariance matrix now becomes 2n times 2n
bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n
bull Similar problem for SKG
7
Approximation Schemes
bull Well developed in spatial statistics and machine learning
bull Banerjee et al (2015)
bull Rasmussen and Williams (2006)
bull Reduced-rank approximations emphasize long-range dependences
bull Sparse approximations emphasize short-range dependences
optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search
2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie
kethxTHORN frac14 1
eth2THORNd
Ze$ iWTxsethWTHORNdW (38)
Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following
kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i
(39)
As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that
kethxx0THORN
m
Xm
ifrac14 1
e$ iWethiTHORNTxeiWethiTHORN
Tx0 (40)
where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN
As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data
3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface
Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off
The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the
Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95
credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are
shown as black crosses The SSGP model used a basis of 80 Fourier features
Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization
Vol 104 No 1 January 2016 | Proceedings of the IEEE 159
Figure 1 Posterior means and variances Source Shahriari et al (2016)
8
Approximation-free
8
Markovian Covariance Functions
Gaussian Markov Random Field (GMRF)
bull M is multivariate normal with sparsity specified on ΣΣΣminus1M
bull A discrete model using graph to describe Markovian structure
bull Given all its neighbors node i is conditionally independent of its
non-neighbors
bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1
M (i j) ∕= 0 lArrrArr i and j are neighbors
0 1 2 3 4
bull The sparsity can reduce necessary computation to O(n2)
9
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
GESK (Qu and Fu 2014)
bull Use gradient estimates to create ldquopseudordquo response estimates
zj(xi ) asymp zj(xi ) + gj(xi )⊺∆xi
where xi = xi +∆xibull ∆xi the direction and step size of the linear extrpolation
bull Predict Z(x0) using the augmented data
(z(x1) z(xn) z(x1) z(xn))
bull The size of the covariance matrix now becomes 2n times 2n
bull One could create d pseudo response estimates at each xi resultingin inverting a matrix of size (d + 1)n times (d + 1)n
bull Similar problem for SKG
7
Approximation Schemes
bull Well developed in spatial statistics and machine learning
bull Banerjee et al (2015)
bull Rasmussen and Williams (2006)
bull Reduced-rank approximations emphasize long-range dependences
bull Sparse approximations emphasize short-range dependences
optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search
2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie
kethxTHORN frac14 1
eth2THORNd
Ze$ iWTxsethWTHORNdW (38)
Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following
kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i
(39)
As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that
kethxx0THORN
m
Xm
ifrac14 1
e$ iWethiTHORNTxeiWethiTHORN
Tx0 (40)
where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN
As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data
3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface
Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off
The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the
Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95
credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are
shown as black crosses The SSGP model used a basis of 80 Fourier features
Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization
Vol 104 No 1 January 2016 | Proceedings of the IEEE 159
Figure 1 Posterior means and variances Source Shahriari et al (2016)
8
Approximation-free
8
Markovian Covariance Functions
Gaussian Markov Random Field (GMRF)
bull M is multivariate normal with sparsity specified on ΣΣΣminus1M
bull A discrete model using graph to describe Markovian structure
bull Given all its neighbors node i is conditionally independent of its
non-neighbors
bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1
M (i j) ∕= 0 lArrrArr i and j are neighbors
0 1 2 3 4
bull The sparsity can reduce necessary computation to O(n2)
9
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Approximation Schemes
bull Well developed in spatial statistics and machine learning
bull Banerjee et al (2015)
bull Rasmussen and Williams (2006)
bull Reduced-rank approximations emphasize long-range dependences
bull Sparse approximations emphasize short-range dependences
optimized to emphasize the potential pathologies of themethod Since in Bayesian optimization we use thecredible intervals to guide exploration these artefactscan mislead our search
2) Sparse Spectrum Gaussian Processes (SSGPs) Whileinducing pseudoinputs reduce computational complexityby using a fixed number of points in the search spacesparse spectrum Gaussian processes (SSGPs) take a similarapproach to the kernelrsquos spectral space [94] Bochnerrsquostheorem states that any stationary kernel kethxx0THORN frac14kethx $ x0THORN has a positive and finite Fourier spectrum sethWTHORN ie
kethxTHORN frac14 1
eth2THORNd
Ze$ iWTxsethWTHORNdW (38)
Since the spectrum is positive and bounded it can benormalized such that pethWTHORN frac14 sethWTHORN= is a valid probabilitydensity function In this formulation evaluating thestationary kernel is equivalent to computing the expecta-tion of the Fourier basis with respect to its specific spectraldensity pethWTHORN as in the following
kethxx0THORN frac14 EW e$ iWTethx$ x0THORNh i
(39)
As the name suggests SSGP approximates this expectationvia MC estimation using m samples drawn from thespectral density so that
kethxx0THORN
m
Xm
ifrac14 1
e$ iWethiTHORNTxeiWethiTHORN
Tx0 (40)
where WethiTHORN amp sethWTHORN= The resulting finite-dimensionalproblem is equivalent to Bayesian linear regression with mbasis functions and the computational cost is once againreduced to Oethnm2 thorn m3THORN
As with the pseudoinputs the spectral points can also betuned via marginal likelihood optimization Although thisviolates the MC assumption and introduces a risk ofoverfitting it allows for a smaller number of basis functionswith good predictive power [94] Once again in Fig 4 wehave not tuned the 80 spectral points in this way Whereasaround observed data (red crosses) the uncertainty estimatesare smoother than the pseudoinputs method away fromobservations both the prediction and uncertainty regionsexhibit spurious oscillations This is highly undesirable forBayesian optimization where we expect our surrogate modelto fall back on the prior away from observed data
3) Random Forests Finally as an alternative to GPsrandom forest regression has been proposed as anexpressive and flexible surrogate model in the context ofsequential model-based algorithm configuration (SMAC)[79] Introduced in 2001 [24] random forests are a class ofscalable and highly parallelizable regression models thathave been very successful in practice [42] More preciselythe random forest is an ensemble method where the weaklearners are decision trees trained on random subsamplesof the data [24] Averaging the predictions of theindividual trees produces an accurate response surface
Subsampling the data and the inherent parallelism ofthe random forest regression model give SMAC the abilityto readily scale to large evaluation budgets beyond wherethe cubic cost of an exact GP would be infeasibleSimilarly at every decision node of every tree a fixed-sized subset of the available dimensions is sampled to fit adecision rule this subsampling also helps the randomforest scale to high-dimensional search spaces Perhapsmost importantly random forests inherit the flexibility ofdecision trees when dealing with various data types theycan easily handle categorical and conditional variables Forexample when considering a decision node the algorithmcan exclude certain search dimensions from considerationwhen the path leading up to said node includes a particularboolean feature that is turned off
The exploration strategy in SMAC still requires anuncertainty estimate for predictions at test points Whilethe random forest does not provide an estimate of the
Fig 4 Comparison of surrogate regression models Four different surrogate model posteriors are shown in blue (shaded area delimits 95
credible intervals) given noisy evaluations (red crosses) of a synthetic function (dashed line) The ten pseudoinputs for the SPGP method are
shown as black crosses The SSGP model used a basis of 80 Fourier features
Shahriari et al Taking the Human Out of the Loop A Review of Bayesian Optimization
Vol 104 No 1 January 2016 | Proceedings of the IEEE 159
Figure 1 Posterior means and variances Source Shahriari et al (2016)
8
Approximation-free
8
Markovian Covariance Functions
Gaussian Markov Random Field (GMRF)
bull M is multivariate normal with sparsity specified on ΣΣΣminus1M
bull A discrete model using graph to describe Markovian structure
bull Given all its neighbors node i is conditionally independent of its
non-neighbors
bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1
M (i j) ∕= 0 lArrrArr i and j are neighbors
0 1 2 3 4
bull The sparsity can reduce necessary computation to O(n2)
9
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Approximation-free
8
Markovian Covariance Functions
Gaussian Markov Random Field (GMRF)
bull M is multivariate normal with sparsity specified on ΣΣΣminus1M
bull A discrete model using graph to describe Markovian structure
bull Given all its neighbors node i is conditionally independent of its
non-neighbors
bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1
M (i j) ∕= 0 lArrrArr i and j are neighbors
0 1 2 3 4
bull The sparsity can reduce necessary computation to O(n2)
9
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Markovian Covariance Functions
Gaussian Markov Random Field (GMRF)
bull M is multivariate normal with sparsity specified on ΣΣΣminus1M
bull A discrete model using graph to describe Markovian structure
bull Given all its neighbors node i is conditionally independent of its
non-neighbors
bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1
M (i j) ∕= 0 lArrrArr i and j are neighbors
0 1 2 3 4
bull The sparsity can reduce necessary computation to O(n2)
9
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Gaussian Markov Random Field (GMRF)
bull M is multivariate normal with sparsity specified on ΣΣΣminus1M
bull A discrete model using graph to describe Markovian structure
bull Given all its neighbors node i is conditionally independent of its
non-neighbors
bull Eg M(x2) perp (M(x0)M(x4)) given (M(x1)M(x3))bull ΣΣΣminus1
M (i j) ∕= 0 lArrrArr i and j are neighbors
0 1 2 3 4
bull The sparsity can reduce necessary computation to O(n2)
9
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Disadvantages
bull Has no explicit expression for the covariances
bull Cannot predict locations ldquooff the gridrdquo
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168unknown
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
10
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Markovian Covariance Function Best of Two Worlds
bull Construct a class of covariance functions for which
1 ΣΣΣM can be inverted analytically
2 ΣΣΣminus1M is sparse
bull Explicit link between covariance function and sparsity
Definition (1-d MCF)
Let p and q be two positive continuous functions that satisfy
p(x)q(y)minus p(y)q(x) lt 0 for all x lt y Then
k(x y) = p(x)q(y) Ixley +p(y)q(x) Ixgty is called a 1-d MCF
bull Brownian motion kBM(x y) = x Ixley +y Ixgty
bull Brownian bridge kBR(x y) = x(1minus y) Ixley +y(1minus x) Ixgty
bull OU process kOU(x y) = exeminusy Ixley +eyeminusx Ixgty
11
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Markovian Covariance Function
bull x1 xn are not necessarily equally spaced
Theorem (Ding and Z 2018)
Kminus1 is tridiagonal and its nonzero entries are
(Kminus1)ii =
983099983105983105983105983105983105983105983103
983105983105983105983105983105983105983101
p2p1(p2q1 minus p1q2)
if i = 1
pi+1qiminus1 minus piminus1qi+1
(piqiminus1 minus piminus1qi )(pi+1qi minus piqi+1) if 2 le i le n minus 1
qnminus1
qn(pnqnminus1 minus pnminus1qn) if i = n
and
(Kminus1)iminus1i = (Kminus1)iiminus1 =minus1
piqiminus1 minus piminus1qi i = 2 n
12
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Reduction in Complexity
bull Woodbury matrix identity
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M983167983166983165983168known
+ ΣΣΣminus1M983167983166983165983168
sparse
983147ΣΣΣminus1
M +ΣΣΣminus1ε983167 983166983165 983168
sparse
983148minus1
ΣΣΣminus1M
bull inversion O(n2)
bull multiplications O(n2)
bull addition O(n2)
bull It takes O(n2) time to compute BLUP
983141Z(x0) = β +ΣΣΣM(x0 middot)983167 983166983165 983168known
[ΣΣΣM +ΣΣΣε]minus1[z minus β1n]
bull If the noise is negligible (ΣΣΣε asymp 0) then no numerical inversion is
needed and computing BLUP is O(n)
13
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Improvement in Stability
1 ΣΣΣM can be made much better conditioned
2 Woodbury also improves numerical stability
[ΣΣΣM +ΣΣΣε]minus1 = ΣΣΣminus1
M +ΣΣΣminus1M
983147ΣΣΣminus1
M +ΣΣΣminus1ε
983148minus1
ΣΣΣminus1M
bull The diagonal entries of ΣΣΣminus1ε are often large
14
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Uncertainty Quantification
15
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Extension for d gt 1
bull Product form k(x y) =983124d
i=1 ki (xi y i )
bull Limitation x1 xn must form a regular lattice
bull Then K =983121d
i=1 Ki and Kminus1 =983121d
i=1 Kminus1i preserving sparsity
(00)
(01)
(02)
(10)
(11)
(12)
(20)
(21)
(22)
16
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Two-Dimensional Response Surfaces
Function Name Expression
Three-Hump Camel Z(x y) = 2x2 minus 105x4 + x6
6+ xy + y 2
Bohachevsky Z(x y) = x2 + 2y 2 minus 03 cos(3πx)minus 04 cos(4πy) + 07
17
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Prediction Accuracy
bull Standardized RMSE =
983155983123K
i=1[Z(xi )minusZ(xi )]2
raquo983123K
i=1[Z(xi )minusKminus1983123K
h=1Z(xh)]
2
18
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Condition Number of ΣΣΣM +ΣΣΣε
bull C = λmax(K )λmin(K ) measures ldquocloseness to singularityrdquo
19
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Scalability Demonstration
bull 4-d Griewank func Z(x) =9831234
i=1
Aumlx (i)
20
auml2minus 10
983124Di=1 cos
Aumlx (i)radici
auml+ 10
bull Mean cycle time of a N-station Jackson network with D different
types of arrivals (Yang et al 2011) N = D = 4
E[CT1] =N983131
j=1
δ1j
microj
iuml1minus ρ
Aring 983123D
i=1αiδijmicroj
maxh983123D
i=1αiδihmicroh
atildeograve
20
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Computational Efficiency
21
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Scalable Gradient Extrapolated
Stochastic Kriging
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Enhancing Scalability of GESK with MCFs
bull GESK creates an augmented set of response estimates for SK
bull MCFs can be applied if the design points form a regular lattice of
size n = n1 times n2 times middot middot middot nd
bull Result in 2dn points in the augmented dataset
bull ΣM has size 2dn times 2dn but we can leverage the Kronecker product
to reduce its inversion to inverting d much smaller matrices each
having size 2nr times 2nr
22
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Numerical Illustration
SK GESK
n=54=625 =108
0
001
002
003
004
005
006
007
008
EIM
SE
SK GESK
n=84=4096 =07
0
001
002
003
004
005
006
007
008
SK GESK
n=104=10000 =06
0
001
002
003
004
005
006
007
008
bull 4-dimensional Griewank function
bull Can manage n = 104 design points
23
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Conclusions
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Remarks on MCFs
bull Allow modeling association directly while retaining sparsity in the
precision matrix
bull Improve the scalability of SK so that it can be used for simulation
models with a high-dimensional design space
bull Reduce computational cost from O(n3) to O(n2) without approx
bull Further reduce to O(n) if observations are noise-free
bull Enhance numerical stability substantially
bull Limitation design points must form a regular lattice though not
necessarily equally spaced
24
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Remarks on Gradient Enhanced SK
bull GESK (Qu and Fu 2014) can easily benefit from MCFs
bull But there are two issues
bull Extrapolation error is hard to characterize
bull Each design point needs (2d minus 1) pseudo response estimates a great
deal of redundancy in using gradient info
bull SKG (Chenn Ankenman and Nelson 2013) does not incur such
computational overhead but requires calculating the gradient
surface of the Gaussian process (on-going work)
25
Markovian covariances without approx
vs
Good approx for all covariances
25
Markovian covariances without approx
vs
Good approx for all covariances
25