apts design of experiments - university of warwick · 2019-08-27 · optimum experimental design,...
TRANSCRIPT
APTS Design of ExperimentsSeptember 2019
Dave Woods ([email protected] ; http://www.southampton.ac.uk/~davew) Statistical Sciences Research Institute, University of Southampton
Preliminaries
A few example designs and data sets for this module are available in the R package apts.doe, which canbe installed from GitHub
References will be provided throughout but some good general purpose texts are
These notes and other resources can be found at https://statsdavew.github.io/apts.doe/
library(devtools)install_github("statsdavew/apts.doe", quiet = T)library(apts.doe)
Atkinson, Donev and Tobias (2007). Optimum Experimental Design, with SAS. OUP
Wu and Hamada (2009). Experiments: Planning, Analysis, and Parameter Design Optimization (2nded.). Wiley.
Morris (2011). Design of Experiments: An Introduction based on Linear Models. Chapman andHall/CRC Press.
Santner, Williams and Notz (2019). The Design and Analysis of Computer Experiments (2nd ed.).Springer.
·
·
·
·
2/129
Motivation and background
Modes of data collection
Observational studies
Sample surveys
Designed experiments
·
·
·
4/129
Experiments
Definition: An experiment is a procedure whereby controllable factors, or features, of a system orprocess are deliberately varied in order to understand the impact of these changes on one or moremeasurable responses.
"prehistory": Bacon, Lind, Peirce, … (establishing the scientific method)
agriculture (1920s)
clinical trials (1940s)
industry (1950s)
in-silico (1980s)
·
·
·
·
·
5/129
Role of experimentation
Why do we experiment?
Design of experiments: a statistical approach to the arrangement of the operational details of theexperiment (eg sample size, specific experimental conditions investigated, …) so that the quality of theanswers to be derived from the data is as high as possible.
key to the scientific method (hypothesis – experiment – observe – infer – conclude)
potential to establish causality …
… and to understand/improve complex systems depending on many factors
comparison of treatments, factor screening, prediction, optimisation, …
·
·
·
·
6/129
Motivating examples
1. Multi-factor experiment in pharmaceutical development.
Key to developing new medicines is the identification of optimal and robust process conditions (e.g.settings of temperature, pressure etc.) at which the active pharmaceutical ingredient should besynthesized.
[Somewhat confusinging, the FDA refer to this as identification of a "design space".]
An important step in is this methodology is a robustness experiment to assess the sensitivity ofidentified conditions to changes in all (or at least very many) controllable factors.
While developing a new melanoma drug, GlaxoSmithKline performed an experiment to investigatesensitivity to 20 factors. Their experimental budget allowed only 10 individual experiments (runs) to beperformed.
7/129
Motivating examples
2. Optimal design to calibrate a physical model.
Physical (mechanistic, mathematical, …) models are used in many scientific fields. Typically, they arederived from fundamental understanding of the physics, chemistry, biology …
Most commonly, these models are solutions to differential equations. The models usually containunknown parameters that should be estimated from experimental data.
Biologists at Southampton were studying the transfer of amino acids between mother and babythrough the placenta. They could control the times at which observations were taken and the initialconcentrations of amino acids (see Overstall, Woods, and Parker 2019).
8/129
Motivating examples
3. Computer experiments to optimise ride performance in luxury cars
Suspension settings can be used to improve the ride performance in cars. Optimising settings acrossmany different car models would take many hundreds of hours of testing, so computer simulations areused.
Jaguar-Land Rover wanted to find suspension settings robust across different car models using acomputer experiment (KTN workshop).
9/129
Simple motivating example
Consider an experiment to compare two treatments (eg drugs, diets, fertilisers).
We have subjects (eg people, mice, plots of land), each of which can be assigned to one of the twotreatments.
A response (eg protein measurement, weight, yield) is then measured from each subject.
Question: How should the two treatments be assigned to the subjects to gain the most preciseinference about the difference in expected response from the two treatments.
n
10/129
Assume a linear model for the response
with independently, unknown parameters and
The difference in expected response between treatment 1 and 2 is
So we need the most precise possible estimator of
= + + , i = 1, … , n,yi β0 β1 xi εi
∼ N(0, )εi σ2 ,β0 β1
= {xi−1+ 1
if treatment 1 is applied to subject i ,if treatment 2 is applied to subject i
E( | = + 1) − E( | = −1) = + − + = 2yi xi yi xi β0 β1 β0 β1 β1
β1
11/129
Both and can be estimated using least squares (or equivalently maximum likelihood).
Writing
we obtain estimators
with
In this simple example, we are interesting in estimating , and we have
β0 β1
y = Xβ + ε ,
= yβ ( X)XT −1XT
Var( ) =β ( X)XT −1σ2
β1
Var( )β1 = nσ2
n∑ −x2i (∑ )xi
2
= nσ2
−n2 (∑ )xi2
12/129
Hence, we need to pick to minimise
Assuming is even, the "optimal design" has
For odd, let and
We can assess a designs, labelled , via its efficiency relative to the optimal design :
, … ,x1 xn = ( −(∑ )xi2 n1 n2)2
denote as the number of subjects assigned to treatment 1, and the number assigned totreatment 2, with
it is obvious that if and only if
· n1 n2+ = nn1 n2
· ∑ = 0xi =n1 n2
n = = n/2n1 n2
n =n1n+ 1
2 =n2n−1
2
ξ ξ⋆
Eff(ξ) = Var( | )β1 ξ⋆
Var( | ξ)β1
13/129
n <- 50eff <- function(n1) 1 - ((2 * n1 - n) / n)^2curve(eff, from = 0, to = n, ylab = "Eff", xlab = expression(n[1]))
14/129
Definitions
Treatment – entities of scientific interest to be studied in the experiment eg varieties of crop, doses of a drug, combinations of temperature and pressure
Unit – smallest subdivision of the experimental material such that two units may receive differenttreatments eg plots of land, subjects in a clinical trial, samples of reagent
Run – application of a treatment to a unit
·
·
·
15/129
Example
Fabrication of integrated circuits (see Wu and Hamada 2009)
Unit
Treatment
an initial step in fabricating integrated circuits is the growth of an epitaxial layer on polished siliconwafers via chemical deposition
·
set of six wafers (mounted in a rotating cylinder)·
combination of settings of the factors·
A : rotation method ( )
B : nozzle position ( )
C : deposition temperature ( )
D : deposition time ( )
- x1
- x2
- x3
- x4
16/129
A unit-treatment statistical model
where
The aims of the experiment are achieved by estimating comparisons between the treatment effects, .
Experimental precision and accuracy are largely obtained through control and comparison.
= + , i = 1, … , t; j = 1, … , ,yij τi εij ni
: measured response from the th unit to which treatment has been applied
: treatment effect (expected response from application of the th treatment)
: random deviation from the expected response [typically ]
· yij j i
· τi i
· εij ∼ N(0, )σ2
−τk τl
17/129
Model assumptions
Three key model assumptions are:
See Dasgupta, Pillai, and Rubin (2015) for discussion of these assumptions for factorial experiments
additivity (response = treatment effect + unit effect)
constancy of treatment effects (treatment effect does not depend on the unit to which it is applied)
no interference between units (the effect of a treatment applied to unit does not depend on thetreatment applied to any other unit)
·
·
· j
18/129
Principles of experimentation
Stratification (blocking)
Replication
account for systematic differences between batches of experimental units by arranging them inhomogeneous sets (blocks)
·
if the same treatment was applied to all units, within-block variation in the response would bemuch less than between-block
compare treatments within the same block and hence eliminate block effects
-
-
the application of each treatment to multiple experimental units·
provides an estimate of experimental error against which to judge treatment differences
reduces the variance of the estimators of treatment differences
-
-
19/129
Randomisation
Randomisation is perhaps the key principle in the design of experiments
we randomise features such as the allocation of units to treatments, the order in which treatmentsare applied, …
·
protects against lurking (uncontrolled) variables (model-robust) and subjectively in the allocationof treatments to units
-
it protects against model misspecification (bias), and hence allows causality to be established
unbiased estimation of and , even if the errors are not normally distributed
exact tests for differences between treatment effects are available (Basu 1980)
·
a clear difference between treatments can only be an accident of the randomisation or aconsequence of the treatments
-
· τ σ2
·
20/129
Without randomisation, unobserved confounders ( ) can induce a dependency betweenthe response ( ) and treatment ( )
cf Cox and Reid (2000), p.35
UY T
21/129
With randomisation, unobserved confounders ( ) are independent of the treatment ( ).Marginalisation over does not induce an edge between and
cf Cox and Reid (2000), p.35
U TU T Y
22/129
Factorial designs
Example revisited
Fabrication of integrated circuits (Wu and Hamada 2009, p155)
Treatment
Assume each factor has two-levels, coded -1 and +1
combination of settings of the factors·
A : rotation method ( )
B : nozzle position ( )
C : deposition temperature ( )
D : deposition time ( )
- x1
- x2
- x3
- x4
24/129
Treatments and a regression model
Each factor has two levels
A treatment is then defined as a combination of four values of
Assume each treatment effect is determined by a regression model in the four factors, eg
= ± 1, k = 1, … , 4xk
−1, + 1
eg
specifies a setting of the process
· = −1, = −1, = + 1, = −1x1 x2 x3 x4
·
τ(x ) = + +β0 ∑i= 1
4βi xi ∑
j= 1
4
∑i> j
4βijxixj
25/129
(Two-level) Factorial design
with(cirfab, cirfab[order(x1, x2, x3, x4), ])
## x1 x2 x3 x4 ybar## 2 -1 -1 -1 -1 13.58983## 1 -1 -1 -1 1 14.59000## 4 -1 -1 1 -1 14.04983## 3 -1 -1 1 1 14.24000## 6 -1 1 -1 -1 13.94000## 5 -1 1 -1 1 14.65000## 8 -1 1 1 -1 14.14017## 7 -1 1 1 1 14.40000## 10 1 -1 -1 -1 13.72000## 9 1 -1 -1 1 14.67000## 12 1 -1 1 -1 13.90000## 11 1 -1 1 1 13.84017## 14 1 1 -1 -1 13.87983## 13 1 1 -1 1 14.56000## 16 1 1 1 -1 14.11017## 15 1 1 1 1 14.30000
treatments in standard order
- average response from the six wafers
·
· y26/129
Regression model and least squares
Y = Xβ + ε , ε ∼ N(0, I) , = Yσ2 β ( X)XT −1XT
model matrix has columns corresponding to intercept, linear and cross-product terms
information matrix
regression coefficients are estimated by independent contrasts in the data
· X
· X = nIXT
·
cirfab.lm <- lm(ybar ~ (.) ^ 2, data = cirfab)coef(cirfab.lm)
## (Intercept) x1 x2 x3 x4 x1:x2 ## 14.161250000 -0.038729167 0.086270833 -0.038708333 0.245020833 0.003708333 ## x1:x3 x1:x4 x2:x3 x2:x4 x3:x4 ## -0.046229167 -0.025000000 0.028770833 -0.015041667 -0.172520833
27/129
Main effects and interactions
Main effect of :
Interaction between and :
Higher-order interactions defined similarly
Assuming -1,+1 coding, there is a straightforward relationship between factorial effects and regressioncoefficients
xk
[Avg. response when = 1] − [Avg. response when = −1]xk xk
xj xk
[Avg. response when = 1] − [Avg. response when = −1]xjxk xjxk
main effect of is equal to
interaction between and is equal to
· xk 2βk
· xj xk 2βjk
28/129
Using the effects package:
library(effects)plot(Effect("x1", cirfab.lm), main = "", rug = F, ylim = c(13.5, 14.5), aspect = 1)plot(Effect("x2", cirfab.lm), main = "", rug = F, ylim = c(13.5, 14.5), aspect = 1)plot(Effect("x3", cirfab.lm), main = "", rug = F, ylim = c(13.5, 14.5), aspect = 1)plot(Effect("x4", cirfab.lm), main = "", rug = F, ylim = c(13.5, 14.5), aspect = 1)
29/129
Main effects
30/129
Interactions
plot(Effect(c("x3", "x4"), cirfab.lm), main = "", rug = F, ylim = c(13.5, 15), x.var = "x4")
31/129
Orthogonality
are independently normally distributed with equal variance
Hence, we can treat the identification of important effects (ie large ) as an outlier identificationproblem
Cuthbert (1959)
X = nI ⇒XT β
β
plot (absolute) ordered factorial effects against (absolute) quantiles from a standard normal
outlying effects are identified as important
·
·
32/129
Using the FrF2 package
library(FrF2)par(pty = "s", mar = c(8, 4, 1, 2))DanielPlot(cirfab.lm, main = "", datax = F, half = T)
33/129
Replication
An unreplicated factorial design provides no model-independent estimate of (Gilmour and Trinca2012)
Replication also increases the power of the design
σ2
any unsaturated model does provide an estimate, but it may be biased by ignored (significant) modelterms
this is one reason why graphical (or associated) analysis methods are popular
·
·
common to replicate a centre point
allows a portmanteau test of curvature
·
·
34/129
Refit a linear model to the fabricated circuits experiment excluding interactions.
cirfab2.lm <- lm(ybar ~ (.), data = cirfab)coef(cirfab.lm)
## (Intercept) x1 x2 x3 x4 x1:x2 ## 14.161250000 -0.038729167 0.086270833 -0.038708333 0.245020833 0.003708333 ## x1:x3 x1:x4 x2:x3 x2:x4 x3:x4 ## -0.046229167 -0.025000000 0.028770833 -0.015041667 -0.172520833
coef(cirfab2.lm)
## (Intercept) x1 x2 x3 x4 ## 14.16125000 -0.03872917 0.08627083 -0.03870833 0.24502083
cbind(sigma1 = summary(cirfab.lm)$sigma, df1 = summary(cirfab.lm)$df[2], sigma2 = summary(cirfab2.lm)$sigma, df2 = summary(cirfab2.lm)$df[2])
## sigma1 df1 sigma2 df2## [1,] 0.137152 5 0.2396108 11
35/129
Principles of factorial experimentation
Effect sparsity
Effect hierarchy
Effect heredity
Wu and Hamada (2009), pp.172–172
the number of important effects in a factorial experiment is small relative to the total number ofeffects investigated (cf Box and Meyer 1986)
·
lower-order effects are more likely to be important than higher-order effects
effects of the same order are equally likely to be important
·
·
interactions where at least one parent main effect is important are more likely to be importantthemselves
·
36/129
Regular fractional factorial designs
Choosing subsets of treatments
Factorial designs can require a large number of runs for only a moderate number of factors ( )
Resource constraints (eg cost) may mean not all combinations can be run
Lots of degrees of freedom are devoted to estimating higher-order interactions
Need to trade-off what you want to estimate against the number of runs you can afford
= 3225
2m
eg in a experiment, 16 degrees of freedom are used to estimate three-factor and higher-orderinteractions
principles of effect hierarchy and sparsity suggest may be wasteful
· 25
·
38/129
Example
Production of bacteriocin (Morris 2011, p231)
Unit
Treatment: combination of settings of the factors
Assume each factor has two-levels, coded -1 and +1
bacteriocin is a natural food preservative frown from bacteria·
a single bio-reaction·
A: amount of glucose ( )
B: initial inoculum size ( )
C: level of aeration ( )
D: temperature ( )
E: amount of sodium ( )
· x1
· x2
· x3
· x4
· x5
39/129
Find an run design using FrF2n= 8
bact.design <- FrF2(8, 5, factor.names = paste0("x", 1:5), generators = list(c(1, 3), c(2, 3)), randomize = F, alias.info = 3)bact.design
## x1 x2 x3 x4 x5## 1 -1 -1 -1 1 1## 2 1 -1 -1 -1 1## 3 -1 1 -1 1 -1## 4 1 1 -1 -1 -1## 5 -1 -1 1 -1 -1## 6 1 -1 1 1 -1## 7 -1 1 1 -1 1## 8 1 1 1 1 1## class=design, type= FrF2.generators
= = =
we need a principled way of choosing one-quarter of the runs from the factorial design that leads toclarity in the analysis
· 8 32/4 /25 22 25−2
·
40/129
Assuming the number of runs is a power of two, , we can construct orthogonal vectors(with inner product zero), spanned by vectors
n= 2k−q − 12k−q
k − q = (n)log2
construct the full factorial design for factors
assign the remaining factors to interaction columns
· k − q· q
model.matrix(~ (x1 + x2 + x3) ^ 3, bact.design[, 1:3])[, ]
## (Intercept) x11 x21 x31 x11:x21 x11:x31 x21:x31 x11:x21:x31## 1 1 -1 -1 -1 1 1 1 -1## 2 1 1 -1 -1 -1 -1 1 1## 3 1 -1 1 -1 -1 1 -1 1## 4 1 1 1 -1 1 -1 -1 -1## 5 1 -1 -1 1 1 -1 -1 1## 6 1 1 -1 1 -1 1 -1 -1## 7 1 -1 1 1 -1 -1 1 -1## 8 1 1 1 1 1 1 1 1
41/129
Aliasing scheme
The design has been deliberately chosen so that
[ is shorthand for the Hadamard (Schur or entry wise) product of two vectors, ]
What other consequences are there?
· =x4 x1 x3
· =x5 x2 x3
x1 x2 ∘x1 x2
the product of any column with itself is the constant column (the identity)
hence,
· = =x4 x5 x1 x3 x2 x3 x1 x2 x23
·
· =x4 x5 x1 x2
42/129
Now we can obtain the defining relation
and the complete aliasing scheme
…
· I = = =x1 x3 x4 x2 x3 x5 x1 x2 x4 x5
…
· = = =x1 x3 x4 x1 x2 x3 x5 x2 x4 x5
· = = =x2 x1 x2 x3 x4 x3 x5 x1 x4 x5
· = = =x3 x1 x4 x2 x5 x1 x2 x3 x4 x5
· = = =x4 x1 x3 x2 x3 x4 x5 x1 x2 x5
· = = =x5 x1 x3 x4 x5 x2 x3 x1 x2 x4
· = = =x1 x2 x2 x3 x4 x1 x3 x5 x4 x5
· = = =x1 x5 x3 x4 x5 x1 x2 x3 x2 x4
43/129
FrF2 will summarise the aliasing amongst main effects and two- and three-factor interactions.
design.info(bact.design)$aliased
## $legend## [1] "A=x1" "B=x2" "C=x3" "D=x4" "E=x5"## ## $main## [1] "A=CD=BDE" "B=CE=ADE" "C=AD=BE" "D=AC=ABE" "E=BC=ABD"## ## $fi2## [1] "AB=DE=ACE=BCD" "AE=BD=ABC=CDE"## ## $fi3## [1] "ACD=BCE"
44/129
The alias matrix
What is the consequence of this aliasing?
If more than one effect in each alias string is non-zero, the least squares estimators will be biased
is the alias matrix
assumed model
true model
· Y = + εX1β1
· Y = + + εX1β1 X2β2
E ( )β1 = E(Y)( )XT1 X1
−1XT1
= ( + )( )XT1 X1
−1XT1 X1β1 X2β2
= +β1 ( )XT1 X1
−1XT1 X2β2
= + Aβ1 β2
A
if the columns of and are not orthogonal, is biased· X1 X2 β1
45/129
For the example:
For a regular design, the matrix will only have entries 0, (no aliasing or complete aliasing)
25−2
is an matrix with columns for the intercept and five linear terms ("main effects")
is an matrix with columns for the 10 product terms ("two-factor interactions")
· X1 8 × 6· X2 8 × 10
A =
⎛
⎝
⎜⎜⎜⎜⎜⎜⎜
000000
000010
000100
000000
000001
000000
000000
010000
001000
000000
⎞
⎠
⎟⎟⎟⎟⎟⎟⎟
A ± 1
46/129
The transpose of the alias matrix is provided by the alias function.
ff.alias <- alias(y ~ (.)^2, data = data.frame(bact.design, y = vector(length = 8)))ff.alias$Complete
## (Intercept) x11 x21 x31 x41 x51 x11:x21 x11:x51## x21:x31 0 0 0 0 0 1 0 0 ## x21:x41 0 0 0 0 0 0 0 1 ## x21:x51 0 0 0 1 0 0 0 0 ## x31:x41 0 1 0 0 0 0 0 0 ## x31:x51 0 0 1 0 0 0 0 0 ## x41:x51 0 0 0 0 0 0 1 0 ## x11:x31 0 0 0 0 1 0 0 0 ## x11:x41 0 0 0 1 0 0 0 0
47/129
These linear dependencies can be seen if we attempt to fit a linear model
bact.lm <- lm(yB ~ (x1 + x2 + x3 + x4 + x5)^2, data = bact)summary(bact.lm)
## ## Call:## lm.default(formula = yB ~ (x1 + x2 + x3 + x4 + x5)^2, data = bact)## ## Residuals:## ALL 8 residuals are 0: no residual degrees of freedom!## ## Coefficients: (8 not defined because of singularities)## Estimate Std. Error t value Pr(>|t|)## (Intercept) 4.42625 NA NA NA## x1 0.26625 NA NA NA## x2 0.24625 NA NA NA## x3 -0.22875 NA NA NA## x4 -1.11875 NA NA NA## x5 -0.66375 NA NA NA## x1:x2 -0.05375 NA NA NA## x1:x3 NA NA NA NA## x1:x4 NA NA NA NA## x1:x5 -0.13375 NA NA NA## x2:x3 NA NA NA NA## x2:x4 NA NA NA NA## x2:x5 NA NA NA NA## x3:x4 NA NA NA NA## x3:x5 NA NA NA NA## x4:x5 NA NA NA NA
48/129
The role of fractional factorial designs in asequential strategy
Typically, in a first experiment, fractional factorial designs are used in screening
At second and later stages, augment the design
investigate which of many factors have a substantive effect on the response
main effects and two-factor interactions
centre points to check for curvature
·
·
·
to resolve ambiguities due to the aliasing of factorial effects ("break the alias strings")
to allow estimation of curvature and prediction from a more complex model
·
·
49/129
-optimality and non-regulardesignsD
Introduction
Regular fractional factorial designs have the number of runs equal to a power of the number of levels
Non-regular designs can have any number of runs (usually with , the number of parameters to beestimated)
Often the clarity provided by a regular design is lost
One approach to finding non-regular designs is via a design optimality criterion
eg ,
this inflexibility in run sizes can be a problem in practical experiments
· 25−2 × 233−1
·
n> p
no defining relation or straightforward aliasing scheme
partial aliasing and fractional entries in
·
· A
51/129
-optimality
Notation: let denote a design (choice of treatments and their replications)
Assuming the model , with , a -optimal design maximises
That is, a -optimal design maximises the determinant of the (expected) Fisher information matrix
Also useful to define a Bayesian version, with a prior precision matrix
(See later)
Dξ = [ , … , ]x 1 x n
Y = Xβ + ε ε ∼ N(0, )σ2In D
ϕ(ξ) = det ( X)XT
D
equivalent to minimising the volume of the joint confidence ellipsoid for · β
R
(ξ) = det ( X + R)ϕB XT
52/129
Comments
-optimal designs are model dependent
-optimality promotes orthogonality in the matrix
There are many other optimality criteria, tailored to other experimental goals
D
if the model (ie the columns of ) changes, the optimal design may change
model-robust design is an active area of research
· X·
D X
if there are sufficient runs, the -optimal design will usually be orthogonal
for particular models and choices of , regular fractional factorial designs are -optimal
· D· n D
prediction, model discrimination, space-filling, …·
53/129
Example: Plackett-Burman design
factors in runs, first-order (main effects) model (Plackett and Burman 1946)
A particular -optimal design is the following orthogonal array
Using the pb function in the FrF2 package:
k = 11 n= 12
D
pb.design <- pb(12, factor.names = paste0("x", 1:11))pb.design
## x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11## 1 -1 1 -1 1 1 -1 1 1 1 -1 -1## 2 -1 -1 1 -1 1 1 -1 1 1 1 -1## 3 -1 -1 -1 1 -1 1 1 -1 1 1 1## 4 1 1 1 -1 -1 -1 1 -1 1 1 -1## 5 -1 1 1 -1 1 1 1 -1 -1 -1 1## 6 1 -1 -1 -1 1 -1 1 1 -1 1 1## 7 -1 1 1 1 -1 -1 -1 1 -1 1 1## 8 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1## 9 1 -1 1 1 -1 1 1 1 -1 -1 -1## 10 1 1 -1 1 1 1 -1 -1 -1 1 -1## 11 1 -1 1 1 1 -1 -1 -1 1 -1 1## 12 1 1 -1 -1 -1 1 -1 1 1 -1 1## class=design, type= pb
54/129
This 12-run PB design is probably the most studied non-regular design
orthogonal columns
complex aliasing between main effects and two-factor interactions
·
·
pb.alias <- alias(y ~ (.)^2, data = data.frame(pb.design, y = vector(length = 12)))head(pb.alias$Complete, n = 15)
## (Intercept) x11 x21 x31 x41 x51 x61 x71 x81 x91 x101 x111## x11:x21 0 0 0 -1/3 -1/3 -1/3 1/3 -1/3 -1/3 1/3 1/3 -1/3## x11:x31 0 0 -1/3 0 1/3 -1/3 -1/3 1/3 -1/3 1/3 -1/3 -1/3## x11:x41 0 0 -1/3 1/3 0 1/3 1/3 -1/3 -1/3 -1/3 -1/3 -1/3## x11:x51 0 0 -1/3 -1/3 1/3 0 -1/3 -1/3 -1/3 -1/3 1/3 1/3## x11:x61 0 0 1/3 -1/3 1/3 -1/3 0 -1/3 1/3 -1/3 -1/3 -1/3## x11:x71 0 0 -1/3 1/3 -1/3 -1/3 -1/3 0 1/3 -1/3 1/3 -1/3## x11:x81 0 0 -1/3 -1/3 -1/3 -1/3 1/3 1/3 0 -1/3 -1/3 1/3## x11:x91 0 0 1/3 1/3 -1/3 -1/3 -1/3 -1/3 -1/3 0 -1/3 1/3## x11:x101 0 0 1/3 -1/3 -1/3 1/3 -1/3 1/3 -1/3 -1/3 0 -1/3## x11:x111 0 0 -1/3 -1/3 -1/3 1/3 -1/3 -1/3 1/3 1/3 -1/3 0## x21:x31 0 -1/3 0 0 -1/3 -1/3 -1/3 1/3 -1/3 -1/3 1/3 1/3## x21:x41 0 -1/3 0 -1/3 0 1/3 -1/3 -1/3 1/3 -1/3 1/3 -1/3## x21:x51 0 -1/3 0 -1/3 1/3 0 1/3 1/3 -1/3 -1/3 -1/3 -1/3## x21:x61 0 1/3 0 -1/3 -1/3 1/3 0 -1/3 -1/3 -1/3 -1/3 1/3## x21:x71 0 -1/3 0 1/3 -1/3 1/3 -1/3 0 -1/3 1/3 -1/3 -1/3
55/129
Example: supersaturated design
Screening designs with fewer runs than factors (see Woods and Lewis 2017)
Supersaturated experiment used by GlaxoSmithKline in the development of a new oncology drug
can't use ordinary least squares/maximum likelihood as does not have full column rank
Bayesian -optimality with
· X· D R = [0 | τ ]Im
factors: e.g. temperature, solvent amount, reaction time
runs
Bayesian -optimal design with
· k = 16· n= 10· D τ = 0.2
56/129
ssd
## x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16## 1 1 1 -1 1 1 -1 -1 -1 -1 -1 -1 1 1 1 1 1## 2 1 1 1 -1 -1 -1 -1 -1 1 1 -1 -1 1 -1 -1 1## 3 -1 -1 1 -1 1 -1 1 -1 -1 1 -1 -1 -1 1 1 -1## 4 -1 1 1 1 1 -1 -1 1 -1 -1 1 -1 -1 -1 -1 -1## 5 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1 -1 -1 -1 1 1## 6 1 1 1 -1 1 1 -1 1 1 1 1 1 1 1 1 -1## 7 -1 -1 1 1 -1 -1 1 -1 1 -1 1 1 1 -1 1 -1## 8 1 -1 -1 1 1 1 1 1 -1 1 -1 1 1 -1 -1 -1## 9 -1 1 -1 1 -1 1 1 -1 1 1 1 -1 -1 1 -1 1## 10 1 -1 1 -1 -1 -1 1 1 -1 -1 1 1 -1 1 -1 1
57/129
Partial aliasing between main effects
Heatmap of column correlations:
library(fields)par(mar=c(8,2,0,0))image.plot(1:16,1:16, cor(ssd), zlim = c(-1, 1), xlab = "Factors", ylab = "", asp = 1, axes = F)axis(1, at = seq(2, 16, by = 2), line = .5)axis(2, at = seq(2, 16, by = 2), line = -5)
58/129
Analysis via regularised (shrinkage) methods (eg lasso, Dantzig selector; see APTS High DimensionalStatistics)
small coefficients shrunk to zero·
59/129
Bayesian optimal design
Introduction
Now consider a more general class of models (cf preliminary material).
Let be iid observations from a distribution with density/mass function
The (expected) information matrix
is an important quantity for design, where (the log-likelihood).
y = ( , … ,y1 yn)T π( ; θ, )yi x i
is a vector of unknown parameters
is a vector of values of controllable variables.
· θ q−· = ( , … ,x i x1i xki)T k
M(θ) = [− ]Eyl(θ)∂2
∂θ∂θT
l(θ) = log π( ; θ, )∑ni= 1 yi x i
is the (asymptotic) precision for the maximum likelihood estimators .
is also an asymptotic approximation to the posterior precision for in a Bayesian analysis.
· M(θ) θ
· M(θ) θ
61/129
Pharmacokinetics
Example 1: Compartmental model
with
for .
Prior distributions (for later use):
Ryan et al. (2014)
∼ N (c(θ)μ(θ; ), ν(θ; )) , ∈ [0, 24] ,yi xi σ2 xi xi
μ(θ; x) = exp(− x) − exp(− x) , c(θ) = , ν(θ; x) = 1 + c(θ μ(θ; x) ,θ1 θ2400θ2( − )θ3 θ2 θ1
τ2
σ2 )2
, , , , > 0θ1 θ2 θ3 τ2 σ2
, with · log ∼ N( , 0.05)θi mi = log 0.1, = 0, = log 20m1 m2 m3
62/129
comp <- function(x, theta, D = 400) { mu <- exp(-theta[1] * x) - exp(-theta[2] * x) c <- (D / theta[3]) * (theta[2]) / (theta[2] - theta[1]) c * mu }theta <- c(.1, 1, 20)M <- 100par(mar = c(6, 4, 0, 1) + .1)lapply(1:M, function(l) { thetat <- rlnorm(3,log(theta),rep(0.05,3)) curve(comp(x, theta = thetat), from = 0, to = 24, ylab = "Expected concentration", xlab = "Time", ylim = c(0, 20), xlim = c(0, 24), add = l!=1) })
63/129
Crystallography
Example 2: multi-factor experiments to build (hierarchical) logistic regression models forpharmaceutical salt formation
Four controllable variables:
rate of agitation during mixing ( )
volume of composition ( )
temperature ( )
evaporation rate ( )
· x1
· x2
· x3
· x4
64/129
For the th observation in the th group :
with
where is the value taken by the th variable.
Prior distributions (for later use):
j i (i = 1, … , g; j = 1, … , )ng
∼ Bernoulli (ρ( ))yij x ij
log( ) = ( + ) + ( + ) ,ρ( )x ij
1 − ρ( )x ijβ0 ωi0 ∑
r= 1
kβr ωir xijr
xijr r
are unknown parameters of interest
are group specific parameters for the th group
· β = ( , , … ,β0 β1 βq−1)T
· = ( , , … ,ωi ωi0 ωi1 ωiq−1)T i
, , , , · ∼ U(−3, 3)β0 ∼ U(4, 10)β1 ∼ U(5, 11)β2 ∼ U(−6, 0)β3 ∼ U(−2.5, 3.5)β4
1. standard logistic regression -
2. hierarchical logistic regression - . with following a triangular distribution
= 0ωir
∼ U(− , + )ωir sr sr > 0sr
65/129
Classical optimal designs
Many Frequentist criteria for finding optimal designs for both linear and nonlinear models optimise afunction of the information matrix; see Atkinson, Donev, and Tobias (2007), ch.10
Let denote a design, and set to explicitly acknowledge the dependenceof the information matrix on the design
we have already seen -optimality· D
ξ = ( , … ,x 1 x n)T M(ξ; θ) = M(θ)
-optimality: maximise
-optimality: minimise
-optimality: minimise
- (or -) optimality - minimise
· D (ξ) = det M(ξ; θ)ϕD
· A (ξ) = trace M(ξ; θϕA )−1
· G (ξ) = Var( (x ))ϕG maxx y where is the predicted response at and the (asymptotic) prediction variance is a function of- (x)y xM(ξ; θ)
· V I (ξ) = Var ( (x )) dxϕV ∫ y
66/129
Optimal design for nonlinear models
For most nonlinear models, will be a function of the unknown parameters (unlike for thelinear model, where )
This leads to a "chicken and egg" situation
For some models/experiments, the quality of a design may change a lot with the value of
M(ξ; θ) θM(ξ; β) = XXT
if you can tell me the values of the unknown parameters, I can give you an optimal design
but if you knew the value of , you probably wouldn't need to perform the experiment!
·
· θ
θ
67/129
A simple example
rho <- function(x, beta0 = 0, beta1 = 1) { eta <- beta0 + beta1 * x 1 / (1 + exp(-eta))}par(mar = c(8, 4, 1, 2) + 0.1)curve(rho, from = -5, to = 5, ylab = expression(rho), xlab = expression(italic(x)), cex.lab = 1.5, cex.axis = 1.5, ylim = c(0, 1), lwd = 2)
68/129
For simple logistic regression, the information matrix has the form
with the model matrix and
For example with , , and
M(ξ; β) = WX ,XT
X n× 2 W = diag {ρ( )[1 − ρ( )]}xi xi
n= 2 ξ = (−1, 1) = 0β0 = 1β1
M(ξ; β) = ( ) ( ) ( )1−1
11
0.20
00.2
11
−11
Minfo <- function(xi, beta0 = 0, beta1 = 1) { X <- cbind(c(1, 1), xi) v <- function(x) rho(x, beta0, beta1) * (1 - rho(x, beta0, beta1)) W <- diag(c(v(xi[1]), v(xi[2]))) t(X) %*% W %*% X}Dcrit <- function(xi, beta0 = 0, beta1 = 1) { d <- det(Minfo(xi, beta0, beta1)) ifelse(is.nan(d), -Inf, d)}
69/129
Locally -optimal designsD
= 0, = 1β0 β1
dopt <- optim(par = c(-1, 1), Dcrit, control = list(fnscale = -1))xi.opt1 <- dopt$parxi.opt1
## [1] -1.543421 1.543530
70/129
Locally -optimal designsD
= 0, = 2β0 β1
dopt <- optim(par = c(-1, 1), Dcrit, control = list(fnscale = -1), beta1 = 2)xi.opt2 <- dopt$parxi.opt2
## [1] -0.7717705 0.7717418
71/129
Locally -optimal designsD
= 0, = 0.5β0 β1
dopt <- optim(par = c(-1, 1), Dcrit, control = list(fnscale = -1), beta1 = .5)xi.opt3 <- dopt$parxi.opt3
## [1] -3.086913 3.086981
72/129
Getting wrong: design for when actually
-efficiency
β1 = .5β1 = 2β1
D
(Dcrit(xi.opt3, beta1 = 2) / Dcrit(xi.opt2, beta1 = 2)) ^ (1 / 2)
## [1] 0.05720905
73/129
Use of the "wrong" design can lead to uninformative experiments (with "small" information matrices)
For the logistic regression example, the drop in efficiency is closely related to the phenomenon ofseparation (see Firth 1993)
Motivates the need for designs which are robust to the values of the model parameters
maximin designs (focus on worst case performance)
Bayesian designs
·
·
74/129
Bayesian optimal design
Decision-theoretic design starts with a utility function that defines the usefulness of a designfor a particular purpose, given data and parameters
Common choices of utility function include
u (ξ, y, θ)y θ
negative squared error loss
surprisal or self information
·
u (ξ, y, θ) = −[θ − E(θ | y)]2
negative squared difference between and the posterior mean- θ·
u (ξ, y, θ) = log π(θ | y, ξ) − log π(θ)= log π(y | θ, ξ) − log π(y | ξ)
difference between log posterior and log prior densities, or between the log-likelihood and thelog-evidence
-
75/129
A priori (before the experiment), we do not know or (we will never know )
So, we take the expectation of the utility function with respect to the joint distribution of
The equivalence of the third and fourth equations follows from Bayes theorem
See Chaloner and Verdinelli (1995)
y θ θ
y, θ
U(ξ) = [u (ξ, y, θ)]Ey,θ | ξ
= ∫ u (ξ, y, θ)π(y, θ | ξ) dθ dy
= ∫ u (ξ, y, θ)π(θ | y, ξ)π(y | ξ) dθ dy
= ∫ u (ξ, y, θ)π(y | θ, ξ)π(θ | ξ) dθ dy
the third equation more clearly shows the dependence on the posterior distribution
the fourth equation is often more useful for calculations and computation
·
·
76/129
Surprisal
- the expected Shannon information gain (SIG) or expected Kullback-Liebler divergence between priorand posterior densities
Negative squared error loss
- the expected negative squared error loss (NSEL)
U(ξ) = ∫ log π(y, θ | ξ) dθ dyπ(θ | y, ξ)
π(θ)
= ∫ log π(y, θ | ξ) dθ dyπ(y | θ, ξ)π(y | ξ)
U(ξ) = − ∫ π(y, θ | ξ) dθ dy[θ − E(θ | y)]2
= − ∫ tr {Var(θ | y, ξ)π(y | ξ)} dy
77/129
Challenges
In general, Bayesian design is easy in principle but hard in practice
1. For most nonlinear models, the expected utility will be intractable and involves high-dimensionalintegrals with respect to
2. A high-dimensional optimisation problem results for multi-factor experiments with many designpoints
yoften, obtaining the utility function itself requires the solution of intractable integrals (cf both ESIGand NSEL)
numerical or analytical approximation is required (eg Ryan et al. 2016)
·
·
78/129
Asymptotic approximations
For large , the inverse information matrix is an asymptotic approximation to the posteriorvariance-covariance matrix
Using this approximation, we can define Bayesian analogues of classical optimality criteria
-optimality: maximise
-optimality: maximise
n M(ξ; θ)
D
(ξ) = ∫ log detM(ξ; θ)π(θ) dθUD
approximation to ESIG·
A
(ξ) = − ∫ tr (ξ; θ)π(θ) dθUA M −1
approximation to NSEL·
79/129
These integrals, with respect to , are lower dimensional and more amenable to deterministic(quadrature) approximation, eg Gotwalt, Jones, and Steinberg (2009)
The acebayes package provides functions for constructing approximations to expected utilities
θ
default is to use quadrature to approximate the Bayesian -optimality objective function· D
library(acebayes)prior <- list(support = matrix(c(0, 0, .5, 2), nrow = 2))logreg.util <- utilityglm(formula = ~ x, family = binomial, prior = prior)$utilityBDcrit <- function(xi) logreg.util(data.frame(x = xi))bdopt <- optim(par = c(-1, 1), BDcrit, control = list(fnscale = -1))bdopt$par
## [1] -1.202960 1.203132
80/129
Monte Carlo approximation
As an alternative to analytical approximations, Monte Carlo approximation to the expected utility issimple to implement and intuitively appealing
where
How to construct the approximation is an active area of research, eg Overstall, McGree, andDrovandi (2018), Beck et al. (2018)
(ξ) = (ξ, , )U 1B ∑
i= 1
Bu yi θi
is a random sample from
is, where necessary, an approximation to the utility function (often, nested Monte Carlo isrequired)
· { , }θh yhBh= 1 π(θ, y | ξ)
· (ξ, y, θ)u
(ξ, y, θ)u
81/129
Optimisation
Find an optimal design using Monte Carlo:
Larger Monte Carlo sample sizes will produce results more similar to the design found usingquadrature (in this example)
priorMC <- function(B) cbind(rep(0, B), runif(n = B, min = .5, max = 2))logreg.utilSIG <- utilityglm(formula = ~ x, family = binomial, prior = priorMC, criterion = "SIG")$utilityBDcritSIG <- function(xi, B = 1000) mean(logreg.utilSIG(data.frame(x = xi), B))bdoptSIG <- optim(par = c(-1, 1), BDcritSIG, control = list(fnscale = -1))bdoptSIG$par
## [1] -1.205382 1.182047
bdopt$par
## [1] -1.202960 1.203132
82/129
In general, direct optimisation of the Monte Carlo approximation requires large to generate suitablesmooth objective function and/or expensive stochastic algorithms (eg genetic algorithms)
Hamada et al. (2001)
Alternatively, the optimisation can be embedded within a simulation scheme and samples generatedfrom the joint artificial distribution of
Müller (1999), Müller, Sansó, and De Iorio (2004)
B
ξ, y, θ
take , the optimal design, to be the posterior mode of the marginal distribution
most effective for small experiments (both numbers of variables and runs)
· ξ∗
·
83/129
Smoothing-based optimisation
Instead of directly minimising a Monte Carlo approximation to the expected utility, find designs viacurve fitting (Müller and Parmigiani 1996)
Return to Example 1, compartmental model
1. Evaluate the Monte Carlo approximation for a small number of designs,
2. Smooth the "data" , i.e. fit a statistical model, to obtain a surrogate
3. Find that maximises
(ξ)U , … ,ξ1 ξQ
{ , ( )}ξi U ξi (ξ)U
ξ (ξ)U
find a design with runs, with fixed
use Monte Carlo approximation to SIG for 10 values of
· n= 2 = 5x1
· x2
84/129
library(DiceKriging)library(DiceDesign)n <- 10; x1<- -0.583; x2 <- 2 * maximinSA_LHS(lhsDesign(n, 1)$design)$design- 1u <- NULL; for(i in 1:n) u[i] <- mean(utilcomp15sig(c(x1, x2[i]), B = 1000))par(mar = c(4, 4, 2, 2) + 0.1)plot(12 * (x2 + 1), u, xlab = expression(x[2]), ylab = "Approx. expected SIG", xlim = c(0, 24), ylim = c(0, 2), pch = 16, cex = 1.5); abline(v = 12 * (x1 + 1), lwd = 2)usmooth <- km(design = 12 * (x2 + 1), response = u, nugget = 1e-3, control = list(trace = F))xgrid <- matrix(seq(0, 24, l = 1000), ncol = 1); pred <- predict(usmooth, xgrid, type = "SK")$meanlines(seq(0, 24, l = 1000), pred, col = "blue", lwd = 2); abline(v = xgrid[which.max(pred), ], lty = 2)
85/129
Approximate coordinate exchange
Coordinate exchange, a version of cyclic ascent, is a popular algorithm for finding optimal designs(Meyer and Nachtsheim 1995)
Approximate coordinate exchange (ACE) combines coordinate exchange with smoothing to find high-dimensional designs under computationally expensive approximate expected utilities
Overstall and Woods (2017)
optimisation of proceeds coordinate-wise, i.e. just one of the is varied at a time· ξ = ( , … , )x 1 x n xij
a nonparametric regression model (a Gaussian process) is used to smooth the Monte Carloapproximations of as a function of one coordinate
·U(ξ)
reduces the computational burden
facilitates optimisation of a noisy function
-
-
86/129
Return to the multifactor logistic regression (crystallography) example
## set up priorpriorMFL <- function(B) { b0 <- runif(B, -3, 3) b1 <- runif(B, 4, 10) b2 <- runif(B, 5, 11) b3 <- runif(B, -6, 0) b4 <- runif(B, -2.5, 2.5) cbind(b0, b1, b2, b3, b4)}## define the utility functionMFL.utilSIG <- utilityglm(formula = ~ x1 + x2 + x3 + x4, family = binomial, prior = priorMFL, criterion = "SIG")$utility## starting design with n=18 runs, on [-1, 1]d <- 2 * randomLHS(18, 4) - 1colnames(d) <- paste0("x", 1:4)## approximate expected utility for starting designmean(MFL.utilSIG(d, 1000))
## [1] 1.139546
## not run - quite computationally expensiveMLF.ace <- ace(utility = MFL.utilSIG, start.d = d, progress = T)
87/129
For this logistic regression example, acebayes has some designs precomputed
pairs(optdeslrsig(18), pch = 16, labels=c(expression(x[1]), expression(x[2]), expression(x[3]), expression(x[4])), cex = 2)
88/129
Hierachical logistic regression with groups (blocks, eg wellplates)g = 3
pairs(optdeshlrsig(18), pch = 16, labels=c(expression(x[1]), expression(x[2]), expression(x[3]), expression(x[4])), col = c("black", "red", "blue")[rep(1:3, rep(6, 3))], cex = 2)
89/129
Computer experiments
Introduction
Many physical and social processes can be approximated by computer codes which encapsulatemathematical models
- computer code: numerical implementation of the mathematical model
Key feature: the model does not have a closed-form; it can only be evaluated numerically, and this istypically (relatively) expensive
We will focus on deterministic computer models
eg partial differential equations solved using finite element methods
eg reaction kinetics modelling in computational biology, in-silico chemistry
·
·
91/129
Computer experiments
Assumption: can only be evaluated numerically; i.e. can be computed for a given but thegeneral form is unknown
How do we learn about the function ?
In an analogy to a physical system, we experiment on , i.e.
Use the "data" to build statistical models linking and
See Santner, Williams, and Notz (2003)
g(x ) g(x ) x
g(x )
g(x )
choose a design
evaluate (run the computer code)
· ξ = ( , … , )x 1 x n
· g( )x i
{ , g( )}x i x i x g(x )
called emulators; typically use a Gaussian process·
92/129
(Very) simple example
Climate modelling involves the solution of many intractable equations, leading to mathematical modelsevaluated via computationally expensive computer codes
We will illustrate methods on a very simple example: a time-stepping advective/diffusive surface layermeridional EBM (energy balance model)
See https://wiki.aston.ac.uk/foswiki/bin/view/MUCM/SurfebmModel
lots of applications of computer experiments·
2D earth with no land
each surface object has a percentage of ice cover
different albedo (fraction of solar energy reflected) for ice vs non-ice surfaces
ocean circulation is explicitly modelling (cf Atlantic gulf stream)
two variables: - solar constant; - non-ice albedo
output is mean temperature
·
·
·
·
· x1 x2
·
93/129
## design and data are in 'ebm'library(akima)fld <- interp(x = ebm$x1, y = ebm$x2, z = ebm$y)filled.contour(x = fld$x, y = fld$y, z = fld$z, asp = 1)
94/129
Space-filling designs
As we will see later, emulators are usually constructed using nonparametric statistical models
This choice leads naturally to using space-filling designs
Common designs are chosen to optimise some space-filling metric, or formed from (stratified) randomsampling
Space-filling designs do not have replication, so ideal for deterministic computer models
such designs do not rely on the functional form of the relationship between the code inputs and theresponse
good coverage is important for prediction (we will predict "better" near points we have already runthe computer model)
·
·
95/129
Uniform designs
Many designs proposed for computer experiments are related to ideas underpinning quadrature, andthe approximation of an expectation.
Let , the sample mean of for . Then
where is the star discrepancy of the design
This relationship leads to the criterion of design selection via minimising discrepancy
Fang, Li, and Sudjianto (2006), Ch.3
= g( )g 1n ∑n
i= 1 x i g(⋅) ξ
| [g(x )] − | ≤ constant × D(ξ)Ex g
D(ξ)
is a measure of the uniformity of the design points· D(ξ)
is difficult to compute for moderate to high numbers of dimensions
therefore, it is more common to minimise the related centred -discrepancy
· D(ξ)· L2
96/129
Designs based on measures of distance
Two sensible criteria for the selection of a space-filling design are
(Johnson, Moore, and Ylvisaker 1990)
The Euclidean distance between points and is given by
make sure no two points in the design are too close together
make sure no point in the design region is too far from a design point
·
·
x x ′
δ(x , ) =x ′ ∑i= 1
k
( − )xj x′j
2‾ ‾‾‾‾‾‾‾‾‾‾‾‾
⎷
97/129
Mm and mM designs
Using Euclidean distance, we can define
where the distance between a point and a design is defined as
Roughly speaking, an Mm design spreads out the design points, and an mM design covers the designregion
Intuitively, covering the design region seems more desirable (eg for prediction), but optimising the mMobjective function is computationally challenging. Hence, Mm designs are more commonly used
maximin (Mm) criterion: maximise·
δ( , )min, ∈ ξx i x j
x i x j
minimax (mM) criterion: minimise·
δ(x , ξ)maxx
x ξ
δ(x , ξ) = δ(x , )min∈ ξx j
x i
98/129
Latin hypercube designs
For high-dimensional problems, space-filling is difficult
Latin hypercube designs (LHDs) are randomly chosen sets of points with the restriction of uniformone-dimensional projections (McKay, Beckman, and Conover 1979)
An LHD only guarantees space-filling properties in each one-dimensional projection, not overall. So wenormally combine the Latin hypercube principle with a space-filling criteria, eg to find a Mm LHD
many points are required to adequate space-fill a high-dimensional space (curse of dimensionality)·
each variable has no overlapping points, and good coverage (compare with a factorial design, whichhas hidden replication)
can be easily constructed using permutations of integers
·
·
99/129
LH <- function(n = 3, d = 2) { D <- NULL for(i in 1:d) D <- cbind(D, sample(1:n, n)) D }set.seed(4)par(mar=c(5,6,2,4)+0.1, pty = "s")plot((LH() -.5)/ 3, xlim = c(0, 1), ylim = c(0, 1), pty = "s", xlab = expression(x[1]), ylab = expression(x[2]), pch = 16, cex.lab = 2, cex.axis = 2, cex = 2)abline(v = c(0, 1/3, 2/3, 1), lty = 2)abline(h = c(0, 1/3, 2/3, 1), lty = 2)
100/129
The DiceDesign package has functions to generate various LHDs
library(DiceDesign)lhs.d <- lhsDesign(9, 2)plot(lhs.d$design, xlim = c(0, 1), ylim = c(0, 1), pty = "s", xlab = expression(x[1]), ylab = expression(x[2]), pch = 16, cex.lab = 2, cex.axis = 2, cex = 2, main = "random", cex.main = 2)abline(v = seq(0, 9) / 9, lty = 2)abline(h = seq(0, 9) / 9, lty = 2)
discrep.d <- discrepSA_LHS(lhs.d$design, criterion = "C2")plot(discrep.d$design, xlim = c(0, 1), ylim = c(0, 1), pty = "s", xlab = expression(x[1]), ylab = expression(x[2]), pch = 16, cex.lab = 2, cex.axis = 2, cex = 2, main = "discrepancy", cex.main = 2)abline(v = seq(0, 9) / 9, lty = 2)abline(h = seq(0, 9) / 9, lty = 2)
maximin.d <- maximinSA_LHS(discrep.d$design)plot(maximin.d$design, xlim = c(0, 1), ylim = c(0, 1), pty = "s", xlab = expression(x[1]), ylab = expression(x[2]), pch = 16, cex.lab = 2, cex.axis = 2, cex = 2, main = "maximin", cex.main = 2)abline(v = seq(0, 9) / 9, lty = 2)abline(h = seq(0, 9) / 9, lty = 2)
101/129
102/129
The design for the EBM example is a Mm LHD
par(mar=c(5,6,2,4)+0.1, pty = "s")plot(ebm[, 2:3], xlim = c(-1, 1), ylim = c(-1, 1), pty = "s", xlab = expression(x[1]), ylab = expression(x[2]), pch = 16, asp = 1)abline(v = 2 * seq(0:20) / 20 - 1, lty = 2)abline(h = 2 * seq(0:20) / 20 - 1, lty = 2)
103/129
Gaussian process
The most common statistical model used to emulate computer models is the Gaussian process (GP)
An intuitive way to think about a GP is as a prior for the unknown function within a Bayesianframework
flexible, nonparametric regression model (few assumptions made about )
naturally allows for uncertainty quantification (eg prediction intervals)
interpolates observed responses
· g(x )·
·
g(x )
104/129
We say that
where is the mean, is the correlation function, is the vector of correlation parametersand is the constant variance, if:
See Rasmussen and Williams (2006)
g(x ) ∼ GP (f (x β, κ(x , ; θ)) ,)T σ2 x ′
f (x β)T κ(x , ; ϕ)x ′ θσ2
any vector satisfies
with a model matrix and the covariance matrix defined by .
· g = (g( ), … , g( ))x 1 x nT
g ∼ N (Fβ, K(θ)) ,σ2
F K m × m K(θ = κ( , ; θ))ij x i x j
105/129
Typically, very simple mean functions are chosen for the GP, eg
The most commonly used correlation functions are separable and stationary
The Matérn function can be defined for other values of ; for , the squared exponential functionis obtained
constant: (sometimes called ordinary kriging)
linear: (universal kriging)
· f (x β =)T β0
· f (x β = +)T β0 ∑kj= 1 βj xj
squared exponential:·
κ(x , ; θ) = exp −x ′⎡
⎣⎢⎢ ∑
j ( )| − |xj x′
j
θj
2 ⎤
⎦⎥⎥
Matérn · ν = 5/2
κ(x , ; θ) = 1 + + exp(− )x ′ ∏j
⎛⎝⎜⎜ 5‾√
| − |xj x′j
θj
53 ( )
| − |xj x′j
θj
2 ⎞⎠⎟⎟ 5‾√
| − |xj x′j
θj
ν ν → ∞
106/129
Given model evaluations , a posterior GP can be obtained:
where is a vector of correlations between and
The updating of the prior mean and variance depends on the "distance" between and the points in
g = [g( ), … , g( )]x 1 x n
g(x ) | g , β, θ, ∼ N (m(x ), (x ))σ2 s2
· m(x ) = f (x β + (g − Fβ))T κTnK −1
· (x ) = (1 − )s2 σ2 κTnK −1κn
= [κ(x , ; θ)κn x i ]ni= 1 g(x ) g( ), … , g( )x 1 x n
x ξ
the posterior mean will be adjusted more for points closer to the design
predictions at these points will have smaller posterior variance
·
·
107/129
If (so we are predicting at a design point), , the th unit vector
The posterior GP interpolates - exactly what you want for a deterministic computer code
Inference unconditional on all the hyperparameters requires numerical approximation (eg Markovchain Monte Carlo)
x = x i =K −1κn ei i
· m( ) = f ( β + (g − Fβ) = g( )x i x i)T eTi x i
· ( ) = (1 − ) = (1 − κ( , ; θ)) = 0s2 x i σ2 κTnei σ2 x i x i
it is common to estimate the parameters, eg using maximum likelihood, to "plug-in" to the posteriorpredictive distribution
·
108/129
A simple example: using the DiceKriging packageg(x) = sin(2πx)
library(DiceDesign)library(DiceKriging)xi <- lhsDesign(6, 1)$designy <- sin(2 * pi * xi)gp <- km(design = xi, response = y, control = list(trace = F))xs <- sort(c(seq(0, 1, length = 100), xi))gpp <- predict(gp, newdata = xs, type = "SK")
plot(xs, gpp$mean, ylim = c(-2, 2), type = "l", col = "red", lwd = 3, ylab = "", xlab = "x")points(xi, y, pch = 4,lwd = 4, col = "blue")lines(xs, gpp$upper95, lty = 2, lwd = 3)lines(xs, gpp$lower95, lty = 2, lwd = 3)legend(x = "topright", legend = c("posterior mean of g", "posterior quantiles for g", expression(paste("observed data ", g(x[i])))), lty = c(1, 2, NA), pch = c(NA, NA, 4), lwd = c(4, 4, 4), col = c("red", "black", "blue"))
109/129
110/129
Return to the EBM example
gpebm <- km(formula = ~., design = ebm[, 2:3], response = ebm[, 1], control = list(trace = F))gpebm
## ## Call:## km(formula = ~., design = ebm[, 2:3], response = ebm[, 1], control = list(trace = F))## ## Trend coeff.:## Estimate## (Intercept) 16.3262## x1 2.4078## x2 -28.9973## ## Covar. type : matern5_2 ## Covar. coeff.:## Estimate## theta(x1) 2.8829## theta(x2) 0.2722## ## Variance estimate: 2.215045
111/129
xs1 <- sort(c(seq(-1, 1, length = 10), ebm[, 2]))xs2 <- sort(c(seq(-1, 1, length = 10), ebm[, 3]))xs <- expand.grid(x1 = xs1, x2 = xs2)gppebm <- predict(gpebm, newdata = xs, type = "UK")filled.contour(x = xs1, y = xs2, z = matrix(gppebm$mean, nrow = length(xs1)))filled.contour(x = xs1, y = xs2, z = matrix(gppebm$sd, nrow = length(xs1)))
112/129
Bayesian optimisation
A common task is optimisation of
When is computationally expensive to evaluate, computer experiments and emulators can be usedto facilitate the optimisation.
The field of Bayesian optimisation uses sequentially collected evaluations of
Uncertainty in the posterior (i.e. for at unobserved ) leads to exploration/exploitation trade-off
The most common acquisition function is expected improvement (EI)
See Jones, Schonlau, and Welch (1998)
g(x )
g(x )
g(x )
place a prior distribution (eg GP) on
collect function evaluations at points chosen sequentially via an acquisition function
update the prior to a posterior distribution, and infer the maximum/minimum of
· g(x )·
· g(x )
g(x ) x
113/129
For a deterministic computer model and a minimisation problem, the improvement from performingone more run is given by:
where is the minimum across the model runs performed to date
This quantity is a random variable - we are uncertain about at a point we have not observed.
EI chooses to maximise
where and are the standard normal pdf and cdf, respectively
EI is an decreasing function of and an increasing function of , so it leads to choosing designpoints that either minimise the posterior mean or the posterior variance
max( − g(x ), 0)gmin
gmin
g(x )
x
[max( − g(x ), 0) ; g ] = [ − m(x )] Φ ( ) + s(x )ϕ ( )Eg gmin gmin− m(x )gmin
s(x )− m(x )gmin
s(x )
ϕ Φ
m(x ) (x )s2
experiment either where our uncertainty is high or near where we predict the minimum to be(explore or exploit)
·
114/129
A simple example: but with a different starting design using DiceOptimg(x ) = sin(2πx)
xi <- matrix(c(0.1, 0.8, 0.9), ncol = 1)fn <- function(x) sin(2 * pi * x)y <- fn(xi)gp <- km(design = xi, response = y, control = list(trace = F))xs <- sort(c(seq(0, 1, length = 100), xi))gpp <- predict(gp, newdata = xs, type = "SK")
plot(xs, gpp$mean, ylim = c(-2, 2), type = "l", col = "red", lwd = 3, ylab = "", xlab = "x")points(xi, y, pch = 4,lwd = 4, col = "blue")lines(xs, gpp$upper95, lty = 2, lwd = 3)lines(xs, gpp$lower95, lty = 2, lwd = 3)
115/129
116/129
library(DiceOptim)xin <- max_EI(model = gp, lower = 0, upper = 1)$par
## ## ## Mon Aug 26 15:55:26 2019## Domains:## 0.000000e+00 <= X1 <= 1.000000e+00 ## ## NOTE: The total number of operators greater than population size## NOTE: I'm increasing the population size to 10 (operators+1).## ## Data Type: Floating Point## Operators (code number, name, population) ## (1) Cloning........................... 0## (2) Uniform Mutation.................. 1## (3) Boundary Mutation................. 1## (4) Non-Uniform Mutation.............. 1## (5) Polytope Crossover................ 1## (6) Simple Crossover.................. 2## (7) Whole Non-Uniform Mutation........ 1## (8) Heuristic Crossover............... 2## (9) Local-Minimum Crossover........... 0## ## HARD Maximum Number of Generations: 12## Maximum Nonchanging Generations: 2## Population size : 10## Convergence Tolerance: 1.000000e-21## ## Using the BFGS Derivative Based Optimizer on the Best Individual Each Generation.
117/129
EI(xin, gp)
## [1] 0.1211283
plot(xs, gpp$mean, ylim = c(-2, 2), type = "l", col = "red", lwd = 3, ylab = "", xlab = "x", cex.lab = 2)points(xi, y, pch = 4,lwd = 4, col = "blue")lines(xs, gpp$upper95, lty = 2, lwd = 3)lines(xs, gpp$lower95, lty = 2, lwd = 3)abline(v = xin)plot(xs, sapply(xs, EI, model = gp), type = "l", lwd = 3, ylab = "", xlab = "x", cex.lab = 2)
118/129
xi <- rbind(xi, xin)y <- c(y, fn(xin))gp2 <- km(design = xi, response = y, control = list(trace = F))xin <- max_EI(model = gp2, lower = 0, upper = 1, control = list(trace = F))$par
## ## ## Mon Aug 26 15:55:26 2019## Domains:## 0.000000e+00 <= X1 <= 1.000000e+00 ## ## NOTE: The total number of operators greater than population size## NOTE: I'm increasing the population size to 10 (operators+1).## ## Data Type: Floating Point## Operators (code number, name, population) ## (1) Cloning........................... 0## (2) Uniform Mutation.................. 1## (3) Boundary Mutation................. 1## (4) Non-Uniform Mutation.............. 1## (5) Polytope Crossover................ 1## (6) Simple Crossover.................. 2## (7) Whole Non-Uniform Mutation........ 1## (8) Heuristic Crossover............... 2## (9) Local-Minimum Crossover........... 0## ## HARD Maximum Number of Generations: 12## Maximum Nonchanging Generations: 2## Population size : 10## Convergence Tolerance: 1.000000e-21
119/129
EI(xin, gp2)
## [1] 0.052808
xs <- sort(c(seq(0, 1, length = 100), xi))gpp <- predict(gp2, newdata = xs, type = "SK")plot(xs, gpp$mean, ylim = c(-2, 2), type = "l", col = "red", lwd = 3, ylab = "", xlab = "x", cex.lab = 2)points(xi, y, pch = 4,lwd = 4, col = "blue")lines(xs, gpp$upper95, lty = 2, lwd = 3)lines(xs, gpp$lower95, lty = 2, lwd = 3)abline(v = xin)plot(xs, sapply(xs, EI, model = gp2), type = "l", lwd = 3, ylab = "", xlab = "x", cex.lab = 2)
120/129
xi <- rbind(xi, xin)y <- c(y, fn(xin))gp3 <- km(design = xi, response = y, control = list(trace = F))xin <- max_EI(model = gp3, lower = 0, upper = 1, control = list(trace = F))$par
## ## ## Mon Aug 26 15:55:26 2019## Domains:## 0.000000e+00 <= X1 <= 1.000000e+00 ## ## NOTE: The total number of operators greater than population size## NOTE: I'm increasing the population size to 10 (operators+1).## ## Data Type: Floating Point## Operators (code number, name, population) ## (1) Cloning........................... 0## (2) Uniform Mutation.................. 1## (3) Boundary Mutation................. 1## (4) Non-Uniform Mutation.............. 1## (5) Polytope Crossover................ 1## (6) Simple Crossover.................. 2## (7) Whole Non-Uniform Mutation........ 1## (8) Heuristic Crossover............... 2## (9) Local-Minimum Crossover........... 0## ## HARD Maximum Number of Generations: 12## Maximum Nonchanging Generations: 2## Population size : 10## Convergence Tolerance: 1.000000e-21
121/129
EI(xin, gp3)
## [1] 0.0005309279
xs <- sort(c(seq(0, 1, length = 100), xi))gpp <- predict(gp3, newdata = xs, type = "SK")plot(xs, gpp$mean, ylim = c(-2, 2), type = "l", col = "red", lwd = 3, ylab = "", xlab = "x", cex.lab = 2)points(xi, y, pch = 4,lwd = 4, col = "blue")lines(xs, gpp$upper95, lty = 2, lwd = 3)lines(xs, gpp$lower95, lty = 2, lwd = 3)abline(v = xin)plot(xs, sapply(xs, EI, model = gp3), type = "l", lwd = 3, ylab = "", xlab = "x", cex.lab = 2)
122/129
Sensitivity analysis
To analyse the relative importance of different model inputs, functional analysis of variance (FANOVA)can be used to decompose into an additive form (see Saltelli et al. 2008)
Assume , eg the input variables may be normally distributed or uniformly distributed. Then
where
and
with higher-order terms defined similarly
g(x )
∼ (x)xi wi
g(x ) = + (x ) + (x ) + … + (x )μ0 ∑i= 1
kμi ∑ ∑
i> jμij μ12⋯k
= ∫ g(x ) w( ) d(x ) , (x ) = ∫ g(x ) ( ) d − ,μ0 ∏i
xi μi ∏j≠i
wj xj xj μ0
(x ) = ∫ g(x ) ( ) d − (x ) − (x ) +μij ∏l≠i,j
wl xl xl μi μj μ0
123/129
A (normalised measure) of the impact of each variable can be obtained by assessing the variance ofeach term via a Sobol' index:
The impact of higher-order terms can be assessed via the total variance index:
where and
μi
= ∈ [0, 1]Si{ (x )}Varxi μi
{g(x )}Varx
= 1 − ∈ [0, 1]Ti{ (x )}Varx (i) μ−i
{g(x )}Varx
= ( , … , , , … ,x (i) x1 xi−1 xi+ 1 xk)T
(x ) = ∫ g(x ) ( ) d − ,μ−i wi xi xi μ0
124/129
measures the "main effect" of the th variable, and measures it's total effect including through"interactions" with other variables
The sensitivity indices can be estimated using Monte Carlo methods (essentially nested MC to estimatethe expectations and variances)
Various methods are implemented in the sensitivity package
Si i Ti
big differences between and suggest the th variable interacts with other variables· Si Ti i
for computationally expensive models, sample from a GP emulator·
125/129
EBM example:
library(sensitivity)d <- 2; n <- 1000X1 <- data.frame(matrix(runif(d * n), nrow = n))X2 <- data.frame(matrix(runif(d * n), nrow = n))colnames(X1) <- colnames(ebm)[2:3]; colnames(X2) <- colnames(ebm)[2:3]res <- sobolGP(model = gpebm, type = "UK", MCmethod = "sobol2002", X1, X2)res
## ## Method: sobol2002## ## Model runs: 20 ## ## Number of GP realizations: 100 ## ## Kriging type: UK ## ## estimate std. error min. c.i. max. c.i.## S1 0.09603572 0.0006806932 0.09478208 0.09728398## S2 0.98825845 0.0007460611 0.98702092 0.98961288## ## estimate std. error min. c.i. max. c.i.## T1 0.01232153 0.0005711950 0.01126854 0.01360374## T2 0.93075208 0.0005361857 0.92958659 0.93178356
126/129
Uncertainty quantification
Computer experiments are an important statistical contribution to the field of uncertaintyquantification (UQ)
Space-filling designs and (GP) emulators are very general, and can be applied to a variety of black boxcomputer models
GP emulators can be used as priors for Bayesian calibration of computer models (Kennedy andO’Hagan 2001)
interdisciplinary topic on the interface of Statistics and Applied Maths
methodologies for taking account of uncertainties when mathematical and computer models areused to describe real-world phenomena
·
·
typically require a lot less knowledge about the model than alternative methods from numericalanalysis (although at some loss of efficiency)
·
eg learning tuning parameters (cf parameter estimation, albeit with various important nuancesaround interpretation and physical understanding)
data fusion: combining computer model runs and data from real experiments
·
·
127/129
References
Atkinson, A. C., A. N. Donev, and R. D. Tobias. 2007. Optimum Experimental Design, with Sas. 2nd ed.Oxford: Oxford University Press.
Basu, D. 1980. “Randomization Analysis of Experimental Data: The Fisher Randomization Test.” Journal
of the American Statistical Association 75: 575–82.
Beck, L., B. Mansour Dia, L. F. R. Espath, Q. Long, and R. Tempone. 2018. “Fast Bayesian ExperimentalDesign: Laplace-Based Importance Sampling for the Expected Information Gain.” Computer Methods in
Applied Mechanics and Engineering 334: 523–53.
Box, G. E. P., and R. D. Meyer. 1986. “An Analysis of Unreplicated Fractional Factorials.” Technometrics 28:11–18.
Chaloner, K., and I. Verdinelli. 1995. “Bayesian Experimental Design: A Review.” Statistical Science 10:273–304.
Cox, D. R., and N. Reid. 2000. The Theory of the Design of Experiments. Boca Raton: Chapman; Hall/CRCPress.
Cuthbert, D. 1959. “Use of Half-Normal Plots in Interpreting Factorial Two-Level Experiments.”Technometrics 1: 311–41.
Dasgupta, T., N. S. Pillai, and D. B. Rubin. 2015. “Causal Inference from Factorial Designs by UsingPotential Outcomes.” Journal of the Royal Statistical Society B 77: 727–53.
Fang, K.-T., R. Li, and A. Sudjianto. 2006. Design and Modelling for Computer Experiments. Boca Raton:Chapman; Hall/CRC Press.
2k
129/129