apts design of experiments - university of warwick · 2019-08-27 · optimum experimental design,...

APTS Design of ExperimentsSeptember 2019

Dave Woods ([email protected] ; http://www.southampton.ac.uk/~davew) Statistical Sciences Research Institute, University of Southampton

mailto:[email protected]

http://www.southampton.ac.uk/~davew

Preliminaries

A few example designs and data sets for this module are available in the R package apts.doe, which canbe installed from GitHub

References will be provided throughout but some good general purpose texts are

These notes and other resources can be found at https://statsdavew.github.io/apts.doe/

library(devtools)install_github("statsdavew/apts.doe", quiet = T)library(apts.doe)

Atkinson, Donev and Tobias (2007). Optimum Experimental Design, with SAS. OUP

Wu and Hamada (2009). Experiments: Planning, Analysis, and Parameter Design Optimization (2nded.). Wiley.

Morris (2011). Design of Experiments: An Introduction based on Linear Models. Chapman andHall/CRC Press.

Santner, Williams and Notz (2019). The Design and Analysis of Computer Experiments (2nd ed.).Springer.

·

·

·

·

2/129

https://statsdavew.github.io/apts.doe/

Motivation and background

Modes of data collection

Observational studies

Sample surveys

Designed experiments

·

·

·

4/129

Experiments

Definition: An experiment is a procedure whereby controllable factors, or features, of a system orprocess are deliberately varied in order to understand the impact of these changes on one or moremeasurable responses.

"prehistory": Bacon, Lind, Peirce, … (establishing the scientific method)

agriculture (1920s)

clinical trials (1940s)

industry (1950s)

in-silico (1980s)

·

·

·

·

·

5/129

Role of experimentation

Why do we experiment?

Design of experiments: a statistical approach to the arrangement of the operational details of theexperiment (eg sample size, specific experimental conditions investigated, …) so that the quality of theanswers to be derived from the data is as high as possible.

key to the scientific method (hypothesis – experiment – observe – infer – conclude)

potential to establish causality …

… and to understand/improve complex systems depending on many factors

comparison of treatments, factor screening, prediction, optimisation, …

·

·

·

·

6/129

Motivating examples

1. Multi-factor experiment in pharmaceutical development.

Key to developing new medicines is the identification of optimal and robust process conditions (e.g.settings of temperature, pressure etc.) at which the active pharmaceutical ingredient should besynthesized.

[Somewhat confusinging, the FDA refer to this as identification of a "design space".]

An important step in is this methodology is a robustness experiment to assess the sensitivity ofidentified conditions to changes in all (or at least very many) controllable factors.

While developing a new melanoma drug, GlaxoSmithKline performed an experiment to investigatesensitivity to 20 factors. Their experimental budget allowed only 10 individual experiments (runs) to beperformed.

7/129

Motivating examples

2. Optimal design to calibrate a physical model.

Physical (mechanistic, mathematical, …) models are used in many scientific fields. Typically, they arederived from fundamental understanding of the physics, chemistry, biology …

Most commonly, these models are solutions to differential equations. The models usually containunknown parameters that should be estimated from experimental data.

Biologists at Southampton were studying the transfer of amino acids between mother and babythrough the placenta. They could control the times at which observations were taken and the initialconcentrations of amino acids (see Overstall, Woods, and Parker 2019).

8/129

Motivating examples

3. Computer experiments to optimise ride performance in luxury cars

Suspension settings can be used to improve the ride performance in cars. Optimising settings acrossmany different car models would take many hundreds of hours of testing, so computer simulations areused.

Jaguar-Land Rover wanted to find suspension settings robust across different car models using acomputer experiment (KTN workshop).

9/129

Simple motivating example

Consider an experiment to compare two treatments (eg drugs, diets, fertilisers).

We have subjects (eg people, mice, plots of land), each of which can be assigned to one of the twotreatments.

A response (eg protein measurement, weight, yield) is then measured from each subject.

Question: How should the two treatments be assigned to the subjects to gain the most preciseinference about the difference in expected response from the two treatments.

n

10/129

Assume a linear model for the response

with independently, unknown parameters and

The difference in expected response between treatment 1 and 2 is

So we need the most precise possible estimator of

= + + , i = 1, … , n,yi β0 β1 xi εi

∼ N(0, )εi σ2 ,β0 β1

= {xi−1+ 1

if treatment 1 is applied to subject i ,if treatment 2 is applied to subject i

E( | = + 1) − E( | = −1) = + − + = 2yi xi yi xi β0 β1 β0 β1 β1

β1

11/129

Both and can be estimated using least squares (or equivalently maximum likelihood).

Writing

we obtain estimators

with

In this simple example, we are interesting in estimating , and we have

β0 β1

y = Xβ + ε ,

= yβ ( X)XT −1XT

Var( ) =β ( X)XT −1σ2

β1

Var( )β1 = nσ2

n∑ −x2i (∑ )xi

2

= nσ2

−n2 (∑ )xi2

12/129

Hence, we need to pick to minimise

Assuming is even, the "optimal design" has

For odd, let and

We can assess a designs, labelled , via its efficiency relative to the optimal design :

, … ,x1 xn = ( −(∑ )xi2 n1 n2)2

denote as the number of subjects assigned to treatment 1, and the number assigned totreatment 2, with

it is obvious that if and only if

· n1 n2+ = nn1 n2

· ∑ = 0xi =n1 n2

n = = n/2n1 n2

n =n1n+ 1

2 =n2n−1

2

ξ ξ⋆

Eff(ξ) = Var( | )β1 ξ⋆

Var( | ξ)β1

13/129

n <- 50eff <- function(n1) 1 - ((2 * n1 - n) / n)^2curve(eff, from = 0, to = n, ylab = "Eff", xlab = expression(n[1]))

14/129

Definitions

Treatment – entities of scientific interest to be studied in the experiment eg varieties of crop, doses of a drug, combinations of temperature and pressure

Unit – smallest subdivision of the experimental material such that two units may receive differenttreatments eg plots of land, subjects in a clinical trial, samples of reagent

Run – application of a treatment to a unit

·

·

·

15/129

Example

Fabrication of integrated circuits (see Wu and Hamada 2009)

Unit

Treatment

an initial step in fabricating integrated circuits is the growth of an epitaxial layer on polished siliconwafers via chemical deposition

·

set of six wafers (mounted in a rotating cylinder)·

combination of settings of the factors·

A : rotation method ( )

B : nozzle position ( )

C : deposition temperature ( )

D : deposition time ( )

- x1

- x2

- x3

- x4

16/129

A unit-treatment statistical model

where

The aims of the experiment are achieved by estimating comparisons between the treatment effects, .

Experimental precision and accuracy are largely obtained through control and comparison.

= + , i = 1, … , t; j = 1, … , ,yij τi εij ni

: measured response from the th unit to which treatment has been applied

: treatment effect (expected response from application of the th treatment)

: random deviation from the expected response [typically ]

· yij j i

· τi i

· εij ∼ N(0, )σ2

−τk τl

17/129

Model assumptions

Three key model assumptions are:

See Dasgupta, Pillai, and Rubin (2015) for discussion of these assumptions for factorial experiments

additivity (response = treatment effect + unit effect)

constancy of treatment effects (treatment effect does not depend on the unit to which it is applied)

no interference between units (the effect of a treatment applied to unit does not depend on thetreatment applied to any other unit)

·

·

· j

18/129

Principles of experimentation

Stratification (blocking)

Replication

account for systematic differences between batches of experimental units by arranging them inhomogeneous sets (blocks)

·

if the same treatment was applied to all units, within-block variation in the response would bemuch less than between-block

compare treatments within the same block and hence eliminate block effects

-

-

the application of each treatment to multiple experimental units·

provides an estimate of experimental error against which to judge treatment differences

reduces the variance of the estimators of treatment differences

-

-

19/129

Randomisation

Randomisation is perhaps the key principle in the design of experiments

we randomise features such as the allocation of units to treatments, the order in which treatmentsare applied, …

·

protects against lurking (uncontrolled) variables (model-robust) and subjectively in the allocationof treatments to units

-

it protects against model misspecification (bias), and hence allows causality to be established

unbiased estimation of and , even if the errors are not normally distributed

exact tests for differences between treatment effects are available (Basu 1980)

·

a clear difference between treatments can only be an accident of the randomisation or aconsequence of the treatments

-

· τ σ2

·

20/129

Without randomisation, unobserved confounders ( ) can induce a dependency betweenthe response ( ) and treatment ( )

cf Cox and Reid (2000), p.35

UY T

21/129

With randomisation, unobserved confounders ( ) are independent of the treatment ( ).Marginalisation over does not induce an edge between and

cf Cox and Reid (2000), p.35

U TU T Y

22/129

Factorial designs

Example revisited

Fabrication of integrated circuits (Wu and Hamada 2009, p155)

Treatment

Assume each factor has two-levels, coded -1 and +1

combination of settings of the factors·

A : rotation method ( )

B : nozzle position ( )

C : deposition temperature ( )

D : deposition time ( )

- x1

- x2

- x3

- x4

24/129

Treatments and a regression model

Each factor has two levels

A treatment is then defined as a combination of four values of

Assume each treatment effect is determined by a regression model in the four factors, eg

= ± 1, k = 1, … , 4xk

−1, + 1

eg

specifies a setting of the process

· = −1, = −1, = + 1, = −1x1 x2 x3 x4

·

τ(x ) = + +β0 ∑i= 1

4βi xi ∑

j= 1

4

∑i> j

4βijxixj

25/129

(Two-level) Factorial design

with(cirfab, cirfab[order(x1, x2, x3, x4), ])

## x1 x2 x3 x4 ybar## 2 -1 -1 -1 -1 13.58983## 1 -1 -1 -1 1 14.59000## 4 -1 -1 1 -1 14.04983## 3 -1 -1 1 1 14.24000## 6 -1 1 -1 -1 13.94000## 5 -1 1 -1 1 14.65000## 8 -1 1 1 -1 14.14017## 7 -1 1 1 1 14.40000## 10 1 -1 -1 -1 13.72000## 9 1 -1 -1 1 14.67000## 12 1 -1 1 -1 13.90000## 11 1 -1 1 1 13.84017## 14 1 1 -1 -1 13.87983## 13 1 1 -1 1 14.56000## 16 1 1 1 -1 14.11017## 15 1 1 1 1 14.30000

treatments in standard order

- average response from the six wafers

·

· y26/129

Regression model and least squares

Y = Xβ + ε , ε ∼ N(0, I) , = Yσ2 β ( X)XT −1XT

model matrix has columns corresponding to intercept, linear and cross-product terms

information matrix

regression coefficients are estimated by independent contrasts in the data

· X

· X = nIXT

·

cirfab.lm <- lm(ybar ~ (.) ^ 2, data = cirfab)coef(cirfab.lm)

## (Intercept) x1 x2 x3 x4 x1:x2 ## 14.161250000 -0.038729167 0.086270833 -0.038708333 0.245020833 0.003708333 ## x1:x3 x1:x4 x2:x3 x2:x4 x3:x4 ## -0.046229167 -0.025000000 0.028770833 -0.015041667 -0.172520833

27/129

Main effects and interactions

Main effect of :

Interaction between and :

Higher-order interactions defined similarly

Assuming -1,+1 coding, there is a straightforward relationship between factorial effects and regressioncoefficients

xk

[Avg. response when = 1] − [Avg. response when = −1]xk xk

xj xk

[Avg. response when = 1] − [Avg. response when = −1]xjxk xjxk

main effect of is equal to

interaction between and is equal to

· xk 2βk

· xj xk 2βjk

28/129

Using the effects package:

library(effects)plot(Effect("x1", cirfab.lm), main = "", rug = F, ylim = c(13.5, 14.5), aspect = 1)plot(Effect("x2", cirfab.lm), main = "", rug = F, ylim = c(13.5, 14.5), aspect = 1)plot(Effect("x3", cirfab.lm), main = "", rug = F, ylim = c(13.5, 14.5), aspect = 1)plot(Effect("x4", cirfab.lm), main = "", rug = F, ylim = c(13.5, 14.5), aspect = 1)

29/129

Main effects

30/129

Interactions

plot(Effect(c("x3", "x4"), cirfab.lm), main = "", rug = F, ylim = c(13.5, 15), x.var = "x4")

31/129

Orthogonality

are independently normally distributed with equal variance

Hence, we can treat the identification of important effects (ie large ) as an outlier identificationproblem

Cuthbert (1959)

X = nI ⇒XT β

β

plot (absolute) ordered factorial effects against (absolute) quantiles from a standard normal

outlying effects are identified as important

·

·

32/129

Using the FrF2 package

library(FrF2)par(pty = "s", mar = c(8, 4, 1, 2))DanielPlot(cirfab.lm, main = "", datax = F, half = T)

33/129

Replication

An unreplicated factorial design provides no model-independent estimate of (Gilmour and Trinca2012)

Replication also increases the power of the design

σ2

any unsaturated model does provide an estimate, but it may be biased by ignored (significant) modelterms

this is one reason why graphical (or associated) analysis methods are popular

·

·

common to replicate a centre point

allows a portmanteau test of curvature

·

·

34/129

Refit a linear model to the fabricated circuits experiment excluding interactions.

cirfab2.lm <- lm(ybar ~ (.), data = cirfab)coef(cirfab.lm)

## (Intercept) x1 x2 x3 x4 x1:x2 ## 14.161250000 -0.038729167 0.086270833 -0.038708333 0.245020833 0.003708333 ## x1:x3 x1:x4 x2:x3 x2:x4 x3:x4 ## -0.046229167 -0.025000000 0.028770833 -0.015041667 -0.172520833

coef(cirfab2.lm)

## (Intercept) x1 x2 x3 x4 ## 14.16125000 -0.03872917 0.08627083 -0.03870833 0.24502083

cbind(sigma1 = summary(cirfab.lm)$sigma, df1 = summary(cirfab.lm)$df[2], sigma2 = summary(cirfab2.lm)$sigma, df2 = summary(cirfab2.lm)$df[2])

## sigma1 df1 sigma2 df2## [1,] 0.137152 5 0.2396108 11

35/129

Principles of factorial experimentation

Effect sparsity

Effect hierarchy

Effect heredity

Wu and Hamada (2009), pp.172–172

the number of important effects in a factorial experiment is small relative to the total number ofeffects investigated (cf Box and Meyer 1986)

·

lower-order effects are more likely to be important than higher-order effects

effects of the same order are equally likely to be important

·

·

interactions where at least one parent main effect is important are more likely to be importantthemselves

·

36/129

Regular fractional factorial designs

Choosing subsets of treatments

Factorial designs can require a large number of runs for only a moderate number of factors ( )

Resource constraints (eg cost) may mean not all combinations can be run

Lots of degrees of freedom are devoted to estimating higher-order interactions

Need to trade-off what you want to estimate against the number of runs you can afford

= 3225

2m

eg in a experiment, 16 degrees of freedom are used to estimate three-factor and higher-orderinteractions

principles of effect hierarchy and sparsity suggest may be wasteful

· 25

·

38/129

Example

Production of bacteriocin (Morris 2011, p231)

Unit

Treatment: combination of settings of the factors

Assume each factor has two-levels, coded -1 and +1

bacteriocin is a natural food preservative frown from bacteria·

a single bio-reaction·

A: amount of glucose ( )

B: initial inoculum size ( )

C: level of aeration ( )

D: temperature ( )

E: amount of sodium ( )

· x1

· x2

· x3

· x4

· x5

39/129

Find an run design using FrF2n= 8

bact.design <- FrF2(8, 5, factor.names = paste0("x", 1:5), generators = list(c(1, 3), c(2, 3)), randomize = F, alias.info = 3)bact.design

## x1 x2 x3 x4 x5## 1 -1 -1 -1 1 1## 2 1 -1 -1 -1 1## 3 -1 1 -1 1 -1## 4 1 1 -1 -1 -1## 5 -1 -1 1 -1 -1## 6 1 -1 1 1 -1## 7 -1 1 1 -1 1## 8 1 1 1 1 1## class=design, type= FrF2.generators

= = =

we need a principled way of choosing one-quarter of the runs from the factorial design that leads toclarity in the analysis

· 8 32/4 /25 22 25−2

·

40/129

Assuming the number of runs is a power of two, , we can construct orthogonal vectors(with inner product zero), spanned by vectors

n= 2k−q − 12k−q

k − q = (n)log2

construct the full factorial design for factors

assign the remaining factors to interaction columns

· k − q· q

model.matrix(~ (x1 + x2 + x3) ^ 3, bact.design[, 1:3])[, ]

## (Intercept) x11 x21 x31 x11:x21 x11:x31 x21:x31 x11:x21:x31## 1 1 -1 -1 -1 1 1 1 -1## 2 1 1 -1 -1 -1 -1 1 1## 3 1 -1 1 -1 -1 1 -1 1## 4 1 1 1 -1 1 -1 -1 -1## 5 1 -1 -1 1 1 -1 -1 1## 6 1 1 -1 1 -1 1 -1 -1## 7 1 -1 1 1 -1 -1 1 -1## 8 1 1 1 1 1 1 1 1

41/129

Aliasing scheme

The design has been deliberately chosen so that

[ is shorthand for the Hadamard (Schur or entry wise) product of two vectors, ]

What other consequences are there?

· =x4 x1 x3

· =x5 x2 x3

x1 x2 ∘x1 x2

the product of any column with itself is the constant column (the identity)

hence,

· = =x4 x5 x1 x3 x2 x3 x1 x2 x23

·

· =x4 x5 x1 x2

42/129

Now we can obtain the defining relation

and the complete aliasing scheme

…

· I = = =x1 x3 x4 x2 x3 x5 x1 x2 x4 x5

…

· = = =x1 x3 x4 x1 x2 x3 x5 x2 x4 x5

· = = =x2 x1 x2 x3 x4 x3 x5 x1 x4 x5

· = = =x3 x1 x4 x2 x5 x1 x2 x3 x4 x5

· = = =x4 x1 x3 x2 x3 x4 x5 x1 x2 x5

· = = =x5 x1 x3 x4 x5 x2 x3 x1 x2 x4

· = = =x1 x2 x2 x3 x4 x1 x3 x5 x4 x5

· = = =x1 x5 x3 x4 x5 x1 x2 x3 x2 x4

43/129

FrF2 will summarise the aliasing amongst main effects and two- and three-factor interactions.

design.info(bact.design)$aliased

## $legend## [1] "A=x1" "B=x2" "C=x3" "D=x4" "E=x5"## ## $main## [1] "A=CD=BDE" "B=CE=ADE" "C=AD=BE" "D=AC=ABE" "E=BC=ABD"## ## $fi2## [1] "AB=DE=ACE=BCD" "AE=BD=ABC=CDE"## ## $fi3## [1] "ACD=BCE"

44/129

The alias matrix

What is the consequence of this aliasing?

If more than one effect in each alias string is non-zero, the least squares estimators will be biased

is the alias matrix

assumed model

true model

· Y = + εX1β1

· Y = + + εX1β1 X2β2

E ( )β1 = E(Y)( )XT1 X1

−1XT1

= ( + )( )XT1 X1

−1XT1 X1β1 X2β2

= +β1 ( )XT1 X1

−1XT1 X2β2

= + Aβ1 β2

A

if the columns of and are not orthogonal, is biased· X1 X2 β1

45/129

For the example:

For a regular design, the matrix will only have entries 0, (no aliasing or complete aliasing)

25−2

is an matrix with columns for the intercept and five linear terms ("main effects")

is an matrix with columns for the 10 product terms ("two-factor interactions")

· X1 8 × 6· X2 8 × 10

A =

⎛

⎝

⎜⎜⎜⎜⎜⎜⎜

000000

000010

000100

000000

000001

000000

000000

010000

001000

000000

⎞

⎠

⎟⎟⎟⎟⎟⎟⎟

A ± 1

46/129

The transpose of the alias matrix is provided by the alias function.

ff.alias <- alias(y ~ (.)^2, data = data.frame(bact.design, y = vector(length = 8)))ff.alias$Complete

## (Intercept) x11 x21 x31 x41 x51 x11:x21 x11:x51## x21:x31 0 0 0 0 0 1 0 0 ## x21:x41 0 0 0 0 0 0 0 1 ## x21:x51 0 0 0 1 0 0 0 0 ## x31:x41 0 1 0 0 0 0 0 0 ## x31:x51 0 0 1 0 0 0 0 0 ## x41:x51 0 0 0 0 0 0 1 0 ## x11:x31 0 0 0 0 1 0 0 0 ## x11:x41 0 0 0 1 0 0 0 0

47/129

These linear dependencies can be seen if we attempt to fit a linear model

bact.lm <- lm(yB ~ (x1 + x2 + x3 + x4 + x5)^2, data = bact)summary(bact.lm)

## ## Call:## lm.default(formula = yB ~ (x1 + x2 + x3 + x4 + x5)^2, data = bact)## ## Residuals:## ALL 8 residuals are 0: no residual degrees of freedom!## ## Coefficients: (8 not defined because of singularities)## Estimate Std. Error t value Pr(>|t|)## (Intercept) 4.42625 NA NA NA## x1 0.26625 NA NA NA## x2 0.24625 NA NA NA## x3 -0.22875 NA NA NA## x4 -1.11875 NA NA NA## x5 -0.66375 NA NA NA## x1:x2 -0.05375 NA NA NA## x1:x3 NA NA NA NA## x1:x4 NA NA NA NA## x1:x5 -0.13375 NA NA NA## x2:x3 NA NA NA NA## x2:x4 NA NA NA NA## x2:x5 NA NA NA NA## x3:x4 NA NA NA NA## x3:x5 NA NA NA NA## x4:x5 NA NA NA NA

48/129

The role of fractional factorial designs in asequential strategy

Typically, in a first experiment, fractional factorial designs are used in screening

At second and later stages, augment the design

investigate which of many factors have a substantive effect on the response

main effects and two-factor interactions

centre points to check for curvature

·

·

·

to resolve ambiguities due to the aliasing of factorial effects ("break the alias strings")

to allow estimation of curvature and prediction from a more complex model

·

·

49/129

-optimality and non-regulardesignsD

Introduction

Regular fractional factorial designs have the number of runs equal to a power of the number of levels

Non-regular designs can have any number of runs (usually with , the number of parameters to beestimated)

Often the clarity provided by a regular design is lost

One approach to finding non-regular designs is via a design optimality criterion

eg ,

this inflexibility in run sizes can be a problem in practical experiments

· 25−2 × 233−1

·

n> p

no defining relation or straightforward aliasing scheme

partial aliasing and fractional entries in

·

· A

51/129

-optimality

Notation: let denote a design (choice of treatments and their replications)

Assuming the model , with , a -optimal design maximises

That is, a -optimal design maximises the determinant of the (expected) Fisher information matrix

Also useful to define a Bayesian version, with a prior precision matrix

(See later)

Dξ = [ , … , ]x 1 x n

Y = Xβ + ε ε ∼ N(0, )σ2In D

ϕ(ξ) = det ( X)XT

D

equivalent to minimising the volume of the joint confidence ellipsoid for · β

R

(ξ) = det ( X + R)ϕB XT

52/129

Comments

-optimal designs are model dependent

-optimality promotes orthogonality in the matrix

There are many other optimality criteria, tailored to other experimental goals

D

if the model (ie the columns of ) changes, the optimal design may change

model-robust design is an active area of research

· X·

D X

if there are sufficient runs, the -optimal design will usually be orthogonal

for particular models and choices of , regular fractional factorial designs are -optimal

· D· n D

prediction, model discrimination, space-filling, …·

53/129

Example: Plackett-Burman design

factors in runs, first-order (main effects) model (Plackett and Burman 1946)

A particular -optimal design is the following orthogonal array

Using the pb function in the FrF2 package:

k = 11 n= 12

D

pb.design <- pb(12, factor.names = paste0("x", 1:11))pb.design

## x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11## 1 -1 1 -1 1 1 -1 1 1 1 -1 -1## 2 -1 -1 1 -1 1 1 -1 1 1 1 -1## 3 -1 -1 -1 1 -1 1 1 -1 1 1 1## 4 1 1 1 -1 -1 -1 1 -1 1 1 -1## 5 -1 1 1 -1 1 1 1 -1 -1 -1 1## 6 1 -1 -1 -1 1 -1 1 1 -1 1 1## 7 -1 1 1 1 -1 -1 -1 1 -1 1 1## 8 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1## 9 1 -1 1 1 -1 1 1 1 -1 -1 -1## 10 1 1 -1 1 1 1 -1 -1 -1 1 -1## 11 1 -1 1 1 1 -1 -1 -1 1 -1 1## 12 1 1 -1 -1 -1 1 -1 1 1 -1 1## class=design, type= pb

54/129

This 12-run PB design is probably the most studied non-regular design

orthogonal columns

complex aliasing between main effects and two-factor interactions

·

·

pb.alias <- alias(y ~ (.)^2, data = data.frame(pb.design, y = vector(length = 12)))head(pb.alias$Complete, n = 15)

## (Intercept) x11 x21 x31 x41 x51 x61 x71 x81 x91 x101 x111## x11:x21 0 0 0 -1/3 -1/3 -1/3 1/3 -1/3 -1/3 1/3 1/3 -1/3## x11:x31 0 0 -1/3 0 1/3 -1/3 -1/3 1/3 -1/3 1/3 -1/3 -1/3## x11:x41 0 0 -1/3 1/3 0 1/3 1/3 -1/3 -1/3 -1/3 -1/3 -1/3## x11:x51 0 0 -1/3 -1/3 1/3 0 -1/3 -1/3 -1/3 -1/3 1/3 1/3## x11:x61 0 0 1/3 -1/3 1/3 -1/3 0 -1/3 1/3 -1/3 -1/3 -1/3## x11:x71 0 0 -1/3 1/3 -1/3 -1/3 -1/3 0 1/3 -1/3 1/3 -1/3## x11:x81 0 0 -1/3 -1/3 -1/3 -1/3 1/3 1/3 0 -1/3 -1/3 1/3## x11:x91 0 0 1/3 1/3 -1/3 -1/3 -1/3 -1/3 -1/3 0 -1/3 1/3## x11:x101 0 0 1/3 -1/3 -1/3 1/3 -1/3 1/3 -1/3 -1/3 0 -1/3## x11:x111 0 0 -1/3 -1/3 -1/3 1/3 -1/3 -1/3 1/3 1/3 -1/3 0## x21:x31 0 -1/3 0 0 -1/3 -1/3 -1/3 1/3 -1/3 -1/3 1/3 1/3## x21:x41 0 -1/3 0 -1/3 0 1/3 -1/3 -1/3 1/3 -1/3 1/3 -1/3## x21:x51 0 -1/3 0 -1/3 1/3 0 1/3 1/3 -1/3 -1/3 -1/3 -1/3## x21:x61 0 1/3 0 -1/3 -1/3 1/3 0 -1/3 -1/3 -1/3 -1/3 1/3## x21:x71 0 -1/3 0 1/3 -1/3 1/3 -1/3 0 -1/3 1/3 -1/3 -1/3

55/129

Example: supersaturated design

Screening designs with fewer runs than factors (see Woods and Lewis 2017)

Supersaturated experiment used by GlaxoSmithKline in the development of a new oncology drug

can't use ordinary least squares/maximum likelihood as does not have full column rank

Bayesian -optimality with

· X· D R = [0 | τ ]Im

factors: e.g. temperature, solvent amount, reaction time

runs

Bayesian -optimal design with

· k = 16· n= 10· D τ = 0.2

56/129

ssd

## x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16## 1 1 1 -1 1 1 -1 -1 -1 -1 -1 -1 1 1 1 1 1## 2 1 1 1 -1 -1 -1 -1 -1 1 1 -1 -1 1 -1 -1 1## 3 -1 -1 1 -1 1 -1 1 -1 -1 1 -1 -1 -1 1 1 -1## 4 -1 1 1 1 1 -1 -1 1 -1 -1 1 -1 -1 -1 -1 -1## 5 -1 -1 -1 -1 -1 1 -1 1 1 -1 -1 -1 -1 -1 1 1## 6 1 1 1 -1 1 1 -1 1 1 1 1 1 1 1 1 -1## 7 -1 -1 1 1 -1 -1 1 -1 1 -1 1 1 1 -1 1 -1## 8 1 -1 -1 1 1 1 1 1 -1 1 -1 1 1 -1 -1 -1## 9 -1 1 -1 1 -1 1 1 -1 1 1 1 -1 -1 1 -1 1## 10 1 -1 1 -1 -1 -1 1 1 -1 -1 1 1 -1 1 -1 1

57/129

Partial aliasing between main effects

Heatmap of column correlations:

library(fields)par(mar=c(8,2,0,0))image.plot(1:16,1:16, cor(ssd), zlim = c(-1, 1), xlab = "Factors", ylab = "", asp = 1, axes = F)axis(1, at = seq(2, 16, by = 2), line = .5)axis(2, at = seq(2, 16, by = 2), line = -5)

58/129

Analysis via regularised (shrinkage) methods (eg lasso, Dantzig selector; see APTS High DimensionalStatistics)

small coefficients shrunk to zero·

59/129

Bayesian optimal design

Introduction

Now consider a more general class of models (cf preliminary material).

Let be iid observations from a distribution with density/mass function

The (expected) information matrix

is an important quantity for design, where (the log-likelihood).

y = ( , … ,y1 yn)T π( ; θ, )yi x i

is a vector of unknown parameters

is a vector of values of controllable variables.

· θ q−· = ( , … ,x i x1i xki)T k

M(θ) = [− ]Eyl(θ)∂2

∂θ∂θT

l(θ) = log π( ; θ, )∑ni= 1 yi x i

is the (asymptotic) precision for the maximum likelihood estimators .

is also an asymptotic approximation to the posterior precision for in a Bayesian analysis.

· M(θ) θ

· M(θ) θ

61/129

Pharmacokinetics

Example 1: Compartmental model

with

for .

Prior distributions (for later use):

Ryan et al. (2014)

∼ N (c(θ)μ(θ; ), ν(θ; )) , ∈ [0, 24] ,yi xi σ2 xi xi

μ(θ; x) = exp(− x) − exp(− x) , c(θ) = , ν(θ; x) = 1 + c(θ μ(θ; x) ,θ1 θ2400θ2( − )θ3 θ2 θ1

τ2

σ2 )2

, , , , > 0θ1 θ2 θ3 τ2 σ2

, with · log ∼ N( , 0.05)θi mi = log 0.1, = 0, = log 20m1 m2 m3

62/129

comp <- function(x, theta, D = 400) { mu <- exp(-theta[1] * x) - exp(-theta[2] * x) c <- (D / theta[3]) * (theta[2]) / (theta[2] - theta[1]) c * mu }theta <- c(.1, 1, 20)M <- 100par(mar = c(6, 4, 0, 1) + .1)lapply(1:M, function(l) { thetat <- rlnorm(3,log(theta),rep(0.05,3)) curve(comp(x, theta = thetat), from = 0, to = 24, ylab = "Expected concentration", xlab = "Time", ylim = c(0, 20), xlim = c(0, 24), add = l!=1) })

63/129

Crystallography

Example 2: multi-factor experiments to build (hierarchical) logistic regression models forpharmaceutical salt formation

Four controllable variables:

rate of agitation during mixing ( )

volume of composition ( )

temperature ( )

evaporation rate ( )

· x1

· x2

· x3

· x4

64/129

For the th observation in the th group :

with

where is the value taken by the th variable.

Prior distributions (for later use):

j i (i = 1, … , g; j = 1, … , )ng

∼ Bernoulli (ρ( ))yij x ij

log( ) = ( + ) + ( + ) ,ρ( )x ij

1 − ρ( )x ijβ0 ωi0 ∑

r= 1

kβr ωir xijr

xijr r

are unknown parameters of interest

are group specific parameters for the th group

· β = ( , , … ,β0 β1 βq−1)T

· = ( , , … ,ωi ωi0 ωi1 ωiq−1)T i

, , , , · ∼ U(−3, 3)β0 ∼ U(4, 10)β1 ∼ U(5, 11)β2 ∼ U(−6, 0)β3 ∼ U(−2.5, 3.5)β4

1. standard logistic regression -

2. hierarchical logistic regression - . with following a triangular distribution

= 0ωir

∼ U(− , + )ωir sr sr > 0sr

65/129

Classical optimal designs

Many Frequentist criteria for finding optimal designs for both linear and nonlinear models optimise afunction of the information matrix; see Atkinson, Donev, and Tobias (2007), ch.10

Let denote a design, and set to explicitly acknowledge the dependenceof the information matrix on the design

we have already seen -optimality· D

ξ = ( , … ,x 1 x n)T M(ξ; θ) = M(θ)

-optimality: maximise

-optimality: minimise

-optimality: minimise

- (or -) optimality - minimise

· D (ξ) = det M(ξ; θ)ϕD

· A (ξ) = trace M(ξ; θϕA )−1

· G (ξ) = Var( (x ))ϕG maxx y where is the predicted response at and the (asymptotic) prediction variance is a function of- (x)y xM(ξ; θ)

· V I (ξ) = Var ( (x )) dxϕV ∫ y

66/129

Optimal design for nonlinear models

For most nonlinear models, will be a function of the unknown parameters (unlike for thelinear model, where )

This leads to a "chicken and egg" situation

For some models/experiments, the quality of a design may change a lot with the value of

M(ξ; θ) θM(ξ; β) = XXT

if you can tell me the values of the unknown parameters, I can give you an optimal design

but if you knew the value of , you probably wouldn't need to perform the experiment!

·

· θ

θ

67/129

A simple example

rho <- function(x, beta0 = 0, beta1 = 1) { eta <- beta0 + beta1 * x 1 / (1 + exp(-eta))}par(mar = c(8, 4, 1, 2) + 0.1)curve(rho, from = -5, to = 5, ylab = expression(rho), xlab = expression(italic(x)), cex.lab = 1.5, cex.axis = 1.5, ylim = c(0, 1), lwd = 2)

68/129

For simple logistic regression, the information matrix has the form

with the model matrix and

For example with , , and

M(ξ; β) = WX ,XT

X n× 2 W = diag {ρ( )[1 − ρ( )]}xi xi

n= 2 ξ = (−1, 1) = 0β0 = 1β1

M(ξ; β) = ( ) ( ) ( )1−1

11

0.20

00.2

11

−11

Minfo <- function(xi, beta0 = 0, beta1 = 1) { X <- cbind(c(1, 1), xi) v <- function(x) rho(x, beta0, beta1) * (1 - rho(x, beta0, beta1)) W <- diag(c(v(xi[1]), v(xi[2]))) t(X) %*% W %*% X}Dcrit <- function(xi, beta0 = 0, beta1 = 1) { d <- det(Minfo(xi, beta0, beta1)) ifelse(is.nan(d), -Inf, d)}

69/129

Locally -optimal designsD

= 0, = 1β0 β1

dopt <- optim(par = c(-1, 1), Dcrit, control = list(fnscale = -1))xi.opt1 <- dopt$parxi.opt1

## [1] -1.543421 1.543530

70/129


= 0, = 2β0 β1

dopt <- optim(par = c(-1, 1), Dcrit, control = list(fnscale = -1), beta1 = 2)xi.opt2 <- dopt$parxi.opt2

## [1] -0.7717705 0.7717418

71/129


= 0, = 0.5β0 β1

dopt <- optim(par = c(-1, 1), Dcrit, control = list(fnscale = -1), beta1 = .5)xi.opt3 <- dopt$parxi.opt3

## [1] -3.086913 3.086981

72/129

Getting wrong: design for when actually

-efficiency

β1 = .5β1 = 2β1

D

(Dcrit(xi.opt3, beta1 = 2) / Dcrit(xi.opt2, beta1 = 2)) ^ (1 / 2)

## [1] 0.05720905

73/129

Use of the "wrong" design can lead to uninformative experiments (with "small" information matrices)

For the logistic regression example, the drop in efficiency is closely related to the phenomenon ofseparation (see Firth 1993)

Motivates the need for designs which are robust to the values of the model parameters

maximin designs (focus on worst case performance)

Bayesian designs

·

·

74/129

Bayesian optimal design

Decision-theoretic design starts with a utility function that defines the usefulness of a designfor a particular purpose, given data and parameters

Common choices of utility function include

u (ξ, y, θ)y θ

negative squared error loss

surprisal or self information

·

u (ξ, y, θ) = −[θ − E(θ | y)]2

negative squared difference between and the posterior mean- θ·

u (ξ, y, θ) = log π(θ | y, ξ) − log π(θ)= log π(y | θ, ξ) − log π(y | ξ)

difference between log posterior and log prior densities, or between the log-likelihood and thelog-evidence

-

75/129

A priori (before the experiment), we do not know or (we will never know )

So, we take the expectation of the utility function with respect to the joint distribution of

The equivalence of the third and fourth equations follows from Bayes theorem

See Chaloner and Verdinelli (1995)

y θ θ

y, θ

U(ξ) = [u (ξ, y, θ)]Ey,θ | ξ

= ∫ u (ξ, y, θ)π(y, θ | ξ) dθ dy

= ∫ u (ξ, y, θ)π(θ | y, ξ)π(y | ξ) dθ dy

= ∫ u (ξ, y, θ)π(y | θ, ξ)π(θ | ξ) dθ dy

the third equation more clearly shows the dependence on the posterior distribution

the fourth equation is often more useful for calculations and computation

·

·

76/129

Surprisal

- the expected Shannon information gain (SIG) or expected Kullback-Liebler divergence between priorand posterior densities

Negative squared error loss

- the expected negative squared error loss (NSEL)

U(ξ) = ∫ log π(y, θ | ξ) dθ dyπ(θ | y, ξ)

π(θ)

= ∫ log π(y, θ | ξ) dθ dyπ(y | θ, ξ)π(y | ξ)

U(ξ) = − ∫ π(y, θ | ξ) dθ dy[θ − E(θ | y)]2

= − ∫ tr {Var(θ | y, ξ)π(y | ξ)} dy

77/129

Challenges

In general, Bayesian design is easy in principle but hard in practice

1. For most nonlinear models, the expected utility will be intractable and involves high-dimensionalintegrals with respect to

2. A high-dimensional optimisation problem results for multi-factor experiments with many designpoints

yoften, obtaining the utility function itself requires the solution of intractable integrals (cf both ESIGand NSEL)

numerical or analytical approximation is required (eg Ryan et al. 2016)

·

·

78/129

Asymptotic approximations

For large , the inverse information matrix is an asymptotic approximation to the posteriorvariance-covariance matrix

Using this approximation, we can define Bayesian analogues of classical optimality criteria



n M(ξ; θ)

D

(ξ) = ∫ log detM(ξ; θ)π(θ) dθUD

approximation to ESIG·

A

(ξ) = − ∫ tr (ξ; θ)π(θ) dθUA M −1

approximation to NSEL·

79/129

These integrals, with respect to , are lower dimensional and more amenable to deterministic(quadrature) approximation, eg Gotwalt, Jones, and Steinberg (2009)

The acebayes package provides functions for constructing approximations to expected utilities

θ

default is to use quadrature to approximate the Bayesian -optimality objective function· D

library(acebayes)prior <- list(support = matrix(c(0, 0, .5, 2), nrow = 2))logreg.util <- utilityglm(formula = ~ x, family = binomial, prior = prior)$utilityBDcrit <- function(xi) logreg.util(data.frame(x = xi))bdopt <- optim(par = c(-1, 1), BDcrit, control = list(fnscale = -1))bdopt$par

## [1] -1.202960 1.203132

80/129

Monte Carlo approximation

As an alternative to analytical approximations, Monte Carlo approximation to the expected utility issimple to implement and intuitively appealing

where

How to construct the approximation is an active area of research, eg Overstall, McGree, andDrovandi (2018), Beck et al. (2018)

(ξ) = (ξ, , )U 1B ∑

i= 1

Bu yi θi

is a random sample from

is, where necessary, an approximation to the utility function (often, nested Monte Carlo isrequired)

· { , }θh yhBh= 1 π(θ, y | ξ)

· (ξ, y, θ)u

(ξ, y, θ)u

81/129

Optimisation

Find an optimal design using Monte Carlo:

Larger Monte Carlo sample sizes will produce results more similar to the design found usingquadrature (in this example)

priorMC <- function(B) cbind(rep(0, B), runif(n = B, min = .5, max = 2))logreg.utilSIG <- utilityglm(formula = ~ x, family = binomial, prior = priorMC, criterion = "SIG")$utilityBDcritSIG <- function(xi, B = 1000) mean(logreg.utilSIG(data.frame(x = xi), B))bdoptSIG <- optim(par = c(-1, 1), BDcritSIG, control = list(fnscale = -1))bdoptSIG$par

## [1] -1.205382 1.182047

bdopt$par

## [1] -1.202960 1.203132

82/129

In general, direct optimisation of the Monte Carlo approximation requires large to generate suitablesmooth objective function and/or expensive stochastic algorithms (eg genetic algorithms)

Hamada et al. (2001)

Alternatively, the optimisation can be embedded within a simulation scheme and samples generatedfrom the joint artificial distribution of

Müller (1999), Müller, Sansó, and De Iorio (2004)

B

ξ, y, θ

take , the optimal design, to be the posterior mode of the marginal distribution

most effective for small experiments (both numbers of variables and runs)

· ξ∗

·

83/129

Smoothing-based optimisation

Instead of directly minimising a Monte Carlo approximation to the expected utility, find designs viacurve fitting (Müller and Parmigiani 1996)

Return to Example 1, compartmental model

1. Evaluate the Monte Carlo approximation for a small number of designs,

2. Smooth the "data" , i.e. fit a statistical model, to obtain a surrogate

3. Find that maximises

(ξ)U , … ,ξ1 ξQ

{ , ( )}ξi U ξi (ξ)U

ξ (ξ)U

find a design with runs, with fixed

use Monte Carlo approximation to SIG for 10 values of

· n= 2 = 5x1

· x2

84/129

library(DiceKriging)library(DiceDesign)n <- 10; x1<- -0.583; x2 <- 2 * maximinSA_LHS(lhsDesign(n, 1)$design)$design- 1u <- NULL; for(i in 1:n) u[i] <- mean(utilcomp15sig(c(x1, x2[i]), B = 1000))par(mar = c(4, 4, 2, 2) + 0.1)plot(12 * (x2 + 1), u, xlab = expression(x[2]), ylab = "Approx. expected SIG", xlim = c(0, 24), ylim = c(0, 2), pch = 16, cex = 1.5); abline(v = 12 * (x1 + 1), lwd = 2)usmooth <- km(design = 12 * (x2 + 1), response = u, nugget = 1e-3, control = list(trace = F))xgrid <- matrix(seq(0, 24, l = 1000), ncol = 1); pred <- predict(usmooth, xgrid, type = "SK")$meanlines(seq(0, 24, l = 1000), pred, col = "blue", lwd = 2); abline(v = xgrid[which.max(pred), ], lty = 2)

85/129

Approximate coordinate exchange

Coordinate exchange, a version of cyclic ascent, is a popular algorithm for finding optimal designs(Meyer and Nachtsheim 1995)

Approximate coordinate exchange (ACE) combines coordinate exchange with smoothing to find high-dimensional designs under computationally expensive approximate expected utilities

Overstall and Woods (2017)

optimisation of proceeds coordinate-wise, i.e. just one of the is varied at a time· ξ = ( , … , )x 1 x n xij

a nonparametric regression model (a Gaussian process) is used to smooth the Monte Carloapproximations of as a function of one coordinate

·U(ξ)

reduces the computational burden

facilitates optimisation of a noisy function

-

-

86/129

Return to the multifactor logistic regression (crystallography) example

## set up priorpriorMFL <- function(B) { b0 <- runif(B, -3, 3) b1 <- runif(B, 4, 10) b2 <- runif(B, 5, 11) b3 <- runif(B, -6, 0) b4 <- runif(B, -2.5, 2.5) cbind(b0, b1, b2, b3, b4)}## define the utility functionMFL.utilSIG <- utilityglm(formula = ~ x1 + x2 + x3 + x4, family = binomial, prior = priorMFL, criterion = "SIG")$utility## starting design with n=18 runs, on [-1, 1]d <- 2 * randomLHS(18, 4) - 1colnames(d) <- paste0("x", 1:4)## approximate expected utility for starting designmean(MFL.utilSIG(d, 1000))

## [1] 1.139546

## not run - quite computationally expensiveMLF.ace <- ace(utility = MFL.utilSIG, start.d = d, progress = T)

87/129

For this logistic regression example, acebayes has some designs precomputed

pairs(optdeslrsig(18), pch = 16, labels=c(expression(x[1]), expression(x[2]), expression(x[3]), expression(x[4])), cex = 2)

88/129

Hierachical logistic regression with groups (blocks, eg wellplates)g = 3

pairs(optdeshlrsig(18), pch = 16, labels=c(expression(x[1]), expression(x[2]), expression(x[3]), expression(x[4])), col = c("black", "red", "blue")[rep(1:3, rep(6, 3))], cex = 2)

89/129

Computer experiments

Introduction

Many physical and social processes can be approximated by computer codes which encapsulatemathematical models

- computer code: numerical implementation of the mathematical model

Key feature: the model does not have a closed-form; it can only be evaluated numerically, and this istypically (relatively) expensive

We will focus on deterministic computer models

eg partial differential equations solved using finite element methods

eg reaction kinetics modelling in computational biology, in-silico chemistry

·

·

91/129

Computer experiments

Assumption: can only be evaluated numerically; i.e. can be computed for a given but thegeneral form is unknown

How do we learn about the function ?

In an analogy to a physical system, we experiment on , i.e.

Use the "data" to build statistical models linking and

See Santner, Williams, and Notz (2003)

g(x ) g(x ) x

g(x )

g(x )

choose a design

evaluate (run the computer code)

· ξ = ( , … , )x 1 x n

· g( )x i

{ , g( )}x i x i x g(x )

called emulators; typically use a Gaussian process·

92/129

(Very) simple example

Climate modelling involves the solution of many intractable equations, leading to mathematical modelsevaluated via computationally expensive computer codes

We will illustrate methods on a very simple example: a time-stepping advective/diffusive surface layermeridional EBM (energy balance model)

See https://wiki.aston.ac.uk/foswiki/bin/view/MUCM/SurfebmModel

lots of applications of computer experiments·

2D earth with no land

each surface object has a percentage of ice cover

different albedo (fraction of solar energy reflected) for ice vs non-ice surfaces

ocean circulation is explicitly modelling (cf Atlantic gulf stream)

two variables: - solar constant; - non-ice albedo

output is mean temperature

·

·

·

·

· x1 x2

·

93/129

https://wiki.aston.ac.uk/foswiki/bin/view/MUCM/SurfebmModel

## design and data are in 'ebm'library(akima)fld <- interp(x = ebm$x1, y = ebm$x2, z = ebm$y)filled.contour(x = fld$x, y = fld$y, z = fld$z, asp = 1)

94/129

Space-filling designs

As we will see later, emulators are usually constructed using nonparametric statistical models

This choice leads naturally to using space-filling designs

Common designs are chosen to optimise some space-filling metric, or formed from (stratified) randomsampling

Space-filling designs do not have replication, so ideal for deterministic computer models

such designs do not rely on the functional form of the relationship between the code inputs and theresponse

good coverage is important for prediction (we will predict "better" near points we have already runthe computer model)

·

·

95/129

Uniform designs

Many designs proposed for computer experiments are related to ideas underpinning quadrature, andthe approximation of an expectation.

Let , the sample mean of for . Then

where is the star discrepancy of the design

This relationship leads to the criterion of design selection via minimising discrepancy

Fang, Li, and Sudjianto (2006), Ch.3

= g( )g 1n ∑n

i= 1 x i g(⋅) ξ

| [g(x )] − | ≤ constant × D(ξ)Ex g

D(ξ)

is a measure of the uniformity of the design points· D(ξ)

is difficult to compute for moderate to high numbers of dimensions

therefore, it is more common to minimise the related centred -discrepancy

· D(ξ)· L2

96/129

Designs based on measures of distance

Two sensible criteria for the selection of a space-filling design are

(Johnson, Moore, and Ylvisaker 1990)

The Euclidean distance between points and is given by

make sure no two points in the design are too close together

make sure no point in the design region is too far from a design point

·

·

x x ′

δ(x , ) =x ′ ∑i= 1

k

( − )xj x′j

2‾ ‾‾‾‾‾‾‾‾‾‾‾‾

⎷

97/129

Mm and mM designs

Using Euclidean distance, we can define

where the distance between a point and a design is defined as

Roughly speaking, an Mm design spreads out the design points, and an mM design covers the designregion

Intuitively, covering the design region seems more desirable (eg for prediction), but optimising the mMobjective function is computationally challenging. Hence, Mm designs are more commonly used

maximin (Mm) criterion: maximise·

δ( , )min, ∈ ξx i x j

x i x j

minimax (mM) criterion: minimise·

δ(x , ξ)maxx

x ξ

δ(x , ξ) = δ(x , )min∈ ξx j

x i

98/129

Latin hypercube designs

For high-dimensional problems, space-filling is difficult

Latin hypercube designs (LHDs) are randomly chosen sets of points with the restriction of uniformone-dimensional projections (McKay, Beckman, and Conover 1979)

An LHD only guarantees space-filling properties in each one-dimensional projection, not overall. So wenormally combine the Latin hypercube principle with a space-filling criteria, eg to find a Mm LHD

many points are required to adequate space-fill a high-dimensional space (curse of dimensionality)·

each variable has no overlapping points, and good coverage (compare with a factorial design, whichhas hidden replication)

can be easily constructed using permutations of integers

·

·

99/129

LH <- function(n = 3, d = 2) { D <- NULL for(i in 1:d) D <- cbind(D, sample(1:n, n)) D }set.seed(4)par(mar=c(5,6,2,4)+0.1, pty = "s")plot((LH() -.5)/ 3, xlim = c(0, 1), ylim = c(0, 1), pty = "s", xlab = expression(x[1]), ylab = expression(x[2]), pch = 16, cex.lab = 2, cex.axis = 2, cex = 2)abline(v = c(0, 1/3, 2/3, 1), lty = 2)abline(h = c(0, 1/3, 2/3, 1), lty = 2)

100/129

The DiceDesign package has functions to generate various LHDs

library(DiceDesign)lhs.d <- lhsDesign(9, 2)plot(lhs.d$design, xlim = c(0, 1), ylim = c(0, 1), pty = "s", xlab = expression(x[1]), ylab = expression(x[2]), pch = 16, cex.lab = 2, cex.axis = 2, cex = 2, main = "random", cex.main = 2)abline(v = seq(0, 9) / 9, lty = 2)abline(h = seq(0, 9) / 9, lty = 2)

discrep.d <- discrepSA_LHS(lhs.d$design, criterion = "C2")plot(discrep.d$design, xlim = c(0, 1), ylim = c(0, 1), pty = "s", xlab = expression(x[1]), ylab = expression(x[2]), pch = 16, cex.lab = 2, cex.axis = 2, cex = 2, main = "discrepancy", cex.main = 2)abline(v = seq(0, 9) / 9, lty = 2)abline(h = seq(0, 9) / 9, lty = 2)

maximin.d <- maximinSA_LHS(discrep.d$design)plot(maximin.d$design, xlim = c(0, 1), ylim = c(0, 1), pty = "s", xlab = expression(x[1]), ylab = expression(x[2]), pch = 16, cex.lab = 2, cex.axis = 2, cex = 2, main = "maximin", cex.main = 2)abline(v = seq(0, 9) / 9, lty = 2)abline(h = seq(0, 9) / 9, lty = 2)

101/129

102/129

The design for the EBM example is a Mm LHD

par(mar=c(5,6,2,4)+0.1, pty = "s")plot(ebm[, 2:3], xlim = c(-1, 1), ylim = c(-1, 1), pty = "s", xlab = expression(x[1]), ylab = expression(x[2]), pch = 16, asp = 1)abline(v = 2 * seq(0:20) / 20 - 1, lty = 2)abline(h = 2 * seq(0:20) / 20 - 1, lty = 2)

103/129

Gaussian process

The most common statistical model used to emulate computer models is the Gaussian process (GP)

An intuitive way to think about a GP is as a prior for the unknown function within a Bayesianframework

flexible, nonparametric regression model (few assumptions made about )

naturally allows for uncertainty quantification (eg prediction intervals)

interpolates observed responses

· g(x )·

·

g(x )

104/129

We say that

where is the mean, is the correlation function, is the vector of correlation parametersand is the constant variance, if:

See Rasmussen and Williams (2006)

g(x ) ∼ GP (f (x β, κ(x , ; θ)) ,)T σ2 x ′

f (x β)T κ(x , ; ϕ)x ′ θσ2

any vector satisfies

with a model matrix and the covariance matrix defined by .

· g = (g( ), … , g( ))x 1 x nT

g ∼ N (Fβ, K(θ)) ,σ2

F K m × m K(θ = κ( , ; θ))ij x i x j

105/129

Typically, very simple mean functions are chosen for the GP, eg

The most commonly used correlation functions are separable and stationary

The Matérn function can be defined for other values of ; for , the squared exponential functionis obtained

constant: (sometimes called ordinary kriging)

linear: (universal kriging)

· f (x β =)T β0

· f (x β = +)T β0 ∑kj= 1 βj xj

squared exponential:·

κ(x , ; θ) = exp −x ′⎡

⎣⎢⎢ ∑

j ( )| − |xj x′

j

θj

2 ⎤

⎦⎥⎥

Matérn · ν = 5/2

κ(x , ; θ) = 1 + + exp(− )x ′ ∏j

⎛⎝⎜⎜ 5‾√

| − |xj x′j

θj

53 ( )

| − |xj x′j

θj

2 ⎞⎠⎟⎟ 5‾√

| − |xj x′j

θj

ν ν → ∞

106/129

Given model evaluations , a posterior GP can be obtained:

where is a vector of correlations between and

The updating of the prior mean and variance depends on the "distance" between and the points in

g = [g( ), … , g( )]x 1 x n

g(x ) | g , β, θ, ∼ N (m(x ), (x ))σ2 s2

· m(x ) = f (x β + (g − Fβ))T κTnK −1

· (x ) = (1 − )s2 σ2 κTnK −1κn

= [κ(x , ; θ)κn x i ]ni= 1 g(x ) g( ), … , g( )x 1 x n

x ξ

the posterior mean will be adjusted more for points closer to the design

predictions at these points will have smaller posterior variance

·

·

107/129

If (so we are predicting at a design point), , the th unit vector

The posterior GP interpolates - exactly what you want for a deterministic computer code

Inference unconditional on all the hyperparameters requires numerical approximation (eg Markovchain Monte Carlo)

x = x i =K −1κn ei i

· m( ) = f ( β + (g − Fβ) = g( )x i x i)T eTi x i

· ( ) = (1 − ) = (1 − κ( , ; θ)) = 0s2 x i σ2 κTnei σ2 x i x i

it is common to estimate the parameters, eg using maximum likelihood, to "plug-in" to the posteriorpredictive distribution

·

108/129

A simple example: using the DiceKriging packageg(x) = sin(2πx)

library(DiceDesign)library(DiceKriging)xi <- lhsDesign(6, 1)$designy <- sin(2 * pi * xi)gp <- km(design = xi, response = y, control = list(trace = F))xs <- sort(c(seq(0, 1, length = 100), xi))gpp <- predict(gp, newdata = xs, type = "SK")

plot(xs, gpp$mean, ylim = c(-2, 2), type = "l", col = "red", lwd = 3, ylab = "", xlab = "x")points(xi, y, pch = 4,lwd = 4, col = "blue")lines(xs, gpp$upper95, lty = 2, lwd = 3)lines(xs, gpp$lower95, lty = 2, lwd = 3)legend(x = "topright", legend = c("posterior mean of g", "posterior quantiles for g", expression(paste("observed data ", g(x[i])))), lty = c(1, 2, NA), pch = c(NA, NA, 4), lwd = c(4, 4, 4), col = c("red", "black", "blue"))

109/129

110/129

Return to the EBM example

gpebm <- km(formula = ~., design = ebm[, 2:3], response = ebm[, 1], control = list(trace = F))gpebm

## ## Call:## km(formula = ~., design = ebm[, 2:3], response = ebm[, 1], control = list(trace = F))## ## Trend coeff.:## Estimate## (Intercept) 16.3262## x1 2.4078## x2 -28.9973## ## Covar. type : matern5_2 ## Covar. coeff.:## Estimate## theta(x1) 2.8829## theta(x2) 0.2722## ## Variance estimate: 2.215045

111/129

xs1 <- sort(c(seq(-1, 1, length = 10), ebm[, 2]))xs2 <- sort(c(seq(-1, 1, length = 10), ebm[, 3]))xs <- expand.grid(x1 = xs1, x2 = xs2)gppebm <- predict(gpebm, newdata = xs, type = "UK")filled.contour(x = xs1, y = xs2, z = matrix(gppebm$mean, nrow = length(xs1)))filled.contour(x = xs1, y = xs2, z = matrix(gppebm$sd, nrow = length(xs1)))

112/129

Bayesian optimisation

A common task is optimisation of

When is computationally expensive to evaluate, computer experiments and emulators can be usedto facilitate the optimisation.

The field of Bayesian optimisation uses sequentially collected evaluations of

Uncertainty in the posterior (i.e. for at unobserved ) leads to exploration/exploitation trade-off

The most common acquisition function is expected improvement (EI)

See Jones, Schonlau, and Welch (1998)

g(x )

g(x )

g(x )

place a prior distribution (eg GP) on

collect function evaluations at points chosen sequentially via an acquisition function

update the prior to a posterior distribution, and infer the maximum/minimum of

· g(x )·

· g(x )

g(x ) x

113/129

For a deterministic computer model and a minimisation problem, the improvement from performingone more run is given by:

where is the minimum across the model runs performed to date

This quantity is a random variable - we are uncertain about at a point we have not observed.

EI chooses to maximise

where and are the standard normal pdf and cdf, respectively

EI is an decreasing function of and an increasing function of , so it leads to choosing designpoints that either minimise the posterior mean or the posterior variance

max( − g(x ), 0)gmin

gmin

g(x )

x

[max( − g(x ), 0) ; g ] = [ − m(x )] Φ ( ) + s(x )ϕ ( )Eg gmin gmin− m(x )gmin

s(x )− m(x )gmin

s(x )

ϕ Φ

m(x ) (x )s2

experiment either where our uncertainty is high or near where we predict the minimum to be(explore or exploit)

·

114/129

A simple example: but with a different starting design using DiceOptimg(x ) = sin(2πx)

xi <- matrix(c(0.1, 0.8, 0.9), ncol = 1)fn <- function(x) sin(2 * pi * x)y <- fn(xi)gp <- km(design = xi, response = y, control = list(trace = F))xs <- sort(c(seq(0, 1, length = 100), xi))gpp <- predict(gp, newdata = xs, type = "SK")

plot(xs, gpp$mean, ylim = c(-2, 2), type = "l", col = "red", lwd = 3, ylab = "", xlab = "x")points(xi, y, pch = 4,lwd = 4, col = "blue")lines(xs, gpp$upper95, lty = 2, lwd = 3)lines(xs, gpp$lower95, lty = 2, lwd = 3)

115/129

116/129

library(DiceOptim)xin <- max_EI(model = gp, lower = 0, upper = 1)$par

## ## ## Mon Aug 26 15:55:26 2019## Domains:## 0.000000e+00 <= X1 <= 1.000000e+00 ## ## NOTE: The total number of operators greater than population size## NOTE: I'm increasing the population size to 10 (operators+1).## ## Data Type: Floating Point## Operators (code number, name, population) ## (1) Cloning........................... 0## (2) Uniform Mutation.................. 1## (3) Boundary Mutation................. 1## (4) Non-Uniform Mutation.............. 1## (5) Polytope Crossover................ 1## (6) Simple Crossover.................. 2## (7) Whole Non-Uniform Mutation........ 1## (8) Heuristic Crossover............... 2## (9) Local-Minimum Crossover........... 0## ## HARD Maximum Number of Generations: 12## Maximum Nonchanging Generations: 2## Population size : 10## Convergence Tolerance: 1.000000e-21## ## Using the BFGS Derivative Based Optimizer on the Best Individual Each Generation.

117/129

EI(xin, gp)

## [1] 0.1211283

plot(xs, gpp$mean, ylim = c(-2, 2), type = "l", col = "red", lwd = 3, ylab = "", xlab = "x", cex.lab = 2)points(xi, y, pch = 4,lwd = 4, col = "blue")lines(xs, gpp$upper95, lty = 2, lwd = 3)lines(xs, gpp$lower95, lty = 2, lwd = 3)abline(v = xin)plot(xs, sapply(xs, EI, model = gp), type = "l", lwd = 3, ylab = "", xlab = "x", cex.lab = 2)

118/129

xi <- rbind(xi, xin)y <- c(y, fn(xin))gp2 <- km(design = xi, response = y, control = list(trace = F))xin <- max_EI(model = gp2, lower = 0, upper = 1, control = list(trace = F))$par

## ## ## Mon Aug 26 15:55:26 2019## Domains:## 0.000000e+00 <= X1 <= 1.000000e+00 ## ## NOTE: The total number of operators greater than population size## NOTE: I'm increasing the population size to 10 (operators+1).## ## Data Type: Floating Point## Operators (code number, name, population) ## (1) Cloning........................... 0## (2) Uniform Mutation.................. 1## (3) Boundary Mutation................. 1## (4) Non-Uniform Mutation.............. 1## (5) Polytope Crossover................ 1## (6) Simple Crossover.................. 2## (7) Whole Non-Uniform Mutation........ 1## (8) Heuristic Crossover............... 2## (9) Local-Minimum Crossover........... 0## ## HARD Maximum Number of Generations: 12## Maximum Nonchanging Generations: 2## Population size : 10## Convergence Tolerance: 1.000000e-21

119/129

EI(xin, gp2)

## [1] 0.052808

xs <- sort(c(seq(0, 1, length = 100), xi))gpp <- predict(gp2, newdata = xs, type = "SK")plot(xs, gpp$mean, ylim = c(-2, 2), type = "l", col = "red", lwd = 3, ylab = "", xlab = "x", cex.lab = 2)points(xi, y, pch = 4,lwd = 4, col = "blue")lines(xs, gpp$upper95, lty = 2, lwd = 3)lines(xs, gpp$lower95, lty = 2, lwd = 3)abline(v = xin)plot(xs, sapply(xs, EI, model = gp2), type = "l", lwd = 3, ylab = "", xlab = "x", cex.lab = 2)

120/129

xi <- rbind(xi, xin)y <- c(y, fn(xin))gp3 <- km(design = xi, response = y, control = list(trace = F))xin <- max_EI(model = gp3, lower = 0, upper = 1, control = list(trace = F))$par

## ## ## Mon Aug 26 15:55:26 2019## Domains:## 0.000000e+00 <= X1 <= 1.000000e+00 ## ## NOTE: The total number of operators greater than population size## NOTE: I'm increasing the population size to 10 (operators+1).## ## Data Type: Floating Point## Operators (code number, name, population) ## (1) Cloning........................... 0## (2) Uniform Mutation.................. 1## (3) Boundary Mutation................. 1## (4) Non-Uniform Mutation.............. 1## (5) Polytope Crossover................ 1## (6) Simple Crossover.................. 2## (7) Whole Non-Uniform Mutation........ 1## (8) Heuristic Crossover............... 2## (9) Local-Minimum Crossover........... 0## ## HARD Maximum Number of Generations: 12## Maximum Nonchanging Generations: 2## Population size : 10## Convergence Tolerance: 1.000000e-21

121/129

EI(xin, gp3)

## [1] 0.0005309279

xs <- sort(c(seq(0, 1, length = 100), xi))gpp <- predict(gp3, newdata = xs, type = "SK")plot(xs, gpp$mean, ylim = c(-2, 2), type = "l", col = "red", lwd = 3, ylab = "", xlab = "x", cex.lab = 2)points(xi, y, pch = 4,lwd = 4, col = "blue")lines(xs, gpp$upper95, lty = 2, lwd = 3)lines(xs, gpp$lower95, lty = 2, lwd = 3)abline(v = xin)plot(xs, sapply(xs, EI, model = gp3), type = "l", lwd = 3, ylab = "", xlab = "x", cex.lab = 2)

122/129

Sensitivity analysis

To analyse the relative importance of different model inputs, functional analysis of variance (FANOVA)can be used to decompose into an additive form (see Saltelli et al. 2008)

Assume , eg the input variables may be normally distributed or uniformly distributed. Then

where

and

with higher-order terms defined similarly

g(x )

∼ (x)xi wi

g(x ) = + (x ) + (x ) + … + (x )μ0 ∑i= 1

kμi ∑ ∑

i> jμij μ12⋯k

= ∫ g(x ) w( ) d(x ) , (x ) = ∫ g(x ) ( ) d − ,μ0 ∏i

xi μi ∏j≠i

wj xj xj μ0

(x ) = ∫ g(x ) ( ) d − (x ) − (x ) +μij ∏l≠i,j

wl xl xl μi μj μ0

123/129

A (normalised measure) of the impact of each variable can be obtained by assessing the variance ofeach term via a Sobol' index:

The impact of higher-order terms can be assessed via the total variance index:

where and

μi

= ∈ [0, 1]Si{ (x )}Varxi μi

{g(x )}Varx

= 1 − ∈ [0, 1]Ti{ (x )}Varx (i) μ−i

{g(x )}Varx

= ( , … , , , … ,x (i) x1 xi−1 xi+ 1 xk)T

(x ) = ∫ g(x ) ( ) d − ,μ−i wi xi xi μ0

124/129

measures the "main effect" of the th variable, and measures it's total effect including through"interactions" with other variables

The sensitivity indices can be estimated using Monte Carlo methods (essentially nested MC to estimatethe expectations and variances)

Various methods are implemented in the sensitivity package

Si i Ti

big differences between and suggest the th variable interacts with other variables· Si Ti i

for computationally expensive models, sample from a GP emulator·

125/129

EBM example:

library(sensitivity)d <- 2; n <- 1000X1 <- data.frame(matrix(runif(d * n), nrow = n))X2 <- data.frame(matrix(runif(d * n), nrow = n))colnames(X1) <- colnames(ebm)[2:3]; colnames(X2) <- colnames(ebm)[2:3]res <- sobolGP(model = gpebm, type = "UK", MCmethod = "sobol2002", X1, X2)res

## ## Method: sobol2002## ## Model runs: 20 ## ## Number of GP realizations: 100 ## ## Kriging type: UK ## ## estimate std. error min. c.i. max. c.i.## S1 0.09603572 0.0006806932 0.09478208 0.09728398## S2 0.98825845 0.0007460611 0.98702092 0.98961288## ## estimate std. error min. c.i. max. c.i.## T1 0.01232153 0.0005711950 0.01126854 0.01360374## T2 0.93075208 0.0005361857 0.92958659 0.93178356

126/129

Uncertainty quantification

Computer experiments are an important statistical contribution to the field of uncertaintyquantification (UQ)

Space-filling designs and (GP) emulators are very general, and can be applied to a variety of black boxcomputer models

GP emulators can be used as priors for Bayesian calibration of computer models (Kennedy andO’Hagan 2001)

interdisciplinary topic on the interface of Statistics and Applied Maths

methodologies for taking account of uncertainties when mathematical and computer models areused to describe real-world phenomena

·

·

typically require a lot less knowledge about the model than alternative methods from numericalanalysis (although at some loss of efficiency)

·

eg learning tuning parameters (cf parameter estimation, albeit with various important nuancesaround interpretation and physical understanding)

data fusion: combining computer model runs and data from real experiments

·

·

127/129

References

Atkinson, A. C., A. N. Donev, and R. D. Tobias. 2007. Optimum Experimental Design, with Sas. 2nd ed.Oxford: Oxford University Press.

Basu, D. 1980. “Randomization Analysis of Experimental Data: The Fisher Randomization Test.” Journal

of the American Statistical Association 75: 575–82.

Beck, L., B. Mansour Dia, L. F. R. Espath, Q. Long, and R. Tempone. 2018. “Fast Bayesian ExperimentalDesign: Laplace-Based Importance Sampling for the Expected Information Gain.” Computer Methods in

Applied Mechanics and Engineering 334: 523–53.

Box, G. E. P., and R. D. Meyer. 1986. “An Analysis of Unreplicated Fractional Factorials.” Technometrics 28:11–18.

Chaloner, K., and I. Verdinelli. 1995. “Bayesian Experimental Design: A Review.” Statistical Science 10:273–304.

Cox, D. R., and N. Reid. 2000. The Theory of the Design of Experiments. Boca Raton: Chapman; Hall/CRCPress.

Cuthbert, D. 1959. “Use of Half-Normal Plots in Interpreting Factorial Two-Level Experiments.”Technometrics 1: 311–41.

Dasgupta, T., N. S. Pillai, and D. B. Rubin. 2015. “Causal Inference from Factorial Designs by UsingPotential Outcomes.” Journal of the Royal Statistical Society B 77: 727–53.

Fang, K.-T., R. Li, and A. Sudjianto. 2006. Design and Modelling for Computer Experiments. Boca Raton:Chapman; Hall/CRC Press.

2k

129/129

apts design of experiments - university of warwick · 2019-08-27 · optimum experimental design,...

Documents