a supermartingale approach to gaussian process …arxiv:1608.01118v3 [stat.ml] 30 aug 2018 arxiv:...

33
arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design of experiments Julien Bect ,1 , François Bachoc 2 and David Ginsbourger 3,4 1 Laboratoire des Signaux et Systèmes (L2S) CentraleSupélec, CNRS, Univ. Paris-Sud, Université Paris-Saclay, Gif-sur-Yvette, France. e-mail: [email protected] 2 Toulouse Mathematics Institute, University Paul Sabatier, France. e-mail: [email protected] 3 Uncertainty Quantification and Optimal Design group, Idiap Research Institute, Martigny, Switzerland e-mail: [email protected] 4 Institute of Mathematical Statistics and Actuarial Science, Department of Mathematics and Statistics, University of Bern, Switzerland e-mail: [email protected] * Corresponding author. Abstract: . Gaussian process (GP) models have become a well-established framework for the adaptive design of costly experiments, and notably of computer experiments. GP-based sequential designs have been found practically efficient for various objectives, such as global optimization (estimating the global maximum or maximizer(s) of a function), reliability anal- ysis (estimating a probability of failure) or the estimation of level sets and excursion sets. In this paper, we study the consistency of an important class of sequential designs, known as stepwise uncertainty reduction (SUR) strategies. Our approach relies on the key obser- vation that the sequence of residual uncertainty measures, in SUR strategies, is generally a supermartingale with respect to the filtration generated by the observations. This observation enables us to establish generic consistency results for a broad class of SUR strategies. The consistency of several popular sequential design strategies is then obtained by means of this general result. Notably, we establish the consistency of two SUR strategies proposed by Bect, Ginsbourger, Li, Picheny and Vazquez (Stat. Comp., 2012)—to the best of our knowledge, these are the first proofs of consistency for GP-based sequential design algorithms dedicated to the estimation of excursion sets and their measure. We also establish a new, more general proof of consistency for the expected improvement algorithm for global optimization which, unlike previous results in the literature, applies to any GP with continuous sample paths. MSC 2010 subject classifications: Primary 60G15, 62L05; secondary 68T05. Keywords and phrases: Sequential Design of Experiments, Active Learning, Stepwise Un- certainty Reduction, Supermartingale, Uncertainty Functional, Convergence. 1. Introduction Sequential design of experiments is an important and lively research field at a crossroads between applied probability, statistics and optimization, where the goal is to allocate experimental resources step by step so as to reduce the uncertainty about some quantity, or function, of interest. While the experimental design vocabulary traditionally refers to observations of natural phenomena presenting aleatory uncertainties, the design of computer experiments—in which observations are replaced by numerical simulations—has become a field of research per se [37, 50, 51], where Gaussian process 1

Upload: others

Post on 26-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

arX

iv:1

608.

0111

8v3

[st

at.M

L]

30

Aug

201

8

arXiv: 1608.01118v3

A supermartingale approach to Gaussian process

based sequential design of experiments

Julien Bect∗,1, François Bachoc2 and David Ginsbourger3,4

1Laboratoire des Signaux et Systèmes (L2S)CentraleSupélec, CNRS, Univ. Paris-Sud, Université Paris-Saclay, Gif-sur-Yvette, France.

e-mail: [email protected]

2Toulouse Mathematics Institute, University Paul Sabatier, France.e-mail: [email protected]

3Uncertainty Quantification and Optimal Design group,Idiap Research Institute, Martigny, Switzerland

e-mail: [email protected]

4Institute of Mathematical Statistics and Actuarial Science,Department of Mathematics and Statistics, University of Bern, Switzerland

e-mail: [email protected]

∗Corresponding author.

Abstract: . Gaussian process (GP) models have become a well-established framework forthe adaptive design of costly experiments, and notably of computer experiments. GP-basedsequential designs have been found practically efficient for various objectives, such as globaloptimization (estimating the global maximum or maximizer(s) of a function), reliability anal-ysis (estimating a probability of failure) or the estimation of level sets and excursion sets.In this paper, we study the consistency of an important class of sequential designs, knownas stepwise uncertainty reduction (SUR) strategies. Our approach relies on the key obser-vation that the sequence of residual uncertainty measures, in SUR strategies, is generally asupermartingale with respect to the filtration generated by the observations. This observationenables us to establish generic consistency results for a broad class of SUR strategies. Theconsistency of several popular sequential design strategies is then obtained by means of thisgeneral result. Notably, we establish the consistency of two SUR strategies proposed by Bect,Ginsbourger, Li, Picheny and Vazquez (Stat. Comp., 2012)—to the best of our knowledge,these are the first proofs of consistency for GP-based sequential design algorithms dedicatedto the estimation of excursion sets and their measure. We also establish a new, more generalproof of consistency for the expected improvement algorithm for global optimization which,unlike previous results in the literature, applies to any GP with continuous sample paths.

MSC 2010 subject classifications: Primary 60G15, 62L05; secondary 68T05.Keywords and phrases: Sequential Design of Experiments, Active Learning, Stepwise Un-certainty Reduction, Supermartingale, Uncertainty Functional, Convergence.

1. Introduction

Sequential design of experiments is an important and lively research field at a crossroads betweenapplied probability, statistics and optimization, where the goal is to allocate experimental resourcesstep by step so as to reduce the uncertainty about some quantity, or function, of interest. While theexperimental design vocabulary traditionally refers to observations of natural phenomena presentingaleatory uncertainties, the design of computer experiments—in which observations are replaced bynumerical simulations—has become a field of research per se [37, 50, 51], where Gaussian process

1

Page 2: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 2

models are massively used to define efficient sequential designs in cases of costly evaluations. Thepredominance of Gaussian processes in this field is probably due to their unique combination ofmodeling flexibility and computational tractability, which makes it possible to work out samplingcriteria accounting for the potential effect of adding new experiments. The definition, calculationand optimization of sampling criteria tailored to various application goals have inspired a significantnumber of research contributions in the last decades [see, e.g., 4, 10–12, 20, 22–24, 26, 28, 46–48, 53, 54, 63]. Yet, available convergence results for the associated sequential designs are quiteheterogeneous in terms of their respective extent and underlying hypotheses [9, 29, 53, 55, 61]. Herewe develop a probabilistic approach to the analysis of a large class of strategies. This enables us toestablish generic consistency results, whose broad applicability is subsequently illustrated on fourpopular sequential design strategies. The crux is that each of these strategies turns out to involvesome uncertainty functional applied to a sequence of conditional probability distributions, and ourmain results rely on the key property—which will be referred to as the supermartingale property—that, for any sequential design, the sequence of random variables produced by these functionals isa supermartingale with respect to the filtration generated by the observations.

Among the sampling criteria considered in our examples, probably the most famous one is theexpected improvement (EI), that arose in sequential design for global optimization. Following thefoundations laid by Mockus et al. [41] and the considerable impact of the work of Jones et al. [34], EIand other Bayesian optimization strategies have spread in a variety of application fields. They arenow commonly used in engineering design [21] and, in the field of machine learning, for automaticconfiguration algorithms (see [54] and references therein). Extensions to constrained, multi-objectiveand/or robust optimization constitute an active field of research [see, e.g., 6, 19, 20, 28, 46, 65]. Ina different context, sequential design strategies based on Gaussian process models have been usedto estimate contour lines, probabilities of failures, profile optima and excursion sets of expensive toevaluate simulators [see, notably, 4, 11, 26, 47, 48, 60, 64, 68].

More specifically, we consider in this paper sequential design strategies built according to thestepwise uncertainty reduction (SUR) paradigm [see 4, 10, 63, and references therein]. Our mainfocus is the consistency of these algorithms under the assumption that the function of interestis a sample path of the Gaussian process model that is used to construct the sequential design.Almost sure consistency has been proved for the EI algorithm in [61], but only under the restrictiveassumption that the covariance function satisfies a certain condition—the “No Empty Ball” (NEB)property—which excludes very regular Gaussian processes1. Moreover, to the authors’ knowledge,no proof of consistency has yet been established for algorithms dedicated to probability of excursionand/or excursion set estimation (referred to as excursion case henceforth) such as those of Bectet al. [4]. The scheme of proof developed in this work allows us to address the excursion case andalso to revisit the consistency of the knowledge gradient algorithm [22–24], as well as that of the EIalgorithm—which can also be seen as a particular case of a SUR strategy [10]—without requiringthe NEB assumption. Before outlining the paper in more detail, let us briefly introduce the generalsetting and, in particular, what we mean by SUR strategies. We will focus directly on the case ofGaussian processes for clarity, but the SUR principle in itself is much more general, and can beused with other types of models [see, e.g., 12, 25, 33, 36, 40, 43].

1On a related note, Bull [9] proves an upper-bound for the convergence rate of the expected improvement algorithmunder the assumption that the covariance function is Hölder, but his result only holds for functions that belong tothe reproducing kernel Hilbert space (RKHS) of the covariance—a condition which, under appropriate assumptions,is almost surely not satisfied by sample paths of the Gaussian process according to Driscoll’s theorem [39]. Anotherresult in the same vein is provided by Yarotsky [67] for the squared exponential covariance in the univariate case,assuming that the objective function is analytic in a sufficiently large complex domain around its interval of definition.

Page 3: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 3

Let ξ be a real-valued Gaussian process defined on a measurable space X—typically, ξ will bea continuous Gaussian process on a compact metric space, such as X = [0, 1]

ℓ— and assume thatevaluations (observations) Zn = ξ(Xn) + ǫn are to be made, sequentially, in order to estimatesome quantity of interest (e.g., the maximum of ξ, or the volume of its excursion above some giventhreshold). We will assume the sequence of observation errors (ǫn)n∈N∗ to be independent of theGaussian process ξ, and composed of independent centered Gaussian variables. The definition ofa SUR strategy starts with the choice of a “measure of residual uncertainty” for the quantity ofinterest after n evaluations, which is a functional

Hn = H(Pξn

)(1.1)

of the conditional distribution Pξn of ξ given Fn, where Fn is the σ-algebra generated by X1, Z1, . . . ,

Xn, Zn. Assuming that the Hn’s are Fn-measurable random variables, a SUR sampling criterion isthen defined as

Jn(x) = En,x (Hn+1) , (1.2)

where En,x denotes the conditional expectation with respect to Fn with Xn+1 = x (assumingthat Hn+1 is integrable, for any choice of x ∈ X). The value of the sampling criterion Jn(x) attime n quantifies the expected residual uncertainty at time n + 1 if the next evaluation is madeat x. Finally, a (non-randomized) sequential design is constructed by greedily choosing at each stepa point that provides the smallest expected residual uncertainty—equivalently, the largest expecteduncertainty reduction—that is,

Xn+1 ∈ argminx∈X Jn(x). (1.3)

Such a greedy strategy is sometimes called myopic or one-step look-ahead (as opposed to a Bayes-optimal strategy, which would consider the reduction of uncertainty achieved at the end of thesequential design, i.e., when the entire experimental budget has been spent). Our goal is to establishthe consistency of these strategies, where consistency means that the residual uncertainty Hn goesalmost surely to zero.

Given a finite measure µ over X and an excursion threshold T ∈ R, a typical choice of mea-sure of residual uncertainty in the excursion case [4] is the integrated indicator variance Hn =H(Pξn

)=∫Xpn (1− pn) dµ (also called integrated Bernoulli variance in what follows) where pn(x) =

Pn (ξ(x) ≥ T ) and Pn denotes the conditional probability with respect to Fn. Note that pn(x)(1−pn(x)) = varn(1ξ(x)≥T ), where varn denotes the conditional variance with respect to Fn. Recallingthe definition of Jn from (1.2), and using the law of total variance, we obtain that

Jn(x) =

X

En,x

(varn+1

(1ξ(u)≥T

))µ(du) ≤

X

varn(1ξ(u)≥T

)µ(du) = Hn, (1.4)

which shows that (Hn) is an (Fn)-supermartingale. Another related measure of uncertainty, forwhich a semi-analytical formula is provided in [11], is the variance of the excursion volume, Hn =varn(µ(x ∈ X : ξ(x) ≥ T )). The supermartingale property follows again in this case from the law oftotal variance. In the optimization case, on the other hand, it turns out [see, e.g., 10, Section 3.3] thatthe EI criterion is underlaid by the measure of residual uncertainty Hn = En (max ξ −Mn), whereMn is defined as Mn = maxi≤n ξ(Xi) for non-degenerate Gaussian processes (see Remark 4.10), andEn denotes the conditional expectation with respect to Fn. A similar construction can be obtainedfor the knowledge gradient, as developed later. It turns out as shown later in the paper that,for both criteria, the associated measures of residual uncertainty also possess the aforementionedsupermartingale property.

Page 4: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 4

It must be pointed out here that, under very weak assumptions about the Gaussian process modeland the uncertainty functional, consistency can also be achieved more simply using any dense (“spacefilling”) deterministic sequence of design points. It has been largely demonstrated, however, thatSUR strategies typically outperform in practice these simple deterministic designs (see referencesabove). Hence there remains a gap between theory and practice that is not filled by our consistencyresults since, by themselves, they do not provide a very strong theoretical support for the choice ofSUR sequential designs over other types of designs, and in particular over non-sequential designs.

The main practical interest of our consistency results is rather to answer the natural concernthat SUR strategies, because of their greedy nature, might fail to be consistent in some situations.Such a concern is justified for instance by the explicit counterexample, provided by Yarotsky [66], ofa particular function for which the sequence of points generated by the EI strategy fails to producea consistent estimate of the optimum of the function. Our results show that such functions arenegligible under the distribution of the Gaussian process used to construct the sequential design.Further study of the convergence rate of SUR sequential designs is nevertheless needed to providea full theoretical support for their practical effectiveness, and will be the subject of future work.Understanding their consistency, in relation with the properties of the uncertainty functionals thatdefined them, is an important first step in this direction.

The rest of the paper is structured as follows. Section 2 defines more precisely the statisticalmodel and design problem considered in the paper, and addresses properties of conditioning andconvergence of Gaussian measures that are instrumental in proving the main results of the paper.Section 3 discusses uncertainty functionals and their properties, and formulates general sufficientconditions for the consistency of SUR sequential designs in terms of properties of the associateduncertainty functionals. Finally, Section 4 presents applications of the general result to four popularexamples of SUR sequential designs, establishing in each case both convergence to zero for theconsidered measure of residual uncertainty and convergence of the corresponding estimator to thequantity of interest, in the almost sure and L1 sense.

2. Preliminaries: Gaussian processes and sequential design

2.1. Model

Let (ξ(x))x∈Xdenote a Gaussian process with mean function m and covariance function k, defined

on a probability space (Ω,F ,P) and indexed by a metric space X. Assume that ξ can be observedat sequentially selected (data-dependent) design points X1, X2, . . . , with additive heteroscedasticGaussian noise:

Zn = ξ(Xn) + τ(Xn)Un, n = 1, 2, . . . (2.1)

where τ : X → [0,+∞) gives the (known) standard deviation τ(x) of an observation at x ∈ X, and(Ui)i≥1 denotes a sequence of independent and identically distributed N (0, 1) variables, independentof ξ. Let Fn denote the σ-algebra generated by X1, Z1, . . . , Xn, Zn.

Definition 2.1. A sequence (Xn)n≥1 will be said to form a (non-randomized) sequential design if,for all n ≥ 1, Xn is Fn−1-measurable.

Standing assumptions 2.2. We will assume in the rest of the paper that

i) X is a compact metric space,ii) ξ has continuous sample paths,

Page 5: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 5

iii) τ is continuous.

Remark 2.3. Note that the variance function τ2 is not assumed to be strictly positive. Indeed, thespecial case where τ2 ≡ 0 is actually an important model to consider given its widespread use inBayesian numerical analysis [see, e.g. 18, 31, 44, 49] and in the design and analysis of deterministiccomputer experiments [see, e.g., 3, 50, 51].

Remark 2.4. A Gaussian process with continuous sample paths automatically has continuous meanand covariance functions (see, e.g., Lemma 1 in [32]). Conversely, assuming continuity of the meanfunction, let us recall a classical sufficient condition for sample path continuity on X ⊂ Rd [see, e.g.,1, Theorem 3.4.1]: if there exist C > 0 and η > 0 such that

k(x, x) + k(y, y)− 2k(x, y) ≤C

|log‖x− y‖|1+η , ∀x, y ∈ X,

then there exists a version of ξ with continuous sample paths.

Remark 2.5. The setting described in this section arises, notably, when considering from a Bayesianpoint of view the following nonparametric interpolation/regression model with heteroscedasticGaussian noise:

Zn = f(Xn) + τ(Xn)Un, n = 1, 2, . . . (2.2)

with a continuous Gaussian process prior on the unknown regression function f . In this case, m and kare the prior mean and covariance functions of ξ.

2.2. Gaussian random elements and Gaussian measures on C(X)

Let S = C(X) denote the set of all continuous functions on X. Since X is assumed compact, S becomesa separable Banach space when equipped with the supremum norm ‖·‖∞. We recall [see, e.g., 2,Theorem 2.9] that any Gaussian process (ξ(x))x∈X

with continuous sample paths on a compactmetric space satisfies E(‖ξ‖∞) < ∞.

Any Gaussian process (ξ(x))x∈Xwith continuous sample paths can be seen as a Gaussian random

element in S. More precisely, the mapping ξ : Ω → S, ω 7→ ξ(ω, ·) is F/S-measurable, where Sdenotes the Borel σ-algebra on S, and the probability distribution Pξ of ξ is a Gaussian measureon S. The reader is referred to Vakhania et al. [58] and Ledoux and Talagrand [38] for backgroundinformation concerning random elements and measures in Banach spaces, and to van der Vaartet al. [59] and Bogachev [7] for more information on the case of Gaussian random elements andmeasures.

We will denote by M the set of all Gaussian measures on S. Any ν ∈ M is the probabilitydistribution of some Gaussian process with continuous sample paths, seen as a random elementin S. The mean function mν and covariance function kν of this Gaussian process are continuous(see Remark 2.4) and fully characterize the measure, which we will denote as GP(mν , kν). Weendow M with the σ-algebra M generated by the evaluation maps πA : ν 7→ ν(A), A ∈ S. Usingthis σ-algebra, conditional distributions on S—seen as transition kernels from Ω to S—can beconveniently identified to random elements in M [see, e.g. 35, p. 105–106].

Given a Gaussian random element ξ in S, we will denote by P(ξ) the set of all Gaussian condi-tional distributions of ξ, i.e., the set of all random Gaussian measures ν such that ν = P

(ξ ∈ · | F ′

)

for some σ-algebra F ′ ⊂ F . Note that we use a bold letter ν to denote a random element in M

(i.e., a random Gaussian measure), and a normal letter ν to denote a point in the same space (i.e.,

Page 6: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 6

a Gaussian measure). Not all conditional distributions of the form ν = P(ξ ∈ · | F ′

)are Gaus-

sian, but an important class of such Gaussian conditional distributions is discussed in the followingsection and in Proposition 2.9.

2.3. Conditioning on finitely many observations

It is well known that Gaussian processes remain Gaussian under conditioning with respect to point-wise evaluations, or more generally linear combinations of pointwise evaluations, possibly corruptedby independent additive Gaussian noise (explicit expressions of the conditional mean and covari-ance functions are recalled in Appendix A.2). In the language of nonparametric Bayesian statistics(see Remark 2.5), Gaussian process priors are conjugate with respect to this sampling model. Thefollowing result formalizes this fact in the framework of Gaussian measures on S, and states that theconjugation property still holds when the observations are made according to a sequential design.

Proposition 2.6. For all n ≥ 1, there exists a measurable mapping

(X× R)n ×M → M,

(x1, z1, . . . , xn, zn, ν) 7→ Condx1,z1,...,xn,zn(ν),(2.3)

such that, for any Pξ ∈ M and any sequential design (Xn)n≥1, CondX1,Z1,...,Xn,Zn(Pξ) is a condi-

tional distribution of ξ given Fn.

A proof of this result is provided in Appendix A.2. In the rest of the paper, we will denoteby Pξ

n = GP(mn, kn) the conditional distribution CondX1,Z1,...,Xn,Zn(Pξ) of ξ given Fn, which

can be seen as a random element in (M,M). The posterior mean mn is also referred to as thekriging predictor [see, e.g., 14, 21, 56]. Note that mn (respectively kn) is an Fn-measurable process2

on X (respectively X × X), with continuous sample paths. Note also that m0 = m and k0 = k.Conditionally to Fn, the next observation follows a normal distribution:

Zn+1 | Fn ∼ N(mn(Xn+1), s

2n(Xn+1)

), (2.4)

where s2n(x) = kn(x, x) + τ2(x).

2.4. Convergence in M

We consider in this paper the following notion of convergence on M:

Definition 2.7. Let νn = GP(mn, kn

)∈ M, n ∈ N∪+∞. We will say that (νn) converges to ν∞,

and write νn → ν∞, if mn → m∞ uniformly on X (i.e., mn → m∞ in S) and kn → k∞ uniformlyon X× X.

Remark 2.8. In other words, we consider the topology on M induced by the strong topology on theBanach space C(X)×C(X×X), where M is identified to a subset of this space through the injectionν 7→ (mν , kν).

Let us now state two important convergence results in this topology, that will be needed inSection 3. In the first of them, and later in the paper, we denote by F∞ =

∨n≥1 Fn the σ-algebra

generated by⋃

n≥1 Fn.

2i.e., a measurable process when considered as defined on (Ω,Fn) instead of (Ω,F)

Page 7: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 7

Proposition 2.9. For any Gaussian random element ξ in S, defined on any probability space, andfor any sequential design (Xn)n≥1, the conditional distribution of ξ given F∞ admits a version Pξ

which is an F∞-measurable random element in M, and Pξn → Pξ

∞ almost surely.

Proposition 2.10. Let ν ∈ M and let (xj , zj) → (x, z) in X×R. Assume that kν(x, x)+τ2(x) > 0.Then Condxj , zj (ν) → Condx, z(ν).

Proof. See Appendix A.3 for proofs of both results.

3. Stepwise Uncertainty Reduction

3.1. Uncertainty functionals and uncertainty reduction

As explained in the introduction, the definition of a SUR strategy starts with the choice of anuncertainty functional H, which maps the conditional distribution Pξ

n to a measure Hn of residualuncertainty for the quantity of interest. The minimal value of the uncertainty functional representsan absence of uncertainty on the quantity of interest: for clarity, and without loss of generality aslong as H is bounded from below and attains its minimum, we will assume in the rest of this sectionthat minH = 0, thus restricting our attention to non-negative uncertainty functionals.

More formally, let H denote a measurable functional from M to [0,+∞). Since Pξn is an Fn-

measurable random element in (M,M), the residual uncertainty Hn = H(Pξn) is an Fn-measurable

random variable. A key observation for the convergence results of this paper is that many uncertaintyfunctionals of interest—examples of which will be given in Section 4—enjoy the following property:

Definition 3.1. A measurable functional H on M will be said to have the supermartingale propertyif, for any Gaussian random element ξ in S, defined on any probability space, and for any sequentialdesign (Xn)n≥1, the sequence

(H(Pξ

n))n≥0

is an (Fn)-supermartingale.

The supermartingale property echoes DeGroot’s observation that “reasonable” measures of uncer-tainty should be decreasing on average for any possible experiment [15]. To discuss this connectionmore precisely in our particular setting, let us consider the following definition.

Definition 3.2. Let M0 denote a set of probability measures on a measurable space (E, E), andlet M0 denote the σ-algebra generated on M0 by the evaluation maps. For any random element νin (M0,M0), defined on any probability space, let ν denote the probability measure defined byν(A) = E (ν(A)), A ∈ E. We will say that a non-negative measurable functional H on M0 isdecreasing on average (DoA) if, for any random element ν in (M0,M0) such that ν ∈ M0,E (H(ν)) ≤ H(ν).

Note that, if the set M0 is convex, DoA functionals on M0 are concave. The converse statement isexpected to be false, however, since Jensen’s inequality does not hold for all concave functionals ininfinite dimensional settings [see 45, for extensions of Jensen’s inequality under various assumptions].The set M of all Gaussian measures on S is not convex, but all the uncertainty functionals presentedin Section 4 can in fact be extended to DoA—hence concave—functionals defined on some largerconvex set of probability measures3.

3More precisely, the functionals discussed in Sections 4.1 and 4.2 can be extended to DoA functionals on the setof all probability measures on S, and those of Sections 4.3 and 4.4 to DoA functionals on the set of all probabilitymeasures ν on S such that E(max ξ −min ξ) < +∞ for ξ ∼ ν.

Page 8: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 8

Remark 3.3. Let β : S → Rp denote a measurable function, and let νβ denote the image of ν by β.Then it is easy to see that any functional of the form H(ν) = H′(νβ) is DoA, where H′ denotesa DoA functional defined on some appropriate subset of the set of all probability measures on Rp;the reader is referred to [30] for a variety of examples of such functionals. Section 4.2 provides anexample of this construction, with p = 1 and H′ the variance functional.

The supermartingale and DoA properties are easily seen to be connected as follows:

Proposition 3.4. If H is DoA on M, then it has the supermartingale property.

Let us conclude this section with another useful property of functionals. Recall that P(ξ) denotesthe set of all Gaussian conditional distributions of ξ (see Section 2.2).

Definition 3.5. A measurable functional H on M will be said to be P-uniformly integrable if, forany Gaussian random element ξ in S, defined on any probability space, the family (H(ν))

ν∈P(ξ) isuniformly integrable.

Proposition 3.6. Let H denote a measurable functional on M. If there exists L+ ∈ ∩ν∈ML1 (S,S, ν)such that |H(ν)| ≤

∫SL+dν for all ν ∈ M, then H is P-uniformly integrable.

Proof. Let ξ denote a Gaussian random element in S and let ν = P(ξ ∈ · | F ′

)∈ P(ξ). Then we

have |H(ν)| ≤ E (L+(ξ) | F ′), and the result follows from the uniform integrability of conditionalexpectations [see, e.g., 35, Lemma 5.5].

Remark 3.7. If H is P-uniformly integrable and has the supermartingale property then, for anysequential design, the sequence (Hn) is a uniformly integrable supermartingale (since Pξ

n ⊂ P(ξ)),and thus converges almost surely and in L1.

3.2. SUR sequential designs and associated functionals

The SUR sampling criterion introduced informally as Jn(x) = En,x (Hn+1) in (1.2) can now bemore precisely defined as

Jn(x) = En

(H(Condx, Zn+1(x)

(Pξn

))), (3.1)

where Zn+1(x) = ξ(x) + Un+1 τ(x). A SUR sequential design is then built by selecting at eachstep, possibly after some initial design, the next design point as a minimizer of the SUR samplingcriterion Jn:

Definition 3.8. Let H denote a non-negative measurable functional on M.

i) We will say that (Xn) is a SUR sequential design associated with the uncertainty functional Hif it is a sequential design such that Xn+1 ∈ argminJn for all n ≥ n0, for some integer n0.

ii) Given a sequence ε = (εn) of non-negative real numbers such that εn → 0, we will say that (Xn)is an ε-quasi-SUR sequential design if it is a sequential design such that Jn (Xn+1) ≤ inf Jn+εnfor all n ≥ n0, for some integer n0.

Remark 3.9. In practice it is not always easy to guarantee that, for a given uncertainty functional H,the sampling criteria Jn attain their infimum over X. Moreover, the actual minimization of Jn istypically carried out by means of a numerical optimization algorithm, which cannot be expectedto provide the exact minimizer. For these reasons, it seems important to study the convergence ofquasi-SUR designs, as introduced by Definition 3.8.ii, instead of the more restrictive case of (exact)

Page 9: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 9

SUR designs. General existence results for SUR and quasi-SUR designs, based on the measurableselection theorem for random closed sets, are provided in Appendix A.4.

Let us now introduce some useful functionals associated to a given (non-negative) uncertaintyfunctional H. First, observe that Jn(x) = Jx(P

ξn), where the functional Jx : M → [0,+∞] is defined

for all x ∈ X and ν ∈ M by

Jx(ν) =

∫∫

S×R

H(Condx, f(x)+u τ(x) (ν)

)ν(df)φ(u) du (3.2)

=

R

H(Condx,mν(x)+v sν(x)(ν)

)φ(v) dv, (3.3)

with s2ν(x) = kν(x, x) + τ2(x) and φ the probability density function of the standard normal distri-bution. The mapping (x, ν) 7→ Jx(ν) is B(X)⊗M-measurable (see Proposition A.6), and it is easyto see that H has the supermartingale property if, and only if,

Jx(ν) ≤ H(ν), for all x ∈ X and ν ∈ M. (3.4)

Assuming that H has the supermartingale property, we will then denote by Gx : M → [0,+∞)the corresponding expected gain functional at x:

Gx(ν) = H(ν)− Jx(ν), (3.5)

and by G : M → [0,+∞) the associated maximal expected gain functional :

G(ν) = supx∈X

Gx(ν). (3.6)

Remark 3.10. Following [15], Gx could be called the “information” brought by an evaluation at xabout the quantity of interest. This would be consistent with the usual definition of mutual infor-mation, when H is taken to be the Shannon entropy of some discrete quantity of interest [see, e.g.,13]. Note that DeGroot renamed it “expected information” in some of his subsequent work on thistopic [see, e.g., 16, 17].

Remark 3.11. Alternatively, SUR sequential designs can be defined by the relation Xn+1 ∈ argmaxGn,where Gn denotes the sampling criterion x 7→ Gn(x) := Gx(P

ξn) = Hn − Jn(x). In the particular

cases discussed in Sections 4.3 and 4.4, Gn corresponds to the knowledge gradient and expectedimprovement criteria, respectively.

3.3. General convergence results

Denote by ZH and ZG the subsets of M where the functionals H and G vanish, respectively. Theinclusion ZH ⊂ ZG always hold: indeed, 0 ≤ Jx ≤ H for all x by (3.4), thus 0 ≤ Gx ≤ H, andtherefore 0 ≤ G ≤ H. The reverse inclusion plays a capital role in the following result, which providessufficient conditions for the almost sure convergence of quasi-SUR sequential designs associated withuncertainty functionals that enjoy the supermartingale property.

Theorem 3.12. Let H denote a non-negative, measurable functional on M with the supermartingaleproperty. Let (Xn) denote a quasi-SUR sequential design for H. Then G

(Pξn

)→ 0 almost surely. If,

moreover,

Page 10: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 10

i) Hn = H(Pξn

)→ H

(Pξ∞

)almost surely,

ii) G(Pξn

)→ G

(Pξ∞

)almost surely,

iii) ZH = ZG ,

then Hn → 0 almost surely.

The proof of Theorem 3.12 relies on two main ideas. First, because the sequence (Hn) is a non-negative supermartingale, the conditional mean of its increments goes to zero almost surely, whichimplies by the quasi-SUR assumption that the maximal expected gain goes to zero as well. Second,using Assumptions i and ii, it is enough to study the limiting distribution Pξ

∞: this is where thereverse inclusion ZG ⊂ ZH is used to conclude that the uncertainty in the limiting distribution iszero.

Proof. Since Xn+1 is Fn-measurable, we have:

Jn (Xn+1) = En

(H(Condx,Zn+1(x)

(Pξn

)))|x=Xn+1

= En

(H(CondXn+1,Zn+1

(Pξn

)))= En (Hn+1) .

(3.7)

Set ∆n+1 = Hn −Hn+1 and ∆n+1 = En (∆n+1) = Hn − En (Hn+1). The random variables ∆n arenon-negative since (Hn) is a supermartingale and, using that (Xn) is an ε-quasi-SUR design, wehave for all n ≥ n0:

∆n+1 = Hn − En (Hn+1) = Hn − Jn (Xn+1) ≥ Hn − infx∈X

Jn(x) − εn, (3.8)

i.e., since Jn(x) = Jx

(Pξn

)and Gx = H−Jx,

∆n+1 ≥ supx∈X

Gx

(Pξn

)− εn = G

(Pξn

)− εn. (3.9)

Moreover, for any n, we have∑n−1

k=0 ∆k = H0 −Hn, and therefore

E

(n−1∑

k=0

∆k

)= E

(n−1∑

k=0

∆k

)= E (H0 −Hn) ≤ E (H0) < +∞.

It follows that E(∑∞

k=0 ∆k

)< +∞, and thus ∆n → 0 almost surely. As a consequence, G

(Pξn

)→ 0

almost surely, since 0 ≤ G(Pξn

)≤ ∆n+1 + εn.

Let now Assumptions i–iii hold. It follows from the first part of the proof that G(Pξn

)→ 0

almost surely. Thus, G(Pξ∞

)= 0 almost surely according to Assumption ii. Then H

(Pξ∞

)= 0 since

ZG ⊂ ZH, and the conclusion follows from Assumption i.

Remark 3.13. Note that the conclusions of Theorem 3.12 still hold partially if it is only assumedthat the condition Jn (Xn+1) ≤ inf Jn + εn holds infinitely often, almost surely: in this case theconclusion of the first part of the theorem is weakened to lim inf G

(Pξn

)= 0, but the final conclusion

(Hn → 0 a.s.) remains the same.

Since Pξn → Pξ

∞ almost surely by Proposition 2.9, Assumptions i and ii of Theorem 3.12 hold ifH and G, respectively, are continuous. Assuming H to be continuous, however, would be too strong

Page 11: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 11

a requirement, that some important examples would fail to satisfy. For instance, the uncertaintyfunctional

H : ν 7→

X

pν(1− pν) dµ (3.10)

studied in Section 4.1, where pν(u) =∫S1f(u)≥T ν(df) for some threshold T ∈ R, is clearly discon-

tinuous at the degenerate measure ν = GP (T1X, 0). The following weaker notion of continuity willturn out to be suitable for our needs:

Definition 3.14. A measurable functional H on M will be said to be P-continuous if, for anyGaussian random element ξ in S, defined on any probability space, and any sequence of randommeasures νn ∈ P(ξ) such that νn

a.s.−−→ ν∞ ∈ P(ξ), the convergence H(νn)

a.s.−−→ H(ν∞) holds.

Remark 3.15. The uncertainty functional (3.10) provides an explicit example of a functional whichis P-continuous (cf. the proof of Theorem 4.1) but not continuous. The expected improvementfunctional, discussed in Section 4.4, provides an example of a functional which is not even P-continuous (see Proposition 4.11), but for which consistency can nonetheless be proved by a directapplication of Theorem 3.12.

Checking that G is P-continuous, however, is not easy in practice. The following results providessufficient conditions for Assumption 3.12.ii that are easier to check.

Theorem 3.16. Let H denote a non-negative, measurable uncertainty functional on M, and let Gdenote the associated maximal expected gain functional. Assume that H = H0 +H1, where

i) H0(ν) =∫SL0 dν for some L0 ∈ ∩ν∈ML

1 (S,S, ν), andii) H1 is P-uniformly integrable, P-continuous and has the supermartingale property.

Then, for any quasi-SUR sequential design associated with H, G(Pξ∞

)= 0 almost surely.

Proof. First, note that H0

(Pξn

)= En (L0(ξ)). Thus, since L0 ∈ L1

(S,S,Pξ

), the sequence

(H0(P

ξn))

is a uniformly integrable martingale (see, e.g., Kallenberg [35], Theorem 6.23), which convergesalmost surely and in L1 to E∞ (L0(ξ)) = H0

(Pξ∞

). As a consequence, H

(Pξn

) a.s.−−→ H

(Pξ∞

)since H1

is P-continuous and Pξn

a.s.−−→ Pξ

∞ by Proposition 2.9.Let x ∈ X. The functional H0 has the supermartingale property by the preceding argument, and

therefore H = H0 +H1 also has the supermartingale property. Then, it follows from the first partof Theorem 3.12 that Gx

(Pξn

) a.s.−−→ 0, and thus

Jx

(Pξn

)= H

(Pξn

)− Gx

(Pξn

) a.s.−−→ H

(Pξ∞

). (3.11)

Let Pξn,x = Condx,Z(x)

(Pξn

), with Z(x) = ξ(x) + τ(x)U and U ∼ N (0, 1) independent from ξ and

the Un’s, and observe that Jx

(Pξn

)= En

(H(Pξn,x

)). Consider then the decomposition:

Jx

(Pξn

)= En

(H(Pξn,x

)−H

(Pξ∞,x

))+ En

(H(Pξ∞,x

))

= En

(H1

(Pξn,x

)−H1

(Pξ∞,x

))+ En

(H(Pξ∞,x

))(3.12)

where the second equality simply follows from the fact that En

(H0

(Pξn,x

))= En

(H0

(Pξ∞,x

))=

En (L0(ξ)) by the law of total expectation. The second conditional expectation in (3.12) is, again,a uniformly integrable martingale that converges almost surely and in L1:

En

(H(Pξ∞,x

)) a.s., L1

−−−−→n→∞

E∞

(H(Pξ∞,x

)). (3.13)

Page 12: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 12

Moreover, note that

Pξn,x = CondX1, Z1, ..., Xn, Zn, x, Z(x)

(Pξ0

)

= Condx, Z(x), X1, Z1, ..., Xn, Zn

(Pξ0

)

is the conditional distribution of ξ at the (n + 1)th step of the modified sequential design(Xn

),

where X1 = x and Xn+1 = Xn for all n ≥ 1, with a modified sequence of “noise variables”(Un

)

defined by U1 = U and Un+1 = Un for all n ≥ 1. Note also that Pξ∞,x corresponds to the conditional

distribution with respect to the σ-algebra generated by X1, Z1, X2, Z2 . . ., where the Zn’s have beendefined accordingly. As a result,

En

(H1

(Pξn,x

)−H1

(Pξ∞,x

)) L1

−−−−→n→∞

0 (3.14)

since H1 is P-continuous and P-uniformly integrable. Combine (3.12), (3.13) and (3.14) to provethat Jx

(Pξn

)→ E∞

(H(Pξ∞,x

))in L1. Then, it follows from a comparison with (3.11) that H

(Pξ∞

)=

E∞

(H(Pξ∞,x

))almost surely, and therefore

Gx

(Pξ∞

)= H

(Pξ∞

)− E∞

(H(Pξ∞,x

))= 0 almost surely. (3.15)

To conclude, note that by Assertion (a) of Theorem A.8 the sample paths of J∞ : x 7→E∞

(H(Pξ∞,x

))are continuous on

x ∈ X : s2∞(x) > 0

. Let xj denote a countable dense sub-

set of X. We have proved that, almost surely, Gxj

(Pξ∞

)= 0 for all j. Using the continuity of J∞

ons2∞ > 0

, and the fact that Gx = 0 on

s2∞ = 0

, we conclude that, almost surely, Gx

(Pξ∞

)= 0

for all x, and therefore G(Pξ∞

)= 0, which concludes the proof.

3.4. Uncertainty functionals based on a loss function

Let us now consider, more specifically, uncertainty functionals H defined in the form of a risk:

H(ν) = infd∈D

S

L(f, d) ν(df) = infd∈D

Lν(d), (3.16)

where D is a set of “decisions”, L : S×D → [0,+∞] a “loss function” such that L(·, d) is S-measurablefor all d ∈ D, and Lν(d) =

∫SL(f, d) ν(df). All the examples that will be discussed in Section 4 can

be written in this particular form.The following result formalizes an important observation of DeGroot [15, p. 408] about such

uncertainty functionals—namely, that they always enjoy the DoA property introduced in Section 3.1(and thus can be studied using Theorem 3.12).

Proposition 3.17. Let H denote a measurable functional on M. If H is of the form (3.16), thenit is DoA on M, and consequently has the supermartingale property.

Proof. The result follows directly from the fact that H is the infimum of a family of linear functionals(ν 7→ Lν(d), for d ∈ D) that commute with expectations: for any random element ν in M andany d ∈ D,

E(Lν(d)

)= E

(∫

S

L(f, d)ν(df)

)= Lν(d), (3.17)

where ν is defined as in Definition 3.2. (In other words, the linear functionals ν 7→ Lν(d) are DoAthemselves, with an equality in (3.17) instead of the inequality in Definition 3.2.)

Page 13: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 13

An uncertainty functional of the form (3.16) is clearly M-measurable if the infimum over d canbe restricted to a countable subset of D (since the linear functionals ν 7→ Lν(d) are M-measurableby Lemma A.1). This is true, for instance, if D is separable and d 7→ Lν(d) is continuous for all ν.See Proposition A.4 for an example where L is discontinuous.

Three of the examples of SUR sequential designs from the literature that will be analyzed inSection 4 are based on regular non-negative loss functions in the following sense.

Definition 3.18. We will say that a non-negative loss function L : S× D → [0,+∞) is regular if

i) D is a separable space,ii) for all d ∈ D, L(·, d) is S-measurable,iii) for all ν ∈ M, Lν takes finite values and is continuous on D,

and if the corresponding functionals H and G satisfy:

iv) H = H0+H1, where H0(ν) =∫SL0 dν for some L0 ∈ ∩ν∈ML1 (S,S, ν), and H1 is P-uniformly

integrable and P-continuous,v) ZH = ZG .

The following corollary is provided as a convenient summary of the results that hold for uncer-tainty functionals based on regular non-negative loss functions.

Corollary 3.19. Let H denote a functional of the form (3.16) for some non-negative loss func-tion L. If L is regular, then H is a measurable functional that satisfies the assumptions of Theo-rems A.8, 3.12 and 3.16. In particular, for any quasi-SUR design associated with H, Hn = H

(Pξn

)→

0 almost surely.

4. Applications to popular sequential design strategies

This section presents applications of our results to four popular sequential design strategies, twoof them addressing the excursion case (Sections 4.1 and 4.2), and the other two addressing theoptimization case (Sections 4.3 and 4.4). For each example, the convergence results are precededby details on the associated loss functions, uncertainty functionals and sampling criteria.

4.1. The integrated Bernoulli variance functional

Assume that X is endowed with a finite measure µ and let T ∈ R be a given excursion threshold.For any measurable function f : X 7→ R, let Γ(f) = u ∈ X : f(u) ≥ T and α(f) = µ(Γ(f)).The quantities of interest are then Γ(ξ) and α(ξ). Let pn(u) = En

(1Γ(ξ)(u)

)= Pn (ξ(u) ≥ T ).

A typical choice of measure of residual uncertainty in this case is the integrated indicator—or“Bernoulli”—variance [4]:

Hn =

X

pn (1− pn) dµ, (4.1)

which corresponds to the uncertainty functional

H(ν) =

X

pν (1− pν) dµ, ν ∈ M, (4.2)

Page 14: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 14

where pν(u) =∫S1f(u)≥T ν(df). See [11] for more information on the computation of the corre-

sponding SUR sampling criterion

Jn(x) = En,x

(∫

X

pn+1 (1− pn+1) dµ

).

The functional (4.2) can be seen as the uncertainty functional induced by the loss function

L : S× D → [0,+∞) ,

(f, d) 7→ ‖1Γ(f) − d‖2L2(X),(4.3)

where D ⊂ L2(X) is the set of “soft classification” functions on X (i.e., measurable functions definedon X and taking values in [0, 1]). Indeed, for all ν ∈ M and ξ ∼ ν,

Lν(d) = E (L(ξ, d)) = ‖pν − d‖2L2(X) +

∫pν(1 − pν) dµ

is minimal for d = pν , and therefore H(ν) = infd∈D Lν(d).The following theorem establishes the convergence of SUR (or quasi-SUR) designs associated to

this uncertainty functional using the theory developed in Section 3.4 for regular loss functions.

Theorem 4.1. The loss function (4.3) is regular in the sense of Definition 3.18. As a consequence,

all the conclusions of Corollary 3.19 hold, and in particular H(Pξn)

a.s.−−→ 0 for any quasi-SUR design

associated with H.

Proof. The proof consists in six points, as follows:

a) D is separableThe space L2(X) is a separable metric space since X is a separable measure space (see, e.g.,

Theorem 4.13 in [8]). Hence D is also separable.

b) for all d ∈ D, L(·, d) is S-measurableIndeed, f 7→

∫X

(1f(x)≥T − d(x)

)2µ(dx) is S-measurable by Fubini’s theorem since the integrand

is S ⊗ B(X)-jointly measurable in (f, x).

c) for all ν ∈ M, Lν takes finite values and is continuous on D

Here Lν is clearly finite since the loss is upper-bounded by µ(X), and its continuity directlyfollows from the continuity of the norm.

d) H = H0 +H1, where H0(ν) =∫SL0 dν for some L0 ∈ ∩ν∈ML

1 (S,S, ν), and H1 is P-uniformlyintegrable

Here this holds with L0 = 0 and H1 = H. Indeed, H is trivially P-uniformly integrable since theloss is upper-bounded.

e) H1 is P-continuousLet ξ ∼ GP(m, k) and let (νn) be a sequence of random measures νn ∈ P(ξ) such that a.s.

νn → ν∞ ∈ P(ξ). For n ∈ N∪∞, let mn and kn be the (random) mean and covariance functionsof νn. For u ∈ X and n ∈ N ∪ ∞, let also σ2(u) = k(u, u), σ2

n(u) = kn(u, u), and

gn(u) = g

(T −mn(u)

σn(u)

)),

Page 15: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 15

where g(p) = p(1 − p) and Φ(t) = P (Z ≥ t) where Z is a standard Gaussian variable, with theconvention that Φ(0/0) = 1. We will prove below that, for all n ∈ N ∪ +∞,

H(νn) =

X

gn(u)µ(du)a.s.=

A

gn(u)µ(du), (4.4)

where A denotes the random subset of X defined by

A(ω) = u ∈ X : σ(u) > 0, σ∞(ω, u) = 0,m∞(ω, u) 6= T ∪ u ∈ X : σ(u) > 0, σ∞(ω, u) > 0.

The motivation for using (4.4) is that it is easy to prove the convergence of gn(ω, u) for u ∈ A(ω) andthat the set A(ω) does not depend on n, which makes it possible to conclude using the dominatedconvergence theorem on the set A(ω) for almost all ω. In more detail: since νn → ν∞ almost surely,it holds for almost all ω ∈ Ω that mn(ω, ·) → m∞(ω, ·) and σn(ω, ·) → σ∞(ω, ·) uniformly on X.Furthermore, for each u ∈ A(ω), either σ∞(ω, u) > 0 or σ∞(ω, u) = 0,m∞(ω, u) 6= T . In both cases,we have that g

(Φ([mn(ω, u)− T ]/σn(ω, u))

)→ g

(Φ([m∞(ω, u)− T ]/σ∞(ω, u))

). So, for almost all

ω ∈ Ω we can apply the dominated convergence theorem on A(ω) and thus obtain that

H(νn) =

A

gn(u)µ(du)a.s.

−−−−→n→∞

H(ν∞) =

A

g∞(u)µ(du),

which proves the claim.Let us now prove (4.4). Observe first that, for any u such that σ(u) = 0, we have σn(u)

a.s.= 0

for all n ∈ N ∪ ∞ since νn ∈ P(ξ). Hence, gn(u)a.s.= 0 when σ(u) = 0. [This is because of

the convention Φ(0/0) = 1, which yields gn(u) = g(1mn(u)≥T ) = 0 when σn(u) = 0, regardless ofwhether mn(u) = T or not.] Thus, setting B(ω) = u ∈ X; σ(u) > 0, σ∞(ω, u) = 0, m∞(ω, u) = T ,we have for all ω ∈ Ω

X = u ∈ X, σ(u) = 0 ∪ A(ω) ∪B(ω),

and the three sets of the right-hand side of the previous display are disjoint. Since, as discussedabove, gn(u)

a.s.= 0 for any u such that σ(u) = 0, we obtain

H(νn)a.s.=

A

gn(u)µ(du) +

B

gn(u)µ(du).

Thus, in order to prove (4.4), it is sufficient to show that∫

B

gn(u)µ(du)a.s.= 0. (4.5)

We will now establish (4.5) by proving that, in fact, µ(B) = 0 almost surely. First, since (ω, u) 7→m∞(ω, u) and (ω, u) 7→ σ∞(ω, u) are jointly measurable (by continuity of m∞(ω, ·) and σ∞(ω, ·)for all ω ∈ Ω), it follows from the Fubini-Tonelli theorem that

E(µ(B)) =

X

1σ(u)>0 E(1σ∞(u)=01m∞(u)=T )µ(du). (4.6)

Then we have, for any u ∈ X,

E(1σ∞(u)=0(ξ(u)−m∞(u))2

)= E

(E[1σ∞(u)=0(ξ(u)−m∞(u))2

∣∣F ′∞

])

= E(1σ∞(u)=0 E

[(ξ(u)−m∞(u))2

∣∣F ′∞

])

= E(1σ∞(u)=0 σ2

∞(u))

= 0,

Page 16: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 16

where F ′∞ denotes the σ-algebra such that ν∞ = P (ξ ∈ · | F ′

∞). Hence, the random variable1σ∞(u)=0(ξ(u)−m∞(u))2 is almost surely zero, since it is non-negative and has a zero expectation.Thus, for any u ∈ X, the implication

σ∞(u) = 0 =⇒ ξ(u) = m∞(u)

holds almost surely. As a consequence, we have

1σ(u)>0 1σ∞(u)=01m∞(u)=T = 1σ(u)>01σ∞(u)=01ξ(u)=T

≤ 1σ(u)>01ξ(u)=T

= 0

almost surely, since ξ(u) ∼ N (0, σ(u)2). Hence, the integrand in the right-hand side of (4.6) is zero,which implies that E (µ(B)) = 0, and therefore µ(B)

a.s.= 0 since µ(B) is a non-negative random

variable. Thus (4.4) holds and the proof of e) is complete.

f) ZH = ZG

Let ν ∈ ZG and let ξ ∼ ν. Let m, k, σ2 be defined as above. Let U ∼ N (0, 1) be independentof ξ. Since G(ν) = 0, we have from the law of total variance

X

var(E(1ξ(u)≥T |Zx

))µ(du) = 0

for all x ∈ X, where Zx = ξ(x) + τ(x)U . Hence, for all x ∈ X, for almost all u ∈ X, we have

var

Φ

T −m(u)− k(x,u) (Zx−m(x))

σ2(x)+τ2(x)√σ2(u)− k(x,u)2

σ2(x)+τ2(x)

= 0,

which implies that k(x, u) = 0 (as can be proved without difficulty by separating the cases of nullityand non-nullity of the denominator). Thus, if there exists x∗ for which σ2(x∗) = k(x∗, x∗) > 0, weobtain a contradiction, since then k(x, u) > 0 in a neighborhood of x∗ by continuity. We concludethat σ2(x) = 0 for all x ∈ X, and therefore H(ν) = 0.

In the next proposition, we refine Theorem 4.1 by showing that it entails a consistent estimationof the excursion set Γ(ξ).

Proposition 4.2. For any quasi-SUR design associated with H, as n → ∞, almost surely andin L1, ∫

X

(1ξ(u)≥T − pn(u)

)2µ(du) → 0

and ∫

X

(1ξ(u)≥T − 1pn(u)≥1/2

)2µ(du) → 0.

Proof. From steps e) and f) in the proof of Theorem 4.1, it follows that∫

X

(1ξ(u)≥T − pn(u)

)2µ(du)

a.s.=

A

(1ξ(u)≥T − pn(u)

)2µ(du).

Also, for almost all ω ∈ Ω and all u ∈ A(ω), pn(ω, u) → 1ξ(ω,u)≥T as n → ∞ since σ∞ ≡ 0 a.s.from the proof of f) in Theorem 4.1 and the conclusion of this theorem. Hence the first part of theproposition follows by applying the dominated convergence theorem twice. The proof of the secondpart of the proposition is identical.

Page 17: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 17

4.2. The variance of excursion volume functional

Following up on the example of Section 4.1, we consider now the alternative measure of residualuncertainty Hn = varn(α(ξ)) from [4, 11]; in other words, we consider the uncertainty functional

H(ν) =

S

(α(f)− αν)2 ν(df), (4.7)

where αν =∫Sα dν. The corresponding sampling criterion is

Jn(x) = En,x

(varn+1 (α(ξ))

).

This uncertainty functional again derives from a loss function: indeed, L(f, d) = (α(f) − d)2 withD = R leads to

LPξn(d) = En[(α(ξ) − d)2] = varn(α(ξ)) + (En(α(ξ)) − d)2,

where LPξn

reaches its infimum for d = En(α(ξ)), and therefore Hn = infd∈D LPξn(d). As in the pre-

vious section, consistency is established in the following theorem by proving that the loss function Lis regular.

Theorem 4.3. The loss function L(f, d) = (α(f) − d)2, where d ∈ D = R, is regular in the senseof Definition 3.18. As a consequence, all the conclusions of Corollary 3.19 hold, and in particularH(Pξ

n)a.s.−−→ 0 for any quasi-SUR design associated with H.

Proof. The proof consists of the same six points as the proof of Theorem 4.1. Here a) and c) areobvious. Let us now prove the four remaining points.

b) Since L(f, d) = (∫X1f(u)≥T µ(du)−d)2, it can be shown similarly as in the proof of Theorem 4.1

that, for any fixed d ∈ D, L(f, d) is an S-measurable function of f .

d) We use again L0 = 0 for this criterion. The functional H1 = H is trivially P-uniformly integrablesince 0 ≤ H1 ≤ µ(X)2.

e) Let us now show that H1 = H is P-continuous. Let ξ denote a random element in S, ξ ∼GP(m, k), and let (νn) be a sequence of random measures νn ∈ P(ξ) such that νn → ν∞ ∈ P(ξ)almost surely. For all n ∈ N∪+∞, let F ′

n denote a σ-algebra such that νn = P (ξ ∈ · | F ′n). Then,

using Fubini’s theorem, we can rewrite H(νn) as

H(νn) =

X

X

cn(u1, u2)µ(du1)µ(du2),

where cn(u1, u2) = cov(1ξ(u1)≥T ,1ξ(u2)≥T | F ′

n

)for n ∈ N ∪ +∞. Consider the partition

X = u ∈ X, σ(u) = 0 ∪ A(ω) ∪B(ω), ω ∈ Ω,

where σ, A and B are defined as in the proof of Theorem 4.1. Recalling from step e) of this proof thatvar(1ξ(u)≥T | F ′

n

) a.s.= 0 when σ(u) = 0, and that µ(B)

a.s.= 0, we obtain that, for all n ∈ N∪ +∞,

H(νn)a.s.=

A

A

cn(u1, u2)µ(du1)µ(du2).

Page 18: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 18

For j = 1, 2 and uj ∈ A(ω), we have either σ∞(uj) > 0 or σ∞(uj) = 0,m∞(uj) 6= T . Hence, foralmost all ω ∈ Ω, for u1 ∈ A(ω) and u2 ∈ A(ω), we obtain cn(u1, u2) → c∞(u1, u2) by the followinglemma (proved later):

Lemma 4.4. Let mn = (mn1,mn2)t → (m1,m2)

t = m as n → ∞. Consider a sequence ofcovariance matrices Σn such that

Σn =

(σn1 σn12

σn12 σn2

)−−−−→n→∞

(σ1 σ12

σ12 σ2

)= Σ.

Assume that for i = 1, 2 we have mi 6= T or σi > 0. Let Zn ∼ N (mn,Σn) and Z ∼ N (m,Σ). Thenas n → ∞, cov(1Zn1≥T,1Zn2≥T) → cov(1Z1≥T,1Z2≥T).

Finally, using the dominated convergence theorem on A(ω) × A(ω) for almost all ω ∈ Ω, weconclude that H(νn) → H(ν∞) almost surely, which proves that H is P-continuous.

f) Let ν ∈ M and let ξ ∼ ν. Let also Zx = ξ(x) + τ(x)U , with U ∼ N (0, 1) independent of ξ, sothat Jx(ν) = E (var (α(ξ) | Zx)). We first remark that, from (3.5) and the law of total variance, forany x ∈ X,

Gx(ν) = var(α(ξ)) − E(var(α(ξ)|Zx)) = var(E(α(ξ)|Zx)). (4.8)

Then we have the following sequence of equivalences:

G(ν) = 0 ⇐⇒ ∀x ∈ X, Gx(ν) = 0

⇐⇒ ∀x ∈ X, var (E(α(ξ) | Zx)) = 0

⇐⇒ ∀x ∈ X, α(ξ)− E(α(ξ)) ⊥ L2(Zx). (4.9)

The first equivalence follows directly from the definition of G: G(ν) = supx∈X Gx(ν), since Gx(ν)is non-negative for all x ∈ X, the second one from (4.8), and the third one from the fact thatE(α(ξ) | Zx) is the orthogonal projection of α(ξ) onto L2(Zx).

Let now ν ∈ ZG . Using Lemma A.10, it follows from (4.9) that α(ξ) − E(α(ξ)) ⊥ L2(ξ(x)), forall x ∈ X. In particular, α(ξ)− E(α(ξ)) ⊥ 1ξ(x)≥T , for all x ∈ X, and thus

var (α(ξ)) =

∫cov

(α(ξ),1ξ(x)≥T

)µ(dx) = 0,

which concludes the proof.

Proof of Lemma 4.4. By the convergence of moments and Gaussianity, (Zn1, Zn2) converges in dis-tribution to (Z1, Z2). Furthermore, from the assumptions the cumulative distribution functions ofZ1 and Z2 are continuous at T , which implies that, by the Portemanteau theorem, P (Zni ≥ T ) →P (Zi ≥ T ). In addition, Y := min(Z1, Z2) also has a continuous cumulative distribution functionat T and, as E(1Z1≥T1Z2≥T) = P (Y ≥ T ), we get similarly that E(1Zn1≥T1Zn2≥T) →E(1Z1≥T1Z2≥T), which completes the proof.

Similarly as before, in the next proposition, we show that Theorem 4.3 yields a consistent esti-mation of the excursion volume.

Proposition 4.5. For any quasi-SUR design associated with H, as n → ∞, almost surely and inL1, En(α(ξ)) → α(ξ).

Page 19: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 19

Proof. Let α = α(ξ). We know from, e.g., Theorem 6.23 in [35], that En(α) → E∞(α) almost surelyand in L1. Moreover, it follows from Theorem 4.3 that varn(α) → 0 almost surely, and thereforeE(varn(α)) → 0 by dominated convergence. Hence, E(En[(En(α) − α)2]) → 0, which shows thatEn(α) converges in L1, and thus almost surely as well, to E∞(α)

a.s.= α

Remark 4.6. Contrary to the case of the integrated Bernoulli variance functional in Proposition 4.2,it is not possible to prove that a SUR sequential design associated with the uncertainty func-tional (4.7) results in a consistent estimation of the set Γ(ξ) = u ∈ X : ξ(u) ≥ T . Indeed, forinstance, let X = [−1, 1], let µ be Lebesgue measure, let T = 0 and let ξ have zero mean functionand covariance function k defined by k(u, v) = uv for u, v ∈ [−1, 1]. Then, almost surely, the setΓ(ξ) is equal to [−1, 0] or to [0, 1] (with probabilities 1/2 for both cases). Hence, we have, with ν thedistribution of ξ, H(ν) = 0 because α(ξ) = 1 almost surely. Thus, any sequential design (Xn)n≥1

is a SUR sequential design associated with (4.7), since we have Jn(x) = 0 for any x ∈ X andn ≥ 1. However, the conclusions of Proposition 4.2 clearly do not hold for any sequential design.For instance, if Xn = 0 for all n ≥ 1, we have pn(u) = 1/2 for all u ∈ X.

As a conclusion, for the SUR strategy associated with the uncertainty functional (4.7), and thusbased on µ(Γ(ξ)), it can only be guaranteed that µ(Γ(ξ)), but not Γ(ξ) in general, is estimatedconsistently.

4.3. The knowledge gradient functional

Coming to the topic of sequential design for global optimization, we now focus on the knowledgegradient criterion [23, 24, 53], which is an extension to the general (noisy) case of the strategyproposed in the 70’s by [41] for the noiseless case. We shall consider, here and in the next section,the case of a maximization problem. The knowledge gradient sampling criterion, to be maximized,is then defined by

Gn(x) = En,x (maxmn+1)−maxmn, (4.10)

with maxima taken over the whole domain X as in [23, 24] for the case of a discrete X, and inSection 3 of [53] for the case of a “continuous” X; we do not consider the KGCP approximationintroduced in Section 4 of [53].Remark 4.7. The quantity maxmn in (4.10) does not depend on the sampling point x, and thusplays no part in the selection of the next observation point. The motivation for writing Gn(x) in thisform is that the sampling criterion thus defined is non-negative, and becomes equal to zero whenσn ≡ 0. The first term in the right-hand side of (4.10) is exactly the sampling criterion proposedby [41] in the noiseless case.

The following criterion, to be minimized, clearly defines the same strategy as (4.10):

Jn(x) = En (max ξ)− En,x (maxmn+1)

= En,x (En+1(max ξ)−maxmn+1) ,

and clearly appears, under the second form, as the SUR sampling criterion corresponding to theuncertainty functional

H(ν) =

S

max f ν(df) −maxmν . (4.11)

Moreover, the original sampling criterion (4.10) is easily seen to be the value Gn(x) = Gx

(Pξn

)of

the associated expected gain functional.

Page 20: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 20

This time again, the uncertainty functional H derives from a loss function, with D = X andL(f, d) = max f − f(d), leading to

LPξn(d) = En (max ξ)−mn(d).

The average loss LPξn

reaches its infimum for d ∈ argmaxmn, and so Hn = infd∈D LPξn(d). Following

the same route as in the last two sections, we have:

Theorem 4.8. The loss function L(f, d) = max f − f(d), where d ∈ D = X, is regular in the senseof Definition 3.18. As a consequence, all the conclusions of Corollary 3.19 hold, and in particularH(Pξ

n)a.s.−−→ 0 for any quasi-SUR design associated with H.

Proof. The proof consists in the same six points as in the proof of Theorem 4.1.

a) X is a compact metric space, hence separable.

b) The mapping L(·, d) : f 7→ max f − f(d) is continuous on S, hence S-measurable.

c) Lν : d 7→∫max f ν(df)−mν(d) is continuous since mν ∈ S for all ν ∈ M.

d) Let L0(f) = max f . Since X is compact, it holds for any Gaussian measure ν ∈ M and any ξ ∼ νthat E(maxX |ξ|) < ∞ (see Section 2.2), and thus we have L0 ∈ ∩ν∈ML1 (S,S, ν). Moreover, it followsfrom Proposition 3.6 that H1 : ν 7→ −maxmν is P-uniformly integrable, since |H1(ν)| ≤

∫L+dν

with L+(f) := max |f |, and L+ ∈ ∩ν∈ML1 (S,S, ν).

e) H1 : ν 7→ −maxmν is continuous, hence P-continuous. Indeed, consider a sequence of measuresνn ∈ M converging to a limit ν∞ ∈ M in the sense of Definition 2.7. Then mνn converges uniformlyto mν∞ as n → ∞, and therefore H1(νn) = −maxmνn converges to H1(ν∞) = −maxmν∞ bycontinuity of f 7→ max f on S.

f) Let ν ∈ ZG and let ξ ∼ ν. Let m, k, σ2 be defined, w.r.t. ξ, as in the proof of Theorem 4.1. LetZx = ξ(x) + τ(x)U with U ∼ N (0, 1) independent of ξ. Let x∗ ∈ argmaxm. We have, for all x ∈ X,

0 = Gx(ν) = H(ν)− Jx(ν)

= (E (max ξ)−maxm)− E(E (max ξ | Zx)−max

u∈X

E(ξ(u) | Zx

))

= E(maxu∈X

E(ξ(u) | Zx

))−m(x∗), (4.12)

by the law of total expectation and the optimality property of x∗. For all x, y ∈ X it holds that

m(x∗) = E(E(ξ(x∗) | Zx)

)≤ E

(max

(E(ξ(y) | Zx), E(ξ(x

∗) | Zx)))

≤ E(maxu∈X

E(ξ(u) | Zx

)),

and therefore, using (4.12),

0 ≤ E(max(E(ξ(y)|Zx),E(ξ(x

∗)|Zx))− E(ξ(x∗)|Zx))≤ Gx(ν) = 0.

Page 21: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 21

Setting Wx,y := E(ξ(y) | Zx)−E(ξ(x∗) | Zx), we have thus proved that E (max (0, Wx,y)) = 0, fromwhich it follows that var (Wx,y) = 0 since

Wx,y = m(y)−m(x∗) + 1σ2(x)+τ2(x)>0k(x, y)− k(x, x∗)

σ2(x) + τ2(x)(Zx −m(x))

is Gaussian. Observe now that

var (Wx,y) = 1σ2(x)+τ2(x)>0(k(x, y)− k(x, x∗))2

σ2(x) + τ2(x).

Hence it must be the case that either σ2(x)+τ2(x) = 0 or k(x, y) = k(x, x∗). But, if σ2(x)+τ2(x) = 0then σ(x) = 0 and therefore k(x, y) = k(x, x∗) = 0. Summing up, we have proved that

k(x, y) = k(x, x∗), ∀x, y ∈ X.

As a consequence, for all x, y ∈ X, we have

k(x, x) − k(x, y) = k(x, x∗)− k(x, x∗) = 0,

and thereforevar (ξ(x)− ξ(y)) = (k(x, x) − k(x, y)) + (k(y, y)− k(x, y)) = 0.

It follows that, almost surely, the sample paths of ξ − m are constant over X, and so max ξ =ξ(x∗)−m(x∗) +maxm = ξ(x∗). We have thus proved that H(ν) = E (max ξ)−m(x∗) = 0 for anyν ∈ ZG , which concludes the proof.

In the next proposition, we refine Theorem 4.8 by showing that the loss max ξ − ξ(X∗n) goes to

zero for any sequence of optimal decisions X∗n ∈ argmaxmn.

Proposition 4.9. Let (X∗n) be any sequence of Fn-measurable X-valued random variables such

that X∗n ∈ argmaxmn almost surely for all n. Then, for any quasi-SUR design associated with H,

ξ(X∗n) → max ξ almost surely and in L1.

Proof. From step f) in the proof of Theorem 4.8, and the fact that H(P ξ∞)

a.s.= 0, it follows that the

sample paths of ξ − m∞ are almost surely constant over X. Let X∗ denote an X-valued randomvariable such that X∗ ∈ argmax ξ. Then, we have

lim supn→∞

(ξ(X∗)− ξ(X∗n)) = lim sup

n→∞(m∞(X∗)−m∞(X∗

n)) = lim supn→∞

(mn(X∗)−mn(X

∗n)) ≤ 0

almost surely. This implies, since ξ(X∗) − ξ(X∗n) ≥ 0, that ξ(X∗

n) → ξ(X∗) = max ξ almost surely.Convergence in the L1 sense is finally obtained by the dominated convergence theorem.

4.4. The expected improvement functional

This section addresses the celebrated expected improvement strategy [34, 41]4. Assume that exact(noiseless) evaluations can be made, in other words, that τ(x) = 0 for all x ∈ X: then, we define

4As explained in Section 4.3, the strategy originally proposed by [41]—and earlier work in Russian by J. Mockusand A. Žilinskas—is more accurately described, in principle, as a special (noiseless) case of what we have calledthe “knowledge gradient” strategy. However, for the particular Brownian motion-based Gaussian process prior usedin [41], the maximum of mn always occurs at an observation point, and the criterion of [41] then coincides (seepage 21 of the paper) with what is, currently, commonly referred to as the expected improvement criterion.

Page 22: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 22

the expected improvement criterion, to be maximized, as

Gn(x) = En,x (Mn+1 −Mn) , (4.13)

with Mn = maxx∈X:σn(x)=0 ξ(x) and σ2n(x) = kn(x, x). Observe that, on the right-hand side

of (4.13), similarly to Remark 4.7 for the knowledge gradient criterion, only Mn+1 actually de-pends on the new observation point Xn+1 = x. Note also that we need at least one x ∈ X suchthat σn(x) = 0 for Mn to be well defined, which is always true as soon as n ≥ 1 (in practice, (4.13)is typically used after an initial design of size n0 > 0).Remark 4.10. Our definition of the EI strategy, and in particular of the current best value Mn, differsslightly from the usual one [34], which takes Mn = max (ξ(X1), . . . , ξ(Xn)). This minor variationis necessary if we want to see the EI strategy as stemming from some uncertainty functional.Remark that, in the case of a non-degenerate Gaussian process (i.e., when σn(x) = 0 if and onlyif x ∈ X1, . . . , Xn), the two definitions of Mn coincide and the criterion can be written morefamiliarly as

Gn(x) = En (max (0, ξ(x)−Mn)) . (4.14)

The sampling criteria (4.13) and (4.14) no longer agree in general, since it can happen for degenerateGaussian processes that Mn > max (ξ(X1), . . . , ξ(Xn)). Degeneracy occurs, e.g., in the case offinite-dimensional Gaussian processes (i.e., linear models with a Gaussian prior on their coefficients),or for processes with pathwise invariance properties [27] (for instance ξ(x) = −ξ(−x) for all x ∈ X,almost surely).

It turns out that the sequential design obtained by iteratively maximizing (4.13) can be inter-preted as a SUR sequential design. Indeed, we have

En,x (Mn+1 −Mn) = En (max ξ −Mn)− En,x (max ξ −Mn+1) ,

where the subscript “x” has been dropped from the first expectation since its argument does notdepend on the position Xn+1 of the next evaluation. Thus, using the fact that En,x (max ξ) =En,x (En+1 (max ξ)) by the law of total expectation, we see that maximizing (4.13) is equivalent tominimizing

Jn(x) = En,x

(En+1 (max ξ)−Mn+1

),

which is precisely the SUR strategy associated with the uncertainty functional defined, for anyν ∈ M such that σν vanishes at at least one x ∈ X, by

H(ν) =

S

max f ν(df) − maxx∈X:σν(x)=0

mν(x). (4.15)

Indeed, maxx∈X:σν(x)=0 ξ(x)a.s.= maxx∈X:σν(x)=0 mν(x) for any such ν and any ξ ∼ ν. Moreover,

the expected improvement criterion Gn turns out to be the value Gn(x) = Gx

(Pξn

)of the associated

expected gain functional. In order to have an uncertainty function H that is well defined on all of M,set

H(ν) =

S

(max f −min f) ν(df) (4.16)

for all ν ∈ M such that σν does not vanish.The uncertainty functional (4.15)–(4.16) is measurable (see Proposition A.4) and can be associ-

ated with a certain loss function, as shown in the following result. Contrary to the case of the threeprevious criteria, however, this loss function is not regular in general.

Page 23: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 23

Proposition 4.11. The EI uncertainty functional is of the form (3.16), with L the loss functiondefined on the decision space D = X× (R ∪ −∞), for any f ∈ S and d = (x∗, z∗) ∈ D, by

L (f, d) =

max f − z∗ if f(x∗) = z∗ and z∗ > −∞,

max f −min f if z∗ = −∞,

+∞ otherwise.

(4.17)

Assuming that X ⊂ Rp has a non-empty interior, the loss function (4.17) is not regular, and neitherwould be any other loss function that could be associated with the EI uncertainty functional.

Proof. Let us first prove that H is of the form (3.16). Let ν ∈ M, ξ ∼ ν and d = (x∗, z∗) ∈ D.Then the average loss is equal to Lν(d) = E (max ξ −min ξ) if z∗ = −∞ and, using the convention(+∞) · 0 = 0, to

Lν(d) = E (L(ξ, d))

= E((max ξ − z∗)1ξ(x∗)=z∗

)+ (+∞) · P (ξ(x∗) 6= z∗)

=

E (max ξ)−mν(x

∗), if σν(x∗) = 0 and mν(x

∗) = z∗,

+∞ otherwise,(4.18)

if z∗ > −∞. The last equality follows from the simple observation that the event ξ(x∗) = z∗ isalmost sure if σν(x

∗) = 0 and mν(x∗) = z∗, and negligible otherwise.

In the case where there exists at least one x ∈ X such that σν(x) = 0, then it is clear thatE (max ξ) − mν(x) < E (max ξ −min ξ) for any such x, and thus the expected loss is minimal ford = (x∗, z∗) such that x∗ ∈ argmaxx:σν(x)=0 mν(x) and z∗ = mν(x

∗), which yields (4.15). In the casewhere σν does not vanish, on the other hand, then (4.18) is always infinite and thus the expectedloss in minimal for any d = (x∗,−∞), which yields (4.16). In both cases, we have proved thatH(ν) = mind∈D Lν(d).

We will now prove that there is no regular loss function L such that H(ν) = mind∈D Lν(d). Todo so, we will show that H cannot be decomposed as H = H0+H1, with H0(ν) =

∫SL0 dν for some

L0 ∈ ∩ν∈M L1 (S,S, ν), and H1 a P-continuous functional.Assume, for the sake of contradiction, that H = H0 +H1, with H0(ν) =

∫SL0 dν for some L0 ∈

∩ν∈ML1 (S,S, ν), and H1 a P-continuous functional. Then, using the same martingale argument asin the proof of Theorem 3.16, we have H(P ξ

n)a.s.−−→ H(P ξ

∞). Also, again by a martingale argument,En(max ξ)

a.s.−−→ E∞(max ξ), and therefore maxσn(x)=0 mn(x)

a.s.−−→ maxσ∞(x)=0 m∞(x).

We will now show that this last convergence does not hold for a certain Gaussian process ξ on X,which yields a contradiction. For simplicity, we assume in the following that X = [0, 1], but thesame argument could be made on any X ⊂ Rp that has a non-empty interior.

Consider a Gaussian process ξ with mean m(x) = x and covariance k(x, y) = exp(−(x − y)2).Let (Xn) be a deterministic sequence, dense in [0, 1/3]. Then, as follows from the proof of Propo-sition 1 in [62], we have σ∞(x) = 0 for all x ∈ [0, 1]. Hence, maxσ∞(x)=0 m∞(x) = maxx∈[0,1] ξ(x).Also, since Xk ∈ [0, 1/3] for all k ∈ N, and since ξ is a non-degenerate Gaussian process, we havemaxσn(x)=0 mn(x) ≤ maxx∈[0,1/3] mn(x). This upper bound converges to maxx∈[0,1/3] ξ(x) almostsurely, and thus maxx∈[0,1/3] ξ(x) = maxx∈[0,1] ξ(x) almost surely. This last equality cannot hold,

Page 24: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 24

however, because

maxx∈[0,1/3]

ξ(x) ≤1

3+ max

x∈[0, 1/3](ξ(x) − x),

maxx∈[2/3,1]

ξ(x) ≥2

3+ max

x∈[2/3, 1](ξ(x) − x),

and by symmetry maxx∈[0,1/3](ξ(x)−x) and maxx∈[2/3,1](ξ(x)−x) have the same distribution.

Since the EI uncertainty functional does not derive from a regular loss function, consistencycannot be proved using Corollary 3.19 as in the three previous examples. The following result willthus be proved by a direct application of the more general Theorem 3.12.

Theorem 4.12. For any quasi-SUR sequential design associated with H, as n → ∞, almost surelyand in L1, Hn → 0, maxmn → max ξ and Mn → max ξ.

Proof. Since the uncertainty functional H derives from a loss function by Proposition 4.11, it isDoA by Proposition 3.17 and, consequently, has the supermartingale property.

Consider now a quasi-SUR sequential design associated with H. Theorem 3.12 applies and there-fore G(P ξ

n) → 0 almost surely. Observe also that, for all n ≥ 1,

Mn+1 = maxσn+1(x)=0

ξ(x) ≥ max

(ξ (Xn+1) , max

σn(x)=0ξ(x)

)= max (ξ (Xn+1) ,Mn) .

Hence G(P ξn) = supx∈X En,x (Mn+1 −Mn) ≥ supx∈X En (max (0, ξ(x) −Mn)), and thus

maxx∈X

γ(mn(x)−Mn, σ

2n(x)

) a.s.−−−−→n→∞

0, (4.19)

where γ denotes the function defined by γ(a, b) = E (max (0, Za,b)), Za,b ∼ N (a, b). Recall fromSection 3 in Vazquez and Bect [61] that γ is continuous and satisfies

• γ(z, s2) > 0 if s2 > 0,• γ(z, s2) ≥ z > 0 if z > 0.

Recall also from Proposition 2.9 that, almost surely, mn → m∞ and σn → σ∞ uniformly on X.Therefore we have

∀x ∈ X, γ(m∞(x) −M∞, σ2

∞(x))= 0, (4.20)

almost surely, where M∞ denotes the almost sure limit of the increasing sequence (Mn), withM∞ ≤ max ξ < +∞. (To see that (Mn) is increasing, observe that the set of points x ∈ X suchthat σn(x) = 0 is growing with n, since (σn(x)) is decreasing for any x.) Considering the propertiesof γ, it follows from (4.20) that almost surely, for all x ∈ X, σ∞(x) = 0 and m∞(x) − M∞ ≤ 0.Therefore, almost surely, we have ξ = m∞ and M∞ ≥ maxm∞. Since it is clear that Mn ≤ maxmn

for all n, we also have M∞ ≤ maxm∞ in the limit, hence M∞ = maxm∞ = max ξ almost surely.We have proved so far that maxmn → max ξ and Mn → max ξ almost surely. Moreover,

En(max ξ) is a martingale that converges almost surely and in L1 to E∞(max ξ) = max ξ (see,e.g., Theorem 6.23 in [35]), and therefore Hn = En(max ξ)−Mn → 0 almost surely.

We conclude the proof by observing that all three convergence results also hold in the L1 senseby the dominated convergence theorem.

Page 25: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 25

Finally, we remark that Proposition 4.12 improves the consistency result of [61], since it does notimpose the no-empty-ball property on the covariance function k. Hence, Proposition 4.12 also holdswith very smooth Gaussian processes, such as Gaussian processes with a Gaussian (a.k.a. squaredexponential) covariance function, or with Gaussian processes whose sample paths have symmetryproperties [27].

Appendix A: Technical results and proofs

A.1. Measurability results

Lemma A.1. Let (E, E) denote a measurable space. Let ϕ : S × E → [0,+∞] denote an S ⊗ E-measurable function. Then the function M × E → [0,+∞], (ν, y) 7→

∫Sϕ(f, y) ν(df), is M ⊗ E-

measurable.

Proof. The result is clear for any ϕ = 1A×B, with A ∈ S and B ∈ E . Indeed,∫Sϕ(f, y) ν(df) =

πA(ν)1B(y), where πA denotes the evaluation map ν 7→ ν(A), and the restriction of πA to M isM-measurable. It can be extended to any ϕ = 1Γ, with Γ ∈ S⊗E , using a standard monotone classargument, and then to any S ⊗ E-measurable function by linearity and increasing approximationby simple functions.

In the following lemma, the Banach space C(X× X) is endowed with its Borel σ-algebra.

Lemma A.2. The mappings m• : M → S, ν 7→ mν and k• : M → C(X×X), ν 7→ kν are measurable.

Proof. The mapping m• is measurable if, and only if, ν 7→ ϕ(mν) is measurable for all ϕ ∈ S′

[see, e.g., 58, Theorem 2.2]. Let ϕ ∈ S′: there exists a unique signed measure µϕ on X such thatϕ(f) =

∫Xf dµϕ. It is then easy to check with Fubini’s theorem that ϕ(mν) =

∫ϕ(f) ν(df), and

the conclusion follows from Lemma A.1. The measurability of k• is established in a similar way,working on X× X instead of X.

Let Θ ⊂ S× C(X×X) denote the range of Ψ = (m•, k•), and let T denote the trace on Θ of theBorel σ-algebra of S× C(X× X).

Lemma A.3. Ψ is a bi-measurable mapping from (M,M) to (Θ, T ).

Proof. The measurability of Ψ follows from Lemma A.2. Since M is generated by the evaluationmaps (see Section 2.2), Ψ−1 is measurable if, and only if, (m, k) 7→ [GP(m, k)] (A) is measurable forall A ∈ S. This is easily checked for any finite intersection of the form A = ∩k f ∈ S | f(xk) ∈ Γk,where (xk) ∈ Xn and (Γk) ∈ B(R)n. The result extends to the ball σ-algebra S0 using a standardmonotone class argument, which concludes the proof since S0 = S [see, e.g., 5].

Proposition A.4. The expected improvement functional (4.15)–(4.16) is M-measurable.

Proof. Let xi denote a countable dense subset of X and set, for all k > 0,

Hk(ν) =

S

(max f −min f) ν(df)− supi

(mν(xi)−

S

min f ν(df)

)1σν(xi)≤

1k.

The mappings ν 7→∫Smax f ν(df), ν 7→

∫Smin f ν(df), (ν, x) 7→ mν(x) and (ν, x) 7→ σ2

ν(x) aremeasurable by Lemma A.1. As a consequence, for any k > 0, the functional Hk is M-measurable.The result follows from the fact that Hk → H pointwise as k → ∞.

Page 26: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 26

A.2. The conditioning operator

Let Zn = (Z1, . . . , Zn) and Xn = (X1, . . . , Xn). For any (m, k) ∈ Θ, xn ∈ Xn and zn ∈ Rn, it is wellknown that the conditional mean and covariance functions of (ξ(x))x∈X given Zn = zn, assuming adeterministic design Xn = xn (see Section 2.1), are given by

mn(x; xn, zn) = m(x) + k(x, xn)K(xn)† (zn −m(xn)) (A.1)

kn(x, y; xn) = k(x, y)− k(x, xn)K(xn)† k(xn, y), (A.2)

where K(xn)† denotes the Moore-Penrose pseudo-inverse of K(xn) =

(k(xi, xj) + τ(xi)

2 δi,j)1≤i,j≤n

,and k(xn, · ) and the other notations should be self-explanatory.

Lemma A.5. κn : (xn, zn, (m, k)) 7→ (mn( · ; xn, zn), kn( · ; xn)) is a measurable mapping fromXn × Rn ×Θ to Θ, where Θ is endowed with the σ-algebra T defined in the preceding section.

Proof. First observe that for any xn, kn( · ; xn) is the covariance function of ξ−mn( · ;xn, Zn), whichis a Gaussian process with continuous sample paths. Thus, (mn( · ; xn, zn), kn( · ; xn)) is indeed anelement of Θ. The result then follows from the continuity of (m,x) 7→ m(x), (k, x) 7→ k(x, · ),and (k, x, y) 7→ k(x, y), and the measurability of K 7→ K† [52].

Proof of Proposition 2.6. Let κn : Xn × Rn ×M → M denote the mapping defined by

κn(xn, zn, ν) = GP (mn( · ; xn, zn), kn( · ; xn)) , (A.3)

where ν = GP(m, k) ∈ M. Observe that, using the notations introduced in the previous section,κn(xn, zn, ν) = Ψ−1 (κn(xn, zn,Ψ(ν))): thus, it follows from Lemmas A.3 and A.5 that κn is mea-surable. Standard algebraic manipulations then show that

κn+m

(xn+m, zn+m, ν

)= κm

(xn+1:n+m, zn+1:n+m, κn (xn, zn, ν)

),

whence it is easy to prove recursively that Pξn := κn

(Xn, Zn,P

ξ)

satisfies the property E(U Pξ

n(Γ))=

E (U 1ξ∈Γ) for any sequential design (Xi), any Fn-measurable U of the form U = Πni=1ϕi(Zi) and

any Γ ∈ S of the form Γ = ∩Jj=1 ξ(xj) ∈ Γj, with xj ∈ X, Γj ∈ B(R), 1 ≤ j ≤ J . The result

extends to any Fn-measurable U and any Γ ∈ S thanks to a monotone class argument, whichproves that Pξ

n is a conditional distribution of ξ given Fn. Proposition 2.6 is thus established withCondx1,z1,...,xn,zn : ν 7→ κn (xn, zn, ν).

Proposition A.6. The mapping (x, ν) 7→ Jx(ν) is B(X)⊗M-measurable.

Proof. Observe that Jx(ν) can be rewritten as

Jx(ν) =

R

H (κ1 (x, mν(x) + v sν(x), ν)) φ(v) dv, (A.4)

where s2ν = kν(x, x) + τ2(x) and κ1 is defined as in the proof of Proposition 2.6. Using Lemma A.3and the measurability of κ1, the integrand in the right-hand side of (A.4) is easily seen to be aB(X)⊗M⊗B(R)-measurable function of (x, ν, v). The result follows from Fubini’s theorem.

Remark A.7. As a consequence of Proposition A.6, Jn : x 7→ Jx(Pξn) is an Fn-measurable process

for all n, and thus Jn(X) is a well-defined Fn-measurable random variable for any Fn-measurableX-valued random variable X .

Page 27: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 27

A.3. Convergence in M

Proof of Proposition 2.9. Recall from Proposition 2.6 that the conditional distribution of ξ given Fn

is of the form Pξn = GP(mn, kn). Moreover, ξ is a Bochner-integrable S-valued random element:

indeed, it is measurable (see, e.g., [58]) and ‖ξ‖∞ is integrable (see, e.g., Theorem 2.9 in [2]). Theconditional expectation E (ξ | Fn) of ξ given Fn is thus well defined as an S-valued random element(since S = C(X) is a separable Banach space; see, e.g., Theorem 5.1.12 in [57]) and is easily seento coincide with mn. As a consequence, it follows from Theorem 6.1.12 in [57] that mn convergesuniformly, almost surely and in L1 (Ω,F ,P), to m∞ := E (ξ | F∞). The limit m∞ is, by definitionof the conditional expectation, an F∞-measurable random element in S.

Let us now prove that the sequence kn converges uniformly to a continuous function k∞. SincePξn = CondX1,Z1,...,Xn,Zn

(Pξ) by Proposition 2.6, and since the sequence of conditional covariancefunctions depends only on the design points Xi (not on the observed values Zi), we can reducewithout loss of generality to the case of a deterministic design (Xi = xi ∈ R, for all i ∈ N)and consider the associated deterministic sequence (kn). Let µ =

∑pi=1 µiδxi

denote any finitelysupported measure on X, and let σ2

n(µ) =∑p

i,j=1 µiµjkn(xi, xj) denote the conditional varianceof Z =

∑pi=1 µiξ(xi) given Fn. Because Z and the observations are jointly Gaussian, the sequence(

σ2n(µ)

)n≥1

is decreasing and therefore converges to a limit σ2∞(µ), for all µ. Thus,

kn(x, y) =1

4

(σ2n (δx + δy)− σ2

n (δx − δy))

−−−−→n→∞

1

4

(σ2∞ (δx + δy)− σ2

∞ (δx − δy)),

which proves convergence to a limit k∞(x, y). Moreover, we have for any x, y, x′, y′ ∈ X:

|kn (x, y)− kn (x′, y′)| ≤ σn(δx)σn (δy − δy′) + σn(δy′)σn (δx − δx′) (A.5)

≤ σ0(δx)σ0 (δy − δy′) + σ0(δy′)σ0 (δx − δx′) . (A.6)

Letting n go to +∞ in the left-hand side, we conclude that k∞ is continuous. To see that theconvergence kn → k∞ is uniform, consider the sequence of functions X2 → R, (x, y) 7→ σ2

n (δx + δy).This is a decreasing sequence of continuous functions, which converges pointwise to the continuousfunction (x, y) 7→ σ2

∞ (δx + δy). Since X2 is compact, the convergence is uniform by Dini’s firsttheorem. The same argument applies to (x, y) 7→ σ2

n (δx − δy) and therefore to kn by polarization.Finally, let Q denote any conditional distribution of ξ given F∞. We will prove that the F∞-

measurable random measure Q is almost surely a Gaussian measure. Let x ∈ X and let φx denotethe (random) characteristic function of Q δ−1

x . It follows from Theorem 6.23 in Kallenberg [35]that, for all u ∈ R, φx(u) = E∞

(eiuξ(x)

) a.s.= limn→∞ En

(eiuξ(x)

). Since

En

(eiuξ(x)

)= eiumn(x) e−

12kn(x,x)u

2 a.s.−−−−→n→∞

eium∞(x) e−12k∞(x,x)u2

,

we conclude from the continuity of φx and Levy’s theorem that Q δ−1x = N (m∞(x), k∞(x, x))

almost surely. The argument extends to any image measure of the form Q h−1, with h =(δy1

, . . . , δym). Considering first the case where the yj ’s are taken in a countable dense subset

of X and then using the continuity of the elements of S, we conclude that there is an almost sureevent Ω0 ∈ F∞ such that, for ω ∈ Ω0, (δx)x∈X

is a Gaussian process defined on the probabilityspace (S,S,Q(ω, ·)), and thus Q(ω, ·) is a Gaussian measure for all ω ∈ Ω0. Finally, letting

Pξ∞(ω, ·) =

Q(w, ·) if w ∈ Ω0,

GP(0, 0) otherwise,

Page 28: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 28

we have constructed an F∞-measurable random element in M such that Pξn → Pξ

∞ a.s. for thetopology introduced in Definition 2.7, thereby concluding the proof.

Proof of Proposition 2.10. Let ν = GP(m, k) ∈ M and let (xj , zj) → (x∞, z∞) in X × R. Forany j ∈ N ∪ +∞, we have Condxj, zj(ν) = GP(m1( · ; xj , zj), k1( · ; xj), where m1 and k1 aregiven by (A.1)–(A.2). It is then easy to check that m1( · ; xj , zj) and k1( · ; xj) converge uniformlyto m1( · ; x∞, z∞) and k1( · ; x∞), respectively, using the facts that k is uniformly continuousover X × X (since k is continuous and X × X is a compact metric space) and that K 7→ K† iscontinuous at K = kν(x, x)+ τ2(x) > 0 (the covariance matrix is actually a scalar in this case).

A.4. Existence of SUR and quasi-SUR sequential designs

This section contains general existence results for ε-quasi-SUR sequential designs. Recall that X isassumed, throughout the paper, to be a compact metric space (see Standing assumptions 2.2).

Theorem A.8. Let the assumptions of Theorem 3.16 hold. Then,

a) for any sequential design, the sample paths of Jn are continuous onx ∈ X : s2n(x) > 0

;

b) for any sequence ε = (εn) of strictly positive real numbers, there exists an ε-quasi-SUR sequentialdesign (Xn)n≥1 associated with H.

Proof. We will assume without loss of generality that H0 = 0, since H0 only adds a constant term(i.e., a term that does not depend on x) to the value of the sampling criterion.

Let us first prove Assertion (a). Since Jn(x) = Jx

(Pξn

), it is equivalent to prove that the result

holds at n = 0 for any Pξ0 ∈ M. Assume then that n = 0, fix x ∈ X such that that s20(x) =

k(x, x)+τ2(x) > 0, and let (xj) denote a sequence in X such xj → x. Recall from (3.1) that J0(x) =Jx(P

ξ0) = E

(H(Condx, Z1(x)

(Pξ0

))). Set νk = Condxk, Z1(xk)

(Pξ0

)and ν∞ = Condx, Z1(x)

(Pξ0

). We

have νk ∈ P(ξ) for all n ∈ N∪+∞, and νk → ν∞ by Proposition 2.10. It follows that H(νk)a.s.−−→

H(ν∞) since H is P-continuous, and thus Jxk(Pξ

0) = E(H(νk)

)→ E

(H(ν∞)

)= Jx(P

ξ0) since

(H(νk)) is uniformly integrable. Assertion (a) is proved.Consider now the following compact subsets of X:

Bn,γ(ω) =x ∈ X | sn (ω, x) ≥ γ−1 > 0

, (A.7)

An,γ(ω) = Bn,γ(ω) ∩ x ∈ X | Jn (ω, x) ≤ inf Jn(ω, x) + εn . (A.8)

Let us prove that, on the event sn 6≡ 0, the set An,γ(ω) is non-empty for large values of γ.Assume that sn(ω, ·) 6≡ 0, and recall that Jn (ω, ·) ≤ Hn(ω) by (3.4). If Jn(ω, ·) ≡ Hn, thenfor any x such that sn(ω, x) > 0 and any γ ≥ sn(ω, x)

−1 we have x ∈ An,γ(ω) = Bn,γ(ω). Ifinfx Jn(ω, x) < Hn, pick a sequence (xk) such that Jn(ω, xk) → infx Jn(ω, x). For some k largeenough, Jn(ω, xk) ≤ infx Jn(ω, x) + εn and Jn(ω, xk) < Hn. As a consequence, sn(ω, xk) > 0 (thisfollows from (3.3) and the fact that Condx,mν(x)(ν) = ν if s2ν(x) = 0) and thus xk ∈ An,γ(ω) forany γ ≥ sn(ω, xk)

−1. In both cases the claim is proved.Since X is a compact metric space, it is easily proved that ω 7→ An,γ(ω) is an Fn-measurable

random closed set, and thus admits [see, e.g., 42, Theorem 2.13] an Fn-measurable selection X(γ)n+1,

i.e., an X-valued random variable such that X(γ)n+1 ∈ An,γ on the event An,γ 6= ∅. Let x denote

Page 29: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 29

an arbitrary fixed point in X. Setting

Xn+1 =

x if sn ≡ 0,

X(k)n+1 if An,k 6= ∅ and An,l = ∅, ∀l < k,

(A.9)

provides the desired ε-quasi-SUR strategy and thus completes the proof.

In some situations, it is possible to prove directly the continuity of the sampling criteria Jn onthe whole of X (see Section 4.4 for an example), in which case a stronger existence result can beformulated as in [61], that does not even require the supermartingale property:

Theorem A.9. Let H denote a measurable uncertainty functional on M, such that, for all ν ∈ M,x 7→ Jx(ν) is finite and continuous on X. Then,

a) for any sequential design, the sample paths of Jn are continuous on X;b) there exists a SUR sequential design (Xn)n≥1 associated with H.

Proof. Assertion a) follows trivially from the fact that Jn(x) = Jx

(Pξn

), and a SUR sequential

design is again obtained using the measurable selection theorem for random closed sets.

A.5. Miscellaneous

Lemma A.10. Let U , V and W be real-valued random variables such that

1. W is independent of (U, V ),2. V and W are Gaussian.

If U is orthogonal to L2(V +W ), then U is orthogonal to L2(V ).

Remark A.11. The reverse implication is also true, but not needed in the paper.

Proof. Assume without loss of generality that U , V and W are centered. Assume further that Uis not orthogonal to L2(V ). Then, there exists a smallest integer k0 such that cov(U, V k0) 6= 0.Indeed, we would have otherwise cov(U,Hk(V )) = 0 for all k, where Hk denotes the kth Hermitepolynomial, and thus U would be orthogonal to L2(V ) since (Hk(V ))k∈N

is an orthonormal basisof L2(V ). Using that cov(U, V k) = 0 for all k < k0, we have:

cov(U, (V +W )k0

)=

k0∑

k=0

(k0k

)E(UV k

)E(W k−k0

)= E

(UV k0

)6= 0. (A.10)

Therefore U is not orthogonal to L2(V +W ), which concludes the proof by contraposition.

Acknowledgments

Part of David Ginsbourger’s contribution took place within the framework of the “Bayesian setestimation under random field priors” project (Number 146354) funded by the Swiss NationalScience Foundation. François Bachoc acknowledges support from the “PEPITO” project fundedby the French National Agency for Research. The authors would like to thank Luc Pronzato forpointing out the important connection with DeGroot’s earlier work on uncertainty functionals, DarioAzzimonti for proofreading, and two anonymous reviewers for their careful reading and numeroussuggestions of improvement.

Page 30: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 30

References

[1] Adler, R. J. (1981). The Geometry of Random Fields. Wiley, New York.[2] Azaïs, J.-M. and Wschebor, M. (2009). Level Sets and Extrema of Random Processes and Fields.

Wiley.[3] Bayarri, M. J., Berger, J. O., Paulo, R., Sacks, J., Cafeo, J. A., Cavendish, J., Lin, C.-H., and

Tu, J. (2012). A framework for validation of computer models. Technometrics.[4] Bect, J., Ginsbourger, D., Li, L., Picheny, V., and Vazquez, E. (2012). Sequential design of

computer experiments for the estimation of a probability of failure. Statistics and Computing,22 (3):773–793.

[5] Billingsley, P. (1999). Convergence of Probability Measures. Second edition. John Wiley & Sons.[6] Binois, M. (2015). Uncertainty Quantification on Pareto Fronts and High-Dimensional Strategies

in Bayesian Optimization, with Applications in Multi-Objective Automotive Design. PhD thesis,Ecole Nationale Supérieure des Mines de Saint-Etienne.

[7] Bogachev, V. I. (1998). Gaussian Measures. American Mathematical Society.[8] Brezis, H. (2010). Functional Analysis, Sobolev Spaces and Partial Differential Equations.

Springer Science & Business Media.[9] Bull, A. D. (2011). Convergence rates of efficient global optimization algorithms. Journal of

Machine Learning Research, 12:2879–2904.[10] Chevalier, C. (2013). Fast Uncertainty Reduction Strategies Relying on Gaussian Process Mod-

els. PhD thesis, University of Bern.[11] Chevalier, C., Bect, J., Ginsbourger, D., Vazquez, E., Picheny, V., and Richet, Y. (2014). Fast

parallel kriging-based stepwise uncertainty reduction with application to the identification of anexcursion set. Technometrics, 56(4):455–465.

[12] Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1996). Active learning with statistical models.Journal of Artificial Intelligence Research, 4:129–145.

[13] Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. John Wiley & Sons.[14] Cressie, N. (1993). Statistics for Spatial Data. Wiley, New York.[15] DeGroot, M. H. (1962). Uncertainty, information, and sequential experiments. The Annals of

Mathematical Statistics, 33(2):404–419.[16] DeGroot, M. H. (1986). Concepts of Information Based on Utility, pages 265–275. Springer

Netherlands, Dordrecht.[17] DeGroot, M. H. (1994). Changes in utility as information. Theory and Decision, 17(3):287–303.[18] Diaconis, P. (1988). Bayesian numerical analysis. Statistical Decision Theory and Related

Topics IV, 1:163–175.[19] Emmerich, M., Giannakoglou, K., and Naujoks, B. (2006). Single-and multiobjective opti-

mization assisted by Gaussian random field metamodels. IEEE Transactions on EvolutionaryComputation, 10 (4).

[20] Feliot, P., Bect, J., and Vazquez, E. (2016). A Bayesian approach to constrained single- andmulti-objective optimization. Journal of Global Optimization, 67(1):97–133.

[21] Forrester, A., Sobester, A., and Keane, A. (2008). Engineering Design via Surrogate Modelling:a Practical Guide. Wiley.

[22] Frazier, P. (2009). Knowledge-Gradient Methods for Statistical Learning. PhD thesis, PrincetonUniversity.

[23] Frazier, P. I., Powell, W. B., and Dayanik, S. (2008). A knowledge-gradient policy for sequentialinformation collection. SIAM Journal on Control and Optimization, 47(5):2410–2439.

Page 31: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 31

[24] Frazier, P. I., Powell, W. B., and Dayanik, S. (2009). The knowledge-gradient policy forcorrelated normal beliefs. INFORMS Journal on Computing, 21(4):599–613.

[25] Geman, D. and Jedynak, B. (1996). An active testing model for tracking roads in satelliteimages. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(1):1–14.

[26] Ginsbourger, D., Baccou, J., Chevalier, C., Garland, N., Perales, F., and Monerie, Y. (2014).Bayesian adaptive reconstruction of profile optima and optimizers. SIAM/ASA Journal on Un-certainty Quantification, 2(1):490–510.

[27] Ginsbourger, D., Roustant, O., and Durrande, N. (2016). On degeneracy and invariancesof random fields paths with applications in Gaussian process modelling. Journal of StatisticalPlanning and Inference, 170:117–128.

[28] Gramacy, R., Gray, G., Le Digabel, S., Lee, H., Ranjan, P., Wells, G., and Wild, S. (2016).Modeling an augmented Lagrangian for blackbox constrained optimization. Technometrics (withdiscussion), 58(1):1–11.

[29] Grünewälder, S., Audibert, J.-Y., Opper, M., and Shawe-Taylor, J. (2010). Regret boundsfor Gaussian process bandit problems. In International Conference on Artificial Intelligence andStatistics.

[30] Hainy, M., Müller, W. G., and Wynn, H. P. (2014). Learning functions and approximateBayesian computation design: ABCD. Entropy, 16(8):4353–4374.

[31] Hennig, P., Osborne, M. A., and Girolami, M. (2015). Probabilistic numerics and uncertaintyin computations. In Proceedings of the Royal Society A, volume 471, page 20150142. The RoyalSociety.

[32] Ibragimov, I. and Rozanov, Y. (1978). Gaussian Random Processes. Springer-Verlag, NewYork.

[33] Johnson, R. (1960). An information theory approach to diagnosis. In Proceedings of the 6thSymposium on Reliability and Quality Control, pages 102–109.

[34] Jones, D., Schonlau, M., and Welch, W. (1998). Efficient global optimization of expensive blackbox functions. Journal of Global Optimization, 13:455–492.

[35] Kallenberg, O. (2002). Foundations of Modern Probability. Second edition. Springer-Verlag.[36] King-Smith, P. E., Grigsby, S. S., Vingrys, A. J., Benes, S. C., and Supowit, A. (1994). Efficient

and unbiased modifications of the quest threshold method: theory, simulations, experimentalevaluation and practical implementation. Vision Research, 34(7):885–912.

[37] Koehler, J., Puhalskii, A. A., and Simon, B. (1998). Estimating functions evaluated by simu-lation: A Bayesian/analytic approach. Annals of Applied Probability, 8(4):1184–1215.

[38] Ledoux, M. and Talagrand, M. (2013). Probability in Banach Spaces: Isoperimetry and Pro-cesses. Springer Science & Business Media.

[39] Lukić, M. and Beder, J. (2001). Stochastic processes with sample paths in reproducing kernelHilbert spaces. Transactions of the American Mathematical Society, 353(10):3945–3969.

[40] MacKay, D. J. C. (1992). Information-based objective functions for active data selection.Neural Computation, 4(4):590–604.

[41] Mockus, J. B., Tiesis, V., and Žilinskas, A. (1978). The application of Bayesian methods forseeking the extremum. In Dixon, L. C. W. and Szegö, G. P., editors, Towards Global Optimization,volume 2, pages 117–129, North Holland, New York.

[42] Molchanov, I. (2006). Theory of Random Sets. Springer Science & Business Media.[43] O’Geran, J. H., Wynn, H. P., and Zhiglyavsky, A. A. (1993). Mastermind as a test-bed for

search algorithms. Chance, 6(1):31–37.[44] O’Hagan, A. (1991). Bayes–Hermite quadrature. Journal of Statistical Planning and Inference,

Page 32: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 32

29(3):245–260.[45] Perlman, M. D. (1974). Jensen’s inequality for a convex vector-valued function on an infinite-

dimensional space. Journal of Multivariate Analysis, 4(1):52–65.[46] Picheny, V. (2014). A stepwise uncertainty reduction approach to constrained global optimiza-

tion. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics(AISTATS).

[47] Picheny, V., Ginsbourger, D., Roustant, O., Haftka, R., and Kim, N.-H. (2010). Adaptivedesigns of experiments for accurate approximation of target regions. Journal of MechanicalDesign, 132(7).

[48] Ranjan, P., Bingham, D., and Michailidis, G. (2008). Sequential experiment design for contourestimation from complex computer codes. Technometrics, 50(4):527–541.

[49] Ritter, K. (2000). Average-Case Analysis of Numerical Problems, volume 1733 of Lecture Notesin Mathematics. Springer Verlag.

[50] Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989). Design and analysis ofcomputer experiments. Statistical Science, 4:409–423.

[51] Santner, T. J., Williams, B. J., and Notz, W. I. (2003). The Design and Analysis of ComputerExperiments. Springer, New York.

[52] Schönfeld, P. (1973). A note on the measurability of the pseudo-inverse. Journal of Economet-rics, 1(3):313–314.

[53] Scott, W., Frazier, P., and Powell, W. (2011). The correlated knowledge gradient for simulationoptimization of continuous parameters using Gaussian process regression. SIAM Journal onOptimization, 21:996.

[54] Shahriari, B., Swersky, K., Wang, Z., Adams, R., and de Freitas, N. (2016). Taking the humanout of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175.

[55] Srinivas, N., A., K., Kakade, S., and Seeger, M. (2012). Information-theoretic regret bounds forGaussian process optimization in the bandit setting. IEEE Transactions on Information Theory,58:3250–3265.

[56] Stein, M. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer, New York.[57] Stroock, D. W. (2010). Probability Theory: An Analytic View. Second Edition. Cambridge

University Press.[58] Vakhania, N., Tarieladze, V., and Chobanyan, S. (1987). Probability Distributions on Banach

Spaces, volume 14. Springer Science & Business Media.[59] van der Vaart, A. W., van Zanten, J. H., et al. (2008). Reproducing kernel Hilbert spaces of

Gaussian priors. In Pushing the Limits of Contemporary Statistics: Contributions in Honor ofJayanta K. Ghosh, pages 200–222. Institute of Mathematical Statistics.

[60] Vazquez, E. and Bect, J. (2009). A sequential Bayesian algorithm to estimate a probability offailure. In 15th IFAC Symposium on System Identification (SYSID 2009).

[61] Vazquez, E. and Bect, J. (2010a). Convergence properties of the expected improvement algo-rithm with fixed mean and covariance functions. Journal of Statistical Planning and Inference,140(11):3088–3095.

[62] Vazquez, E. and Bect, J. (2010b). Pointwise consistency of the kriging predictor with knownmean and covariance functions. In mODa 9–Advances in Model-Oriented Design and Analysis,pages 221–228. Springer.

[63] Villemonteix, J., Vazquez, E., and Walter, E. (2009). An informational approach to the globaloptimization of expensive-to-evaluate functions. Journal of Global Optimization, 44(4):509–534.

[64] Wang, H., Lin, G., and Li, J. (2016). Gaussian process surrogates for failure detection: a

Page 33: A supermartingale approach to Gaussian process …arXiv:1608.01118v3 [stat.ML] 30 Aug 2018 arXiv: 1608.01118v3 A supermartingale approach to Gaussian process based sequential design

J. Bect, F. Bachoc and D. Ginsbourger/Supermartingale approach to Gaussian process design of experiments 33

Bayesian experimental design approach. Journal of Computational Physics, 313:247–259.[65] Williams, B. J., Santner, T. J., and Notz, W. I. (2000). Sequential design of computer experi-

ments to minimize integrated response functions. Statistica Sinica, 10:1133–1152.[66] Yarotsky, D. (2013a). Examples of inconsistency in optimization by expected improvement.

Journal of Global Optimization, 56(4):1773–1790.[67] Yarotsky, D. (2013b). Univariate interpolation by exponential functions and Gaussian RBFs

for generic sets of nodes. Journal of Approximation Theory, 166:163–175.[68] Zuluaga, M., Krause, A., Sergent, G., and Püschel, M. (2013). Active learning for level set

estimation. In International Joint Conference on Artificial Intelligence (IJCAI).