preferential bayesian optimization · a/b tests or recommender systems. we present a new framework...

Preferential Bayesian Optimization

Javier Gonzalez 1 Zhenwen Dai 1 Andreas Damianou 1 Neil D. Lawrence 1 2

AbstractBayesian optimization (BO) has emerged duringthe last few years as an effective approach to opti-mizing black-box functions where direct queriesof the objective are expensive. In this paper weconsider the case where direct access to the func-tion is not possible, but information about userpreferences is. Such scenarios arise in problemswhere human preferences are modeled, such asA/B tests or recommender systems. We presenta new framework for this scenario that we callPreferential Bayesian Optimization (PBO) whichallows us to find the optimum of a latent func-tion that can only be queried through pairwisecomparisons, the so-called duels. PBO extendsthe applicability of standard BO ideas and gen-eralizes previous discrete dueling approaches bymodeling the probability of the winner of eachduel by means of a Gaussian process model witha Bernoulli likelihood. The latent preference func-tion is used to define a family of acquisition func-tions that extend usual policies used in BO. Weillustrate the benefits of PBO in a variety of exper-iments, showing that PBO needs drastically fewercomparisons for finding the optimum. Accordingto our experiments, the way of modeling correla-tions in PBO is key in obtaining this advantage.

1. IntroductionLet g : X → < be a well-behaved black-box functiondefined on a bounded subset X ⊆ <q. We are interested insolving the global optimization problem of finding

xmin = argminx∈X

g(x). (1)

We assume that g is not directly accessible and that queriesto g can only be done in pairs of points or duels [x,x′] ∈

1Amazon Research Cambridge, UK 2University ofSheffield, UK. Correspondence to: Javier Gonzalez <[email protected]>.

Proceedings of the 34 th International Conference on MachineLearning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 bythe author(s).

X × X from which we obtain binary feedback {0, 1} thatrepresents whether or not x is preferred over x′ (has lowervalue)1. We will consider that x is the winner of the duel ifthe output is {1} and that x′ wins the duel if the output is{0}. The goal here is to find xmin by reducing as much aspossible the number of queried duels.

Our setup is different to the one typically used in BO wheredirect feedback from g in the domain is available (Jones,2001; Snoek et al., 2012). In our context, the objectiveis a latent object that is only accessible via indirect obser-vations. However, although the scenario described in thiswork has not received a wider attention, there exist a vari-ety of real wold scenarios in which the objective functionneeds to be optimized via preferential returns. Most casesinvolve modeling latent human preferences, such as webdesign via A/B testing, the use of recommender systems(Brusilovsky et al., 2007) or the ranking of game playersskills (Herbrich et al., 2007). In prospect theory, the modelsused are based on comparisons with some reference point,as it has been demonstrated that humans are better at evaluat-ing differences rather than absolute magnitudes (Kahnemanand Tversky, 1979).

Optimization methods for pairwise preferences have beenstudied in the armed-bandits context (Yuea et al., 2012).Zoghi et al. (2014) propose a new method for the K-armedduelling bandit problem based on the Upper ConfidenceBound algorithm. Jamieson et al. (2015) study the problemby allowing noise comparisons between the duels. Zoghiet al. (2015b) choose actions using contextual information.Dudık et al. (2015) study the Copeland’s dueling bandits,a case in which a Condorcet winner, or an arm that uni-formly wins the duels with all the other arms may not ex-ist. Szorenyi et al. (2015) study Online Rank Elicitationproblem in the duelling bandits setting. An analysis onThompson sampling in duelling bandits is done by Wu et al.(2016). Yue and Joachims (2011) proposes a method thatdoes not need transitivity and comparison outcomes to haveindependent and time-stationary distributions.

Preference learning has also been studied (Chu and Ghahra-mani, 2005) in the context of Gaussian process (GP) modelsby using a likelihood able to model preferential returns. Sim-

1In this work we use [x,x′] to represent the vector resultingfrom concatenating both elements involved in the duel.


−10

−5

0

5

10

15

20

f(x)

Objective function

Global minimum

0.0 0.2 0.4 0.6 0.8 1.0

x

0.00.20.40.60.81.01.21.4

Scor

eva

lue

Copeland and soft-Copeland functions

Copelandsoft-Copeland

0.0 0.2 0.4 0.6 0.8 1.0

x

0.0

0.2

0.4

0.6

0.8

1.0

x’

0.5

0.5

0.5

Preference function

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 1. Illustration of the key elements of an optimization problem with pairwise preferential returns in a one-dimensional example.Top-left: Objective (Forrester) function to minimize. This function is only accessible through pairwise comparisons of inputs x andx′. Right: true preference function πf ([x,x

′]) that represents the probability that x will win a duel over x′. Note that, by symmetry,πf ([x,x

′]) = 1− πf ([x′,x]). Bottom left: The normalised Copeland’s and soft-Copeland function whose maximum is located at the

same point of the minimum of f .

ilarly, Brochu (2010) used a probabilistic model to activelylearn preferences in the context of discovering optimal pa-rameters for simple graphics and animations engines. Toselect new duels, standard acquisition functions like the Ex-pected Improvement (EI) (Mockus, 1977) are extended ontop of a GP model with likelihood for duels. Although thisapproach is simple an effective in some cases, new duels areselected greedily, which may lead to over-exploitation.

In this work we propose a new approach aiming at combin-ing the good properties of the arm-bandit methods with theadvantages of having a probabilistic model able to capturecorrelations across the points in the domain X . Followingthe above mentioned literature in the bandits settings, thekey idea is to learn a preference function in the space of theduels by using a Gaussian process. This allows us to selectthe most relevant comparisons non-greedily and improvethe state-of-the-art in this domain.

This paper is organized as follows. In Section 2 we intro-duce the point of view that it is followed in this work tomodel latent preferences. We define concepts such as theCopeland score function and the Condorcet’s winner whichform the basis of our approach. Also in Section 2, we showhow to learn these objects from data. In Section 3 we gen-eralize most commonly used acquisition functions to thedueling case. In Section 4 we illustrate the benefits of theproposed framework compared to state-of-the art methodsin the literature. We conclude in Section 5 with a discussionand some future lines of research.

2. Learning latent preferencesThe approach followed in this work is inspired by the workof (Ailon et al., 2014) in which cardinal bandits are reducedto ordinal ones. Similarly, here we focus on the idea ofreducing the problem of finding the optimum of a latentfunction defined on X to determine a sequence of duels onX × X .

We assume that each duel [x,x′] produces in a joint rewardf([x,x′]) that is never directly observed. Instead, after eachpair is proposed, the obtained feedback is a binary return y ∈{0, 1} representing which of the two locations is preferred.In this work, we assume that f([x,x′]) = g(x′) − g(x),but other alternatives are possible. Note that the more x ispreferred over x′ the bigger is the reward.

Since the preferences of humans are often unclear and mayconflict, we model preferences as a stochastic process. Inparticular, the model of preference is a Bernoulli probabilityfunction

p(y = 1|[x,x′]) = πf ([x,x′])

and

p(y = 0|[x,x′]) = πf ([x,x′])

where π : < × < → [0, 1] is an inverse link function. Viathe latent loss, f maps each query [x,x′] to the probabilityof having a preference on the left input x over the rightinput x′. The inverse link function has the property thatπf ([x

′,x]) = 1 − πf ([x,x′]). A natural choice for πf is


the logistic function

πf ([x,x′]) = σ(f([x,x′])) =

1

1 + e−f([x,x′]), (2)

but others are possible. Note that for any duel [x,x′] inwhich g(x) ≤ g(x′) it holds that πf ([x,x′]) ≥ 0.5. πfis therefore a preference function that fully specifies theproblem.

We introduce here the concept of normalised Copelandscore, already used in the literature of raking methods(Zoghi et al., 2015a), as

S(x) = Vol(X )−1∫XI{πf ([x,x′])≥0.5}dx

′,

where Vol(X ) =∫X dx

′ is a normalizing constant thatbounds S(x) in the interval [0, 1]. If X is a finite set, theCopeland score is simply the proportion of duels that a cer-tain element x will win with probability larger than 0.5.Instead of the Copeland score, in this work we use a soft ver-sion of it, in which the probability function πf is integratedover X without further truncation. Formally, we define thesoft-Copeland score as

C(x) = Vol(X )−1∫Xπf ([x,x

′])dx′, (3)

which aims to capture the ‘averaged’ probability of x beingthe winner of a duel.

Following the armed-bandits literature, we say that xc isa Condorcet winner if it is the point with maximal soft-Copeland score. It is straightforward to see that if xc is aCondorcet winner with respect to the soft-Copeland score,it is a global minimum of f in X : the integral in (3) takesmaximum value for points x ∈ X such that f([x,x′]) =g(x′) − g(x) > 0 for all x′, which only occurs if xc is aminimum of f . This implies that if by observing the resultsof a set of duels we can learn the preference function πf theoptimization problem of finding the minimum of f can beaddressed by finding the Condorcet winner of the Copelandscore. See Figure 1 for an illustration of this property.

2.1. Learning the preference function πf ([x,x′]) withGaussian processes

Assume that N duels have been performed so far resultingin a dataset D = {[xi,x′i], yi}Ni=1. Given D, inference overthe latent function f and its warped version πf can be car-ried out by using Gaussian processes (GP) for classification(Rasmussen and Williams, 2005).

In a nutshell, a GP is a probability measure over functionssuch that any linear restriction is multivariate Gaussian. AnyGP is fully determined by a positive definite covariance oper-ator. In standard regression cases with Gaussian likelihoods,

closed forms for the posterior mean and variance are avail-able. In the preference learning, like the one we face here,the basic idea behind Gaussian process modeling is to placea GP prior over some latent function f that captures themembership of the data to the two classes and to squash itthrough the logistic function to obtain some prior probabil-ity πf . In other words, the model for a GP for classificationlooks similar to eq. (2) but with the difference that f is anstochastic process as it is πf . The stochastic latent functionf is a nuisance function as we are not directly interestedin its values but instead on particular values of πf at testlocations [x?,x′?].

Inference is divided in two steps. First we need to computethe distribution of the latent variable corresponding to a testcase, p(f?|D, [x?,x′?], θ) and later use this distribution overthe latent f? to produce a prediction

πf ([x?,x′?];D, θ) = p(y? = 1|D, [x,x′], θ) (4)

=

∫σ(f?)p(f?|D, [x?,x′?], θ)df?

where the vector θ contains the hyper-parameters of themodel that can also be marginalized out. In this scenario,GP predictions are not straightforward (in contrast to the re-gression case), since the posterior distribution is analyticallyintractable and approximations at required (see (Rasmussenand Williams, 2005) for details). The important messagehere is, however, that given data from the locations and re-sult of the duels we can learn the preference function πf bytaking into account the correlations across the duels, whichmakes the approach to be very data efficient compared tobandits scenarios where correlations are ignored.

2.2. Computing the soft-Copeland score and theCondorcet winner

The soft-Copeland function can be obtained by integrat-ing πf ([x,x′]) over X , so it is possible to learn the soft-Copeland function from data by integrating πf ([x,x′],D).Unfortunately, a closed form solution for

Vol(X )−1∫Xπf ([x,x

′];D, θ)dx′

does not necessarily exist. In this work we use Monte-Carlointegration to approximate the Copeland score at any x ∈ Xvia

C(x;D, θ) ≈ 1

M

M∑k=1

πf ([x,xk]);D, θ), (5)

where x1, . . . ,xM are a set of landmark points to performthe integration. For simplicity, in this work we select thelandmark points uniformly, although more sophisticatedprobabilistic approaches can be applied (Briol et al., 2015).


0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0Expectation of y? and σ(f?)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0Variance of y∗

0.10

0.12

0.14

0.16

0.18

0.20

0.22

0.24

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0Variance of σ(f?)

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Figure 2. Differences between the sources of uncertainty that can be used for the exploration of the duels. The three figures show differentelements of a GP used for preferential learning in the context of the optimization of the (latent) Forrester function. The model is learnedusing the result of 30 duels. Left: Expectation of y?, which coincides with the expectation of σ(f?) and that is denoted as πf ([x?,x

′?].

Center: Variance of output of the duels y?, that is computed as πf ([x?,x′?](1− πf ([x?,x

′?]). Note that the variance does not necessarily

decrease in locations where observations are available. Right: Variance of the latent function σ(f?). The variance of σ(f?) decreases inregions where data are available, which make it appropriate for duels exploration contrast to the variance of y?.

The Condorcet winner can be computed by taking

xc = argmaxx∈X

C(x;D, θ),

which can be done using a standard numerical optimizer. xcis the point that has, on average, the maximum probability ofwining most of the duels (given the data setD) and thereforeit is the most likely point to be the optimum of g.

3. Sequential Learning of the Condorcetwinner

In this section we analyze the case in which n extra duelscan be carried out to augment the dataset D before we haveto report a solution to (1). This is similar to the set-up in(Brochu, 2010) where interactive Bayesian optimization isproposed by allowing a human user to sequentially decidethe result of a number of duels.

In the sequel, we will denote by Dj the data set resultingof augmenting D with j new pairwise comparisons. Ourgoal in this section is to define a sequential policy for query-ing duels: α([x,x′];Dj , θ). This policy will enable us toidentify as soon as possible the minimum of the the latentfunction g. Note that here, differently to the situation instandard Bayesian optimization, the search space of theacquisition, X × X is not the same as domain X of thelatent function that we are optimizing. Our best guess aboutits optimum, however, is the location of the Condorcet’swinner.

We approach the problem by proposing three dueling ac-quisition functions: (i) pure exploration (PE), the CopelandExpected improvement (CEI) and duelling-Thompson sam-pling, which makes explicitly use of the generative capa-bilities of our model. We analyze the three approaches in

terms of what the balance exploration-exploitation means inour context. For simplicity in the notation, in the sequel wedrop the dependency of all quantities on the parameters θ.

3.1. Pure Exploration

The first question that arises when defining a new acquisitionfor duels, is what exploration means in this context. Givena model as described in Section 2.1, the output variables y?follow a Bernoulli distribution with probability given by thepreference function πf . A straightforward interpretation ofpure exploration would be to search for the duel of whichthe outcome is most uncertain (has the highest variance ofy?). The variance of y? is given by

V[y?|[x?,x′?],Dj ] = πf ([x?,x′?];Dj)(1−πf ([x?,x′?];Dj)).

However, as preferences are modeled with a Bernoullimodel, the variance of y? does not necessarily reduce withsufficient observations. For example, according to eq. (2),for any two values x? and x′? such that g(x?) ≈ g(x′?),πf ([x?,x

′?],Dj) will tend to be close to 0.5, and therefore

it will have maximal variance even if we have already col-lected several observations in that region of the duels space.

Alternatively, exploration can be carried out by searchingfor the duel where GP is most uncertain about the probabilityof the outcome (has the highest variance of σ(f?)), whichis the result of transforming out epistemic uncertainty aboutf , modeled by a GP, through the logistic function. Thefirst order moment of this distribution coincides with theexpectation of y? but its variance is

V[σ(f?)] =

∫(σ(f?)− E[σ(f?)])2 p(f?|D, [x,x′])df?

=

∫σ(f?)

2p(f?|D, [x,x′])df? − E[σ(f?)]2,


0.0

0.2

0.4

0.6

0.8

1.0100 Copeland Samples (#duels = 10)

0.0

0.2

0.4

0.6

0.8


0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8


Figure 3. 100 continuous samples of the Copeland score function(grey) in the Forrester example generated using Thompson sam-pling. The three plots show the samples obtained once the modelhas been trained with different number of duels (10, 30 and 150).In black we show the Coplenad function computed using the pref-erence function. The more samples are available more exploitationis encouraged in the first element of the duel as the probability ofselecting xnext as the true optimum increases.

which explicitly takes into account the uncertainty over f .Hence, pure exploration of duels space can be carried outby maximizing

αPE([x,x′]|Dj) = V[σ(f?)|[x?,x′?]|Dj ].

Remark that in this case, duels that have been already visitedwill have a lower chance of being visited again even in casesin which the objective takes similar values in both players.See Figure 2 for an illustration of this property.

In practice, this acquisition requires to compute and in-tractable integral, that we approximate in practice usingMonte-Carlo.

3.2. Copeland Expected Improvement

An alternative way to define an acquisition function isby generalizing the idea of the Expected Improvement(Mockus, 1977). The idea of the EI is to compute, in ex-pectation, the marginal gain with respect to the current bestobserved output. In our context, as we do not have directaccess to the objective, our only way of evaluating the qual-ity of a single point is by computing its Copeland score. To

generalize the idea to our context we need to find a coupleof duels able to maximally improve the expected score ofthe Condorcet winner.

Denote by c?j the value of the Condorcet’s winner when jduels have been already run. For any new proposed duel[x,x′], two outcomes {0, 1} are possible that correspond tocases wether x or x′ wins the duel. We denote by the c?x,jthe value of the estimated Condorcet winner resulting ofaugmenting D with {[x,x′], 1} and by c?x′,j the equivalentvalue but augmenting the dataset with {[x,x′], 0}. We de-fine the one-lookahead Copeland Expected Improvement atiteration j as:

αCEI([x,x′]|Dj) = E

[(0, c− c?j )+|Dj

](6)

where (·)+ = max(0, ·) and the expectation is take over c,the value at the Condorcet winner given the result of theduel. The next duel is selected as the pair that maximizesthe CEI. Intuitively, the CEI evaluated at [x,x′] is a weightedsum of the total increase of the best possible value of theCopeland score in the two possible outcomes of the duel.The weights are chosen to be the probability of the twooutcomes, which are given by πf . The CEI can be computedin closed form as

αCEI([x,x′]|Dj) = πf ([x,x

′]|Dj)(c?x,j − c?j )++ πf ([x

′,x]|Dj)(c?x′,j − c?j )+

The computation of this acquisition is computationally de-manding as it requires updating of the GP classificationmodel for every fantasized output at any point in the do-main. As we show in the experimental section, and simi-larly with what is observed in the literature about the EI, thisacquisition tends to be greedy in certain scenarios leadingto over exploitation (Hernandez-Lobato et al., 2014; Hennigand Schuler, 2012). Although non-myopic generalizationsof this acquisition are possible to address this issue in thesame fashion as in (Gonzalez et al., 2016b) these are beintractable.

3.3. Dueling-Thompson sampling

As we have previously detailed, pure explorative approachesthat do not exploit the available knowledge about the cur-rent best location and CEI is expensive to compute and tendsto over-exploit. In this section we propose an alternativeacquisition function that is fast to compute and explicitly bal-ances exploration and exploitation. We call this acquisitiondueling-Thompson sampling and it is inspired by Thompsonsampling approaches. It follows a two-step policy for duels:

1. Step 1, selecting x: First, generate a sample f fromthe model using continuous Thompson sampling 2 and

2Approximated continuous samples from a GP with shift-


0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0Sample of σ(f?)

0.2

0.4

0.6

0.8

0.0 0.2 0.4 0.6 0.8 1.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Sampled Copeland Function

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0Variance of σ(f?)

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Figure 4. Illustration of the steps to propose a new duel using the duelling-Thompson acquisition. The duel is computed using the samemodel as in Figure 2. The white triangle represents the final selected duel. Left: Sample from f? squashed through the logistic function σ.Center: Sampled soft-Copeland function, which results from integrating the the sample from σ(f?) on the left across the vertical axis.The first element of the duel x is selected as the location of the maximum of the sampled soft-Copeland function (vertical dotted line).Right: The second element of the duel, x′, is selected by maximizing the variance of σ(f?) marginally given x (maximum across thevertical dotted line).

compute the associated soft-Copland’s score by inte-grating over X . The first element of the new duel,xnext, is selected as:

xnext = argmaxx∈X

∫Xπf ([x,x

′];Dj)dx′.

The term Vol(X )−1 in the Copeland score has beendropped here as it does not change the location of theoptimum. The goal of this step is to balance explorationand exploitation in the selection of the Condorcet win-ner, it is the same fashion Thompson sampling does:it is likely to select a point close to the current Con-dorcet winner but the policy also allows exploration ofother locations as we base our decision on a stochasticf . Also, the more evaluations are collected, the moregreedy the selection will be towards the Condorcetwinner. See Figure 3 for an illustration of this effect.

2. Step 2, selecting x′: Given xnext the second elementof the duel is selected as the location that maximizesthe variance of σ(f?) in the direction of xnext. Moreformally, x′next is selected as

x′next = arg maxx′?∈X

V[σ(f?)|[x?,x′?],Dj ,x? = xnext]

This second step is purely explorative in the directionof xnew and its goal is to find informative comparisonsto run with current good locations identified in theprevious step.

invariant kernel can be obtained by using Bochner’s theorem(Bochner et al., 1959). In a nutshell, the idea is to approximate thekernel by means of the inner product of a finite number Fourier fea-tures that are sampled according to the measure associated to thekernel (Rahimi and Recht, 2008; Hernandez-Lobato et al., 2014).

Algorithm 1 The PBO algorithm.Input: Dataset D0 = {[xi,x′i], yi}Ni=1 and number ofremaining evaluations n, acquisition for duels α([x,x′]).for j = 0 to n do

1. Fit a GP with kernel k to Dj and learn πf,j(x).2. Compute the acquisition for duels α.3. Next duel: [xj+1,x

′j+1] = argmaxα([x,x′]).

4. Run the duel [xj+1,x′j+1] and obtain yj+1.

5. Augment Dj+1 = {Dj ∪ ([xj+1,x′j+1], yj+1)}.

end forFit a GP with kernel k to Dn.Returns: Report the current Condorcet’s winner x?n.

In summary the dueling-Thompson sampling approach se-lects the next duel as:

arg max[x,x′]

αDTS([x,x′]|Dj) = [xnext,x

′next]

where xnext and xnext are defined above. This policy com-bines a selection of a point with high chances of being theoptimum with a point whose result of the duel is uncertainwith respect of the previous one. Note that this has some in-teresting connections with uncertain sampling as presentedin (Houlsby et al., 2011). See Figure 4 for a visual illustra-tion of the two steps in toy example. See Algorithm 1 for afull description of the PBO approach.

3.4. Generalizations to multiple returns scenarios

A natural extension of the PBO set-up detailed above arecases in which multiple comparisons of inputs are simulta-neously allowed. This is equivalent to providing a rankingover a set of points x1, . . . ,xk. Rankings are trivial to mapto pairwise preferences by using the pairwise ordering to


0 25 50 75 100 125 150 175 200#iterations

−6

−4

−2

0

2

g(x

c)Forrester

PBO-PE

PBO-DTS

PBO-CEI

RANDOM

IBO

SPARRING

0 25 50 75 100 125 150 175 200#iterations

−0.5

0.0

0.5

1.0

1.5

2.0

g(x

c)

Six Hump Camel

PBO-PE

PBO-DTS

RANDOM

IBO

0 25 50 75 100 125 150 175 200#iterations

103

104

g(x

c)

Gold Stein

PBO-PE

PBO-DTS

RANDOM

IBO

0 25 50 75 100 125 150 175 200#iterations

100

101

g(x

c)

Levy

PBO-PE

PBO-DTS

RANDOM

IBO

Figure 5. Averaged results across 4 latent objective functions and 6 different methods. The CEI is only computed in the Forrester functionas it is intractable when the dimension increases. Results are computed over 20 replicates in which 5 randomly selected initial pointsare used to initialize the models and that are kept the same across the six methods. The horizontal axis of the plots represents thenumber of evaluation and the vertical axis represent the value (log scale in the second row) of the true objective function at the bestguess (Condorcet winner) at each evaluation. Note that this is only possible as we know the true objective function.The curves are notnecessarily monotonically decreasing as we do not show the current best solution but the current solution at each iteration (proposedlocation by each method at each step in a real scenario).

obtain the result of the duels. The problem is, therefore,equivalent from a modeling perspective. However, from thepoint of view of selecting the k locations to rank in eachiteration, generalization of the above mentioned acquisitionsare required. Although this is not the goal of this work, it isinteresting to remark that this problem has strong connec-tions with the one of computing batches in BO (Gonzalezet al., 2016a).

4. ExperimentsWe present three experiments which validate our approachin terms of performance and illustrate its key properties. Theset-up is as follows: we have a non-linear black-box functiong(x) of which we look for its minimum as described inequation (1). However, we can only query this functionthrough pairwise comparisons. The outcome of a pairwisecomparison is generated as described in Section 2, i.e., theoutcome is drawn from a Bernoulli distribution of which thesample probability is computed according to equation (2).

We have considered for g(x) the Forrester, the ‘six-hump

camel’, ‘Gold-Stein’ and ‘Levy’ as latent objective func-tions to optimize. The Forrester function is 1-dimensional,whereas the rest are defined in domains of dimension 2. Theexplicit formulation of these objectives and the domains inwhich they are optimized are available as part of standardoptimization benchmarks3. The PBO framework is appli-cable in the continuous setting. In this section, however,the search of the optimum of the objectives is performed ina grid of size (33 per dimension for all cases), which haspractical advantages: the integral in eq. (5) can easily betreated as a sum and, more importantly, we can comparePBO with bandit methods that are only defined in discretedomains. Each comparison starts with 5 initial (randomlyselected) duels and a total budget of 200 duels are run, afterwhich, the best location of the optimum should be reported.Further, each algorithm runs for 20 times (trials) with differ-ent initial duels (the same for all methods) 4 5. We report theaverage performance across all trials, which is defined asthe value of g (known in all cases) evaluated at the current

3https://www.sfu.ca/ssurjano/optimisation.html4RAMDOM runs for 100 trials to give a reliable curve5PBO-CEI only runs 5 trials for Forrester as it is very slow.


0 500 1000 1500 2000 2500 3000 3500 4000#iterations

−1

0

1

2

3

4

5

6

g(x

c)Six Hump Camel

PBO-DTS

SPARRING

Figure 6. Comparison of the bandits Sparring algorithm and PBO-DTS on the Six-Hump Camel for an horizon in which the banditmethod is run for 4000 iterations. The horizontal line representsthe solution proposed by PBO-DTS after 200 duels. As the plotillustrates, modeling correlations across the 900 arms (points in2D) with a GP drastically improves the speed at which the optimumof the function is found. Note that the bandit method needs to visitat least each arm before starting to get any improvement whilePBO-DTS is able to make a good (initial) guess with the first 5random duels used in the experiment. Similar results are obtainedfor the rest of functions.

Condorcet winner, xc considered to be the best by eachalgorithm at each one of the 200 steps taken until the end ofthe budget. Note that each algorithm chooses xc differently,which leads to different performance at step 0. Therefore,we present plots of #iterations versus g(xc).

We compare 6 methods. Firstly, the three variants within thePBO framework: PBO with pure exploration (PBO-PE, seesection 3.1), PBO with the Copeland Expected Improvement(PBO-CEI, see section 3.2) and PBO with dueling Thom-son sampling (PBO-DTS, see section 3.3). We also com-pare against a random policy where duels are drawn froma uniform distribution (RAMDOM) 6 and with the interac-tive Bayesian optimization (IBO) method of (Brochu, 2010).IBO selects duels by using an extension of Expected Im-provement on a GP model that encodes the information ofthe preferential returns in the likelihood. Finally, we com-pared against all three cardinal bandit algorithms proposedby Ailon et al. (2014), namely Doubler, MultiSBM andSparring. Ailon et al. (2014) observes that Sparring hasthe best performance, and it also outperforms the other twobandit-based algorithms in our experiments. Therefore, weonly report the performance for Sparring to keep the plotsclean. In a nutshell, the Sparring considers two bandit play-ers (agents), one for each element of the duel, which use theUpper Confidence Bound criterion and where the input gridis acting as the set of arms. The decision for which pair ofarms to query is according to the strategies and beliefs ofeach agent. In this case, correlations in f are not captured.

6xc is chosen as the location that wins most frequently.

Figure 5 shows the performance of the compared methods,which is consistent across the four plots: IBO shows a poorperformance, due to the combination of the greedy nature ofthe acquisitions and the poor calibration of the model. TheRAMDOM policy converges to a sub-optimal result and thePBO approaches tend to be the superior ones. In particular,we observe that PBO-DTS is consistently proven as the bestpolicy, and it is able to balance correctly exploration andexploitation in the duels space. Contrarily, PBO-CEI, whichis only used in the first experiment due to the excessivecomputational overload, tends to over exploit. PBO-PE ob-tains reasonable results but tends to work worse in largerdimensions where is harder to cover the space.

Regarding the bandits methods, they need a much largernumber of evaluations to converge compared to methodsthat model correlations across the arms. They are also heav-ily affected by an increase of the dimension (number ofarms). The results of the Sparring method are shown for theForrester function but are omitted in the rest of the plots (thenumber of used evaluations used is smaller than the numbersof the arms and therefore no real learning can happen withinthe budget). However, in Figure 6 we show the comparisonbetween Sparring and PBO-DTS for an horizon in which thebandit method has almost converged. The gain obtained bymodeling correlations among the duels is evident.

5. ConclusionsWe have explored a new framework, PBO, for optimizingblack-box functions in which only preferential returns areavailable. The fundamental idea is to model comparisonsof pairs of points with a Gaussian, which leads to the defi-nition of new policies for augmenting the available dataset.We have proposed three acquisitions for duels, PE, CEI andDTS, and explored their connections with existing policiesin standard BO. Via simulation, we have demonstrated thesuperior performance of DTS, both because it finds a goodbalance between exploration and exploitation in the duelsspace and because it is computationally tractable. In com-parison with other alternatives out of the PBO framework,such as IBO or other bandit methods, DTS shows the state-of-the-art performance. There exist several future extensionsof our current approach like tackling the existing limita-tion on the dimension of the input space, which is doubledwith respect to the original dimensionality of the problem.Also further theoretical analysis will be carried out on theproposed acquisitions.

Acknowledgements

The authors would like to thank Mike Croucher for theinspirational ideas that motivated the development of thiswork during the Gaussian process summer school held inSheffield in September 2015.


ReferencesNir Ailon, Zohar Shay Karnin, and Thorsten Joachims. Re-

ducing dueling bandits to cardinal bandits. In Proceedingsof the 31th International Conference on Machine Learn-ing, ICML 2014, Beijing, China, 21-26 June 2014, pages856–864, 2014.

Salomon Bochner, Monotonic Functions, Stieltjes Integrals,Harmonic Analysis, Morris Tenenbaum, and Harry Pol-lard. Lectures on Fourier Integrals. (AM-42). PrincetonUniversity Press, 1959. ISBN 9780691079943.

Francois-Xavier Briol, Chris J. Oates, Mark Girolami, andMichael A. Osborne. Frank-Wolfe Bayesian quadra-ture: Probabilistic integration with theoretical guarantees.In Proceedings of the 28th International Conference onNeural Information Processing Systems, NIPS’15, pages1162–1170, Cambridge, MA, USA, 2015. MIT Press.

Eric Brochu. Interactive Bayesian Optimization: Learn-ing Parameters for Graphics and Animation. PhD thesis,University of British Columbia, Vancouver, Canada, De-cember 2010.

Peter Brusilovsky, Alfred Kobsa, and Wolfgang Nejdl, edi-tors. The Adaptive Web: Methods and Strategies of WebPersonalization. Springer-Verlag, Berlin, Heidelberg,2007. ISBN 978-3-540-72078-2.

Wei Chu and Zoubin Ghahramani. Preference learningwith Gaussian processes. In Proceedings of the 22NdInternational Conference on Machine Learning, ICML’05, pages 137–144, New York, NY, USA, 2005. ACM.ISBN 1-59593-180-5.

Miroslav Dudık, Katja Hofmann, Robert E. Schapire, Alek-sandrs Slivkins, and Masrour Zoghi. Contextual duelingbandits. In Proceedings of The 28th Conference on Learn-ing Theory, COLT 2015, Paris, France, July 3-6, 2015,pages 563–587, 2015.

Javier Gonzalez, Zhenwen Dai, Philipp Hennig, andN. Lawrence. Batch bayesian optimization via local pe-nalization. In Proceedings of the 19th International Con-ference on Artificial Intelligence and Statistics (AISTATS2016), volume 51 of JMLR Workshop and ConferenceProceedings, pages 648–657, 2016a.

Javier Gonzalez, Michael A. Osborne, and Neil D.Lawrence. GLASSES: relieving the myopia of bayesianoptimisation. In Proceedings of the 19th InternationalConference on Artificial Intelligence and Statistics, AIS-TATS 2016, Cadiz, Spain, May 9-11, 2016, pages 790–799, 2016b.

Philipp Hennig and Christian J. Schuler. Entropy searchfor information-efficient global optimization. Journal ofMachine Learning Research, 13, 2012.

Ralf Herbrich, Tom Minka, and Thore Graepel. TrueskillTM:A bayesian skill rating system. In P. B. Scholkopf, J. C.Platt, and T. Hoffman, editors, Advances in Neural In-formation Processing Systems 19, pages 569–576. MITPress, 2007.

Jose Miguel Hernandez-Lobato, Matthew W Hoffman, andZoubin Ghahramani. Predictive entropy search for ef-ficient global optimization of black-box functions. InZ. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,and K. Q. Weinberger, editors, Advances in Neural Infor-mation Processing Systems 27, pages 918–926. CurranAssociates, Inc., 2014.

Neil Houlsby, Ferenc Huszar, Zoubin Ghahramani, andMate Lengyel. Bayesian active learning for classificationand preference learning. CoRR, abs/1112.5745, 2011.

Kevin G. Jamieson, Sumeet Katariya, Atul Deshpande, andRobert D. Nowak. Sparse dueling bandits. In Proceedingsof the Eighteenth International Conference on ArtificialIntelligence and Statistics, AISTATS 2015, San Diego,California, USA, May 9-12, 2015, 2015.

Donald R. Jones. A taxonomy of global optimization meth-ods based on response surfaces. Journal of global opti-mization, 21(4):345383, 2001.

Daniel Kahneman and Amos Tversky. Prospect theory: Ananalysis of decision under risk. Econometrica, 47(2):263–91, 1979.

Jonas Mockus. On Bayesian methods for seeking the ex-tremum and their application. In IFIP Congress, pages195–200, 1977.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. C. Platt, D. Koller, Y. Singer,and S. T. Roweis, editors, Advances in Neural Informa-tion Processing Systems 20, pages 1177–1184. CurranAssociates, Inc., 2008.

Carl Edward Rasmussen and Christopher K. I. Williams.Gaussian Processes for Machine Learning (AdaptiveComputation and Machine Learning). The MIT Press,2005.

Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practi-cal bayesian optimization of machine learning algorithms.In Advances in Neural Information Processing Systems25, pages 2951–2959, 12/2012 2012.

Balazs Szorenyi, Robert Busa-Fekete, Adil Paul, and EykeHullermeier. Online rank elicitation for plackett-luce: Adueling bandits approach. In Advances in Neural Infor-mation Processing Systems 28: Annual Conference onNeural Information Processing Systems 2015, December7-12, 2015, Montreal, Quebec, Canada, pages 604–612,2015.


Huasen Wu, Xin Liu, and R. Srikant. Double Thompsonsampling for dueling bandits. CoRR, abs/1604.07101,2016.

Yisong Yue and Thorsten Joachims. Beat the mean bandit.In ICML, pages 241–248, 2011.

Yisong Yuea, Josef Broderb, Robert Kleinbergc, andThorsten Joachims. The k-armed dueling bandits prob-lem. Journal of Computer and System Sciences, 78(5):1538 – 1556, 2012. ISSN 0022-0000. {JCSS} SpecialIssue: Cloud Computing 2011.

Masrour Zoghi, Shimon Whiteson, Remi Munos, andMaarten de Rijke. Relative upper confidence bound forthe K-armed dueling bandit problem. In ICML 2014:Proceedings of the Thirty-First International Conferenceon Machine Learning, pages 10–18, June 2014.

Masrour Zoghi, Zohar S Karnin, Shimon Whiteson, andMaarten de Rijke. Copeland dueling bandits. In C. Cortes,N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,editors, Advances in Neural Information Processing Sys-tems 28, pages 307–315. Curran Associates, Inc., 2015a.

Masrour Zoghi, Zohar S. Karnin, Shimon Whiteson, andMaarten de Rijke. Copeland dueling bandits. In Advancesin Neural Information Processing Systems 28: AnnualConference on Neural Information Processing Systems2015, December 7-12, 2015, Montreal, Quebec, Canada,pages 307–315, 2015b.

preferential bayesian optimization · a/b tests or recommender systems. we present a new framework...

Documents