dynamic learning gaussian process bandits€¦ · diogo dubart norinho (ucl) december 10, 2012 8 /...

Dynamic Learning Gaussian Process Bandits


Diogo Dubart Norinho

University College London,Computer Science Department

December 10, 2012

Diogo Dubart Norinho (UCL) December 10, 2012 1 / 44


Outline

1 Introduction

2 Gaussian Process (GP)

3 Bandits

4 DLGP

5 DLGP vs. GP bandits

6 Conclusion


Dynamic Learning Gaussian Process Bandits Introduction

Outline

1 Introduction


3 Bandits

4 DLGP


6 Conclusion


Dynamic Learning Gaussian Process Bandits Introduction


Dynamic Learning Gaussian Process Bandits Gaussian Process (GP)

Outline

1 Introduction


3 Bandits

4 DLGP


6 Conclusion



What it a Gaussian Process? Intuitively...

A GP is generalisation of a multivariate gaussian distribution.

So instead of being a distribution over vectors it is a distribution overfunctions.

Conceptualy GPs are based on the naive, yet effective, idea that afunction f (x) can be regarded as an infinite vector whose valuescorrespond f (x) evaluated at every possible input x.



What it a Gaussian Process? Intuitively...

A GP is a random function f : X → R, where X be a non-empty indexset, such that for any finite set of input points x1, . . . , xn,f (x1)

...f (xn)

∼ Nm(x1)

...m(xn)

,

k(x1, x1) . . . k(x1, xn)...

...k(xn, x1) . . . k(xn, xn)

,

with the parameters being the mean function m(x) and the covariancekernel k(x, x′), for all x, x′ ∈ X . This is also known as the consistencyproperty of the GPs.



More formally...

Definition (Gaussian Process )

A GP {f (x), x ∈ X}, is a family of random variables f (x), all defined inthe same probability space. In addition, for any finite subset F ⊂ X , withF := {xπ1 , . . . , xπn}, the random vector f := [f (xπ1), . . . , f (xπn)]> has a(possibly degenerate) Gaussian distribution.

The GP is denoted as,

f (x) ∼ GP(m(x), k(x, x′)

).



Example of a kernel: the ARD kernel

The Automatic Relevance Determination kernel is defined as:

kARD(x, x′) := σ20 exp(−1

2(x− x′)>M(x− x′)), (1)

where M can be a positive semidefinite. If M is diagonal then it’s elementsare known as length scales.



ARD figure

ARD kernel with length-scale of dimension x is M1,1 = 0.4 and the onealong dimension y is M1,1 = 1.

02

46

810 0

24

68

10−1.5

−1

−0.5

0

0.5

1

y

x

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8



Model Assumption

We assume that each observation f (xi ) is tainted with some gaussiannoise εi ∼ N (0, σ2)

yi = f (xi ) + εi . (2)



Predictive distribution

Consider that we have n input values drawn from a GP, gathered in amatrix X∗ = (x∗1, . . . , x

∗n)>, with ∗ indicating that we have not observed

f (x∗i ) (with or without measurement noise). We compute the probabilitydistribution of the unseen observations given the observed data (X , y), inother words the predictive probability:

f∗|X , y,X∗ ∼ N (f̄∗,Σ∗), where (3)

f̄∗ := E[f∗|X , y,X∗] = K (X∗,X )[K (X ,X ) + σ2nI ]−1y, (4)

Σ∗ := K (X∗,X∗)− K (X∗,X )[K (X ,X ) + σ2nI ]−1K (X ,X∗). (5)



Hyperparameter Learning: maximising the evidence

The evidence is the probability of observing the data under a model withparameters θ:

P(y|X, θ) = N (f|0,K ) +N (ε|0, σ2 I ) = N (0,Σ)

with Σ := K (X ,X ) + σ2 I .

We aim at maximising the following:

L := logP(y|X, θ) = −12 log |Σ| − 1

2y>Σ−1 y − d

2 log(2π).

Here, we will do that by using the conjugate gradient algorithm.



Hyperparameter Learning: Leave one out crossvalidation

Li := logP(y∗i |x∗i , y−i ,X−i , θ)

= −12 log σ2i −

(y∗i − µi )2

2σ2i− 1

2 log(2π)

The hyperparameters are found by maximising the following,

LLOO(y ,X , θ) :=n∑

i=1

Li . (6)


Dynamic Learning Gaussian Process Bandits Bandits

Outline

1 Introduction


3 Bandits

4 DLGP


6 Conclusion



Visual representation of a multi-armed bandit



The Bandit framework

It is a way of trading off exploitation and exploration.

Environment: N slot machines, each having an unknown but fixedmean µi and variance σi .

Goal: get rich

1st Approach : sample each machine uniformly until you find sloti∗ := argmaxi∈Nµi.

2nd Approach : play more at the machines where you have earned sofar, while trying others once in a while, just to be sure you did notunder evaluate them.



The Bandit framework

It is a way of trading off exploitation and exploration.

Environment: N slot machines, each having an unknown but fixedmean µi and variance σi .

Goal: get rich

1st Approach : sample each machine uniformly until you find sloti∗ := argmaxi∈Nµi. Probably not the best approach

2nd Approach : play more at the machines where you have earned sofar, while trying others once in a while, just to be sure you did notunder evaluate them.



Formally our goal

Let the regret be RT := Tµi∗ −∑T

t=1 rt .Find a strategy such that the on average we our regret tends to zero:

limT→∞

RT

T= 0, (7)



GP bandits and GP-UCB

GP-bandits are a framework that relies on the assumption that thearms of the bandit are dependent, hence the reward obtained bypulling one arm delivers information about neighbouring arms.

Gaussian Process - Upper Confidence Bound (GP-UCB) is analgorithm applies the principle of optimism in the face of uncertaintyto trade off exploration and exploitation.



GP-UCB

The surrogate function at evaluation t:

xt = argmaxx∈D{µt−1(x) +√βtσt−1(x)}

where: • D is the objective function’s input space,• set of pulled arms At := {x1, . . . , xt−1},• µt−1 the predictive mean based on A,• σt−1 the predictive variance based on A,

• βt = 2 log(|D| t2π2

6δ

), with δ ∈ (0, 1).



GP-UCB algorithm

input: Input space D; GP prior µ0 = 0, σ0, k0(. , .), θ0

1 for t=1,2,. . . do2 Generate a set of m arms U := {u1, . . . ,um} ⊂ D;3 Choose xt ← argmaxu∈U{µt−1(u) +

√βtσt−1(u)};

4 Sample yt ← f (xt) + εt ;5 µt ← GP posterior mean with θ0 and inputs Xt using (4);6 σ2t ← GP posterior variance with θ0 and inputs Xt using (5);

7 end


Dynamic Learning Gaussian Process Bandits DLGP

Outline

1 Introduction


3 Bandits

4 DLGP


6 Conclusion



Why learn hyperparameters?



GP-UCB algorithm

input: Input space D; GP prior µ0 = 0, σ0, k(. , .), θ0 and φ1 θ∗ ← θ0 and A0 ← {∅} ;2 for t = 1, 2, . . . do3 if t ∈ φ then4 θt ← h(θ0,At−1, ...) select new hyperparameters;5 θ∗ ← θt ;

6 end7 Generate a set of m arms U := {u1, . . . ,um} ⊂ D;8 Choose xt ← argmaxu∈U{µt−1(u) +

√βtσt−1(u)};

9 At ← {At−1, xt} adding the last arm pulled to the

10 set of pulled arms;11 Sample yt ← f (xt) + εt ;12 µt ← GP posterior mean with θ∗ at inputs Xt using (4);13 σ2t ← GP posterior variance with θ∗ at inputs Xt using (5);

14 end



DLGP algorithm: notation

φ is a set of increasing integers (ideally: φ1 > #(θ) ).

θ is the set of all hyperparameters.

At := {x1, . . . , xt} is the set of all pulled arms up to t.



What about overfitting?

When using such flexible methods, as GPs, there is always a risk to over fitthe data when computing θ. Hence, the model explains the observed datavery well but has low predictive power.

To avoid that we can use crossvalidation. But that assumes iid data,which is not the case here.

So we need to find a method h(θ0,At−1, ...) that somehow manages toovercome this by removing the strong dependency with the dataset.



EM algorithm

E step (at recursion t):

r(t)ik :=

p(t−1)k N (x|µ(t−1)k ,Σ

(t−1)k )∑

k ′ p(t−1)k ′ N (x|µ(t−1)k ′ ,Σ

(t−1)k ′ )

,

M step (at recursion t):

µ(t)k =

∑ni=1 r

(t)ik xi∑n

i=1 r(t)ik

,

Σ(t)k =

∑ni=1 r

(t)ik (xi − µk)(xi − µk)>∑n

i=1 r(t)ik

,

p(t)k =

1

n

n∑i=1

r(t)ik .



Silhouette coefficient

Let us consider the K means algorithm, that determines cluster meansµ1, . . . , µK , such that:

min{µ1,...,µK}

K∑k=1

∑i∈Ck

‖xi − µk‖2.

We find the number of clusters K by using the silhouette coefficient s(i)obtained for each observation i in the sample.Let ai := 1

|Ci |∑

j∈Ci ‖xi − xj‖ and bi := minC∈Cci1|C|∑

j∈C ‖xi − xj‖. Then

s(i) :=bi − ai

max(ai , bi ),

and by construction s(i) ∈ [−1, 1]. In the case where the cluster Ci onlycontains i we set s(i) = 0.



Algorithm for h(θ0,At−1, ...) (1)

input: At , θ0, wfor k=2,3,. . . do

run k means;s̄c(k)← average silhouettes s(i)if max{s̄c(k), . . . , s̄c(k − w + 1)} < s̄c(k − w) then

K ← k − w ;leave loop;

end

end



Algorithm for h(θ0,At−1, ...) (2)

t ← 0;while EM not converged do

t ← t + 1;

r(t)ik ← E step (for K clusters);

(µ(t)k ,Σ

(t)k , p

(t)k )←M step(for K clusters);

endfor i=1,2,. . . ,n do

P(xi ) =∑K

k=1 p(t)k N (xi |µ

(t)k ,Σ

(t)k ) ;

γ ← U(0, 1);f (xi )← γP(xi )

end∆← only keep the n − h lowest values in {f (x1), . . . , f (xn)};perform LOO on ∆ and use conjugate gradient to maximise θ;


Dynamic Learning Gaussian Process Bandits DLGP vs. GP bandits

Outline

1 Introduction


3 Bandits

4 DLGP


6 Conclusion



Metrics: Euclidean distance

Euclidean distance of the j th closest element to the optimum,

D(j)t := min{‖xi − xopt‖ : xi ∈ A

(j)t }

where A(j)t corresponds to At once the j − 1 closest elements to the

optimum were removed. Then D̄(i)t is the average D(i)

t over all generatedsample paths under the same distribution.



Metrics: Average regret

the average over all sample paths of the individual average regret overtime,

R̄t := 1t

m∑`=1

Rt,`

where

Rt,` := tµi∗ −t∑

j=1

r `j

with an additional ` in the definition of regret indicates which of the msample paths it corresponds to. Note that again we average R̄t overdifferent runs to gain additional robustness.



Methodology

At every step t of every experience a set of arms to pull, Ut , israndomly generated and then Ut is presented to both algorithms.

Once they have chosen an arm ut ∈ Ut the same random noise valueεt will blur the observation.

For the 2 dimensional functions simulations a Gaussian noiseN (0, 0.3) was used.

The parameters of the DLGP were set as follows: φ1 = 20 and thenincreases by steps of 50, i.e. φi+1 = φi + 50.



Rastirgin 2D

−5

0

5

−5

0

5

0

10

20

30

40

50

60

70

80

90

xy

f(x,y)

−5 0 5−5

−4

−3

−2

−1

0

1

2

3

4

5

x

y



Rastirgin 2D

−5 0 5−5

0

5

evaluated points by GP−UCB

x

y

−3 −2 −1 0 1 2 3−5

0

5

evaluated points by DLGP

x

y

0 100 200 300 400 5000

1

2

3

4

5

6

7

average convergence speed

number of function evaluations

dist

ance

to o

ptim

um

GP−UCB

DLGP

0 100 200 300 400 50050

60

70

80

90

100


aver

age

regr

et

average regret

GP−UCB

DLGP



Griewank 2D

−100

−50

0

50

100

−100

−50

0

50

100

0

1

2

3

4

5

6

7

xy

f(x,y)

−100 −50 0 50 100−100

−80

−60

−40

−20

0

20

40

60

80

100

x

y



Griewank 2D

−100 −50 0 50 100−100

−50

0

50

100


x

y

−100 −50 0 50 100−100

−50

0

50

100


x

y

0 100 200 300 400 5000

10

20

30

40

50

60

70



dist

ance

to o

ptim

um

GP−UCB

DLGP

0 100 200 300 400 5001.8

2

2.2

2.4

2.6

2.8

3


aver

age

regr

et

average regret

GP−UCB

DLGP



Ackley’s 2D

−100 −50 0 50 100−100

0100

0

5

10

15

20

25

xy

f(x,y)

−100 −50 0 50 100−100

−80

−60

−40

−20

0

20

40

60

80

100

x

y



Ackley’s 2D

−100 −80 −60 −40 −20 0 20 40 60 80 100−100

−80

−60

−40

−20

0

20

40

60

80

100


x

y

−100 −80 −60 −40 −20 0 20 40 60 80 100−100

−80

−60

−40

−20

0

20

40

60

80

100


x

y

0 50 100 150 200 250 300 350 400 450 5000

10

20

30

40

50

60

70

80

90

100



dis

tan

ce

to

op

tim

um

GP−UCB

DLGP

0 50 100 150 200 250 300 350 400 450 50016

17

18

19

20

21

22


ave

rag

e r

eg

ret

average regret

GP−UCB

DLGP



Comparison of D(5)500 for DLGP and GP–UCB on 2D

functions

DLGP GP–UCB

Best Median Worst Mean Std Best Median Worst Mean Std

F1 0.0724 0.157 0.256 0.144 0.0757 0.429 0.454 1.27 0.607 0.368

F2 0.725 0.881 1.38 0.954 0.262 6.82 6.96 8.08 7.16 0.515

F3 0.291 0.529 0.773 0.521 0.176 8.74 10.2 10.4 9.75 0.821


Dynamic Learning Gaussian Process Bandits Conclusion

Outline

1 Introduction


3 Bandits

4 DLGP


6 Conclusion



Take Home Message

Bandits allow to optimise easily without highly restrictiveassumptions.

GP bandits exploit dependencies in the observations.

DLGP converges up to 20 times faster to the optimum.

Exciting opportunities to further improve DLGP.

Use other surrogate optimising algorithms.Learn kernels (not only hyperparameters).

Use DLGP instead of GP-bandits



Thank you.



ANNEXES



The functions in our test suit include the Rastigin F1, Griewank F2 andAckley’s F3. The search domain for F1 is [−5, 5]D while it is [−100, 100]D

for F2 and F3. All of them are multimodal and only F2 is non-separable.

F1(x) = 10D +D∑i=1

[x2i + cos(2πxi ) ]

F2(x) =D∑i=1

xi4000

−D∏i=1

cosxi√i

+ 1

F3(z) =− 20 exp

−0.2

√√√√ 1

D

D∑i=1

z2i

− exp

(1

D

D∑i=1

cos(2πzi )

)+ 20 + e1


dynamic learning gaussian process bandits€¦ · diogo dubart norinho (ucl) december 10, 2012 8 /...

Documents