dynamic learning gaussian process bandits€¦ · diogo dubart norinho (ucl) december 10, 2012 8 /...
TRANSCRIPT
Dynamic Learning Gaussian Process Bandits
Dynamic Learning Gaussian Process Bandits
Diogo Dubart Norinho
University College London,Computer Science Department
December 10, 2012
Diogo Dubart Norinho (UCL) December 10, 2012 1 / 44
Dynamic Learning Gaussian Process Bandits
Outline
1 Introduction
2 Gaussian Process (GP)
3 Bandits
4 DLGP
5 DLGP vs. GP bandits
6 Conclusion
Diogo Dubart Norinho (UCL) December 10, 2012 2 / 44
Dynamic Learning Gaussian Process Bandits Introduction
Outline
1 Introduction
2 Gaussian Process (GP)
3 Bandits
4 DLGP
5 DLGP vs. GP bandits
6 Conclusion
Diogo Dubart Norinho (UCL) December 10, 2012 3 / 44
Dynamic Learning Gaussian Process Bandits Introduction
Diogo Dubart Norinho (UCL) December 10, 2012 4 / 44
Dynamic Learning Gaussian Process Bandits Gaussian Process (GP)
Outline
1 Introduction
2 Gaussian Process (GP)
3 Bandits
4 DLGP
5 DLGP vs. GP bandits
6 Conclusion
Diogo Dubart Norinho (UCL) December 10, 2012 5 / 44
Dynamic Learning Gaussian Process Bandits Gaussian Process (GP)
What it a Gaussian Process? Intuitively...
A GP is generalisation of a multivariate gaussian distribution.
So instead of being a distribution over vectors it is a distribution overfunctions.
Conceptualy GPs are based on the naive, yet effective, idea that afunction f (x) can be regarded as an infinite vector whose valuescorrespond f (x) evaluated at every possible input x.
Diogo Dubart Norinho (UCL) December 10, 2012 6 / 44
Dynamic Learning Gaussian Process Bandits Gaussian Process (GP)
What it a Gaussian Process? Intuitively...
A GP is a random function f : X → R, where X be a non-empty indexset, such that for any finite set of input points x1, . . . , xn,f (x1)
...f (xn)
∼ Nm(x1)
...m(xn)
,
k(x1, x1) . . . k(x1, xn)...
...k(xn, x1) . . . k(xn, xn)
,
with the parameters being the mean function m(x) and the covariancekernel k(x, x′), for all x, x′ ∈ X . This is also known as the consistencyproperty of the GPs.
Diogo Dubart Norinho (UCL) December 10, 2012 7 / 44
Dynamic Learning Gaussian Process Bandits Gaussian Process (GP)
More formally...
Definition (Gaussian Process )
A GP {f (x), x ∈ X}, is a family of random variables f (x), all defined inthe same probability space. In addition, for any finite subset F ⊂ X , withF := {xπ1 , . . . , xπn}, the random vector f := [f (xπ1), . . . , f (xπn)]> has a(possibly degenerate) Gaussian distribution.
The GP is denoted as,
f (x) ∼ GP(m(x), k(x, x′)
).
Diogo Dubart Norinho (UCL) December 10, 2012 8 / 44
Dynamic Learning Gaussian Process Bandits Gaussian Process (GP)
Example of a kernel: the ARD kernel
The Automatic Relevance Determination kernel is defined as:
kARD(x, x′) := σ20 exp(−1
2(x− x′)>M(x− x′)), (1)
where M can be a positive semidefinite. If M is diagonal then it’s elementsare known as length scales.
Diogo Dubart Norinho (UCL) December 10, 2012 9 / 44
Dynamic Learning Gaussian Process Bandits Gaussian Process (GP)
ARD figure
ARD kernel with length-scale of dimension x is M1,1 = 0.4 and the onealong dimension y is M1,1 = 1.
02
46
810 0
24
68
10−1.5
−1
−0.5
0
0.5
1
y
x
−1.2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Diogo Dubart Norinho (UCL) December 10, 2012 10 / 44
Dynamic Learning Gaussian Process Bandits Gaussian Process (GP)
Model Assumption
We assume that each observation f (xi ) is tainted with some gaussiannoise εi ∼ N (0, σ2)
yi = f (xi ) + εi . (2)
Diogo Dubart Norinho (UCL) December 10, 2012 11 / 44
Dynamic Learning Gaussian Process Bandits Gaussian Process (GP)
Predictive distribution
Consider that we have n input values drawn from a GP, gathered in amatrix X∗ = (x∗1, . . . , x
∗n)>, with ∗ indicating that we have not observed
f (x∗i ) (with or without measurement noise). We compute the probabilitydistribution of the unseen observations given the observed data (X , y), inother words the predictive probability:
f∗|X , y,X∗ ∼ N (f̄∗,Σ∗), where (3)
f̄∗ := E[f∗|X , y,X∗] = K (X∗,X )[K (X ,X ) + σ2nI ]−1y, (4)
Σ∗ := K (X∗,X∗)− K (X∗,X )[K (X ,X ) + σ2nI ]−1K (X ,X∗). (5)
Diogo Dubart Norinho (UCL) December 10, 2012 12 / 44
Dynamic Learning Gaussian Process Bandits Gaussian Process (GP)
Hyperparameter Learning: maximising the evidence
The evidence is the probability of observing the data under a model withparameters θ:
P(y|X, θ) = N (f|0,K ) +N (ε|0, σ2 I ) = N (0,Σ)
with Σ := K (X ,X ) + σ2 I .
We aim at maximising the following:
L := logP(y|X, θ) = −12 log |Σ| − 1
2y>Σ−1 y − d
2 log(2π).
Here, we will do that by using the conjugate gradient algorithm.
Diogo Dubart Norinho (UCL) December 10, 2012 13 / 44
Dynamic Learning Gaussian Process Bandits Gaussian Process (GP)
Hyperparameter Learning: Leave one out crossvalidation
Li := logP(y∗i |x∗i , y−i ,X−i , θ)
= −12 log σ2i −
(y∗i − µi )2
2σ2i− 1
2 log(2π)
The hyperparameters are found by maximising the following,
LLOO(y ,X , θ) :=n∑
i=1
Li . (6)
Diogo Dubart Norinho (UCL) December 10, 2012 14 / 44
Dynamic Learning Gaussian Process Bandits Bandits
Outline
1 Introduction
2 Gaussian Process (GP)
3 Bandits
4 DLGP
5 DLGP vs. GP bandits
6 Conclusion
Diogo Dubart Norinho (UCL) December 10, 2012 15 / 44
Dynamic Learning Gaussian Process Bandits Bandits
Visual representation of a multi-armed bandit
Diogo Dubart Norinho (UCL) December 10, 2012 16 / 44
Dynamic Learning Gaussian Process Bandits Bandits
The Bandit framework
It is a way of trading off exploitation and exploration.
Environment: N slot machines, each having an unknown but fixedmean µi and variance σi .
Goal: get rich
1st Approach : sample each machine uniformly until you find sloti∗ := argmaxi∈Nµi.
2nd Approach : play more at the machines where you have earned sofar, while trying others once in a while, just to be sure you did notunder evaluate them.
Diogo Dubart Norinho (UCL) December 10, 2012 17 / 44
Dynamic Learning Gaussian Process Bandits Bandits
The Bandit framework
It is a way of trading off exploitation and exploration.
Environment: N slot machines, each having an unknown but fixedmean µi and variance σi .
Goal: get rich
1st Approach : sample each machine uniformly until you find sloti∗ := argmaxi∈Nµi. Probably not the best approach
2nd Approach : play more at the machines where you have earned sofar, while trying others once in a while, just to be sure you did notunder evaluate them.
Diogo Dubart Norinho (UCL) December 10, 2012 17 / 44
Dynamic Learning Gaussian Process Bandits Bandits
Formally our goal
Let the regret be RT := Tµi∗ −∑T
t=1 rt .Find a strategy such that the on average we our regret tends to zero:
limT→∞
RT
T= 0, (7)
Diogo Dubart Norinho (UCL) December 10, 2012 18 / 44
Dynamic Learning Gaussian Process Bandits Bandits
GP bandits and GP-UCB
GP-bandits are a framework that relies on the assumption that thearms of the bandit are dependent, hence the reward obtained bypulling one arm delivers information about neighbouring arms.
Gaussian Process - Upper Confidence Bound (GP-UCB) is analgorithm applies the principle of optimism in the face of uncertaintyto trade off exploration and exploitation.
Diogo Dubart Norinho (UCL) December 10, 2012 19 / 44
Dynamic Learning Gaussian Process Bandits Bandits
GP-UCB
The surrogate function at evaluation t:
xt = argmaxx∈D{µt−1(x) +√βtσt−1(x)}
where: • D is the objective function’s input space,• set of pulled arms At := {x1, . . . , xt−1},• µt−1 the predictive mean based on A,• σt−1 the predictive variance based on A,
• βt = 2 log(|D| t2π2
6δ
), with δ ∈ (0, 1).
Diogo Dubart Norinho (UCL) December 10, 2012 20 / 44
Dynamic Learning Gaussian Process Bandits Bandits
GP-UCB algorithm
input: Input space D; GP prior µ0 = 0, σ0, k0(. , .), θ0
1 for t=1,2,. . . do2 Generate a set of m arms U := {u1, . . . ,um} ⊂ D;3 Choose xt ← argmaxu∈U{µt−1(u) +
√βtσt−1(u)};
4 Sample yt ← f (xt) + εt ;5 µt ← GP posterior mean with θ0 and inputs Xt using (4);6 σ2t ← GP posterior variance with θ0 and inputs Xt using (5);
7 end
Diogo Dubart Norinho (UCL) December 10, 2012 21 / 44
Dynamic Learning Gaussian Process Bandits DLGP
Outline
1 Introduction
2 Gaussian Process (GP)
3 Bandits
4 DLGP
5 DLGP vs. GP bandits
6 Conclusion
Diogo Dubart Norinho (UCL) December 10, 2012 22 / 44
Dynamic Learning Gaussian Process Bandits DLGP
Why learn hyperparameters?
Diogo Dubart Norinho (UCL) December 10, 2012 23 / 44
Dynamic Learning Gaussian Process Bandits DLGP
GP-UCB algorithm
input: Input space D; GP prior µ0 = 0, σ0, k(. , .), θ0 and φ1 θ∗ ← θ0 and A0 ← {∅} ;2 for t = 1, 2, . . . do3 if t ∈ φ then4 θt ← h(θ0,At−1, ...) select new hyperparameters;5 θ∗ ← θt ;
6 end7 Generate a set of m arms U := {u1, . . . ,um} ⊂ D;8 Choose xt ← argmaxu∈U{µt−1(u) +
√βtσt−1(u)};
9 At ← {At−1, xt} adding the last arm pulled to the
10 set of pulled arms;11 Sample yt ← f (xt) + εt ;12 µt ← GP posterior mean with θ∗ at inputs Xt using (4);13 σ2t ← GP posterior variance with θ∗ at inputs Xt using (5);
14 end
Diogo Dubart Norinho (UCL) December 10, 2012 24 / 44
Dynamic Learning Gaussian Process Bandits DLGP
DLGP algorithm: notation
φ is a set of increasing integers (ideally: φ1 > #(θ) ).
θ is the set of all hyperparameters.
At := {x1, . . . , xt} is the set of all pulled arms up to t.
Diogo Dubart Norinho (UCL) December 10, 2012 25 / 44
Dynamic Learning Gaussian Process Bandits DLGP
What about overfitting?
When using such flexible methods, as GPs, there is always a risk to over fitthe data when computing θ. Hence, the model explains the observed datavery well but has low predictive power.
To avoid that we can use crossvalidation. But that assumes iid data,which is not the case here.
So we need to find a method h(θ0,At−1, ...) that somehow manages toovercome this by removing the strong dependency with the dataset.
Diogo Dubart Norinho (UCL) December 10, 2012 26 / 44
Dynamic Learning Gaussian Process Bandits DLGP
EM algorithm
E step (at recursion t):
r(t)ik :=
p(t−1)k N (x|µ(t−1)k ,Σ
(t−1)k )∑
k ′ p(t−1)k ′ N (x|µ(t−1)k ′ ,Σ
(t−1)k ′ )
,
M step (at recursion t):
µ(t)k =
∑ni=1 r
(t)ik xi∑n
i=1 r(t)ik
,
Σ(t)k =
∑ni=1 r
(t)ik (xi − µk)(xi − µk)>∑n
i=1 r(t)ik
,
p(t)k =
1
n
n∑i=1
r(t)ik .
Diogo Dubart Norinho (UCL) December 10, 2012 27 / 44
Dynamic Learning Gaussian Process Bandits DLGP
Silhouette coefficient
Let us consider the K means algorithm, that determines cluster meansµ1, . . . , µK , such that:
min{µ1,...,µK}
K∑k=1
∑i∈Ck
‖xi − µk‖2.
We find the number of clusters K by using the silhouette coefficient s(i)obtained for each observation i in the sample.Let ai := 1
|Ci |∑
j∈Ci ‖xi − xj‖ and bi := minC∈Cci1|C|∑
j∈C ‖xi − xj‖. Then
s(i) :=bi − ai
max(ai , bi ),
and by construction s(i) ∈ [−1, 1]. In the case where the cluster Ci onlycontains i we set s(i) = 0.
Diogo Dubart Norinho (UCL) December 10, 2012 28 / 44
Dynamic Learning Gaussian Process Bandits DLGP
Algorithm for h(θ0,At−1, ...) (1)
input: At , θ0, wfor k=2,3,. . . do
run k means;s̄c(k)← average silhouettes s(i)if max{s̄c(k), . . . , s̄c(k − w + 1)} < s̄c(k − w) then
K ← k − w ;leave loop;
end
end
Diogo Dubart Norinho (UCL) December 10, 2012 29 / 44
Dynamic Learning Gaussian Process Bandits DLGP
Algorithm for h(θ0,At−1, ...) (2)
t ← 0;while EM not converged do
t ← t + 1;
r(t)ik ← E step (for K clusters);
(µ(t)k ,Σ
(t)k , p
(t)k )←M step(for K clusters);
endfor i=1,2,. . . ,n do
P(xi ) =∑K
k=1 p(t)k N (xi |µ
(t)k ,Σ
(t)k ) ;
γ ← U(0, 1);f (xi )← γP(xi )
end∆← only keep the n − h lowest values in {f (x1), . . . , f (xn)};perform LOO on ∆ and use conjugate gradient to maximise θ;
Diogo Dubart Norinho (UCL) December 10, 2012 30 / 44
Dynamic Learning Gaussian Process Bandits DLGP vs. GP bandits
Outline
1 Introduction
2 Gaussian Process (GP)
3 Bandits
4 DLGP
5 DLGP vs. GP bandits
6 Conclusion
Diogo Dubart Norinho (UCL) December 10, 2012 31 / 44
Dynamic Learning Gaussian Process Bandits DLGP vs. GP bandits
Metrics: Euclidean distance
Euclidean distance of the j th closest element to the optimum,
D(j)t := min{‖xi − xopt‖ : xi ∈ A
(j)t }
where A(j)t corresponds to At once the j − 1 closest elements to the
optimum were removed. Then D̄(i)t is the average D(i)
t over all generatedsample paths under the same distribution.
Diogo Dubart Norinho (UCL) December 10, 2012 32 / 44
Dynamic Learning Gaussian Process Bandits DLGP vs. GP bandits
Metrics: Average regret
the average over all sample paths of the individual average regret overtime,
R̄t := 1t
m∑`=1
Rt,`
where
Rt,` := tµi∗ −t∑
j=1
r `j
with an additional ` in the definition of regret indicates which of the msample paths it corresponds to. Note that again we average R̄t overdifferent runs to gain additional robustness.
Diogo Dubart Norinho (UCL) December 10, 2012 33 / 44
Dynamic Learning Gaussian Process Bandits DLGP vs. GP bandits
Methodology
At every step t of every experience a set of arms to pull, Ut , israndomly generated and then Ut is presented to both algorithms.
Once they have chosen an arm ut ∈ Ut the same random noise valueεt will blur the observation.
For the 2 dimensional functions simulations a Gaussian noiseN (0, 0.3) was used.
The parameters of the DLGP were set as follows: φ1 = 20 and thenincreases by steps of 50, i.e. φi+1 = φi + 50.
Diogo Dubart Norinho (UCL) December 10, 2012 34 / 44
Dynamic Learning Gaussian Process Bandits DLGP vs. GP bandits
Rastirgin 2D
−5
0
5
−5
0
5
0
10
20
30
40
50
60
70
80
90
xy
f(x,y)
−5 0 5−5
−4
−3
−2
−1
0
1
2
3
4
5
x
y
Diogo Dubart Norinho (UCL) December 10, 2012 35 / 44
Dynamic Learning Gaussian Process Bandits DLGP vs. GP bandits
Rastirgin 2D
−5 0 5−5
0
5
evaluated points by GP−UCB
x
y
−3 −2 −1 0 1 2 3−5
0
5
evaluated points by DLGP
x
y
0 100 200 300 400 5000
1
2
3
4
5
6
7
average convergence speed
number of function evaluations
dist
ance
to o
ptim
um
GP−UCB
DLGP
0 100 200 300 400 50050
60
70
80
90
100
number of function evaluations
aver
age
regr
et
average regret
GP−UCB
DLGP
Diogo Dubart Norinho (UCL) December 10, 2012 36 / 44
Dynamic Learning Gaussian Process Bandits DLGP vs. GP bandits
Griewank 2D
−100
−50
0
50
100
−100
−50
0
50
100
0
1
2
3
4
5
6
7
xy
f(x,y)
−100 −50 0 50 100−100
−80
−60
−40
−20
0
20
40
60
80
100
x
y
Diogo Dubart Norinho (UCL) December 10, 2012 37 / 44
Dynamic Learning Gaussian Process Bandits DLGP vs. GP bandits
Griewank 2D
−100 −50 0 50 100−100
−50
0
50
100
evaluated points by GP−UCB
x
y
−100 −50 0 50 100−100
−50
0
50
100
evaluated points by DLGP
x
y
0 100 200 300 400 5000
10
20
30
40
50
60
70
average convergence speed
number of function evaluations
dist
ance
to o
ptim
um
GP−UCB
DLGP
0 100 200 300 400 5001.8
2
2.2
2.4
2.6
2.8
3
number of function evaluations
aver
age
regr
et
average regret
GP−UCB
DLGP
Diogo Dubart Norinho (UCL) December 10, 2012 38 / 44
Dynamic Learning Gaussian Process Bandits DLGP vs. GP bandits
Ackley’s 2D
−100 −50 0 50 100−100
0100
0
5
10
15
20
25
xy
f(x,y)
−100 −50 0 50 100−100
−80
−60
−40
−20
0
20
40
60
80
100
x
y
Diogo Dubart Norinho (UCL) December 10, 2012 39 / 44
Dynamic Learning Gaussian Process Bandits DLGP vs. GP bandits
Ackley’s 2D
−100 −80 −60 −40 −20 0 20 40 60 80 100−100
−80
−60
−40
−20
0
20
40
60
80
100
evaluated points by GP−UCB
x
y
−100 −80 −60 −40 −20 0 20 40 60 80 100−100
−80
−60
−40
−20
0
20
40
60
80
100
evaluated points by DLGP
x
y
0 50 100 150 200 250 300 350 400 450 5000
10
20
30
40
50
60
70
80
90
100
average convergence speed
number of function evaluations
dis
tan
ce
to
op
tim
um
GP−UCB
DLGP
0 50 100 150 200 250 300 350 400 450 50016
17
18
19
20
21
22
number of function evaluations
ave
rag
e r
eg
ret
average regret
GP−UCB
DLGP
Diogo Dubart Norinho (UCL) December 10, 2012 40 / 44
Dynamic Learning Gaussian Process Bandits DLGP vs. GP bandits
Comparison of D(5)500 for DLGP and GP–UCB on 2D
functions
DLGP GP–UCB
Best Median Worst Mean Std Best Median Worst Mean Std
F1 0.0724 0.157 0.256 0.144 0.0757 0.429 0.454 1.27 0.607 0.368
F2 0.725 0.881 1.38 0.954 0.262 6.82 6.96 8.08 7.16 0.515
F3 0.291 0.529 0.773 0.521 0.176 8.74 10.2 10.4 9.75 0.821
Diogo Dubart Norinho (UCL) December 10, 2012 41 / 44
Dynamic Learning Gaussian Process Bandits Conclusion
Outline
1 Introduction
2 Gaussian Process (GP)
3 Bandits
4 DLGP
5 DLGP vs. GP bandits
6 Conclusion
Diogo Dubart Norinho (UCL) December 10, 2012 42 / 44
Dynamic Learning Gaussian Process Bandits Conclusion
Take Home Message
Bandits allow to optimise easily without highly restrictiveassumptions.
GP bandits exploit dependencies in the observations.
DLGP converges up to 20 times faster to the optimum.
Exciting opportunities to further improve DLGP.
Use other surrogate optimising algorithms.Learn kernels (not only hyperparameters).
Use DLGP instead of GP-bandits
Diogo Dubart Norinho (UCL) December 10, 2012 43 / 44
Dynamic Learning Gaussian Process Bandits Conclusion
Thank you.
Diogo Dubart Norinho (UCL) December 10, 2012 44 / 44
Dynamic Learning Gaussian Process Bandits
ANNEXES
Diogo Dubart Norinho (UCL) December 10, 2012 45 / 44
Dynamic Learning Gaussian Process Bandits
The functions in our test suit include the Rastigin F1, Griewank F2 andAckley’s F3. The search domain for F1 is [−5, 5]D while it is [−100, 100]D
for F2 and F3. All of them are multimodal and only F2 is non-separable.
F1(x) = 10D +D∑i=1
[x2i + cos(2πxi ) ]
F2(x) =D∑i=1
xi4000
−D∏i=1
cosxi√i
+ 1
F3(z) =− 20 exp
−0.2
√√√√ 1
D
D∑i=1
z2i
− exp
(1
D
D∑i=1
cos(2πzi )
)+ 20 + e1
Diogo Dubart Norinho (UCL) December 10, 2012 46 / 44