egret bounds for meta ayesian optimization with an … · regret bounds for meta bayesian...

REGRET BOUNDS FOR META BAYESIAN OPTIMIZATION WITH AN UNKNOWN GAUSSIAN PROCESS PRIORZI WANG∗, BEOMJOON KIM∗, LESLIE PACK KAELBLING ziw, beomjoon, lpk@mit.edu (∗equal contribution)

MAIN CONTRIBUTIONS

• A stand-alone Bayesian optimization module that takes in only a multi-tasktraining data set as input and then actively selects inputs to efficiently opti-mize a new function.

• Constructive analyses of the regret of this module.

BAYESIAN OPTIMIZATION• Maximize an expensive blackbox function f : X → R with sequential

queries x1, · · · , xT and noisy observations of their values y1, · · · , yT .

• Assume a Gaussian process prior f ∼ GP (µ, k).

• Use acquisition functions as the decision criterion for where to query.

• Major problem: the prior is unknown. Existing approaches:

– maximum likelihood;

– hierarchical Bayes.

• Current theoretical results break down if the prior is not given.

• What if we have experience with similar functions? Ex. Optimizing robotgrasps in different environments:

OUR MODEL

Generic knowledge of functions

GP(μ, k)distribution over functions

x1, y11 x2, y12 ⋯

observations of function values

x1, y21 x2, y22 ⋯

?? ?? ⋯

new function

function 2

f1function 1

f1, f2, ⋯fN, f ∼ GP(μ, k)

Goal: optimize with few queries

Our evaluation criteria:

• best-sample simple regret rT = maxx∈X f(x)−maxt∈[T ] f(xt);

• simple regret RT = maxx∈X f(x)− f(xt∗), t∗ = arg maxt∈[T ] yt.

GAUSSIAN PROCESSESGiven observations Dt = {(xτ , yτ )}tτ=1, yτ ∼ N (f(xτ ), σ2), the posteriorGP (µt, kt) satisfies

µt(x) = µ(x) + k(x,xt)(k(xt) + σ2I)−1(yt − µ(xt)),

kt(x, x′) = k(x, x′)− k(x,xt)(k(xt) + σ2I)−1k(xt, x

where yt = [yτ ]Tτ=1 and xt = [xτ ]Tτ=1, µ(x) = [µ(xi)]ni=1, k(x,x′) =

[k(xi, x′j)]i∈[n],j∈[n′], k(x) = k(x,x).

MAIN IDEA: USE THE PAST EXPERIENCE TO ESTIMATE THE PRIOR OF f• Offline phase:

– collect meta training data: M evaluations from eachof the N functions sampled from the same prior,DN = {[(xj , yij)]Mj=1}Ni=1, yij ∼ N (fi(xj), σ

2), fi ∼GP (µ, k);

– estimate the prior mean function µ and kernel k from themeta training data.

• Online phase: estimate the posterior mean µt and covariancekt and use them for BO on a new function f ∼ GP (µ, k) witha total of T iterations.

ESTIMATE THE PRIOR GP (µ, k) AND POSTERIOR GP (µt, kt): DISCRETE AND CONTINUOUS CASESX is finite: directly estimate the prior and the posterior [1]Define the observation matrix as Y = [Yi]i∈[N ] = [yij ]i∈[N ],j∈[M ]. Missing entries? Use matrix completion [2].

µ(X) =1

NY T1N ∼ N (µ(X),

N(k(X) + σ2I)), k(X) =

N − 1(Y − 1N µ(X)T)T(Y − 1N µ(X)T) ∼ W(

N − 1(k(X) + σ2I).

µt(x) = µ(x) + k(x,xt)k(xt,xt)−1

(yt − µ(xt)), kt(x, x′) =

N − 1

N − t− 1

(k(x, x′)− k(x,xt)k(xt,xt)

−1k(xt, x

′)).

X ⊂ Rd is compact: using weights to represent the prior

• Assume there exist basis functions Φ = [φs]Ks=1 : X → RK , mean parameter u ∈ RK and covariance parameter Σ ∈

RK×K such that µ(x) = Φ(x)Tu and k(x, x′) = Φ(x)TΣΦ(x′), i.e. f = Φ(x)TW ∼ GP (µ, k),W ∼ N (u,Σ).

• Assume M ≥ K, and Φ(x) has full row rank. The observation Yi = Φ(x)TWi + εi ∼ N (Φ(x)Tu,Φ(x)TΣΦ(x) + σ2I),and we estimate the weight vector as Wi = (Φ(x)T)+Yi ∼ N (u,Σ + σ2(Φ(x)Φ(x)T)−1). Let W = [Wi]

Ni=1 ∈ RN×K .

• Unbiased GP prior parameter estimator: u = 1NWT1N and Σ = 1

N−1 (W − 1N u)T(W − 1N u).

• Unbiased GP posterior estimator: µt(x) = Φ(x)Tut and kt(x) = Φ(x)TΣtΦ(x) where

ut = u+ ΣΦ(xt)(Φ(xt)TΣΦ(xt))

−1(yt − Φ(xt)Tu), Σt =

N − 1

N − t− 1

(Σ− ΣΦ(xt)(Φ(xt)

TΣΦ(xt))−1Φ(xt)

Lemma 1. If the size of the training dataset satisfies N ≥ T + 2, then for any input x ∈ X, with probability at least 1− δ,

|µt(x)− µt(x)|2 < at(kt(x) + σ2(x)) and 1− 2√bt < kt(x)/(kt(x) + σ2(x)) < 1 + 2

√bt + 2bt,

where at =4(N−2+t+2

√t log (4/δ)+2 log (4/δ)

)δN(N−t−2) and bt = 1

N−t−1 log 4δ . For finite X, σ2(x) = σ2; for compact X, σ2(x) = σ2Φ(x)T(Φ(x)Φ(x)T)−1Φ(x).

NEAR ZERO REGRET BOUNDS FOR BO WITH THE ESTIMATED PRIOR AND POSTERIOR

Acquisition functions: αPIt−1(x) = µt−1(x)−f∗

kt−1(x)12, αGP-UCB

t−1 (x) = µt−1(x) + ζtkt−1(x)12 .

f∗ ≥ maxx∈X

f(x), ζt =

(6(N − 3 + t+ 2

√t log 6

δ + 2 log 6δ )/(δN(N − t− 1))

+ (2 log( 3δ ))

(1− 2( 1N−t log 6

δ )12 )

With high probability, the simple regret decreases to a constant proportional to the noise level σ as the number of iterations and training functions increases.

Theorem 2. Assume there exist constant c ≥ maxx∈X k(x) and a training dataset is available whose size is N ≥ 4 log 6δ + T + 2. With probability at

least 1− δ, the best-sample simple regret in T iterations of meta BO with either GP-UCB or PI satisfies

rUCBT < ηUCB

T (N)λT , rPIT < ηPI

T (N)λT , λ2T = O(ρT /T ) + σ(xτ )2,

where ηUCBT (N) = (m + C1)(√1+m√1−m + 1), ηPI

T (N) = (m + C2)(√1+m√1−m + 1) + C3, m = O(

N−T ), C1, C2, C3 > 0 are constants, τ =

arg mint∈[T ] kt−1(xt) and ρT = maxA∈X,|A|=T

12 log |I + σ−2k(A)|. σ is defined in the same way as in Lemma 1.

Lemma 3. With probability at least 1− δ, the simple regret RT ≤ rT + 2(2 log 1δ )

EXPERIMENTSOptimizing a grasp (N = 1800,M = 162)

0 20 40 60 80 100 120 140 1603.4

Random

Plain-UCB

PEM-BO-UCB

TLSM-BO-UCB

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.093.4

PEM-BO-UCB

Plain-UCB

Proportion of prior training dataset Number of evaluations of test f

rved v

Optimizing a grasp, base pose, and placement (N = 1500,M = 1000)

0 5 10 15 20 25 30

Random

Plain-UCB

PEM-BO-UCB

TLSM-BO-UCB

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

PEM-BO-UCB

Plain-UCB

rved v

Optimizing a continuous synthetic function drawn from a GP with a squaredexponential kernel (N = 100,M = 1000)

rved v

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

105 PEM-BO-UCB

Plain-UCB w/ correct form of prior

0 20 40 60 80

Random

Plain-UCB w/ correct form of prior

PEM-BO-UCB

TLSM-BO

• PEM-BO: our point estimate meta BO approach.• Plain-BO: plain BO using a squared exponential kernel and maximum like-

lihood for GP hyperparameter estimation.• TLSM-BO: transfer learning sequential model-based optimization [5].• Random: random selection.

SELECTED REFERENCES[1] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley New

York, 1958.

[2] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foun-dations of Computational mathematics, 9(6):717, 2009.

[3] B. Kim, L. P. Kaelbling, and T. Lozano-Pérez. Learning to guide task and motionplanning using score-space representation. In ICRA, 2017.

[4] R. Neal. Bayesian Learning for Neural networks. Lecture Notes in Statistics 118.Springer, 1996.

[5] D. Yogatama and G. Mann. Efficient transfer learning method for automatic hyper-parameter tuning. In AISTATS, 2014.

egret bounds for meta ayesian optimization with an … · regret bounds for meta bayesian...

Documents

kernel lower bounds

william bounds spice and nut mills -...

bounds good

inter pech bounds

on linear data structures and matrix rigidity · 2021. 1....

egret bounds for meta ayesian optimization with an … ·...

tenney james meta-hodos and meta meta-hodos

chapter 5 bounds on...

metes and bounds surveys 02/04/2015. metes and bounds metes...

ball in bounds

sample complexity bounds for iterative stochastic … ·...

inter bounds

methods for algorithmic meta...

storytelling without bounds

christopher bounds 0

meta [dados] / meta [data]

meta-algorithms vs. circuit lower...

performance optimizations and bounds for sparse...

out of bounds

· meta summarize, tau2(.2) 1. 2meta summarize—...