egret bounds for meta ayesian optimization with an … · regret bounds for meta bayesian...

1
R EGRET BOUNDS FOR META B AYESIAN OPTIMIZATION WITH AN UNKNOWN G AUSSIAN PROCESS PRIOR Z I W ANG * ,B EOMJOON K IM * ,L ESLIE P ACK K AELBLING ziw, beomjoon, [email protected] ( * equal contribution) M AIN CONTRIBUTIONS A stand-alone Bayesian optimization module that takes in only a multi-task training data set as input and then actively selects inputs to efficiently opti- mize a new function. Constructive analyses of the regret of this module. B AYESIAN O PTIMIZATION Maximize an expensive blackbox function f : X R with sequential queries x 1 , ··· ,x T and noisy observations of their values y 1 , ··· ,y T . Assume a Gaussian process prior f GP (μ, k ). Use acquisition functions as the decision criterion for where to query. Major problem: the prior is unknown. Existing approaches: maximum likelihood; hierarchical Bayes. Current theoretical results break down if the prior is not given. What if we have experience with similar functions? Ex. Optimizing robot grasps in different environments: O UR MODEL Generic knowledge of functions GP( μ, k) distribution over functions x 1 , y 11 x 2 , y 12 observations of function values x 1 , y 21 x 2 , y 22 f ?? ?? new function f 2 function 2 f 1 function 1 f 1 , f 2 , f N , f GP( μ, k) Goal: optimize with few queries f Our evaluation criteria: best-sample simple regret r T = max xX f (x) - max t[T ] f (x t ); simple regret R T = max xX f (x) - f (x t * ), t * = arg max t[T ] y t . G AUSSIAN PROCESSES Given observations D t = {(x τ ,y τ )} t τ =1 , y τ ∼N (f (x τ )2 ), the posterior GP (μ t ,k t ) satisfies μ t (x)= μ(x)+ k (x, x t )(k (x t )+ σ 2 I ) -1 (y t - μ(x t )), k t (x, x 0 )= k (x, x 0 ) - k (x, x t )(k (x t )+ σ 2 I ) -1 k (x t ,x 0 ), where y t =[y τ ] T τ =1 and x t =[x τ ] T τ =1 , μ(x)=[μ(x i )] n i=1 , k (x, x 0 )= [k (x i ,x 0 j )] i[n],j [n 0 ] ,k (x)= k (x, x). M AIN IDEA : USE THE PAST EXPERIENCE TO ESTIMATE THE PRIOR OF f Offline phase: collect meta training data: M evaluations from each of the N functions sampled from the same prior, ¯ D N = {[(¯ x j , ¯ y ij )] M j =1 } N i=1 , ¯ y ij ∼N (f i x j )2 ), f i GP (μ, k ); estimate the prior mean function ˆ μ and kernel ˆ k from the meta training data. Online phase: estimate the posterior mean ˆ μ t and covariance ˆ k t and use them for BO on a new function f GP (μ, k ) with a total of T iterations. E STIMATE THE PRIOR GP μ, ˆ k ) AND POSTERIOR GP μ t , ˆ k t ) : DISCRETE AND CONTINUOUS CASES X is finite: directly estimate the prior and the posterior [1] Define the observation matrix as Y =[Y i ] i[N ] = [¯ y ij ] i[N ],j [M ] . Missing entries? Use matrix completion [2]. ˆ μ(X)= 1 N Y T 1 N ∼N (μ(X), 1 N (k (X)+ σ 2 I )), ˆ k (X)= 1 N - 1 (Y - 1 N ˆ μ(X) T ) T (Y - 1 N ˆ μ(X) T ) ∼W ( 1 N - 1 (k (X)+ σ 2 I ). ˆ μ t (x)=ˆ μ(x)+ ˆ k (x, x t ) ˆ k (x t , x t ) -1 (y t - ˆ μ(x t )), ˆ k t (x, x 0 )= N - 1 N - t - 1 ˆ k (x, x 0 ) - ˆ k (x, x t ) ˆ k (x t , x t ) -1 ˆ k (x t ,x 0 ) . X R d is compact: using weights to represent the prior Assume there exist basis functions Φ=[φ s ] K s=1 : X R K , mean parameter u R K and covariance parameter Σ R K ×K such that μ(x) = Φ(x) T u and k (x, x 0 ) = Φ(x) T ΣΦ(x 0 ), i.e. f = Φ(x) T W GP (μ, k ),W ∼N (u, Σ). Assume M K , and Φ(¯ x) has full row rank. The observation Y i = Φ( ¯ x) T W i i ∼N (Φ( ¯ x) T u, Φ(¯ x) T ΣΦ( ¯ x)+ σ 2 I ), and we estimate the weight vector as ˆ W i = (Φ( ¯ x) T ) + Y i ∼N (u, Σ+ σ 2 (Φ( ¯ x)Φ( ¯ x) T ) -1 ). Let W =[ ˆ W i ] N i=1 R N ×K . Unbiased GP prior parameter estimator: ˆ u = 1 N W T 1 N and ˆ Σ= 1 N -1 (W - 1 N ˆ u) T (W - 1 N ˆ u). Unbiased GP posterior estimator: ˆ μ t (x) = Φ(x) T ˆ u t and ˆ k t (x) = Φ(x) T ˆ Σ t Φ(x) where ˆ u t u + ˆ ΣΦ(x t )(Φ(x t ) T ˆ ΣΦ(x t )) -1 (y t - Φ(x t ) T u), ˆ Σ t = N - 1 N - t - 1 ˆ Σ - ˆ ΣΦ(x t )(Φ(x t ) T ˆ ΣΦ(x t )) -1 Φ(x t ) T ˆ Σ . Lemma 1. If the size of the training dataset satisfies N T +2, then for any input x X, with probability at least 1 - δ , | ˆ μ t (x) - μ t (x)| 2 <a t (k t (x)+¯ σ 2 (x)) and 1 - 2 p b t < ˆ k t (x)/(k t (x)+¯ σ 2 (x)) < 1+2 p b t +2b t , where a t = 4 N -2+t+2 t log (4)+2 log (4) δN (N -t-2) and b t = 1 N -t-1 log 4 δ . For finite X, ¯ σ 2 (x)= σ 2 ; for compact X, ¯ σ 2 (x)= σ 2 Φ(x) T (Φ( ¯ x)Φ( ¯ x) T ) -1 Φ(x). N EAR ZERO REGRET BOUNDS FOR BO WITH THE ESTIMATED PRIOR AND POSTERIOR Acquisition functions: α PI t-1 (x)= ˆ μ t-1 (x)- ˆ f * ˆ k t-1 (x) 1 2 , α GP-UCB t-1 (x)=ˆ μ t-1 (x)+ ζ t ˆ k t-1 (x) 1 2 . ˆ f * max xX f (x), ζ t = 6(N - 3+ t +2 q t log 6 δ + 2 log 6 δ )/(δN (N - t - 1)) 1 2 + (2 log( 3 δ )) 1 2 (1 - 2( 1 N -t log 6 δ ) 1 2 ) 1 2 With high probability, the simple regret decreases to a constant proportional to the noise level σ as the number of iterations and training functions increases. Theorem 2. Assume there exist constant c max xX k (x) and a training dataset is available whose size is N 4 log 6 δ + T +2. With probability at least 1 - δ , the best-sample simple regret in T iterations of meta BO with either GP-UCB or PI satisfies r UCB T UCB T (N )λ T , r PI T PI T (N )λ T , λ 2 T = O(ρ T /T )+¯ σ (x τ ) 2 , where η UCB T (N )=(m + C 1 )( 1+m 1-m + 1), η PI T (N )=(m + C 2 )( 1+m 1-m + 1) + C 3 , m = O( q 1 N -T ), C 1 ,C 2 ,C 3 > 0 are constants, τ = arg min t[T ] k t-1 (x t ) and ρ T = max AX,|A|=T 1 2 log |I + σ -2 k (A)|. ¯ σ is defined in the same way as in Lemma 1. Lemma 3. With probability at least 1 - δ , the simple regret R T r T + 2(2 log 1 δ ) 1 2 σ . E XPERIMENTS Optimizing a grasp (N = 1800,M = 162) 0 20 40 60 80 100 120 140 160 3.4 3.2 3.0 2.8 2.6 2.4 2.2 Random Plain-UCB PEM-BO-UCB TLSM-BO-UCB 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 3.4 3.2 3.0 2.8 2.6 2.4 PEM-BO-UCB Plain-UCB Proportion of prior training dataset Number of evaluations of test f Max observed value Optimizing a grasp, base pose, and placement (N = 1500,M = 1000) 0 5 10 15 20 25 30 6 5 4 3 2 Random Plain-UCB PEM-BO-UCB TLSM-BO-UCB 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 6 5 4 3 2 PEM-BO-UCB Plain-UCB Proportion of prior training dataset Number of evaluations of test f Max observed value Optimizing a continuous synthetic function drawn from a GP with a squared exponential kernel (N = 100,M = 1000) Proportion of prior training dataset Number of evaluations of test f Max observed value 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 70 75 80 85 90 95 100 105 PEM-BO-UCB Plain-UCB w/ correct form of prior 0 20 40 60 80 0 25 50 75 100 125 150 Random Plain-UCB w/ correct form of prior PEM-BO-UCB TLSM-BO PEM-BO: our point estimate meta BO approach. Plain-BO: plain BO using a squared exponential kernel and maximum like- lihood for GP hyperparameter estimation. TLSM-BO: transfer learning sequential model-based optimization [5]. Random: random selection. S ELECTED R EFERENCES [1] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley New York, 1958. [2] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foun- dations of Computational mathematics, 9(6):717, 2009. [3] B. Kim, L. P. Kaelbling, and T. Lozano-Pérez. Learning to guide task and motion planning using score-space representation. In ICRA, 2017. [4] R. Neal. Bayesian Learning for Neural networks. Lecture Notes in Statistics 118. Springer, 1996. [5] D. Yogatama and G. Mann. Efficient transfer learning method for automatic hyper- parameter tuning. In AISTATS, 2014.

Upload: others

Post on 12-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EGRET BOUNDS FOR META AYESIAN OPTIMIZATION WITH AN … · REGRET BOUNDS FOR META BAYESIAN OPTIMIZATION WITH AN UNKNOWN GAUSSIAN PROCESS PRIOR ZI WANG, BEOMJOON KIM, LESLIE PACK KAELBLING

REGRET BOUNDS FOR META BAYESIAN OPTIMIZATION WITH AN UNKNOWN GAUSSIAN PROCESS PRIORZI WANG∗, BEOMJOON KIM∗, LESLIE PACK KAELBLING ziw, beomjoon, [email protected] (∗equal contribution)

MAIN CONTRIBUTIONS

• A stand-alone Bayesian optimization module that takes in only a multi-tasktraining data set as input and then actively selects inputs to efficiently opti-mize a new function.

• Constructive analyses of the regret of this module.

BAYESIAN OPTIMIZATION• Maximize an expensive blackbox function f : X → R with sequential

queries x1, · · · , xT and noisy observations of their values y1, · · · , yT .

• Assume a Gaussian process prior f ∼ GP (µ, k).

• Use acquisition functions as the decision criterion for where to query.

• Major problem: the prior is unknown. Existing approaches:

– maximum likelihood;

– hierarchical Bayes.

• Current theoretical results break down if the prior is not given.

• What if we have experience with similar functions? Ex. Optimizing robotgrasps in different environments:

OUR MODEL

Generic knowledge of functions

GP(μ, k)distribution over functions

x1, y11 x2, y12 ⋯

observations of function values

x1, y21 x2, y22 ⋯

f

?? ?? ⋯

new function

⋯f2

function 2

f1function 1

f1, f2, ⋯fN, f ∼ GP(μ, k)

Goal: optimize with few queries

f

Our evaluation criteria:

• best-sample simple regret rT = maxx∈X f(x)−maxt∈[T ] f(xt);

• simple regret RT = maxx∈X f(x)− f(xt∗), t∗ = arg maxt∈[T ] yt.

GAUSSIAN PROCESSESGiven observations Dt = {(xτ , yτ )}tτ=1, yτ ∼ N (f(xτ ), σ2), the posteriorGP (µt, kt) satisfies

µt(x) = µ(x) + k(x,xt)(k(xt) + σ2I)−1(yt − µ(xt)),

kt(x, x′) = k(x, x′)− k(x,xt)(k(xt) + σ2I)−1k(xt, x

′),

where yt = [yτ ]Tτ=1 and xt = [xτ ]Tτ=1, µ(x) = [µ(xi)]ni=1, k(x,x′) =

[k(xi, x′j)]i∈[n],j∈[n′], k(x) = k(x,x).

MAIN IDEA: USE THE PAST EXPERIENCE TO ESTIMATE THE PRIOR OF f• Offline phase:

– collect meta training data: M evaluations from eachof the N functions sampled from the same prior,DN = {[(xj , yij)]Mj=1}Ni=1, yij ∼ N (fi(xj), σ

2), fi ∼GP (µ, k);

– estimate the prior mean function µ and kernel k from themeta training data.

• Online phase: estimate the posterior mean µt and covariancekt and use them for BO on a new function f ∼ GP (µ, k) witha total of T iterations.

ESTIMATE THE PRIOR GP (µ, k) AND POSTERIOR GP (µt, kt): DISCRETE AND CONTINUOUS CASESX is finite: directly estimate the prior and the posterior [1]Define the observation matrix as Y = [Yi]i∈[N ] = [yij ]i∈[N ],j∈[M ]. Missing entries? Use matrix completion [2].

µ(X) =1

NY T1N ∼ N (µ(X),

1

N(k(X) + σ2I)), k(X) =

1

N − 1(Y − 1N µ(X)T)T(Y − 1N µ(X)T) ∼ W(

1

N − 1(k(X) + σ2I).

µt(x) = µ(x) + k(x,xt)k(xt,xt)−1

(yt − µ(xt)), kt(x, x′) =

N − 1

N − t− 1

(k(x, x′)− k(x,xt)k(xt,xt)

−1k(xt, x

′)).

X ⊂ Rd is compact: using weights to represent the prior

• Assume there exist basis functions Φ = [φs]Ks=1 : X → RK , mean parameter u ∈ RK and covariance parameter Σ ∈

RK×K such that µ(x) = Φ(x)Tu and k(x, x′) = Φ(x)TΣΦ(x′), i.e. f = Φ(x)TW ∼ GP (µ, k),W ∼ N (u,Σ).

• Assume M ≥ K, and Φ(x) has full row rank. The observation Yi = Φ(x)TWi + εi ∼ N (Φ(x)Tu,Φ(x)TΣΦ(x) + σ2I),and we estimate the weight vector as Wi = (Φ(x)T)+Yi ∼ N (u,Σ + σ2(Φ(x)Φ(x)T)−1). Let W = [Wi]

Ni=1 ∈ RN×K .

• Unbiased GP prior parameter estimator: u = 1NWT1N and Σ = 1

N−1 (W − 1N u)T(W − 1N u).

• Unbiased GP posterior estimator: µt(x) = Φ(x)Tut and kt(x) = Φ(x)TΣtΦ(x) where

ut = u+ ΣΦ(xt)(Φ(xt)TΣΦ(xt))

−1(yt − Φ(xt)Tu), Σt =

N − 1

N − t− 1

(Σ− ΣΦ(xt)(Φ(xt)

TΣΦ(xt))−1Φ(xt)

TΣ).

Lemma 1. If the size of the training dataset satisfies N ≥ T + 2, then for any input x ∈ X, with probability at least 1− δ,

|µt(x)− µt(x)|2 < at(kt(x) + σ2(x)) and 1− 2√bt < kt(x)/(kt(x) + σ2(x)) < 1 + 2

√bt + 2bt,

where at =4(N−2+t+2

√t log (4/δ)+2 log (4/δ)

)δN(N−t−2) and bt = 1

N−t−1 log 4δ . For finite X, σ2(x) = σ2; for compact X, σ2(x) = σ2Φ(x)T(Φ(x)Φ(x)T)−1Φ(x).

NEAR ZERO REGRET BOUNDS FOR BO WITH THE ESTIMATED PRIOR AND POSTERIOR

Acquisition functions: αPIt−1(x) = µt−1(x)−f∗

kt−1(x)12, αGP-UCB

t−1 (x) = µt−1(x) + ζtkt−1(x)12 .

f∗ ≥ maxx∈X

f(x), ζt =

(6(N − 3 + t+ 2

√t log 6

δ + 2 log 6δ )/(δN(N − t− 1))

) 12

+ (2 log( 3δ ))

12

(1− 2( 1N−t log 6

δ )12 )

12

With high probability, the simple regret decreases to a constant proportional to the noise level σ as the number of iterations and training functions increases.

Theorem 2. Assume there exist constant c ≥ maxx∈X k(x) and a training dataset is available whose size is N ≥ 4 log 6δ + T + 2. With probability at

least 1− δ, the best-sample simple regret in T iterations of meta BO with either GP-UCB or PI satisfies

rUCBT < ηUCB

T (N)λT , rPIT < ηPI

T (N)λT , λ2T = O(ρT /T ) + σ(xτ )2,

where ηUCBT (N) = (m + C1)(√1+m√1−m + 1), ηPI

T (N) = (m + C2)(√1+m√1−m + 1) + C3, m = O(

√1

N−T ), C1, C2, C3 > 0 are constants, τ =

arg mint∈[T ] kt−1(xt) and ρT = maxA∈X,|A|=T

12 log |I + σ−2k(A)|. σ is defined in the same way as in Lemma 1.

Lemma 3. With probability at least 1− δ, the simple regret RT ≤ rT + 2(2 log 1δ )

12σ.

EXPERIMENTSOptimizing a grasp (N = 1800,M = 162)

0 20 40 60 80 100 120 140 1603.4

3.2

3.0

2.8

2.6

2.4

2.2

Random

Plain-UCB

PEM-BO-UCB

TLSM-BO-UCB

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.093.4

3.2

3.0

2.8

2.6

2.4

PEM-BO-UCB

Plain-UCB

Proportion of prior training dataset Number of evaluations of test f

Max o

bse

rved v

alu

e

Optimizing a grasp, base pose, and placement (N = 1500,M = 1000)

0 5 10 15 20 25 30

6

5

4

3

2

Random

Plain-UCB

PEM-BO-UCB

TLSM-BO-UCB

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

6

5

4

3

2

PEM-BO-UCB

Plain-UCB

Proportion of prior training dataset Number of evaluations of test f

Max o

bse

rved v

alu

e

Optimizing a continuous synthetic function drawn from a GP with a squaredexponential kernel (N = 100,M = 1000)

Proportion of prior training dataset Number of evaluations of test f

Max o

bse

rved v

alu

e

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

70

75

80

85

90

95

100

105 PEM-BO-UCB

Plain-UCB w/ correct form of prior

0 20 40 60 80

0

25

50

75

100

125

150

Random

Plain-UCB w/ correct form of prior

PEM-BO-UCB

TLSM-BO

• PEM-BO: our point estimate meta BO approach.• Plain-BO: plain BO using a squared exponential kernel and maximum like-

lihood for GP hyperparameter estimation.• TLSM-BO: transfer learning sequential model-based optimization [5].• Random: random selection.

SELECTED REFERENCES[1] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley New

York, 1958.

[2] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foun-dations of Computational mathematics, 9(6):717, 2009.

[3] B. Kim, L. P. Kaelbling, and T. Lozano-Pérez. Learning to guide task and motionplanning using score-space representation. In ICRA, 2017.

[4] R. Neal. Bayesian Learning for Neural networks. Lecture Notes in Statistics 118.Springer, 1996.

[5] D. Yogatama and G. Mann. Efficient transfer learning method for automatic hyper-parameter tuning. In AISTATS, 2014.