egret bounds for meta ayesian optimization with an … · – collect meta training data: m...

1
R EGRET BOUNDS FOR META B AYESIAN OPTIMIZATION WITH AN UNKNOWN G AUSSIAN PROCESS PRIOR Z I W ANG * ,B EOMJOON K IM * ,L ESLIE P ACK K AELBLING ziw, beomjoon, [email protected] ( * equal contribution) M AIN CONTRIBUTIONS A stand-alone Bayesian optimization module that takes in only a multi-task training data set as input and then actively selects inputs to efficiently opti- mize a new function. Constructive analyses of the regret of this module. B AYESIAN O PTIMIZATION Maximize an expensive blackbox function f : X R with sequential queries x 1 , ··· ,x T and noisy observations of their values y 1 , ··· ,y T . Assume a Gaussian process prior f GP (μ, k ). Use acquisition functions as the decision criterion for where to query. Major problem: the prior is unknown. Existing approaches: maximum likelihood; hierarchical Bayes. Current theoretical results break down if the prior is not given. What if we have experience with similar functions? Ex. Optimizing robot grasps in different environments: O UR MODEL Generic knowledge of functions GP( μ, k) distribution over functions x 1 , y 11 x 2 , y 12 observations of function values x 1 , y 21 x 2 , y 22 f ?? ?? new function f 2 function 2 f 1 function 1 f 1 , f 2 , f N , f GP( μ, k) Goal: optimize with few queries f Our evaluation criteria: best-sample simple regret r T = max xX f (x) - max t[T ] f (x t ); simple regret R T = max xX f (x) - f (x t * ), t * = arg max t[T ] y t . G AUSSIAN PROCESSES Given observations D t = {(x τ ,y τ )} t τ =1 , y τ ∼N (f (x τ )2 ), the posterior GP (μ t ,k t ) satisfies μ t (x)= μ(x)+ k (x, x t )(k (x t )+ σ 2 I ) -1 (y t - μ(x t )), k t (x, x 0 )= k (x, x 0 ) - k (x, x t )(k (x t )+ σ 2 I ) -1 k (x t ,x 0 ), where y t =[y τ ] T τ =1 and x t =[x τ ] T τ =1 , μ(x)=[μ(x i )] n i=1 , k (x, x 0 )= [k (x i ,x 0 j )] i[n],j [n 0 ] ,k (x)= k (x, x). M AIN IDEA : USE THE PAST EXPERIENCE TO ESTIMATE THE PRIOR OF f Offline phase: collect meta training data: M evaluations from each of the N functions sampled from the same prior, ¯ D N = {[(¯ x j , ¯ y ij )] M j =1 } N i=1 , ¯ y ij ∼N (f i x j )2 ), f i GP (μ, k ); estimate the prior mean function ˆ μ and kernel ˆ k from the meta training data. Online phase: estimate the posterior mean ˆ μ t and covariance ˆ k t and use them for BO on a new function f GP (μ, k ) with a total of T iterations. E STIMATE THE PRIOR GP μ, ˆ k ) AND POSTERIOR GP μ t , ˆ k t ) : DISCRETE AND CONTINUOUS CASES X is finite: directly estimate the prior and the posterior [1] Define the observation matrix as Y =[Y i ] i[N ] = [¯ y ij ] i[N ],j [M ] . Missing entries? Use matrix completion [2]. ˆ μ(X)= 1 N Y T 1 N ∼N (μ(X), 1 N (k (X)+ σ 2 I )), ˆ k (X)= 1 N - 1 (Y - 1 N ˆ μ(X) T ) T (Y - 1 N ˆ μ(X) T ) ∼W ( 1 N - 1 (k (X)+ σ 2 I ). ˆ μ t (x)=ˆ μ(x)+ ˆ k (x, x t ) ˆ k (x t , x t ) -1 (y t - ˆ μ(x t )), ˆ k t (x, x 0 )= N - 1 N - t - 1 ˆ k (x, x 0 ) - ˆ k (x, x t ) ˆ k (x t , x t ) -1 ˆ k (x t ,x 0 ) . X R d is compact: using weights to represent the prior Assume there exist basis functions Φ=[φ s ] K s=1 : X R K , mean parameter u R K and covariance parameter Σ R K ×K such that μ(x) = Φ(x) T u and k (x, x 0 ) = Φ(x) T ΣΦ(x 0 ), i.e. f = Φ(x) T W GP (μ, k ),W ∼N (u, Σ). Assume M K , and Φ(¯ x) has full row rank. The observation Y i = Φ( ¯ x) T W i i ∼N (Φ( ¯ x) T u, Φ(¯ x) T ΣΦ( ¯ x)+ σ 2 I ), and we estimate the weight vector as ˆ W i = (Φ( ¯ x) T ) + Y i ∼N (u, Σ+ σ 2 (Φ( ¯ x)Φ( ¯ x) T ) -1 ). Let W =[ ˆ W i ] N i=1 R N ×K . Unbiased GP prior parameter estimator: ˆ u = 1 N W T 1 N and ˆ Σ= 1 N -1 (W - 1 N ˆ u) T (W - 1 N ˆ u). Unbiased GP posterior estimator: ˆ μ t (x) = Φ(x) T ˆ u t and ˆ k t (x) = Φ(x) T ˆ Σ t Φ(x) where ˆ u t u + ˆ ΣΦ(x t )(Φ(x t ) T ˆ ΣΦ(x t )) -1 (y t - Φ(x t ) T u), ˆ Σ t = N - 1 N - t - 1 ˆ Σ - ˆ ΣΦ(x t )(Φ(x t ) T ˆ ΣΦ(x t )) -1 Φ(x t ) T ˆ Σ . Lemma 1. If the size of the training dataset satisfies N T +2, then for any input x X, with probability at least 1 - δ , | ˆ μ t (x) - μ t (x)| 2 <a t (k t (x)+¯ σ 2 (x)) and 1 - 2 p b t < ˆ k t (x)/(k t (x)+¯ σ 2 (x)) < 1+2 p b t +2b t , where a t = 4 N -2+t+2 t log (4)+2 log (4) δN (N -t-2) and b t = 1 N -t-1 log 4 δ . For finite X, ¯ σ 2 (x)= σ 2 ; for compact X, ¯ σ 2 (x)= σ 2 Φ(x) T (Φ( ¯ x)Φ( ¯ x) T ) -1 Φ(x). N EAR ZERO REGRET BOUNDS FOR BO WITH THE ESTIMATED PRIOR AND POSTERIOR Acquisition functions: α PI t-1 (x)= ˆ μ t-1 (x)- ˆ f * ˆ k t-1 (x) 1 2 , α GP-UCB t-1 (x)=ˆ μ t-1 (x)+ ζ t ˆ k t-1 (x) 1 2 . ˆ f * max xX f (x), ζ t = 6(N - 3+ t +2 q t log 6 δ + 2 log 6 δ )/(δN (N - t - 1)) 1 2 + (2 log( 3 δ )) 1 2 (1 - 2( 1 N -t log 6 δ ) 1 2 ) 1 2 With high probability, the simple regret decreases to a constant proportional to the noise level σ as the number of iterations and training functions increases. Theorem 2. Assume there exist constant c max xX k (x) and a training dataset is available whose size is N 4 log 6 δ + T +2. With probability at least 1 - δ , the best-sample simple regret in T iterations of meta BO with either GP-UCB or PI satisfies r UCB T UCB T (N )λ T , r PI T PI T (N )λ T , λ 2 T = O(ρ T /T )+¯ σ (x τ ) 2 , where η UCB T (N )=(m + C 1 )( 1+m 1-m + 1), η PI T (N )=(m + C 2 )( 1+m 1-m + 1) + C 3 , m = O( q 1 N -T ), C 1 ,C 2 ,C 3 > 0 are constants, τ = arg min t[T ] k t-1 (x t ) and ρ T = max AX,|A|=T 1 2 log |I + σ -2 k (A)|. ¯ σ is defined in the same way as in Lemma 1. Lemma 3. With probability at least 1 - δ , the simple regret R T r T + 2(2 log 1 δ ) 1 2 σ . E XPERIMENTS Optimizing a grasp (N = 1800,M = 162) 0 20 40 60 80 100 120 140 160 3.4 3.2 3.0 2.8 2.6 2.4 2.2 Random Plain-UCB PEM-BO-UCB TLSM-BO-UCB 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 3.4 3.2 3.0 2.8 2.6 2.4 PEM-BO-UCB Plain-UCB Proportion of prior training dataset Number of evaluations of test f Max observed value Optimizing a grasp, base pose, and placement (N = 1500,M = 1000) 0 5 10 15 20 25 30 6 5 4 3 2 Random Plain-UCB PEM-BO-UCB TLSM-BO-UCB 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 6 5 4 3 2 PEM-BO-UCB Plain-UCB Proportion of prior training dataset Number of evaluations of test f Max observed value Optimizing a continuous synthetic function drawn from a GP with a squared exponential kernel (N = 100,M = 1000) Proportion of prior training dataset Number of evaluations of test f Max observed value 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 70 75 80 85 90 95 100 105 PEM-BO-UCB Plain-UCB w/ correct form of prior 0 20 40 60 80 0 25 50 75 100 125 150 Random Plain-UCB w/ correct form of prior PEM-BO-UCB TLSM-BO PEM-BO: our point estimate meta BO approach. Plain-BO: plain BO using a squared exponential kernel and maximum like- lihood for GP hyperparameter estimation. TLSM-BO: transfer learning sequential model-based optimization [5]. Random: random selection. S ELECTED R EFERENCES [1] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley New York, 1958. [2] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foun- dations of Computational mathematics, 9(6):717, 2009. [3] B. Kim, L. P. Kaelbling, and T. Lozano-Pérez. Learning to guide task and motion planning using score-space representation. In ICRA, 2017. [4] R. Neal. Bayesian Learning for Neural networks. Lecture Notes in Statistics 118. Springer, 1996. [5] D. Yogatama and G. Mann. Efficient transfer learning method for automatic hyper- parameter tuning. In AISTATS, 2014.

Upload: others

Post on 12-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EGRET BOUNDS FOR META AYESIAN OPTIMIZATION WITH AN … · – collect meta training data: M evaluations from each of the N functions sampled from the same prior, D N = f[( x j;y

REGRET BOUNDS FOR META BAYESIAN OPTIMIZATION WITH AN UNKNOWN GAUSSIAN PROCESS PRIORZI WANG∗, BEOMJOON KIM∗, LESLIE PACK KAELBLING ziw, beomjoon, [email protected] (∗equal contribution)

MAIN CONTRIBUTIONS

• A stand-alone Bayesian optimization module that takes in only a multi-tasktraining data set as input and then actively selects inputs to efficiently opti-mize a new function.

• Constructive analyses of the regret of this module.

BAYESIAN OPTIMIZATION• Maximize an expensive blackbox function f : X → R with sequential

queries x1, · · · , xT and noisy observations of their values y1, · · · , yT .

• Assume a Gaussian process prior f ∼ GP (µ, k).

• Use acquisition functions as the decision criterion for where to query.

• Major problem: the prior is unknown. Existing approaches:

– maximum likelihood;

– hierarchical Bayes.

• Current theoretical results break down if the prior is not given.

• What if we have experience with similar functions? Ex. Optimizing robotgrasps in different environments:

OUR MODEL

Generic knowledge of functions

GP(μ, k)distribution over functions

x1, y11 x2, y12 ⋯

observations of function values

x1, y21 x2, y22 ⋯

f

?? ?? ⋯

new function

⋯f2

function 2

f1function 1

f1, f2, ⋯fN, f ∼ GP(μ, k)

Goal: optimize with few queries

f

Our evaluation criteria:

• best-sample simple regret rT = maxx∈X f(x)−maxt∈[T ] f(xt);

• simple regret RT = maxx∈X f(x)− f(xt∗), t∗ = arg maxt∈[T ] yt.

GAUSSIAN PROCESSESGiven observations Dt = {(xτ , yτ )}tτ=1, yτ ∼ N (f(xτ ), σ2), the posteriorGP (µt, kt) satisfies

µt(x) = µ(x) + k(x,xt)(k(xt) + σ2I)−1(yt − µ(xt)),

kt(x, x′) = k(x, x′)− k(x,xt)(k(xt) + σ2I)−1k(xt, x

′),

where yt = [yτ ]Tτ=1 and xt = [xτ ]Tτ=1, µ(x) = [µ(xi)]ni=1, k(x,x′) =

[k(xi, x′j)]i∈[n],j∈[n′], k(x) = k(x,x).

MAIN IDEA: USE THE PAST EXPERIENCE TO ESTIMATE THE PRIOR OF f• Offline phase:

– collect meta training data: M evaluations from eachof the N functions sampled from the same prior,DN = {[(xj , yij)]Mj=1}Ni=1, yij ∼ N (fi(xj), σ

2), fi ∼GP (µ, k);

– estimate the prior mean function µ and kernel k from themeta training data.

• Online phase: estimate the posterior mean µt and covariancekt and use them for BO on a new function f ∼ GP (µ, k) witha total of T iterations.

ESTIMATE THE PRIOR GP (µ, k) AND POSTERIOR GP (µt, kt): DISCRETE AND CONTINUOUS CASESX is finite: directly estimate the prior and the posterior [1]Define the observation matrix as Y = [Yi]i∈[N ] = [yij ]i∈[N ],j∈[M ]. Missing entries? Use matrix completion [2].

µ(X) =1

NY T1N ∼ N (µ(X),

1

N(k(X) + σ2I)), k(X) =

1

N − 1(Y − 1N µ(X)T)T(Y − 1N µ(X)T) ∼ W(

1

N − 1(k(X) + σ2I).

µt(x) = µ(x) + k(x,xt)k(xt,xt)−1

(yt − µ(xt)), kt(x, x′) =

N − 1

N − t− 1

(k(x, x′)− k(x,xt)k(xt,xt)

−1k(xt, x

′)).

X ⊂ Rd is compact: using weights to represent the prior

• Assume there exist basis functions Φ = [φs]Ks=1 : X → RK , mean parameter u ∈ RK and covariance parameter Σ ∈

RK×K such that µ(x) = Φ(x)Tu and k(x, x′) = Φ(x)TΣΦ(x′), i.e. f = Φ(x)TW ∼ GP (µ, k),W ∼ N (u,Σ).

• Assume M ≥ K, and Φ(x) has full row rank. The observation Yi = Φ(x)TWi + εi ∼ N (Φ(x)Tu,Φ(x)TΣΦ(x) + σ2I),and we estimate the weight vector as Wi = (Φ(x)T)+Yi ∼ N (u,Σ + σ2(Φ(x)Φ(x)T)−1). Let W = [Wi]

Ni=1 ∈ RN×K .

• Unbiased GP prior parameter estimator: u = 1NWT1N and Σ = 1

N−1 (W − 1N u)T(W − 1N u).

• Unbiased GP posterior estimator: µt(x) = Φ(x)Tut and kt(x) = Φ(x)TΣtΦ(x) where

ut = u+ ΣΦ(xt)(Φ(xt)TΣΦ(xt))

−1(yt − Φ(xt)Tu), Σt =

N − 1

N − t− 1

(Σ− ΣΦ(xt)(Φ(xt)

TΣΦ(xt))−1Φ(xt)

TΣ).

Lemma 1. If the size of the training dataset satisfies N ≥ T + 2, then for any input x ∈ X, with probability at least 1− δ,

|µt(x)− µt(x)|2 < at(kt(x) + σ2(x)) and 1− 2√bt < kt(x)/(kt(x) + σ2(x)) < 1 + 2

√bt + 2bt,

where at =4(N−2+t+2

√t log (4/δ)+2 log (4/δ)

)δN(N−t−2) and bt = 1

N−t−1 log 4δ . For finite X, σ2(x) = σ2; for compact X, σ2(x) = σ2Φ(x)T(Φ(x)Φ(x)T)−1Φ(x).

NEAR ZERO REGRET BOUNDS FOR BO WITH THE ESTIMATED PRIOR AND POSTERIOR

Acquisition functions: αPIt−1(x) = µt−1(x)−f∗

kt−1(x)12, αGP-UCB

t−1 (x) = µt−1(x) + ζtkt−1(x)12 .

f∗ ≥ maxx∈X

f(x), ζt =

(6(N − 3 + t+ 2

√t log 6

δ + 2 log 6δ )/(δN(N − t− 1))

) 12

+ (2 log( 3δ ))

12

(1− 2( 1N−t log 6

δ )12 )

12

With high probability, the simple regret decreases to a constant proportional to the noise level σ as the number of iterations and training functions increases.

Theorem 2. Assume there exist constant c ≥ maxx∈X k(x) and a training dataset is available whose size is N ≥ 4 log 6δ + T + 2. With probability at

least 1− δ, the best-sample simple regret in T iterations of meta BO with either GP-UCB or PI satisfies

rUCBT < ηUCB

T (N)λT , rPIT < ηPI

T (N)λT , λ2T = O(ρT /T ) + σ(xτ )2,

where ηUCBT (N) = (m + C1)(√1+m√1−m + 1), ηPI

T (N) = (m + C2)(√1+m√1−m + 1) + C3, m = O(

√1

N−T ), C1, C2, C3 > 0 are constants, τ =

arg mint∈[T ] kt−1(xt) and ρT = maxA∈X,|A|=T

12 log |I + σ−2k(A)|. σ is defined in the same way as in Lemma 1.

Lemma 3. With probability at least 1− δ, the simple regret RT ≤ rT + 2(2 log 1δ )

12σ.

EXPERIMENTSOptimizing a grasp (N = 1800,M = 162)

0 20 40 60 80 100 120 140 1603.4

3.2

3.0

2.8

2.6

2.4

2.2

Random

Plain-UCB

PEM-BO-UCB

TLSM-BO-UCB

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.093.4

3.2

3.0

2.8

2.6

2.4

PEM-BO-UCB

Plain-UCB

Proportion of prior training dataset Number of evaluations of test f

Max o

bse

rved v

alu

e

Optimizing a grasp, base pose, and placement (N = 1500,M = 1000)

0 5 10 15 20 25 30

6

5

4

3

2

Random

Plain-UCB

PEM-BO-UCB

TLSM-BO-UCB

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

6

5

4

3

2

PEM-BO-UCB

Plain-UCB

Proportion of prior training dataset Number of evaluations of test f

Max o

bse

rved v

alu

e

Optimizing a continuous synthetic function drawn from a GP with a squaredexponential kernel (N = 100,M = 1000)

Proportion of prior training dataset Number of evaluations of test f

Max o

bse

rved v

alu

e

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

70

75

80

85

90

95

100

105 PEM-BO-UCB

Plain-UCB w/ correct form of prior

0 20 40 60 80

0

25

50

75

100

125

150

Random

Plain-UCB w/ correct form of prior

PEM-BO-UCB

TLSM-BO

• PEM-BO: our point estimate meta BO approach.• Plain-BO: plain BO using a squared exponential kernel and maximum like-

lihood for GP hyperparameter estimation.• TLSM-BO: transfer learning sequential model-based optimization [5].• Random: random selection.

SELECTED REFERENCES[1] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley New

York, 1958.

[2] E. J. Candès and B. Recht. Exact matrix completion via convex optimization. Foun-dations of Computational mathematics, 9(6):717, 2009.

[3] B. Kim, L. P. Kaelbling, and T. Lozano-Pérez. Learning to guide task and motionplanning using score-space representation. In ICRA, 2017.

[4] R. Neal. Bayesian Learning for Neural networks. Lecture Notes in Statistics 118.Springer, 1996.

[5] D. Yogatama and G. Mann. Efficient transfer learning method for automatic hyper-parameter tuning. In AISTATS, 2014.