a tutorial on bayesian optimization for machine learning...
TRANSCRIPT
![Page 1: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/1.jpg)
http://hips.seas.harvard.edu
Ryan P. Adams!School of Engineering and Applied Sciences!
Harvard University
A Tutorial on!Bayesian Optimization!for Machine Learning
![Page 2: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/2.jpg)
Machine Learning Meta-Challenges
‣ Increasing Model Complexity More flexible models have more parameters.!
‣ More Sophisticated Fitting Procedures Non-convex optimization has many knobs to turn.!
‣ Less Accessible to Non-Experts Harder to apply complicated techniques.!
‣ Results Are Less Reproducible Too many important implementation details are missing.
![Page 3: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/3.jpg)
Example: Deep Neural Networks
‣ Resurgent interest in large neural networks.!‣ When well-tuned, very successful for visual object
identification, speech recognition, comp bio, ...!‣ Big investments by Google, Facebook, Microsoft, etc.!‣ Many choices: number of layers, weight regularization,
layer size, which nonlinearity, batch size, learning rate schedule, stopping conditions
![Page 4: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/4.jpg)
Search for Good Hyperparameters?
‣ Define an objective function. Most often, we care about generalization performance. Use cross validation to measure parameter quality.!
‣ How do people currently search? Black magic. Grid search Random search Grad student descent!
‣ Painful!Requires many training cycles. Possibly noisy.
![Page 5: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/5.jpg)
Can We Do Better? Bayesian Optimization
‣ Build a probabilistic model for the objective. Include hierarchical structure about units, etc.!
‣ Compute the posterior predictive distribution. Integrate out all the possible true functions. We use Gaussian process regression.!
‣ Optimize a cheap proxy function instead. The model is much cheaper than that true objective.
The main insight:!Make the proxy function exploit uncertainty to balance
exploration against exploitation.
![Page 6: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/6.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
The Bayesian Optimization Idea
current!best
![Page 7: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/7.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
The Bayesian Optimization Idea
current!best
![Page 8: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/8.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
The Bayesian Optimization Idea
current!best
![Page 9: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/9.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
The Bayesian Optimization Idea
current!best
![Page 10: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/10.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
The Bayesian Optimization Idea
current!best
![Page 11: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/11.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
The Bayesian Optimization Idea
current!best
![Page 12: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/12.jpg)
The Bayesian Optimization Idea
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
current!best
Where should we evaluate next!in order to improve the most?
![Page 13: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/13.jpg)
The Bayesian Optimization Idea
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
current!best
Where should we evaluate next!in order to improve the most?
![Page 14: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/14.jpg)
The Bayesian Optimization Idea
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
current!best
Where should we evaluate next!in order to improve the most?
![Page 15: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/15.jpg)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
The Bayesian Optimization Idea
current!best
One idea: “expected improvement”
![Page 16: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/16.jpg)
Today’s Topics
‣ Review of Gaussian process priors!‣ Bayesian optimization basics!‣ Managing covariances and kernel parameters!‣ Accounting for the cost of evaluation!‣ Parallelizing training!‣ Sharing information across related problems!‣ Better models for nonstationary functions!‣ Random projections for high-dimensional problems!‣ Accounting for constraints!‣ Leveraging partially-completed training runs
![Page 17: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/17.jpg)
Gaussian Processes as Function Models
Nonparametric prior on functions specified in terms of a positive definite kernel.
−2
0
2
−2
0
2
−4
−2
0
2
4
−3 −2 −1 0 1 2 3−2
−1
0
1
2
3
strings (Haussler 1999)
permutations (Kondor 2008) proteins (Borgwardt 2007)
the quick brown fox jumps over the lazy dog
![Page 18: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/18.jpg)
Gaussian Processes
‣ Gaussian process (GP) is a distribution on functions.!
‣ Allows tractable Bayesian modeling of functions without specifying a particular finite basis.!
‣ Input space (where we’re optimizing) !
‣ Model scalar functions !
‣ Positive definite covariance function!
‣ Mean function
X
f : X ! R
C : X ⇥ X ! R
m : X ! R
![Page 19: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/19.jpg)
Gaussian Processes
Any finite set of points in , induces a homologous -dimensional Gaussian distribution on , taken to be the distribution on
XN
N
RN
{xn}Nn=1
{yn = f(xn)}Nn=1
![Page 20: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/20.jpg)
Gaussian Processes
Any finite set of points in , induces a homologous -dimensional Gaussian distribution on , taken to be the distribution on
XN
N
RN
{xn}Nn=1
{yn = f(xn)}Nn=1
!3 !2 !1 0 1 2 3!2
!1
0
1
2
3
4
![Page 21: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/21.jpg)
Gaussian Processes
Any finite set of points in , induces a homologous -dimensional Gaussian distribution on , taken to be the distribution on
XN
N
RN
{xn}Nn=1
{yn = f(xn)}Nn=1
−3 −2 −1 0 1 2 3−2
−1
0
1
2
3
4
![Page 22: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/22.jpg)
Gaussian Processes
‣ Due to Gaussian form, closed-form solutions for many useful questions about finite data.!
‣ Marginal likelihood:!
!
‣ Predictive distribution at test points :!
!
!
‣ We compute these matrices from the covariance:
ln p(y |X, ✓) = �N
2ln 2⇡ � 1
2ln |K✓|�
1
2yTK�1
✓ y
[K✓]n,n0 = C(xn,xn0 ; ✓)
[k✓]n,m = C(xn,xm ; ✓) [✓]m,m0 = C(xm,xm0 ; ✓)
{xm}Mm=1
ytest ⇠ N(m,⌃)
m = kT✓K
�1✓ y ⌃ = ✓ � kT
✓K�1✓ k✓
![Page 23: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/23.jpg)
Examples of GP Covariances
C(x, x0) = � exp
(�1
2
DX
d=1
✓xd � x0
d
⇥d
◆2)
Squared-Exponential
C(r) =21�⌫
�(�)
p2� r
⇥
!⌫
K⌫
p2� r
⇥
!
Matérn
C(x, x0) =2
�sin�1
8<
:2xT�x0
q(1 + 2xT�x)(1 + 2x0T�x0)
9=
;
“Neural Network”
C(x, x0) = exp
(�2 sin2
�12 (x� x0)
�
�2
)
Periodic
![Page 24: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/24.jpg)
GPs Provide Closed-Form Predictions
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
![Page 25: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/25.jpg)
Using Uncertainty in Optimization
‣ Find the minimum:!
‣ We can evaluate the objective pointwise, but do not have an easy functional form or gradients. !
‣ After performing some evaluations, the GP gives us easy closed-form marginal means and variances.!
‣ Exploration: Seek places with high variance.!
‣ Exploitation: Seek places with low mean.!
‣ The acquisition function balances these for our proxy optimization to determine the next evaluation.
x? = argminx2X
f(x)
![Page 26: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/26.jpg)
Closed-Form Acquisition Functions
‣ The GP posterior gives a predictive mean function and a predictive marginal variance function !
!
‣ Probability of Improvement (Kushner 1964):!
‣ Expected Improvement (Mockus 1978):!
‣ GP Upper Confidence Bound (Srinivas et al. 2010):
aPI(x) = �(�(x))
�(x) =f(xbest)� µ(x)
�(x)
aEI(x) = �(x)(�(x)�(�(x)) +N(�(x) ; 0, 1))
µ(x)
�2(x)
aLCB(x) = µ(x)� �(x)
![Page 27: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/27.jpg)
Probability of Improvement
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
![Page 28: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/28.jpg)
Expected Improvement
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
![Page 29: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/29.jpg)
GP Upper (Lower) Confidence Bound
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
![Page 30: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/30.jpg)
Distribution Over Minimum (Entropy Search)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
![Page 31: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/31.jpg)
Illustrating Bayesian Optimization
![Page 32: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/32.jpg)
Illustrating Bayesian Optimization
![Page 33: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/33.jpg)
Illustrating Bayesian Optimization
![Page 34: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/34.jpg)
Illustrating Bayesian Optimization
![Page 35: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/35.jpg)
Illustrating Bayesian Optimization
![Page 36: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/36.jpg)
Illustrating Bayesian Optimization
![Page 37: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/37.jpg)
Illustrating Bayesian Optimization
![Page 38: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/38.jpg)
Why Doesn’t Everyone Use This?
‣ Fragility and poor default choices. Getting the function model wrong can be catastrophic.!
‣ There hasn’t been standard software available. It’s a bit tricky to build such a system from scratch.!
‣ Experiments are run sequentially. We want to take advantage of cluster computing.!
‣ Limited scalability in dimensions and evaluations. We want to solve big problems.
These ideas have been around for decades.!Why is Bayesian optimization in broader use?
![Page 39: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/39.jpg)
Fragility and Poor Default Choices
‣ Covariance function selectionThis turns out to be crucial to good performance. The default choice for regression is way too smooth. Instead: use adaptive Matèrn 3/5 kernel.!
‣ Gaussian process hyperparameters Typical empirical Bayes approach can fail horribly. Instead: use Markov chain Monte Carlo integration. Slice sampling means no additional parameters!
Ironic Problem:!Bayesian optimization has its own hyperparameters!
![Page 40: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/40.jpg)
Covariance Function Choice
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−3
−2
−1
0
1
2
3
C(r) =21�⌫
�(�)
p2� r
⇥
!⌫
K⌫
p2� r
⇥
!
⌫ = 1/2 ⌫ = 3/2
⌫ = 5/2 ⌫ = 1
![Page 41: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/41.jpg)
Choosing Covariance Functions
Structured SVM for Protein Motif Finding Miller et al (2012)
0 10 20 30 40 50 60 70 80 90 1000.24
0.245
0.25
0.255
0.26
0.265
0.27
0.275
0.28
Cla
ssifi
catio
n E
rro
r
Function evaluations
Matern 52 ARDExpQuad ISOExpQuad ARDMatern 32 ARD
Snoek, Larochelle & RPA, NIPS 2012
![Page 42: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/42.jpg)
MCMC for GP Hyperparameters‣ Covariance hyperparameters are often optimized rather than
marginalized, typically in the name of convenience and efficiency.!‣ Slice sampling of hyperparameters (e.g., Murray and Adams 2010) is
comparably fast and easy, but accounts for uncertainty in length scale, mean, and amplitude.!
‣ Integrated Acquisition Function:!!
!
!
!
‣ For a theoretical discussion of the implications of inferring hyperparameters with BayesOpt, see recent work by Wang and de Freitas (http://arxiv.org/abs/1406.7758)
a(x) =
Za(x ; ✓) p(✓ | {xn, yn}Nn=1) d✓
⇡ 1
K
KX
k=1
a(x ; ✓(k)) ✓(k) ⇠ p(✓ | {xn, yn}Nn=1)
Snoek, Larochelle & RPA, NIPS 2012
![Page 43: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/43.jpg)
Integrating Out GP Hyperparameters
Posterior samples with three different
length scales
Length scale specific expected improvement
Integrated expected improvement
![Page 44: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/44.jpg)
MCMC for Hyperparameters
0 10 20 30 40 50 60 70 80 90 1000.05
0.1
0.15
0.2
0.25
Cla
ssifi
catio
n E
rror
Function Evaluations
GP EI MCMCGP EI ConventionalTree Parzen Algorithm
Logistic regression for handwritten digit recognition
Snoek, Larochelle & RPA, NIPS 2012
![Page 45: A Tutorial on Bayesian Optimization for Machine Learning ...rgrosse/courses/csc411_f18/tutorials/tut8... · School of Engineering and Applied Sciences! Harvard University A Tutorial](https://reader030.vdocument.in/reader030/viewer/2022040611/5ed4130d8d46b66d2263690e/html5/thumbnails/45.jpg)