javier gonzalez´ (most of slides today: neil lawrence)javiergonzalezh.github.io › coursebo ›...

Course in Bayesian Optimization

Javier González(most of slides today: Neil Lawrence)

University of Sheffield, Sheffield, UK

27th October 2015

Many thanks to

I Mauricio Álvarez.I Cristian Guarizno.I Neil Lawrence, University of Sheffield.I Zhenwen Dai, University of Sheffield.I Machine Learning group, University of Sheffield.I Philipp Hennig, Max Planck institute.I Michael Osborne, University of Oxford.

Outline of the Course

I Lecture 1: Uncertainty and Gaussian Processes.

I Lecture 2: Introduction to Bayesian (probabilistic)optimization.

I Lecture 3: Advanced topics in Bayesian Optimization.

Outline of the Course

I Lab 1: Introduction to GPy.

I Lab 2: Introduction to GPyOpt.

I Lab 3: Advanced GPyOpt.

I Day 4: Projects + presentations.

Points of the day

What is machine learning?

What is the uncertainty? Types?

How the uncertainty plays a role in the learning process.

Gaussian processes as models to handle uncertainty.

What is to learn?

The human learning process?

Cancerous Tumor? 

What is Machine Learning?

data

+ model = prediction

I data: observations, could be actively or passively acquired(meta-data).

I model: assumptions, based on previous experience (otherdata! transfer learning etc), or beliefs about the regularitiesof the universe. Inductive bias.

I prediction: an action to be taken or a categorization or aquality score.


data +

model = prediction





data + model

= prediction





data + model =

prediction





data + model = prediction




Historical Perspective

I A data driven approach to Artificial Intelligence.I Inspired by attempts to model the brain (the

connectionists).I A community that transcended traditional boundaries

(psychology, statistical physics, signal processing)I Led to an approach that dominates in the modern data-rich

world.

Two Dominant Approaches

I Machine Learning as Optimization:I Formulate your learning Problem as an optimization

problem.I Typically intractable, so minimize a relaxed version of the

cost function.I Prove characteristics of the resulting solution.

I Machine Learning as Probabilistic Modelling:I Formulate your learning problem as a probabilistic model.I Relate variables through probability distributions.I If Bayesian, treat parameters with probability distributions.I Required integrals often intractable: use approximations

(MCMC, variational etc).

Modelling Assumptions

I Modelling assumptions are either included as:I a regularizer (optimization) orI in the probability distribution (probabilistic approach).

I Typical assumptions: sparsity, smoothness.

Applications of Machine Learning

Handwriting Recognition : Recognising handwrittencharacters. For example LeNethttp://bit.ly/d26fwK.

Friend Indentification : Suggesting friends on social networkshttps:

//www.facebook.com/help/501283333222485

Ranking : Learning relative skills of on line game players,the TrueSkill system http://research.microsoft.com/en-us/projects/trueskill/.http://www.netflixprize.com/.

Internet Search : For example Ad Click Through rateprediction http://bit.ly/a7XLH4.

News Personalisation : For example Zitehttp://www.zite.com/.

Game Play Learning : For example, learning to play Gohttp://bit.ly/cV77zM.

http://bit.ly/d26fwKhttps://www.facebook.com/help/501283333222485https://www.facebook.com/help/501283333222485http://research.microsoft.com/en-us/projects/trueskill/http://research.microsoft.com/en-us/projects/trueskill/http://www.netflixprize.com/http://bit.ly/a7XLH4http://www.zite.com/http://bit.ly/cV77zM

Learning is Optimization

y = mx + c

0

1

2

3

4

5

0 1 2 3 4 5

y

x

y = mx + c

0

1

2

3

4

5

0 1 2 3 4 5

y

x

y = mx + cc

m

0

1

2

3

4

5

0 1 2 3 4 5

y

x

y = mx + c

y = mx + c

point 1: x = 1, y = 3

3 = m + c

point 2: x = 3, y = 1

1 = 3m + c

point 3: x = 2, y = 2.5

2.5 = 2m + c

y = mx + c + �

point 1: x = 1, y = 3

3 = m + c + �1

point 2: x = 3, y = 1

1 = 3m + c + �2

point 3: x = 2, y = 2.5

2.5 = 2m + c + �3

Regression Revisited

I We introduce an error function of the form

E(w) =n∑

i=1

(yi −mxi − c

)2I Minimize the error function with respect to m and c

Mathematical Interpretation

I What is the mathematical interpretation?I There is a cost function.I It expresses mismatch between your prediction and reality.

E(m, c) =n∑

i=1

(yi −mxi − c

)2I This is known as the sum of squares error.


I Learning is minimization of the cost function.I At the minima the gradient is zero.I Coordinate ascent, find gradient in each coordinate and set

to zero.dE(m)

dm= −2

n∑i=1

xi(yi −mxi − c

)



to zero.

0 = −2n∑

i=1

xi(yi −mxi − c

)



to zero.

0 = −2n∑

i=1

xiyi + 2n∑

i=1

mx2i + 2n∑

i=1

cxi



to zero.

m =∑n

i=1(yi − c

)xi∑n

i=1 x2i



to zero.dE(c)

dc= −2

n∑i=1

(yi −mxi − c

)



to zero.

0 = −2n∑

i=1

(yi −mxi − c

)



to zero.

0 = −2n∑

i=1

yi + 2n∑

i=1

mxi + 2nc



to zero.

c =∑n

i=1(yi −mxi

)n

Fixed Point Updates

Worked example.

c∗ =∑n

i=1(yi −m∗xi

)n

,

m∗ =∑n

i=1 xi(yi − c∗

)∑ni=1 x

2i

,

σ2∗

=

∑ni=1

(yi −m∗xi − c∗

)2n

Coordinate Descent

-1

-0.5

0

0.5

1

1.5

2

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1

c

m

E(m, c)

Coordinate Descent

-1

-0.5

0

0.5

1

1.5

2

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1

c

m

Iteration 1

Coordinate Descent

-1

-0.5

0

0.5

1

1.5

2

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1

c

m

Iteration 2

Coordinate Descent

-1

-0.5

0

0.5

1

1.5

2

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1

c

m

Iteration 3

Coordinate Descent

0.9

1

1.1

1.2

1.3

1.4

1.5

0.2 0.25 0.3 0.35 0.4

c

m

Iteration 4

Coordinate Descent

0.9

1

1.1

1.2

1.3

1.4

1.5

0.2 0.25 0.3 0.35 0.4

c

m

Iteration 5

Coordinate Descent

0.9

1

1.1

1.2

1.3

1.4

1.5

0.2 0.25 0.3 0.35 0.4

c

m

Iteration 6

Coordinate Descent

0.9

1

1.1

1.2

1.3

1.4

1.5

0.2 0.25 0.3 0.35 0.4

c

m

Iteration 7

Coordinate Descent

0.9

1

1.1

1.2

1.3

1.4

1.5

0.2 0.25 0.3 0.35 0.4

c

m

Iteration 8

Coordinate Descent

0.9

1

1.1

1.2

1.3

1.4

1.5

0.2 0.25 0.3 0.35 0.4

c

m

Iteration 9

Coordinate Descent

0.9

1

1.1

1.2

1.3

1.4

1.5

0.2 0.25 0.3 0.35 0.4

c

m

Iteration 10

Coordinate Descent

0.9

1

1.1

1.2

1.3

1.4

1.5

0.2 0.25 0.3 0.35 0.4

c

m

Iteration 20

Coordinate Descent

0.9

1

1.1

1.2

1.3

1.4

1.5

0.2 0.25 0.3 0.35 0.4

c

m

Iteration 30

Important Concepts Not Covered

I Optimization methods.I Second order methods, conjugate gradient, quasi-Newton

and Newton.I Effective heuristics such as momentum, CMA, etc

I Local vs global solutions (Bayesian optimization!).

Learning is probabilistic modeling

Machine Learning and Probability

I The world is an uncertain place.Epistemic uncertainty: uncertainty arising through lack of

knowledge. (What colour socks is that personwearing?)

Aleatoric uncertainty: uncertainty arising through anunderlying stochastic system. (Where will asheet of paper fall if I drop it?)

Probability: A Framework to Characterise Uncertainty

I We need a framework to characterise the uncertainty.I In this course we make use of probability theory to

characterise uncertainty.

Richard Price

I Welsh philosopher and essay writer.I Edited Thomas Bayes’s essay which contained

foundations of Bayesian philosophy.

Figure: Richard Price, 1723–1791. (source Wikipedia)

Laplace

I French Mathematician and Astronomer.

Figure: Pierre-Simon Laplace, 1749–1827. (source Wikipedia)

Probabilistic Intepretation

I Quadratic error functions can be seen as Gaussian noisemodels [1, 2].

I Imagine we are seeing data given by,

y(xi) = mxi + c + �

where � is Gaussian noise with standard deviation σ,

� ∼ N(0, σ2

).

Noise Corrupted Mapping

I This implies that

yi ∼ N(mxi + c, σ2

)I Which we also write

p(yi|w, σ) = N(yi|mxi + c, σ2

)

Gaussian Likelihood

I If the noise is sampled independently for each data pointfrom the same density we have

p(y|m, c, σ2) =n∏

i=1

N(yi|mxi + c, σ2

)I This is an i.i.d. assumption about the noise.I Writing the functional form we have

p(y|m, c, σ2) =n∏

i=1

1√2πσ2

exp(− (yi −mxi − c)

2

2σ2

)

Gaussian Likelihood


p(y|m, c, σ2) =n∏

i=1

N(yi|mxi + c, σ2


p(y|m, c, σ2) ∝n∏

i=1

exp(− (yi −mxi − c)

2

2σ2

)

Gaussian Likelihood


p(y|m, c, σ2) =n∏

i=1

N(yi|mxi + c, σ2


p(y|m, c, σ2) ∝ exp− n∑

i=1

(yi −mxi − c)22σ2

Gaussian Log Likelihood


p(y|m, c, σ2) =n∏

i=1

N(yi|mxi + c, σ2


logp(y|m, c, σ2) = − 12σ2

n∑i=1

(yi −mxi − c)2 + const



p(y|m, c, σ2) =n∏

i=1

N(yi|mxi + c, σ2


− logp(y|m, c, σ2) = 12σ2

n∑i=1

(yi −mxi − c)2 + const



p(y|m, c, σ2) =n∏

i=1

N(yi|mxi + c, σ2


− logp(y|m, c, σ2) = 12σ2

E(m, c) + const

Probabilistic Interpretation of the Error Function

I Probabilistic Interpretation for Error Function is NegativeLog Likelihood.

I Minimizing error function is equivalent to maximizing loglikelihood.

I Maximizing log likelihood is equivalent to maximizing thelikelihood because log is monotonic.

I Probabilistic interpretation: Minimizing error function isequivalent to maximum likelihood with respect toparameters.

Sample Based Approximation implies i.i.d

I The log likelihood is

L(θ) = log P(y|θ)I If the likelihood is independent over the individual data

points,

P(y|θ) =n∏

i=1

P(yi|θ)

I This is equivalent to the assumption that the data isindependent and identically distributed. This is known asi.i.d..

I Now the log likelihood is

L(θ) =n∑

i=1

log P(yi|θ)

I We take the negative log likelihood to recover the sum ofsquares error.

Bayesian perspective

Underdetermined System

What about two unknowns andone observation?

y1 = mx1 + c

012345

0 1 2 3y

x


Can compute m given c.

m =y1 − c

x

012345

0 1 2 3y

x



c = 1.75 =⇒ m = 1.25

012345

0 1 2 3y

x



c = −0.777 =⇒ m = 3.78

012345

0 1 2 3y

x



c = −4.01 =⇒ m = 7.01

012345

0 1 2 3y

x



c = −0.718 =⇒ m = 3.72

012345

0 1 2 3y

x



c = 2.45 =⇒ m = 0.545

012345

0 1 2 3y

x



c = −0.657 =⇒ m = 3.66

012345

0 1 2 3y

x



c = −3.13 =⇒ m = 6.13

012345

0 1 2 3y

x



c = −1.47 =⇒ m = 4.47

012345

0 1 2 3y

x


Can compute m given c.Assume

c ∼ N (0, 4) ,

we find a distribution of solu-tions.

012345

0 1 2 3y

x

Bayesian Approach

I Likelihood for the regression example has the form

p(y|w, σ2) =n∏

i=1

N(yi|w>φi, σ2

).

I Suggestion was to maximize this likelihood with respect tow.

I This can be done with gradient based optimization of thelog likelihood.

I Alternative approach: integration across w.I Consider expected value of likelihood under a range of

potential ws.I This is known as the Bayesian approach.

Note on the Term Bayesian

I We will use Bayes’ rule to invert probabilities in theBayesian approach.

I Bayesian is not named after Bayes’ rule (v. commonconfusion).

I The term Bayesian refers to the treatment of the parametersas stochastic variables.

I This approach was proposed by Laplace and Bayesindependently.

I For early statisticians this was very controversial (Fisher etal).

Bayesian Controversy

I Bayesian controversy relates to treating epistemicuncertainty as aleatoric uncertainty.

I Another analogy:I Before a football match the uncertainty about the result is

aleatoric.I If I watch a recorded match without knowing the result the

uncertainty is epistemic.

Simple Bayesian Inference

posterior =likelihood × prior

marginal likelihood

I Four components:1. Prior distribution: represents belief about parameter values

before seeing data.2. Likelihood: gives relation between parameters and data.3. Posterior distribution: represents updated belief about

parameters after data is observed.4. Marginal likelihood: represents assessment of the quality of

the model. Can be compared with other models(likelihood/prior combinations). Ratios of marginallikelihoods are known as Bayes factors.

Recap

I We can see the learning process as an optimizationproblem.

I We can also see the learning process as probabilisticmodelling (that also depends on parameters that need tobe optimised).

I The Bayesian frameworks allows us to handle ‘epistemic’uncertainty in systems.

I Examples only for linear functions, so far.

Generalizations: What if we want to use a moreexpressive model (non-linear function)

I ML as optimization: Regularization in RKHSs (kernelmethods)

I Bayesian perspective: Gaussian Processes.

Both approaches are related: as standard regression andBayesian regression are.

More important for us: Gaussian processes.

Basis FunctionsNonlinear Regression

I Problem with Linear Regression—x may not be linearlyrelated to y.

I Potential solution: create a feature space: define φ(x)where φ(·) is a nonlinear function of x.

I Model for target is a linear combination of these nonlinearfunctions

f (x) =K∑

j=1

w jφ j(x) (1)

Quadratic Basis

I Basis functions can be global. E.g. quadratic basis:

[1, x, x2]

-2

-1

0

1

2

-1 0 1

φ(x

)

x

φ(x) = 1

Figure: A quadratic basis.

Quadratic Basis


[1, x, x2]

-2

-1

0

1

2

-1 0 1

φ(x

)

x

φ(x) = 1

φ(x) = x


Quadratic Basis


[1, x, x2]

-2

-1

0

1

2

-1 0 1

φ(x

)

x

φ(x) = 1

φ(x) = xφ(x) = x2


Functions Derived from Quadratic Basisf (x) = w1 + w2x + w3x2

-4-3-2-10123

-1 0 1

f(x)

xFigure: Function from quadratic basis with weights w1 = 0.87466,w2 = −0.38835, w3 = −2.0058 .


-4-3-2-10123

-1 0 1

f(x)

xFigure: Function from quadratic basis with weights w1 = −0.35908,w2 = 1.2274, w3 = −0.32825 .


-4-3-2-10123

-1 0 1

f(x)

xFigure: Function from quadratic basis with weights w1 = −1.5638,w2 = −0.73577, w3 = 1.6861 .

Radial Basis Functions

I Or they can be local. E.g. radial (or Gaussian) basis

φ j(x) = exp(− (x−µ j)

2

`2

)

0

1

-2 -1 0 1 2

φ(x

)

x

φ1(x) = e−2(x+1)2

Figure: Radial basis functions.




2

`2

)

0

1

-2 -1 0 1 2

φ(x

)

x

φ1(x) = e−2(x+1)2

φ2(x) = e−2x2





2

`2

)

0

1

-2 -1 0 1 2

φ(x

)

x

φ1(x) = e−2(x+1)2

φ2(x) = e−2x2

φ3(x) = e−2(x−1)2


Functions Derived from Radial Basisf (x) = w1e−2(x+1)

2+ w2e−2x

2+ w3e−2(x−1)

2

-2

-1

0

1

2

-3 -2 -1 0 1 2 3

f(x)

xFigure: Function from radial basis with weights w1 = −0.47518,w2 = −0.18924, w3 = −1.8183 .


2+ w2e−2x

2+ w3e−2(x−1)

2

-2

-1

0

1

2

-3 -2 -1 0 1 2 3

f(x)

xFigure: Function from radial basis with weights w1 = 0.50596,w2 = −0.046315, w3 = 0.26813 .


2+ w2e−2x

2+ w3e−2(x−1)

2

-2

-1

0

1

2

-3 -2 -1 0 1 2 3

f(x)

xFigure: Function from radial basis with weights w1 = 0.07179,w2 = 1.3591, w3 = 0.50604 .

Probabilistic Model with Basis Functions

I Define a general function:

f (xi) = w>φ(xi)

I Corrupt with independent noise:

y(xi) = f (xi) + �i

� ∼n∏

i=1

N(0, σ2

)I Implies the following likelihood:

p(y|w, σ2) =n∏

i=1

N(yi|w>φ(xi), σ2

)

Multivariate Regression Likelihood

I Noise corrupted data point

yi = w>xi,: + �i

I Multivariate regression likelihood:

p(y|X,w) = 1(2πσ2)n/2

exp

− 12σ2n∑

i=1

(yi −w>xi,:

)2I Now use a multivariate Gaussian prior:

p(w) =1

(2πα)p2

exp(− 1

2αw>w

)

Posterior Density

I Once again we want to know the posterior:

p(w|y,X) ∝ p(y|X,w)p(w)

I And we can compute by completing the square.

log p(w|y,X) = − 12σ2

n∑i=1

y2i +1σ2

n∑i=1

yix>i,:w

− 12σ2

n∑i=1

w>xi,:x>i,:w −1

2αw>w + const.

Computing the Posterior

I By inspection we extract the inverse covariance

log p(w|y,X) = − 12σ2

n∑i=1

y2i +1σ2

n∑i=1

yix>i,:w

− 12σ2

n∑i=1

w>xi,:x>i,:w −1

2αw>w + const.

I Completing the square allows us to compute the mean.



log p(w|y,X) = − 12σ2

y>y +1σ2

y>Xw

− 12σ2

w>X>Xw − 12α

w>w + const.




log p(w|y,X) = − 12σ2

y>y +1σ2

y>Xw

− 12

w>[σ−1X>X + α−1I

]w + const.


Bayesian vs Maximum Likelihood

I Note the similarity between posterior mean

µw = (σ−2X>X + α−1I)−1σ−2X>y

I and Maximum likelihood solution

ŵ = (X>X)−1X>y

Marginal Likelihood

I In some sense though the real model is now the marginallikelihood.

I Marginalization of W follows sum rule

p(y|X) =∫

p(y|X,w)p(w)dw

givingp(y|X) = N

(y|0, αXX> + σ2I

)I Often the integral is intractable.I Leads to variational approximations, MCMC (Michael

Betancourt, Mark Girolami), Laplace approximation(Harvard Rue).

I For the case of Gaussians it’s trivial!!

Marginal Likelihood

I Can compute the marginal likelihood as:

p(y|X, α, σ) = N(y|0, αXX> + σ2I

)I Or if we use a basis set we have

p(y|X, α, σ) = N(y|0, αΦΦ> + σ2I

)I This Gaussian is no longer i.i.d. across data and this is

where things get interesting.

Marginal Likelihood

I The marginal likelihood can also be computed, it has theform:

p(y|X, σ2, α) = 1(2π)

n2 |K| 12

exp(−1

2y>K−1y

)where K = αΦΦ> + σ2I.

I So it is a zero mean n-dimensional Gaussian withcovariance matrix K.

Sampling a Function

Multi-variate Gaussians

I We will consider a Gaussian with a particular structure ofcovariance matrix.

I Generate a single sample from this 25 dimensionalGaussian distribution, f =

[f1, f2 . . . f25

].

I We will plot these points against their index.

Gaussian Distribution Sample

-2

-1

0

1

2

0 5 10 15 20 25

f i

i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

ji

0

0.2

0.4

0.6

0.8

1

(b) colormap showing correlations be-tween dimensions.

Figure: A sample from a 25 dimensional Gaussian distribution.

Gaussian Distribution Sample

-2

-1

0

1

2

0 5 10 15 20 25

f i

i(a) A 25 dimensional correlated ran-dom variable (values ploted againstindex)

0

0.2

0.4

0.6

0.8

1

(b) colormap showing correlations be-tween dimensions.

Figure: A sample from a 25 dimensional Gaussian distribution.

Computing the Expected Output

I Given the posterior for the parameters, how can wecompute the expected output at a given location?

I Output of model at location xi is given by

f (xi; w) = φ>i w

I We want the expected output under the posterior density,p(w|y,X, σ2, α).

I Mean of mapping function will be given by〈f (xi; w)

〉p(w|y,X,σ2,α) = φ

>i 〈w〉p(w|y,X,σ2,α)

= φ>i µw

Variance of Expected Output

I Variance of model at location xi is given by

var( f (xi; w)) =〈( f (xi; w))2

〉− 〈 f (xi; w)〉2

= φ>i〈ww>

〉φi −φ>i 〈w〉〈w〉

>φi

= φ>i Cwφi

where all these expectations are taken under the posteriordensity, p(w|y,X, σ2, α).

Olympic Marathon Data

I Gold medal times forOlympic Marathon since1896.

I Marathons before 1924didn’t have astandardised distance.

I Present results usingpace per km.

I In 1904 Marathon wasbadly organised leadingto very slow times.

Image from WikimediaCommons

http://bit.ly/16kMKHQ

http://bit.ly/16kMKHQ

Olympic Marathon Data

3

3.5

4

4.5

5

1900 1920 1940 1960 1980 2000 2020

y,pa

cem

in/k

m

x, year

Olympic Marathon Data.

Olympics Data analysis

I Use Bayesian approach on olympics data withpolynomials.

I Choose a prior w ∼ N (0, αI) with α = 1.I Choose noise variance σ2 = 0.01

Sampling the Prior

I Always useful to perform a ‘sanity check’ and sample fromthe prior before observing the data.

I Since y = Φw + � just need to sample

w ∼ N (0, α)

� ∼ N(0, σ2

)with α = 1 and � = 0.01.

Recall: Validation Set for Maximum Likelihood

2.53

3.54

4.55

5.5

1892 1932 1972 2012-40-20

020406080

100

0 1 2 3 4 5 6 7

polynomial order

Left: fit to data, Right: model error. Polynomial order 0, trainingerror -1.8774, validation error -0.13132, σ2 = 0.302, σ = 0.549.


2.53

3.54

4.55

5.5

1892 1932 1972 2012-40-20

020406080

100

0 1 2 3 4 5 6 7

polynomial order

Left: fit to data, Right: model error. Polynomial order 1, trainingerror -15.325, validation error 2.5863, σ2 = 0.0733, σ = 0.271.


2.53

3.54

4.55

5.5

1892 1932 1972 2012-40-20

020406080

100

0 1 2 3 4 5 6 7

polynomial order

Left: fit to data, Right: model error. Polynomial order 2, trainingerror -17.579, validation error -8.4831, σ2 = 0.0578, σ = 0.240.


2.53

3.54

4.55

5.5

1892 1932 1972 2012-40-20

020406080

100

0 1 2 3 4 5 6 7

polynomial order



2.53

3.54

4.55

5.5

1892 1932 1972 2012-40-20

020406080

100

0 1 2 3 4 5 6 7

polynomial order

Left: fit to data, Right: model error. Polynomial order 6, trainingerror -22.881, validation error 67775, σ2 = 0.0331, σ = 0.182.

Validation Set

2.53

3.54

4.55

5.5

1892 1932 1972 2012-40-20

020406080

100

0 1 2 3 4 5 6 7

polynomial order

Left: fit to data, Right: model error. Polynomial order 0, trainingerror 29.757, validation error -0.29243, σ2 = 0.302, σ = 0.550.

Validation Set

2.53

3.54

4.55

5.5

1892 1932 1972 2012-40-20

020406080

100

0 1 2 3 4 5 6 7

polynomial order

Left: fit to data, Right: model error. Polynomial order 1, trainingerror 14.942, validation error 4.4027, σ2 = 0.0762, σ = 0.276.

Validation Set

2.53

3.54

4.55

5.5

1892 1932 1972 2012-40-20

020406080

100

0 1 2 3 4 5 6 7

polynomial order


Regularized Mean

I Validation fit here based on mean solution for w only.I For Bayesian solution

µw =[σ−2Φ>Φ + α−1I

]−1σ−2Φ>y

instead ofw∗ =

[Φ>Φ

]−1Φ>y

I Two are equivalent when α→∞.I Equivalent to a prior for w with infinite variance.I In other cases αI regularizes the system (keeps parameters

smaller).

Sampling the Posterior

I Now check samples by extracting w from the posterior.I Now for y = Φw + � need

w ∼ N(µw,Cw

)with Cw =

[σ−2Φ>Φ + α−1I

]−1and µw = Cwσ−2Φ>y

� ∼ N(0, σ2

)with α = 1 and � = 0.01.

Direct Construction of Covariance Matrix

Use matrix notation to write function,

f (xi; w) =m∑

k=1

wkφk (xi)



f (xi; w) =m∑

k=1

wkφk (xi)

computed at training data gives a vector

f = Φw.



f (xi; w) =m∑

k=1

wkφk (xi)


f = Φw.

w ∼ N (0, αI)



f (xi; w) =m∑

k=1

wkφk (xi)


f = Φw.

w ∼ N (0, αI)

w and f are only related by an inner product.



f (xi; w) =m∑

k=1

wkφk (xi)


f = Φw.

w ∼ N (0, αI)

w and f are only related by an inner product.

Φ ∈

Expectations

I We have〈f〉 = Φ 〈w〉 .

I Prior mean of w was zero giving

〈f〉 = 0.

I Prior covariance of f is

K =〈ff>

〉− 〈f〉〈f〉>

We use 〈·〉 to denote expectations under prior distributions.

Expectations

I We have〈f〉 = Φ 〈w〉 .

I Prior mean of w was zero giving

〈f〉 = 0.I Prior covariance of f is

K =〈ff>

〉− 〈f〉〈f〉>

〈ff>

〉= Φ

〈ww>

〉Φ>,

givingK = αΦΦ>.

We use 〈·〉 to denote expectations under prior distributions.

Covariance between Two Points

I The prior covariance between two points xi and x j is

k(xi, x j

)= αφ: (xi)

> φ:(x j

),

or in sum notation

k(xi, x j

)= α

m∑k=1

φk (xi)φk(x j

)I For the radial basis used this gives

k(xi, x j

)= α

m∑k=1

exp

−∣∣∣xi − µk∣∣∣2 + ∣∣∣x j − µk∣∣∣2

2`2

.

Covariance Functions and Mercer Kernels

I Mercer Kernels and Covariance Functions are similar.I the kernel perspective does not make a probabilistic

interpretation of the covariance function.I Algorithms can be simpler, but probabilistic interpretation

is crucial for kernel parameter optimization.

More on Mercer Kernels

Let X be a metric space and K : X × X→< a continuous andsymmetric function. If we assume that K is positive definite,that is, for any set x = {x1, · · · , xn} ⊂ X the n × n matrix K withcomponents

Ki j = K(xi, x j),

is positive semi-definite, then K is a Mercer kernel.

More on Mercer Kernels

Mercer’s Theorem (1909): Let K : X × X −→ < a Mercer’skernel. Let λ j the j-th eigenvalue of LK and {φ j} j≥1 thecorresponding eigenvector. Then, for all x, x′ ∈ X

K(x,y) =∞∑j=1

λ jφ j(x)φ j(y)

where the convergence is absolute (for each (x, x′) ∈ X × X) anduniform (on (x, x′) ∈ X × X).

By using directly a kernel we are using a basis functionimplicitly (possibly with infinity elements: Bayesian

non-parametrics).

Prediction with Correlated Gaussians

I Prediction of f∗ from f requires multivariate conditionaldensity.

I Multivariate conditional density is also Gaussian.

p(f∗|f) = N(f∗|K∗,fK−1f,f f,K∗,∗ −K∗,fK−1f,f Kf,∗

)

I Here covariance of joint density is given by

K =[Kf,f K∗,fKf,∗ K∗,∗

]

Prediction with Correlated Gaussians

I Prediction of f∗ from f requires multivariate conditionaldensity.

I Multivariate conditional density is also Gaussian.

p(f∗|f) = N(f∗|µ,Σ

)µ = K∗,fK−1f,f f

Σ = K∗,∗ −K∗,fK−1f,f Kf,∗I Here covariance of joint density is given by

K =[Kf,f K∗,fKf,∗ K∗,∗

]

Constructing Covariance Functions

I Sum of two covariances is also a covariance function.

k(x, x′) = k1(x, x′) + k2(x, x′)

Constructing Covariance Functions

I Product of two covariances is also a covariance function.

k(x, x′) = k1(x, x′)k2(x, x′)

Multiply by Deterministic Function

I If f (x) is a Gaussian process.I g(x) is a deterministic function.I h(x) = f (x)g(x)I Then

kh(x, x′) = g(x)k f (x, x′)g(x′)

where kh is covariance for h(·) and k f is covariance for f (·).

Covariance Functions

MLP Covariance Function

k (x, x′) = αasin(

wx>x′ + b√wx>x + b + 1

√wx′>x′ + b + 1

)

I Based on infinite neuralnetwork model.

w = 40

b = 4


Linear Covariance Function

k (x, x′) = αx>x′

I Bayesian linearregression.

α = 1

Covariance FunctionsWhere did this covariance matrix come from?

Ornstein-Uhlenbeck (stationary Gauss-Markov) covariancefunction

k (x, x′) = α exp(−|x − x

′|2`2

)I In one dimension arises

from a stochasticdifferential equation.Brownian motion in aparabolic tube.

I In higher dimension aFourier filter of the form

1π(1+x2) .


Markov Process

k (t, t′) = αmin(t, t′)

I Covariance matrix isbuilt using the inputs tothe function t.


Markov Process

k (t, t′) = αmin(t, t′)

I Covariance matrix isbuilt using the inputs tothe function t.

-3-2-10123

0 0.5 1 1.5 2


Matern 5/2 Covariance Function

k (x, x′) = α(1 +√

5r +53

r2)

exp(−√

5r)

where r =‖x − x′‖2

`

I Matern 5/2 is a twicedifferentiablecovariance.

I Matern familyconstructed withStudent-t filters inFourier space.


RBF Basis Functions

k (x, x′) = αφ(x)>φ(x′)

φk(x) = exp

−∥∥∥x − µk∥∥∥22

`2

µ =

−101


RBF Basis Functions

k (x, x′) = αφ(x)>φ(x′)

φk(x) = exp

−∥∥∥x − µk∥∥∥22

`2

µ =

−101

-3-2-10123

-3 -2 -1 0 1 2 3


Exponentiated Quadratic Kernel Function (RBF, SquaredExponential, Gaussian)

k (x, x′) = α exp

−‖x − x′‖222`2

I Covariance matrix isbuilt using the inputs tothe function x.

I For the example above itwas based on Euclideandistance.

I The covariance functionis also know as a kernel.

Gaussian Process Interpolation

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

f(x)

x

Figure: Real example: BACCO (see e.g. [3]). Interpolation throughoutputs from slow computer simulations (e.g. atmospheric carbonlevels).

Gaussian Noise

I Gaussian noise model,

p(yi| fi

)= N

(yi| fi, σ2

)where σ2 is the variance of the noise.

I Equivalent to a covariance function of the form

k(xi, x j) = δi, jσ2

where δi, j is the Kronecker delta function.I Additive nature of Gaussians means we can simply add

this term to existing covariance matrices.

Gaussian Process Regression

-3

-2

-1

0

1

2

3

-2 -1 0 1 2

y(x)

x

Figure: Examples include WiFi localization, C14 callibration curve.

I Learning with Gaussian processes allows to characterizethe problem uncertainty:.

I We can choose a basis of functions o directly to select akernel (equivalent, but better to choose the kernel:non-parametric models).

I Given a covariance (prior), how to select the rightparameters?

I Back to optimization...

Learning Covariance ParametersCan we determine covariance parameters from the data?

N (y|0,K) = 1(2π)

n2 |K|12

exp(−y>K−1y

2

)

The parameters are inside the covariancefunction (matrix).

ki, j = k(xi, x j;θ)


logN (y|0,K) =−12

log |K|−y>K−1y

2− n

2log 2π




E(θ) =12

log |K| + y>K−1y

2



Learning Covariance ParametersCan we determine length scales and noise levels from the data?

-2

-1

0

1

2

-2 -1 0 1 2

y(x)

x

-10-505

101520

10−1 100 101

length scale, `

E(θ) =12

log |K| + y>K−1y

2

Limitations of Gaussian Processes

I Inference is O(n3) due to matrix inverse (in practice useCholesky).

I Gaussian processes don’t deal well with discontinuities(financial crises, phosphorylation, collisions, edges inimages).

I Widely used exponentiated quadratic covariance (RBF) canbe too smooth in practice (but there are manyalternatives!!).

Conclusions

I Machine learning has focussed on prediction.I Two main approaches: optimize objective, or model

probabilistically.I Both approaches require to optimize parameters.I Gaussian processes: fundamental models to deal with

uncertainty in complex scenarios.

Tomorow

I Global optimization.I Parameter tuning in Machine Learning as a global

optimization problem.I Can we automate the parameter choice of Machine

Learning algorithms?I Yes! Bayesian Optimization.

References I

[1] Carl Friedrich Gauss. Theoria motus corporum coelestium. Perthes et Besser,Hamburg.

[2] Pierre Simon Laplace. Mémoire sur la probabilité des causes par lesévènemens. In Mémoires de mathèmatique et de physique, presentés àlAcadémie Royale des Sciences, par divers savans, & lù dans ses assemblées 6,pages 621–656, 1774. Translated in [5].

[3] Jeremey Oakley and Anthony O’Hagan. Bayesian inference for theuncertainty distribution of computer model outputs. Biometrika,89(4):769–784, 2002.

[4] Carl Edward Rasmussen and Christopher K. I. Williams. GaussianProcesses for Machine Learning. MIT Press, Cambridge, MA, 2006.

[5] Stephen M. Stigler. Laplace’s 1774 memoir on inverse probability.Statistical Science, 1:359–378, 1986.

IntroductionLearning as optimizationMultivariate Bayesian Linear RegressionMercer KernelsTwo Point MarginalsConstructing CovarianceGP InterpolationGP RegressionParameter OptimizationGP Limitations

0.0: 0.1: 0.2: 0.3: 0.4: 0.5: 0.6: 0.7: 0.8: 0.9: 0.10: 0.11: 0.12: 0.13: 0.14: 0.15: 0.16: 0.17: 0.18: 0.19: 0.20: 0.21: 0.22: 0.23: 0.24: 0.25: 0.26: 0.27: 0.28: 0.29: 0.30: 0.31: 0.32: 0.33: 0.34: 0.35: 0.36: 0.37: 0.38: 0.39: anm0: 1.0: 1.1: 1.2: 1.3: 1.4: 1.5: 1.6: 1.7: 1.8: 1.9: 1.10: 1.11: 1.12: 1.13: 1.14: 1.15: 1.16: 1.17: 1.18: 1.19: 1.20: 1.21: 1.22: 1.23: 1.24: 1.25: 1.26: 1.27: 1.28: 1.29: anm1: 2.0: 2.1: 2.2: 2.3: 2.4: 2.5: 2.6: 2.7: 2.8: 2.9: 2.10: 2.11: 2.12: 2.13: 2.14: 2.15: 2.16: 2.17: 2.18: 2.19: 2.20: 2.21: 2.22: 2.23: 2.24: 2.25: 2.26: 2.27: 2.28: 2.29: anm2: 3.0: 3.1: 3.2: 3.3: 3.4: 3.5: 3.6: 3.7: 3.8: 3.9: 3.10: 3.11: 3.12: 3.13: 3.14: 3.15: 3.16: 3.17: 3.18: 3.19: 3.20: 3.21: 3.22: 3.23: 3.24: 3.25: 3.26: 3.27: 3.28: 3.29: anm3: 4.0: 4.1: 4.2: 4.3: 4.4: 4.5: 4.6: 4.7: 4.8: 4.9: 4.10: 4.11: 4.12: 4.13: 4.14: 4.15: 4.16: 4.17: 4.18: 4.19: 4.20: 4.21: 4.22: 4.23: 4.24: 4.25: 4.26: 4.27: 4.28: 4.29: anm4:

javier gonzalez´ (most of slides today: neil lawrence)javiergonzalezh.github.io › coursebo ›...

Documents