gaussian processes - imperial college londondfg/probabilisticinference/gaussianproce… · a...

Post on 10-Jun-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Analysis and Probabilistic Inference

Gaussian ProcessesRecommended reading:Rasmussen/Williams: Chapters 1, 2, 4, 5Deisenroth & Ng (2015)[3]

Marc DeisenrothDepartment of ComputingImperial College London

February 22, 2017

http://www.gaussianprocess.org/

Gaussian Processes Marc Deisenroth February 22, 2017 2

Problem Setting

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2ε

˘

, find adistribution over functions pp f q that explains the data

Probabilistic regression problem

Gaussian Processes Marc Deisenroth February 22, 2017 3

Problem Setting

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10

−2

0

2

x

f(x)

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2ε

˘

, find adistribution over functions pp f q that explains the data

Probabilistic regression problem

Gaussian Processes Marc Deisenroth February 22, 2017 3

Recap from CO-496: Bayesian Linear Regression

§ Linear Regression Model:

f pxq “ φpxqJw, w „ N`

0, Σp˘

y “ f pxq ` ε, ε „ N`

0, σ2n˘

§ Integrating out the parameters when predicting leads to adistribution over functions:

pp f px˚q|x˚, X, yq “ż

pp f px˚q|x˚, wqppw|X, yqdw

“ N`

µpx˚q, σ2px˚q˘

µpx˚q “ φJ˚ Σp ΦpK` σ2n Iq´1y

σ2px˚q “ φJ˚ Σp φ˚ ´ φJ˚ Σp ΦpK` σ2n Iq´1ΦJ Σp φ˚

K “ ΦJΣpΦ

Gaussian Processes Marc Deisenroth February 22, 2017 4

Recap from CO-496: Bayesian Linear Regression

§ Linear Regression Model:

f pxq “ φpxqJw, w „ N`

0, Σp˘

y “ f pxq ` ε, ε „ N`

0, σ2n˘

§ Integrating out the parameters when predicting leads to adistribution over functions:

pp f px˚q|x˚, X, yq “ż

pp f px˚q|x˚, wqppw|X, yqdw

“ N`

µpx˚q, σ2px˚q˘

µpx˚q “ φJ˚ Σp ΦpK` σ2n Iq´1y

σ2px˚q “ φJ˚ Σp φ˚ ´ φJ˚ Σp ΦpK` σ2n Iq´1ΦJ Σp φ˚

K “ ΦJΣpΦ

Gaussian Processes Marc Deisenroth February 22, 2017 4

Recap from CO-496: Bayesian Linear Regression

§ Linear Regression Model:

f pxq “ φpxqJw, w „ N`

0, Σp˘

y “ f pxq ` ε, ε „ N`

0, σ2n˘

§ Integrating out the parameters when predicting leads to adistribution over functions:

pp f px˚q|x˚, X, yq “ż

pp f px˚q|x˚, wqppw|X, yqdw

“ N`

µpx˚q, σ2px˚q˘

µpx˚q “ φJ˚ Σp ΦpK` σ2n Iq´1y

σ2px˚q “ φJ˚ Σp φ˚ ´ φJ˚ Σp ΦpK` σ2n Iq´1ΦJ Σp φ˚

K “ ΦJΣpΦ

Gaussian Processes Marc Deisenroth February 22, 2017 4

Sampling from the Prior over Functions

Consider a linear regression setting

y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-4 -2 0 2 4a

-4

-2

0

2

4

b

Gaussian Processes Marc Deisenroth February 22, 2017 5

Sampling from the Prior over Functions

Consider a linear regression setting

y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-4 -2 0 2 4a

-4

-2

0

2

4

b

demo: sampling from prior, sampling from posteriorGaussian Processes Marc Deisenroth February 22, 2017 6

Sampling from the Prior over Functions

Consider a linear regression setting

y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-4 -2 0 2 4a

-4

-2

0

2

4

b

-10 0 10x

-10

-5

0

5

10

y

Gaussian Processes Marc Deisenroth February 22, 2017 7

Sampling from the Prior over Functions

Consider a linear regression setting

y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-10 0 10x

-5

0

5

y

Gaussian Processes Marc Deisenroth February 22, 2017 8

Sampling from the Posterior over Functions

Consider a linear regression setting

y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-4 -2 0 2 4a

-4

-2

0

2

4

b

Gaussian Processes Marc Deisenroth February 22, 2017 9

Sampling from the Posterior over Functions

Consider a linear regression setting

y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-4 -2 0 2 4a

-4

-2

0

2

4

b

demo: sampling from prior, sampling from posteriorGaussian Processes Marc Deisenroth February 22, 2017 10

Sampling from the Posterior over Functions

Consider a linear regression setting

y “ a` bx` ε , ε „ N`

0, σ2n˘

ppa, bq “ N`

0, I˘

-4 -2 0 2 4a

-4

-2

0

2

4

b

-10 0 10x

-5

0

5

y

Gaussian Processes Marc Deisenroth February 22, 2017 11

Fitting Nonlinear Functions

§ Fit nonlinear functions using (Bayesian) linear regression:Linear combination of nonlinear features

§ Example: Radial-basis-function (RBF) network

f pxq “nÿ

i“1

wiφipxq , wi „ N`

0, σ2p˘

whereφipxq “ exp

`

´ 12px´ µiq

Jpx´ µiq˘

for given “centers” µi

Gaussian Processes Marc Deisenroth February 22, 2017 12

Illustration: Fitting a Radial Basis Function Network

φipxq “ exp`

´ 12px´ µiq

Jpx´ µiq˘

-5 0 5x

-2

0

2f(

x)

§ Place Gaussian-shaped basis functions φi at 25 input locations µi,linearly spaced in the interval r´5, 3s

Gaussian Processes Marc Deisenroth February 22, 2017 13

Samples from the RBF Prior

f pxq “nÿ

i“1

wiφipxq , ppwq “ N`

0, I˘

-5 0 5x

-4

-2

0

2

4f(

x)

Gaussian Processes Marc Deisenroth February 22, 2017 14

Samples from the RBF Posterior

f pxq “nÿ

i“1

wiφipxq , ppw|X, yq “ N`

mN , SN˘

-5 0 5x

-4

-2

0

2

4f(

x)

Gaussian Processes Marc Deisenroth February 22, 2017 15

RBF Posterior

-5 0 5x

-2

0

2f(

x)

Gaussian Processes Marc Deisenroth February 22, 2017 16

Limitations

-5 0 5x

-2

0

2

f(x)

§ Feature engineering§ Finite number of features:

§ Above: Without basis functions on the right, we cannot expressany variability of the function

§ Ideally: Add more (infinitely many) basis functionsGaussian Processes Marc Deisenroth February 22, 2017 17

Approach

§ Instead of sampling parameters, which induce a distribution overfunctions, sample functions directly

Make assumptions on the distribution of functions

§ Intuition: function = infinitely long vector of function valuesMake assumptions on the distribution of function values

Gaussian Processes Marc Deisenroth February 22, 2017 18

Gaussian Process

§ We will place a distribution pp f q on functions f§ Informally, a function can be considered an infinitely long vector

of function values f “ r f1, f2, f3, ...s§ A Gaussian process is a generalization of a multivariate Gaussian

distribution to infinitely many variables.

DefinitionA Gaussian process (GP) is a collection of random variables f1, f2, . . . ,any finite number of which is Gaussian distributed.

§ A Gaussian distribution is specified by a mean vector µ and acovariance matrix Σ

§ A Gaussian process is specified by a mean function mp¨q and acovariance function (kernel) kp¨, ¨q

Gaussian Processes Marc Deisenroth February 22, 2017 19

Gaussian Process

§ We will place a distribution pp f q on functions f§ Informally, a function can be considered an infinitely long vector

of function values f “ r f1, f2, f3, ...s§ A Gaussian process is a generalization of a multivariate Gaussian

distribution to infinitely many variables.

DefinitionA Gaussian process (GP) is a collection of random variables f1, f2, . . . ,any finite number of which is Gaussian distributed.

§ A Gaussian distribution is specified by a mean vector µ and acovariance matrix Σ

§ A Gaussian process is specified by a mean function mp¨q and acovariance function (kernel) kp¨, ¨q

Gaussian Processes Marc Deisenroth February 22, 2017 19

Gaussian Process

§ We will place a distribution pp f q on functions f§ Informally, a function can be considered an infinitely long vector

of function values f “ r f1, f2, f3, ...s§ A Gaussian process is a generalization of a multivariate Gaussian

distribution to infinitely many variables.

DefinitionA Gaussian process (GP) is a collection of random variables f1, f2, . . . ,any finite number of which is Gaussian distributed.

§ A Gaussian distribution is specified by a mean vector µ and acovariance matrix Σ

§ A Gaussian process is specified by a mean function mp¨q and acovariance function (kernel) kp¨, ¨q

Gaussian Processes Marc Deisenroth February 22, 2017 19

Covariance Function

§ The covariance function (kernel) is symmetric and positivesemi-definite

§ It allows us to compute covariances between (unknown) functionvalues by just looking at the corresponding inputs:

Covr f pxiq, f pxjqs “ kpxi, xjq

Gaussian Processes Marc Deisenroth February 22, 2017 20

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2n˘

, find a(posterior) distribution over functions pp f |X, yq that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2n I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f pXqqpp f |Xqd fPosterior: pp f |y, Xq “ GPpmpost, kpostq

Gaussian Processes Marc Deisenroth February 22, 2017 21

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2n˘

, find a(posterior) distribution over functions pp f |X, yq that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2n I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f pXqqpp f |Xqd fPosterior: pp f |y, Xq “ GPpmpost, kpostq

Gaussian Processes Marc Deisenroth February 22, 2017 21

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2n˘

, find a(posterior) distribution over functions pp f |X, yq that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2n I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f pXqqpp f |Xqd fPosterior: pp f |y, Xq “ GPpmpost, kpostq

Gaussian Processes Marc Deisenroth February 22, 2017 21

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2n˘

, find a(posterior) distribution over functions pp f |X, yq that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2n I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f pXqqpp f |Xqd fPosterior: pp f |y, Xq “ GPpmpost, kpostq

Gaussian Processes Marc Deisenroth February 22, 2017 21

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2n˘

, find a(posterior) distribution over functions pp f |X, yq that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2n I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f pXqqpp f |Xqd f

Posterior: pp f |y, Xq “ GPpmpost, kpostq

Gaussian Processes Marc Deisenroth February 22, 2017 21

GP Regression as a Bayesian Inference Problem

Objective

For a set of observations yi “ f pxiq ` ε, ε „ N`

0, σ2n˘

, find a(posterior) distribution over functions pp f |X, yq that explains the data

Training data: X, y. Bayes’ theorem yields

pp f |X, yq “ppy| f , Xq pp f q

ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.

Likelihood (noise model): ppy| f , Xq “ N`

f pXq, σ2n I˘

Marginal likelihood (evidence): ppy|Xq “ş

ppy| f pXqqpp f |Xqd fPosterior: pp f |y, Xq “ GPpmpost, kpostq

Gaussian Processes Marc Deisenroth February 22, 2017 21

Prior over Functions

§ Treat a function as a long vector of function values:

f “ r f1, f2, . . . s

Look at a distribution over function values fi “ f pxiq

§ Consider a finite number of N function values f and all other(infinitely many) function values f . Informally:

pp f , f q “ N

¨

˝

»

µ f

µ f

fi

fl ,

»

Σ f f Σ f f

Σ f f Σ f f

fi

fl

˛

where Σ f f P Rmˆm and Σ f f P R

Nˆm, m Ñ8.

§ Σpi,jqf f “ Covr f pxiq, f pxjqs “ kpxi, xjq

§ Key property: The marginal remains finite

pp f q “ż

pp f , f qd f “ N`

µ f , Σ f f˘

Gaussian Processes Marc Deisenroth February 22, 2017 22

Prior over Functions

§ Treat a function as a long vector of function values:

f “ r f1, f2, . . . s

Look at a distribution over function values fi “ f pxiq

§ Consider a finite number of N function values f and all other(infinitely many) function values f . Informally:

pp f , f q “ N

¨

˝

»

µ f

µ f

fi

fl ,

»

Σ f f Σ f f

Σ f f Σ f f

fi

fl

˛

where Σ f f P Rmˆm and Σ f f P R

Nˆm, m Ñ8.

§ Σpi,jqf f “ Covr f pxiq, f pxjqs “ kpxi, xjq

§ Key property: The marginal remains finite

pp f q “ż

pp f , f qd f “ N`

µ f , Σ f f˘

Gaussian Processes Marc Deisenroth February 22, 2017 22

Prior over Functions

§ Treat a function as a long vector of function values:

f “ r f1, f2, . . . s

Look at a distribution over function values fi “ f pxiq

§ Consider a finite number of N function values f and all other(infinitely many) function values f . Informally:

pp f , f q “ N

¨

˝

»

µ f

µ f

fi

fl ,

»

Σ f f Σ f f

Σ f f Σ f f

fi

fl

˛

where Σ f f P Rmˆm and Σ f f P R

Nˆm, m Ñ8.

§ Σpi,jqf f “ Covr f pxiq, f pxjqs “ kpxi, xjq

§ Key property: The marginal remains finite

pp f q “ż

pp f , f qd f “ N`

µ f , Σ f f˘

Gaussian Processes Marc Deisenroth February 22, 2017 22

Training and Test Marginal

§ In practice, we always have finite training and test inputsxtrain, xtest.

§ Define f˚ :“ f test, f :“ f train.

§ Then, we obtain the finite marginal

pp f , f˚ q “ż

pp f , f˚ , f other qd f other “ N˜«

µ f

µ˚

ff

,

«

Σ f f Σ f˚

Σ˚ f Σ˚˚

ff¸

Gaussian Processes Marc Deisenroth February 22, 2017 23

Training and Test Marginal

§ In practice, we always have finite training and test inputsxtrain, xtest.

§ Define f˚ :“ f test, f :“ f train.

§ Then, we obtain the finite marginal

pp f , f˚ q “ż

pp f , f˚ , f other qd f other “ N˜«

µ f

µ˚

ff

,

«

Σ f f Σ f˚

Σ˚ f Σ˚˚

ff¸

Gaussian Processes Marc Deisenroth February 22, 2017 23

GP Regression as a Bayesian Inference Problem (ctd.)

Posterior over functions (with training data X, y):

pp f |X, yq “ppy| f , Xq pp f |Xq

ppy|Xq

Using the properties of Gaussians, we obtain

ppy| f , Xq pp f |Xq “ N`

y | f pXq, σ2n I˘

N`

f pXq |mpXq, K˘

“ ZN`

f pXq | mpXq `KpK` σ2n Iq´1py´mpXqq

looooooooooooooooooooomooooooooooooooooooooon

posterior mean

, K´KpK` σ2n Iq´1K

loooooooooooomoooooooooooon

posterior covariance

˘

K “ kpX, Xq

Marginal likelihood:

Z “ ppy|Xq “ż

ppy| f , Xq pp f |Xq d f “ N`

y |mpXq, K` σ2n I˘

Gaussian Processes Marc Deisenroth February 22, 2017 24

GP Regression as a Bayesian Inference Problem (ctd.)

Posterior over functions (with training data X, y):

pp f |X, yq “ppy| f , Xq pp f |Xq

ppy|Xq

Using the properties of Gaussians, we obtain

ppy| f , Xq pp f |Xq “ N`

y | f pXq, σ2n I˘

N`

f pXq |mpXq, K˘

“ ZN`

f pXq | mpXq `KpK` σ2n Iq´1py´mpXqq

looooooooooooooooooooomooooooooooooooooooooon

posterior mean

, K´KpK` σ2n Iq´1K

loooooooooooomoooooooooooon

posterior covariance

˘

K “ kpX, Xq

Marginal likelihood:

Z “ ppy|Xq “ż

ppy| f , Xq pp f |Xq d f “ N`

y |mpXq, K` σ2n I˘

Gaussian Processes Marc Deisenroth February 22, 2017 24

GP Regression as a Bayesian Inference Problem (ctd.)

Posterior over functions (with training data X, y):

pp f |X, yq “ppy| f , Xq pp f |Xq

ppy|Xq

Using the properties of Gaussians, we obtain

ppy| f , Xq pp f |Xq “ N`

y | f pXq, σ2n I˘

N`

f pXq |mpXq, K˘

“ ZN`

f pXq | mpXq `KpK` σ2n Iq´1py´mpXqq

looooooooooooooooooooomooooooooooooooooooooon

posterior mean

, K´KpK` σ2n Iq´1K

loooooooooooomoooooooooooon

posterior covariance

˘

K “ kpX, Xq

Marginal likelihood:

Z “ ppy|Xq “ż

ppy| f , Xq pp f |Xq d f “ N`

y |mpXq, K` σ2n I˘

Gaussian Processes Marc Deisenroth February 22, 2017 24

GP Regression as a Bayesian Inference Problem (ctd.)

Posterior over functions (with training data X, y):

pp f |X, yq “ppy| f , Xq pp f |Xq

ppy|Xq

Using the properties of Gaussians, we obtain

ppy| f , Xq pp f |Xq “ N`

y | f pXq, σ2n I˘

N`

f pXq |mpXq, K˘

“ ZN`

f pXq | mpXq `KpK` σ2n Iq´1py´mpXqq

looooooooooooooooooooomooooooooooooooooooooon

posterior mean

, K´KpK` σ2n Iq´1K

loooooooooooomoooooooooooon

posterior covariance

˘

K “ kpX, Xq

Marginal likelihood:

Z “ ppy|Xq “ż

ppy| f , Xq pp f |Xq d f “ N`

y |mpXq, K` σ2n I˘

Gaussian Processes Marc Deisenroth February 22, 2017 24

GP Predictions (1)

y “ f pxq ` ε, ε „ N`

0, σ2n˘

§ Objective: Find pp f pX˚q|X, yq for training data X, y and testinputs X˚.

§ GP prior: pp f |Xq “ N`

mpXq, K˘

§ Gaussian Likelihood: ppy| f pXqq “ N`

f pXq, σ2n I˘

§ With f „ GP it follows that f , f˚ are jointly Gaussian distributed:

pp f , f˚|X, X˚q “ N˜«

mpXqmpX˚q

ff

,

«

K kpX, X˚qkpX˚, Xq kpX˚, X˚q

ff¸

§ Due to the Gaussian likelihood, we also get ( f is unobserved)

ppy, f˚|X, X˚q “ N˜«

mpXqmpX˚q

ff

,

«

K`σ2n I kpX, X˚q

kpX˚, Xq kpX˚, X˚q

ff¸

Gaussian Processes Marc Deisenroth February 22, 2017 25

GP Predictions (1)

y “ f pxq ` ε, ε „ N`

0, σ2n˘

§ Objective: Find pp f pX˚q|X, yq for training data X, y and testinputs X˚.

§ GP prior: pp f |Xq “ N`

mpXq, K˘

§ Gaussian Likelihood: ppy| f pXqq “ N`

f pXq, σ2n I˘

§ With f „ GP it follows that f , f˚ are jointly Gaussian distributed:

pp f , f˚|X, X˚q “ N˜«

mpXqmpX˚q

ff

,

«

K kpX, X˚qkpX˚, Xq kpX˚, X˚q

ff¸

§ Due to the Gaussian likelihood, we also get ( f is unobserved)

ppy, f˚|X, X˚q “ N˜«

mpXqmpX˚q

ff

,

«

K`σ2n I kpX, X˚q

kpX˚, Xq kpX˚, X˚q

ff¸

Gaussian Processes Marc Deisenroth February 22, 2017 25

GP Predictions (1)

y “ f pxq ` ε, ε „ N`

0, σ2n˘

§ Objective: Find pp f pX˚q|X, yq for training data X, y and testinputs X˚.

§ GP prior: pp f |Xq “ N`

mpXq, K˘

§ Gaussian Likelihood: ppy| f pXqq “ N`

f pXq, σ2n I˘

§ With f „ GP it follows that f , f˚ are jointly Gaussian distributed:

pp f , f˚|X, X˚q “ N˜«

mpXqmpX˚q

ff

,

«

K kpX, X˚qkpX˚, Xq kpX˚, X˚q

ff¸

§ Due to the Gaussian likelihood, we also get ( f is unobserved)

ppy, f˚|X, X˚q “ N˜«

mpXqmpX˚q

ff

,

«

K`σ2n I kpX, X˚q

kpX˚, Xq kpX˚, X˚q

ff¸

Gaussian Processes Marc Deisenroth February 22, 2017 25

GP Predictions (2)

Prior:

ppy, f˚|X, X˚q “ Nˆ„

mpXqmpX˚q

,„

K` σ2n I kpX, X˚q

kpX˚, Xq kpX˚, X˚q

˙

Posterior predictive distribution pp f˚|X, y, X˚q at test inputs X˚

obtained by Gaussian conditioning:

pp f˚|X, y, X˚q “ N`

Er f˚|X, y, X˚s, Vr f˚|X, y, X˚s˘

Er f˚|X, y, X˚s “ mpostpX˚q “ mpX˚qloomoon

prior mean

`kpX˚, XqpK` σ2n Iq´1py´mpXqq

Vr f˚|X, y, X˚s “ kpostpX˚, X˚q

“ kpX˚, X˚qloooomoooon

prior variance

´kpX˚, XqpK` σ2n Iq´1kpX, X˚q

From now: Set prior mean function m ” 0

Gaussian Processes Marc Deisenroth February 22, 2017 26

GP Predictions (2)

Prior:

ppy, f˚|X, X˚q “ Nˆ„

mpXqmpX˚q

,„

K` σ2n I kpX, X˚q

kpX˚, Xq kpX˚, X˚q

˙

Posterior predictive distribution pp f˚|X, y, X˚q at test inputs X˚obtained by Gaussian conditioning:

pp f˚|X, y, X˚q “ N`

Er f˚|X, y, X˚s, Vr f˚|X, y, X˚s˘

Er f˚|X, y, X˚s “ mpostpX˚q “ mpX˚qloomoon

prior mean

`kpX˚, XqpK` σ2n Iq´1py´mpXqq

Vr f˚|X, y, X˚s “ kpostpX˚, X˚q

“ kpX˚, X˚qloooomoooon

prior variance

´kpX˚, XqpK` σ2n Iq´1kpX, X˚q

From now: Set prior mean function m ” 0

Gaussian Processes Marc Deisenroth February 22, 2017 26

GP Predictions (2)

Prior:

ppy, f˚|X, X˚q “ Nˆ„

mpXqmpX˚q

,„

K` σ2n I kpX, X˚q

kpX˚, Xq kpX˚, X˚q

˙

Posterior predictive distribution pp f˚|X, y, X˚q at test inputs X˚obtained by Gaussian conditioning:

pp f˚|X, y, X˚q “ N`

Er f˚|X, y, X˚s, Vr f˚|X, y, X˚s˘

Er f˚|X, y, X˚s “ mpostpX˚q “ mpX˚qloomoon

prior mean

`kpX˚, XqpK` σ2n Iq´1py´mpXqq

Vr f˚|X, y, X˚s “ kpostpX˚, X˚q

“ kpX˚, X˚qloooomoooon

prior variance

´kpX˚, XqpK` σ2n Iq´1kpX, X˚q

From now: Set prior mean function m ” 0Gaussian Processes Marc Deisenroth February 22, 2017 26

Illustration: Inference with Gaussian Processes

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Prior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚,∅s “ mpx˚q “ 0Vr f px˚q|x˚,∅s “ σ2px˚q “ kpx˚, x˚q

Gaussian Processes Marc Deisenroth February 22, 2017 27

Illustration: Inference with Gaussian Processes

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Prior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚,∅s “ mpx˚q “ 0Vr f px˚q|x˚,∅s “ σ2px˚q “ kpx˚, x˚q

Gaussian Processes Marc Deisenroth February 22, 2017 27

Illustration: Inference with Gaussian Processes

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth February 22, 2017 27

Illustration: Inference with Gaussian Processes

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth February 22, 2017 27

Illustration: Inference with Gaussian Processes

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth February 22, 2017 27

Illustration: Inference with Gaussian Processes

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth February 22, 2017 27

Illustration: Inference with Gaussian Processes

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth February 22, 2017 27

Illustration: Inference with Gaussian Processes

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth February 22, 2017 27

Illustration: Inference with Gaussian Processes

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth February 22, 2017 27

Illustration: Inference with Gaussian Processes

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth February 22, 2017 27

Illustration: Inference with Gaussian Processes

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function

Predictive (marginal) mean and variance:

Er f px˚q|x˚, X, ys “ mpx˚q “ kpX, x˚qJpK` σ2ε Iq´1y

Vr f px˚q|x˚, X, ys “ σ2px˚q “ kpx˚, x˚q ´ kpX, x˚qJpK` σ2ε Iq´1kpX, x˚q

Gaussian Processes Marc Deisenroth February 22, 2017 27

Covariance Function

§ A Gaussian process is fully specified by a mean function m and akernel/covariance function k

§ The covariance function (kernel) is symmetric and positivesemi-definite

§ Covariance function encodes high-level structural assumptionsabout the latent function f (e.g., smoothness, differentiability,periodicity)

Gaussian Processes Marc Deisenroth February 22, 2017 28

Gaussian Covariance FunctionkGausspxi, xjq “ σ2

f exp`

´ pxi ´ xjqJpxi ´ xjq{`

§ σf : Amplitude of the latent function§ `: Length scale. How far do we have to move in input space

before the function value changes significantlySmoothness parameter

§ Assumption on latent function: Smooth (8 differentiable)Gaussian Processes Marc Deisenroth February 22, 2017 29

Length-Scales

Length scales determine how wiggly the function is and how muchinformation we can transfer to other function values

x-10 -5 0 5 10

f(x)

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3Data

Gaussian Processes Marc Deisenroth February 22, 2017 30

Length-Scales

Length scales determine how wiggly the function is and how muchinformation we can transfer to other function values

x-10 -5 0 5 10

f(x)

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3Data

Gaussian Processes Marc Deisenroth February 22, 2017 30

Length-Scales

Length scales determine how wiggly the function is and how muchinformation we can transfer to other function values

x-10 -5 0 5 10

f(x)

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3Data

Gaussian Processes Marc Deisenroth February 22, 2017 30

Length-Scales

Length scales determine how wiggly the function is and how muchinformation we can transfer to other function values

x-10 -5 0 5 10

f(x)

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3Data

Gaussian Processes Marc Deisenroth February 22, 2017 30

Matern Covariance Function

kMat,3{2pxi, xjq “ σ2f

´

1`?

3}xi´xj}

`

¯

exp´

´

?3}xi´xj}

`

¯

§ σf : Amplitude of the latent function§ `: Length scale. How far do we have to move in input space

before the function value changes significantly?

§ Assumption on latent function: 1-times differentiable

Gaussian Processes Marc Deisenroth February 22, 2017 31

Periodic Covariance Function

kperpxi, xjq “ σ2f exp

´

´2 sin2 ` κpxi´xjq

˘

`2

¯

“ kGausspupxiq, upxjqq, upxq “„

cospκxqsinpκxq

κ: Periodicity parameter

Gaussian Processes Marc Deisenroth February 22, 2017 32

Meta-Parameters of a GP

The GP possesses a set of hyper-parameters:

§ Parameters of the mean function

§ Hyper-parameters of the covariance function (e.g., length-scalesand signal variance)

§ Likelihood parameters (e.g., noise variance σ2n)

Train a GP to find a good set of hyper-parameters

Model selection to find good mean and covariance functions(can also be automated Automatic Statistician (Lloyd et al., 2014))

Gaussian Processes Marc Deisenroth February 22, 2017 33

Meta-Parameters of a GP

The GP possesses a set of hyper-parameters:

§ Parameters of the mean function

§ Hyper-parameters of the covariance function (e.g., length-scalesand signal variance)

§ Likelihood parameters (e.g., noise variance σ2n)

Train a GP to find a good set of hyper-parameters

Model selection to find good mean and covariance functions(can also be automated Automatic Statistician (Lloyd et al., 2014))

Gaussian Processes Marc Deisenroth February 22, 2017 33

Meta-Parameters of a GP

The GP possesses a set of hyper-parameters:

§ Parameters of the mean function

§ Hyper-parameters of the covariance function (e.g., length-scalesand signal variance)

§ Likelihood parameters (e.g., noise variance σ2n)

Train a GP to find a good set of hyper-parameters

Model selection to find good mean and covariance functions(can also be automated Automatic Statistician (Lloyd et al., 2014))

Gaussian Processes Marc Deisenroth February 22, 2017 33

Gaussian Process Training: Hyper-Parameters

GP TrainingFind good GP hyper-parameters θ (kerneland mean function parameters)

θ

σnyixi

f

N

§ Place a prior ppθq on hyper-parameters§ Posterior over hyper-parameters:

ppθ|X, yq “ppθq ppy|X, θq

ppy|Xq, ppy|X, θq “

ż

ppy| f pXqqpp f |X, θqd f

§ Choose hyper-parameters θ˚, such that

θ˚ P arg maxθ

log ppθq ` log ppy|X, θq

Maximize marginal likelihood if ppθq “ U (uniform prior)

Gaussian Processes Marc Deisenroth February 22, 2017 34

Gaussian Process Training: Hyper-Parameters

GP TrainingFind good GP hyper-parameters θ (kerneland mean function parameters)

θ

σnyixi

f

N

§ Place a prior ppθq on hyper-parameters§ Posterior over hyper-parameters:

ppθ|X, yq “ppθq ppy|X, θq

ppy|Xq, ppy|X, θq “

ż

ppy| f pXqqpp f |X, θqd f

§ Choose hyper-parameters θ˚, such that

θ˚ P arg maxθ

log ppθq ` log ppy|X, θq

Maximize marginal likelihood if ppθq “ U (uniform prior)

Gaussian Processes Marc Deisenroth February 22, 2017 34

Gaussian Process Training: Hyper-Parameters

GP TrainingFind good GP hyper-parameters θ (kerneland mean function parameters)

θ

σnyixi

f

N

§ Place a prior ppθq on hyper-parameters§ Posterior over hyper-parameters:

ppθ|X, yq “ppθq ppy|X, θq

ppy|Xq, ppy|X, θq “

ż

ppy| f pXqqpp f |X, θqd f

§ Choose hyper-parameters θ˚, such that

θ˚ P arg maxθ

log ppθq ` log ppy|X, θq

Maximize marginal likelihood if ppθq “ U (uniform prior)

Gaussian Processes Marc Deisenroth February 22, 2017 34

Gaussian Process Training: Hyper-Parameters

GP TrainingFind good GP hyper-parameters θ (kerneland mean function parameters)

θ

σnyixi

f

N

§ Place a prior ppθq on hyper-parameters§ Posterior over hyper-parameters:

ppθ|X, yq “ppθq ppy|X, θq

ppy|Xq, ppy|X, θq “

ż

ppy| f pXqqpp f |X, θqd f

§ Choose hyper-parameters θ˚, such that

θ˚ P arg maxθ

log ppθq ` log ppy|X, θq

Maximize marginal likelihood if ppθq “ U (uniform prior)

Gaussian Processes Marc Deisenroth February 22, 2017 34

Training via Marginal Likelihood Maximization

GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters, where the unwieldy f has beenintegrated out) Also called Maximum Likelihood-Type-II

Marginal likelihood:

ppy|X, θq “

ż

ppy| f pXqqpp f |X, θqd f

ż

N`

y | f pXq, σ2n I˘

N`

f pXq | 0, K˘

d f “ N`

y | 0, K` σ2n I˘

Learning the GP hyper-parameters:

θ˚ P arg maxθ

log ppy|X, θq

log ppy|X, θq “ ´12 yJK´1

θ y ´ 12 log |Kθ| ` const , Kθ :“ K` σ2

n I

Gaussian Processes Marc Deisenroth February 22, 2017 35

Training via Marginal Likelihood Maximization

GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters, where the unwieldy f has beenintegrated out) Also called Maximum Likelihood-Type-II

Marginal likelihood:

ppy|X, θq “

ż

ppy| f pXqqpp f |X, θqd f

ż

N`

y | f pXq, σ2n I˘

N`

f pXq | 0, K˘

d f “ N`

y | 0, K` σ2n I˘

Learning the GP hyper-parameters:

θ˚ P arg maxθ

log ppy|X, θq

log ppy|X, θq “ ´12 yJK´1

θ y ´ 12 log |Kθ| ` const , Kθ :“ K` σ2

n I

Gaussian Processes Marc Deisenroth February 22, 2017 35

Training via Marginal Likelihood Maximization

Log-marginal likelihood:

log ppy|X, θq “ ´12 yJK´1

θ y ´ 12 log |Kθ| ` const , Kθ :“ K` σ2

n I

§ Automatic trade-off between data fit and model complexity

§ Gradient-based optimization of hyper-parameters θ:

B log ppy|X, θq

Bθi“ 1

2 yJK´1θ

BKθ

BθiK´1

θ y´ 12 tr

`

K´1θ

BKθ

Bθi

˘

“ 12 tr

`

pααJ ´K´1θ qBKθ

Bθi

˘

,

α :“ K´1θ y

Gaussian Processes Marc Deisenroth February 22, 2017 36

Training via Marginal Likelihood Maximization

Log-marginal likelihood:

log ppy|X, θq “ ´12 yJK´1

θ y ´ 12 log |Kθ| ` const , Kθ :“ K` σ2

n I

§ Automatic trade-off between data fit and model complexity

§ Gradient-based optimization of hyper-parameters θ:

B log ppy|X, θq

Bθi“ 1

2 yJK´1θ

BKθ

BθiK´1

θ y´ 12 tr

`

K´1θ

BKθ

Bθi

˘

“ 12 tr

`

pααJ ´K´1θ qBKθ

Bθi

˘

,

α :“ K´1θ y

Gaussian Processes Marc Deisenroth February 22, 2017 36

Training via Marginal Likelihood Maximization

Log-marginal likelihood:

log ppy|X, θq “ ´12 yJK´1

θ y ´ 12 log |Kθ| ` const , Kθ :“ K` σ2

n I

§ Automatic trade-off between data fit and model complexity

§ Gradient-based optimization of hyper-parameters θ:

B log ppy|X, θq

Bθi“ 1

2 yJK´1θ

BKθ

BθiK´1

θ y´ 12 tr

`

K´1θ

BKθ

Bθi

˘

“ 12 tr

`

pααJ ´K´1θ qBKθ

Bθi

˘

,

α :“ K´1θ y

Gaussian Processes Marc Deisenroth February 22, 2017 36

Example: Training Data

-10 -8 -6 -4 -2 0 2 4 6 8 10x

-3

-2

-1

0

1

2

3y

Gaussian Processes Marc Deisenroth February 22, 2017 37

Example: Marginal Likelihood Contour

-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5

log-

leng

th-s

cale

sLog-Marginal Likelihood, N=20

-4

-3.5

-3

-2.5

-2

-1.5

Gaussian Processes Marc Deisenroth February 22, 2017 38

Example: Exploring the Modes (1)

-10 -8 -6 -4 -2 0 2 4 6 8 10x

-3

-2

-1

0

1

2

3y

Gaussian Processes Marc Deisenroth February 22, 2017 39

Example: Exploring the Modes (2)

-10 -8 -6 -4 -2 0 2 4 6 8 10x

-3

-2

-1

0

1

2

3y

Gaussian Processes Marc Deisenroth February 22, 2017 40

Marginal Likelihood (1)

-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les

Log-Marginal Likelihood, N=2

-4

-3.5

-3

-2.5

-2

-1.5

-1

Gaussian Processes Marc Deisenroth February 22, 2017 41

Marginal Likelihood (2)

-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les

Log-Marginal Likelihood, N=3

-4

-3.5

-3

-2.5

-2

-1.5

Gaussian Processes Marc Deisenroth February 22, 2017 42

Marginal Likelihood (3)

-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les

Log-Marginal Likelihood, N=5

-4

-3.5

-3

-2.5

-2

-1.5

-1

Gaussian Processes Marc Deisenroth February 22, 2017 43

Marginal Likelihood (4)

-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les

Log-Marginal Likelihood, N=10

-4

-3.5

-3

-2.5

-2

-1.5

-1

Gaussian Processes Marc Deisenroth February 22, 2017 44

Marginal Likelihood (5)

-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les

Log-Marginal Likelihood, N=15

-4

-3.5

-3

-2.5

-2

-1.5

Gaussian Processes Marc Deisenroth February 22, 2017 45

Marginal Likelihood (6)

-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les

Log-Marginal Likelihood, N=20

-4

-3.5

-3

-2.5

-2

-1.5

Gaussian Processes Marc Deisenroth February 22, 2017 46

Marginal Likelihood (7)

-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les

Log-Marginal Likelihood, N=50

-4

-3.5

-3

-2.5

-2

-1.5

Gaussian Processes Marc Deisenroth February 22, 2017 47

Marginal Likelihood (8)

-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les

Log-Marginal Likelihood, N=100

-4

-3.5

-3

-2.5

-2

-1.5

-1

Gaussian Processes Marc Deisenroth February 22, 2017 48

Marginal Likelihood (9)

-6 -5 -4 -3 -2 -1 0log-noise

-1

0

1

2

3

4

5lo

g-le

ngth

-sca

les

Log-Marginal Likelihood, N=200

-4

-3.5

-3

-2.5

-2

-1.5

-1

Gaussian Processes Marc Deisenroth February 22, 2017 49

Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex

§ In particular in the very-small-data regime, a GP can end up inthree different modes when optimizing the hyper-parameters:

§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit

§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem

§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.

§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?

Gaussian Processes Marc Deisenroth February 22, 2017 50

Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in

three different modes when optimizing the hyper-parameters:

§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit

§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem

§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.

§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?

Gaussian Processes Marc Deisenroth February 22, 2017 50

Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in

three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)

§ Underfitting (everything is considered noise)§ Good fit

§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem

§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.

§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?

Gaussian Processes Marc Deisenroth February 22, 2017 50

Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in

three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)

§ Good fit

§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem

§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.

§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?

Gaussian Processes Marc Deisenroth February 22, 2017 50

Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in

three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit

§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem

§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.

§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?

Gaussian Processes Marc Deisenroth February 22, 2017 50

Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in

three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit

§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem

§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.

§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?

Gaussian Processes Marc Deisenroth February 22, 2017 50

Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in

three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit

§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem

§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.

§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?

Gaussian Processes Marc Deisenroth February 22, 2017 50

Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex§ In particular in the very-small-data regime, a GP can end up in

three different modes when optimizing the hyper-parameters:§ Overfitting (unlikely, but possible)§ Underfitting (everything is considered noise)§ Good fit

§ Re-start hyper-parameter optimization from randominitialization to mitigate the problem

§ With increasing data set size the GP typically ends up in the“good-fit” mode. Overfitting (indicator: small length-scales andsmall noise variance) is very unlikely.

§ Ideally, we would integrate the hyper-parameters outWhy can we do not do this easily?

Gaussian Processes Marc Deisenroth February 22, 2017 50

Model Selection—Mean Function and Kernel

§ Assume we have a finite set of models Mi, each one specifying amean function mi and a kernel ki. How do we find the best one?

§ Some options:§ BIC, AIC (see CO-496)§ Compare marginal likelihood values (assuming a uniform prior on

the set of models)

Gaussian Processes Marc Deisenroth February 22, 2017 51

Model Selection—Mean Function and Kernel

§ Assume we have a finite set of models Mi, each one specifying amean function mi and a kernel ki. How do we find the best one?

§ Some options:§ BIC, AIC (see CO-496)§ Compare marginal likelihood values (assuming a uniform prior on

the set of models)

Gaussian Processes Marc Deisenroth February 22, 2017 51

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3

§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model

Gaussian Processes Marc Deisenroth February 22, 2017 52

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Constant kernel, LML=-1.1073

§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model

Gaussian Processes Marc Deisenroth February 22, 2017 52

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Linear kernel, LML=-1.0065

§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model

Gaussian Processes Marc Deisenroth February 22, 2017 52

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Matern kernel, LML=-0.8625

§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model

Gaussian Processes Marc Deisenroth February 22, 2017 52

Example

x-4 -3 -2 -1 0 1 2 3 4

f(x)

-2

-1

0

1

2

3Gaussian kernel, LML=-0.69308

§ Four different kernels (mean function fixed to m ” 0)§ MAP hyper-parameters for each kernel§ Log-marginal likelihood values for each (optimized) model

Gaussian Processes Marc Deisenroth February 22, 2017 52

Application Areas

−2 0 2−5

0

5

angle in rad

ang.

vel.

in r

ad/s

−2

0

2

4

6

8

§ Reinforcement learning and roboticsModel value functions and/or dynamics with GPs

§ Bayesian optimization (Experimental Design)Model unknown utility functions with GPs

§ GeostatisticsSpatial modeling (e.g., landscapes, resources)

§ Sensor networks§ Time-series modeling and forecasting

Gaussian Processes Marc Deisenroth February 22, 2017 53

Limitations of Gaussian Processes

Computational and memory complexityTraining set size: N

§ Training scales in OpN3q

§ Prediction (variances) scales in OpN2q

§ Memory requirement: OpND` N2q

Practical limit N « 10, 000

Gaussian Processes Marc Deisenroth February 22, 2017 54

Tips and Tricks for Practitioners

§ To set initial hyper-parameters, use domain knowledge ifpossible.

§ Standardize input data and set initial length-scales ` to « 0.5.§ Standardize targets y and set initial signal variance to σf « 1.§ Often useful: Set initial noise level relatively high (e.g.,

σn « 0.5ˆ σf amplitude, even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.

§ When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.

§ Mitigate the problem of numerical instability (Choleskydecomposition of K` σ2

n I) by penalizing high signal-to-noiseratios σf {σn

Gaussian Processes Marc Deisenroth February 22, 2017 55

Appendix

Gaussian Processes Marc Deisenroth February 22, 2017 56

The Gaussian Distribution

ppx|µ, Σq “ p2πq´D2 |Σ|´

12 exp

`

´ 12px´ µqJΣ´1px´ µq

˘

§ Mean vector µ Average of the data

§ Covariance matrix Σ Spread of the data

−4 −3 −2 −1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

x

p(x

)

p(x)

Mean

95% confidence bound

x

86

42

0

y

42

02

46

8

p(x, y

)

0.04

0.03

0.02

0.01

0.00

0.01

0.02

0.03

0.04

Gaussian Processes Marc Deisenroth February 22, 2017 57

The Gaussian Distribution

ppx|µ, Σq “ p2πq´D2 |Σ|´

12 exp

`

´ 12px´ µqJΣ´1px´ µq

˘

§ Mean vector µ Average of the data

§ Covariance matrix Σ Spread of the data

−4 −3 −2 −1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

x

p(x

)

p(x)

Mean

95% confidence bound

−5 −4 −3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

x1

x2

Mean

95% confidence bound

Gaussian Processes Marc Deisenroth February 22, 2017 57

The Gaussian Distribution

ppx|µ, Σq “ p2πq´D2 |Σ|´

12 exp

`

´ 12px´ µqJΣ´1px´ µq

˘

§ Mean vector µ Average of the data

§ Covariance matrix Σ Spread of the data

−4 −3 −2 −1 0 1 2 3 4 5 60

0.05

0.1

0.15

0.2

0.25

0.3

x

p(x

)

Data

p(x)

Mean

95% confidence interval

−5 −4 −3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

x1

x2

Data

Mean

95% confidence bound

Gaussian Processes Marc Deisenroth February 22, 2017 57

Sampling from a Multivariate Gaussian

Objective

Generate a random sample y „ N`

µ, Σ˘

from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.

However, we only have access to a random number generator thatcan sample x from N

`

0, I˘

...

Exploit that affine transformations y “ Ax` b of a Gaussian randomvariable x remain Gaussian

§ Mean: ExrAx` bs “ AExrxs ` b§ Covariance: VxrAx` bs “ AVxrxsAJ

1. Find conditions for A, b to match the mean of y

2. Find conditions for A, b to match the covariance of y

Gaussian Processes Marc Deisenroth February 22, 2017 58

Sampling from a Multivariate Gaussian

Objective

Generate a random sample y „ N`

µ, Σ˘

from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.

However, we only have access to a random number generator thatcan sample x from N

`

0, I˘

...

Exploit that affine transformations y “ Ax` b of a Gaussian randomvariable x remain Gaussian

§ Mean: ExrAx` bs “ AExrxs ` b§ Covariance: VxrAx` bs “ AVxrxsAJ

1. Find conditions for A, b to match the mean of y

2. Find conditions for A, b to match the covariance of y

Gaussian Processes Marc Deisenroth February 22, 2017 58

Sampling from a Multivariate Gaussian

Objective

Generate a random sample y „ N`

µ, Σ˘

from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.

However, we only have access to a random number generator thatcan sample x from N

`

0, I˘

...

Exploit that affine transformations y “ Ax` b of a Gaussian randomvariable x remain Gaussian

§ Mean: ExrAx` bs “ AExrxs ` b§ Covariance: VxrAx` bs “ AVxrxsAJ

1. Find conditions for A, b to match the mean of y

2. Find conditions for A, b to match the covariance of yGaussian Processes Marc Deisenroth February 22, 2017 58

Sampling from a Multivariate Gaussian (2)

Objective

Generate a random sample y „ N`

µ, Σ˘

from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.

x = randn(D,1); Sample x „ N`

0, I˘

y = chol(Σ)’*x + µ; Scale x and add offset

Here chol(Σ) is the Cholesky factor L, such that LJL “ Σ

Therefore, the mean and covariance of y are

Erys “ y “ ErLJx` µs “ LJErxs ` µ “ µ

Covrys “ Erpy´ yqpy´ yqJs “ ErLJxxJLs “ LJErxxJsL “ LJL “ Σ

Gaussian Processes Marc Deisenroth February 22, 2017 59

Sampling from a Multivariate Gaussian (2)

Objective

Generate a random sample y „ N`

µ, Σ˘

from a D-dimensional jointGaussian with covariance matrix Σ and mean vector µ.

x = randn(D,1); Sample x „ N`

0, I˘

y = chol(Σ)’*x + µ; Scale x and add offset

Here chol(Σ) is the Cholesky factor L, such that LJL “ Σ

Therefore, the mean and covariance of y are

Erys “ y “ ErLJx` µs “ LJErxs ` µ “ µ

Covrys “ Erpy´ yqpy´ yqJs “ ErLJxxJLs “ LJErxxJsL “ LJL “ Σ

Gaussian Processes Marc Deisenroth February 22, 2017 59

Conditional

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y) ppx, yq “ N

˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

ppx|yq “ N`

µx|y, Σx|y˘

µx|y “ µx ` Σxy Σ´1yy py´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Conditional ppx|yq is also GaussianComputationally convenient

Gaussian Processes Marc Deisenroth February 22, 2017 60

Conditional

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y)Observation

ppx, yq “ N˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

ppx|yq “ N`

µx|y, Σx|y˘

µx|y “ µx ` Σxy Σ´1yy py´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Conditional ppx|yq is also GaussianComputationally convenient

Gaussian Processes Marc Deisenroth February 22, 2017 60

Conditional

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y)Observation yConditional p(x|y)

ppx, yq “ N˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

ppx|yq “ N`

µx|y, Σx|y˘

µx|y “ µx ` Σxy Σ´1yy py´ µy q

Σx|y “ Σxx ´ Σxy Σ´1yy Σyx

Conditional ppx|yq is also GaussianComputationally convenient

Gaussian Processes Marc Deisenroth February 22, 2017 60

Marginal

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y)Marginal p(x)

ppx, yq “ N˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal distribution:

pp x q “ż

pp x , y qd y

“ N`

µx , Σxx˘

§ The marginal of a joint Gaussian distribution is Gaussian

§ Intuitively: Ignore (integrate out) everything you are notinterested in

Gaussian Processes Marc Deisenroth February 22, 2017 61

Marginal

x-6 -4 -2 0 2 4

y

-5

-4

-3

-2

-1

0

1

2

3Joint p(x,y)Marginal p(x)

ppx, yq “ N˜«

µxµy

ff

,

«

Σxx Σxy

Σyx Σyy

ff¸

Marginal distribution:

pp x q “ż

pp x , y qd y

“ N`

µx , Σxx˘

§ The marginal of a joint Gaussian distribution is Gaussian

§ Intuitively: Ignore (integrate out) everything you are notinterested in

Gaussian Processes Marc Deisenroth February 22, 2017 61

The Gaussian Distribution in the Limit

Consider the joint Gaussian distribution ppx, xq, where x P RD andx P Rk, k Ñ8 are random variables.

Then

ppx, xq “ N˜«

µxµx

ff

,„

Σxx ΣxxΣxx Σxx

¸

where Σxx P Rkˆk and Σxx P R

Dˆk, k Ñ8.However, the marginal remains finite

pp x q “ż

pp x , x qd x “ N`

µx , Σxx˘

where we integrate out an infinite number of random variables xi.

Gaussian Processes Marc Deisenroth February 22, 2017 62

The Gaussian Distribution in the Limit

Consider the joint Gaussian distribution ppx, xq, where x P RD andx P Rk, k Ñ8 are random variables.Then

ppx, xq “ N˜«

µxµx

ff

,„

Σxx ΣxxΣxx Σxx

¸

where Σxx P Rkˆk and Σxx P R

Dˆk, k Ñ8.

However, the marginal remains finite

pp x q “ż

pp x , x qd x “ N`

µx , Σxx˘

where we integrate out an infinite number of random variables xi.

Gaussian Processes Marc Deisenroth February 22, 2017 62

The Gaussian Distribution in the Limit

Consider the joint Gaussian distribution ppx, xq, where x P RD andx P Rk, k Ñ8 are random variables.Then

ppx, xq “ N˜«

µxµx

ff

,„

Σxx ΣxxΣxx Σxx

¸

where Σxx P Rkˆk and Σxx P R

Dˆk, k Ñ8.However, the marginal remains finite

pp x q “ż

pp x , x qd x “ N`

µx , Σxx˘

where we integrate out an infinite number of random variables xi.

Gaussian Processes Marc Deisenroth February 22, 2017 62

Marginal and Conditional in the Limit

§ In practice, we consider finite training and test data xtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xother plays the role of x from previous slide)

ppxq “ N

¨

˚

˚

˝

»

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi

ffi

ffi

fl

˛

ppxtrain, xtestq “

ż

pp xtrain, xtest , xother qd xother

ppxtest|xtrainq “ N`

µ˚, Σ˚˘

µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q

Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test

Gaussian Processes Marc Deisenroth February 22, 2017 63

Marginal and Conditional in the Limit

§ In practice, we consider finite training and test data xtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xother plays the role of x from previous slide)

ppxq “ N

¨

˚

˚

˝

»

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi

ffi

ffi

fl

˛

ppxtrain, xtestq “

ż

pp xtrain, xtest , xother qd xother

ppxtest|xtrainq “ N`

µ˚, Σ˚˘

µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q

Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test

Gaussian Processes Marc Deisenroth February 22, 2017 63

Marginal and Conditional in the Limit

§ In practice, we consider finite training and test data xtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xother plays the role of x from previous slide)

ppxq “ N

¨

˚

˚

˝

»

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi

ffi

ffi

fl

˛

ppxtrain, xtestq “

ż

pp xtrain, xtest , xother qd xother

ppxtest|xtrainq “ N`

µ˚, Σ˚˘

µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q

Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test

Gaussian Processes Marc Deisenroth February 22, 2017 63

Marginal and Conditional in the Limit

§ In practice, we consider finite training and test data xtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xother plays the role of x from previous slide)

ppxq “ N

¨

˚

˚

˝

»

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi

ffi

ffi

fl

˛

ppxtrain, xtestq “

ż

pp xtrain, xtest , xother qd xother

ppxtest|xtrainq “ N`

µ˚, Σ˚˘

µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q

Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test

Gaussian Processes Marc Deisenroth February 22, 2017 63

Marginal and Conditional in the Limit

§ In practice, we consider finite training and test data xtrain, xtest

§ Then, x “ txtrain, xtest, xotheru

(xother plays the role of x from previous slide)

ppxq “ N

¨

˚

˚

˝

»

µtrain

µtest

µother

fi

ffi

ffi

fl

,

»

Σtrain Σtrain,test

Σtest,train Σtest

Σtrain,other

Σtest,other

Σother,train Σother,test Σother

fi

ffi

ffi

fl

˛

ppxtrain, xtestq “

ż

pp xtrain, xtest , xother qd xother

ppxtest|xtrainq “ N`

µ˚, Σ˚˘

µ˚ “ µtest ` Σtest,train Σ´1train pxtrain ´ µtrain q

Σ˚ “ Σtest ´ Σtest,train Σ´1train Σtrain,test

Gaussian Processes Marc Deisenroth February 22, 2017 63

Gaussian Process Training: Hierarchical Inference

§ Level-1 inference (posterior on f ):

pp f |X, y, θq “ppy|X, f q pp f |X, θq

ppy|X, θq

ppy|X, θq “

ż

ppy| f , Xq pp f |X, f θqd f

§ Level-2 inference (posterior on θ)

ppθ|X, yq “ppy|X, θq ppθq

ppy|Xq

θ

σnyixi

f

N

Gaussian Processes Marc Deisenroth February 22, 2017 64

Gaussian Process Training: Hierarchical Inference

§ Level-1 inference (posterior on f ):

pp f |X, y, θq “ppy|X, f q pp f |X, θq

ppy|X, θq

ppy|X, θq “

ż

ppy| f , Xq pp f |X, f θqd f

§ Level-2 inference (posterior on θ)

ppθ|X, yq “ppy|X, θq ppθq

ppy|Xq

θ

σnyixi

f

N

Gaussian Processes Marc Deisenroth February 22, 2017 64

GP as the Limit of an Infinite RBF Network

Consider the universal function approximator

f pxq “ÿ

iPZ

limNÑ8

1N

Nÿ

n“1

γn exp

˜

´px´ pi` n

N qq2

λ2

¸

, x P R , λ P R`

with γn „ N`

0, 1˘

(random weights)Gaussian-shaped basis functions (with variance λ2{2) everywhere

on the real axis

f pxq “ÿ

iPZ

ż i`1

iγpsq exp

ˆ

´px´ sq2

λ2

˙

ds “ż 8

´8

γpsq expˆ

´px´ sq2

λ2

˙

ds

§ Mean: Er f pxqs “ 0

§ Covariance: Covr f pxq, f px1qs “ θ21 exp

´

´px´x1q2

2λ2

¯

for suitable θ21

GP with mean 0 and Gaussian covariance function

Gaussian Processes Marc Deisenroth February 22, 2017 65

GP as the Limit of an Infinite RBF Network

Consider the universal function approximator

f pxq “ÿ

iPZ

limNÑ8

1N

Nÿ

n“1

γn exp

˜

´px´ pi` n

N qq2

λ2

¸

, x P R , λ P R`

with γn „ N`

0, 1˘

(random weights)Gaussian-shaped basis functions (with variance λ2{2) everywhere

on the real axis

f pxq “ÿ

iPZ

ż i`1

iγpsq exp

ˆ

´px´ sq2

λ2

˙

ds “ż 8

´8

γpsq expˆ

´px´ sq2

λ2

˙

ds

§ Mean: Er f pxqs “ 0

§ Covariance: Covr f pxq, f px1qs “ θ21 exp

´

´px´x1q2

2λ2

¯

for suitable θ21

GP with mean 0 and Gaussian covariance function

Gaussian Processes Marc Deisenroth February 22, 2017 65

GP as the Limit of an Infinite RBF Network

Consider the universal function approximator

f pxq “ÿ

iPZ

limNÑ8

1N

Nÿ

n“1

γn exp

˜

´px´ pi` n

N qq2

λ2

¸

, x P R , λ P R`

with γn „ N`

0, 1˘

(random weights)Gaussian-shaped basis functions (with variance λ2{2) everywhere

on the real axis

f pxq “ÿ

iPZ

ż i`1

iγpsq exp

ˆ

´px´ sq2

λ2

˙

ds “ż 8

´8

γpsq expˆ

´px´ sq2

λ2

˙

ds

§ Mean: Er f pxqs “ 0

§ Covariance: Covr f pxq, f px1qs “ θ21 exp

´

´px´x1q2

2λ2

¯

for suitable θ21

GP with mean 0 and Gaussian covariance functionGaussian Processes Marc Deisenroth February 22, 2017 65

References I

[1] N. A. C. Cressie. Statistics for Spatial Data. Wiley-Interscience, 1993.[2] M. P. Deisenroth and S. Mohamed. Expectation Propagation in Gaussian Process Dynamical Systems. In Advances in

Neural Information Processing Systems, pages 2618–2626, 2012.[3] M. P. Deisenroth and J. W. Ng. Distributed Gaussian Processes. In Proceedings of the International Conference on Machine

Learning, 2015.[4] M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian Process Dynamic Programming. Neurocomputing,

72(7–9):1508–1524, March 2009.[5] M. P. Deisenroth, R. Turner, M. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust Filtering and Smoothing with

Gaussian Processes. IEEE Transactions on Automatic Control, 57(7):1865–1871, 2012.[6] R. Frigola, F. Lindsten, T. B. Schon, and C. E. Rasmussen. Bayesian Inference and Learning in Gaussian Process

State-Space Models with Particle MCMC. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,editors, Advances in Neural Information Processing Systems, pages 3156–3164. Curran Associates, Inc., 2013.

[7] J. Kocijan, R. Murray-Smith, C. E. Rasmussen, and A. Girard. Gaussian Process Model Based Predictive Control. InProceedings of the 2004 American Control Conference (ACC 2004), pages 2214–2219, Boston, MA, USA, June–July 2004.

[8] A. Krause, A. Singh, and C. Guestrin. Near-Optimal Sensor Placements in Gaussian Processes: Theory, EfficientAlgorithms and Empirical Studies. Journal of Machine Learning Research, 9:235–284, February 2008.

[9] J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic Construction andNatural-Language Description of Nonparametric Regression Models. In AAAI Conference on Artificial Intelligence, pages1–11, 2014.

[10] M. A. Osborne, S. J. Roberts, A. Rogers, S. D. Ramchurn, and N. R. Jennings. Towards Real-Time Information Processingof Sensor Network Data Using Computationally Efficient Multi-output Gaussian Processes. In Proceedings of theInternational Conference on Information Processing in Sensor Networks, pages 109–120. IEEE Computer Society, 2008.

[11] J. Quinonero-Candela and C. E. Rasmussen. A Unifying View of Sparse Approximate Gaussian Process Regression.Journal of Machine Learning Research, 6(2):1939–1960, 2005.

[12] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and MachineLearning. The MIT Press, Cambridge, MA, USA, 2006.

[13] S. Roberts, M. A. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian Processes for Time Series Modelling.Philosophical Transactions of the Royal Society (Part A), 371(1984), February 2013.

Gaussian Processes Marc Deisenroth February 22, 2017 66

top related