introduction to gpr

8/20/2019 Introduction to GPR

1/25

Introduction

Introduction to Gaussian Process Regression

Hanna M. Wallach

[email protected]

January 25, 2005

Hanna M. Wallach [email protected]



2/25

Introduction

Outline

Regression: weight-space view

Regression: function-space view (Gaussian processes)Weight-space and function-space correspondence

Making predictions

Model selection: hyperparameters




3/25

Introduction

Supervised Learning: Regression (1)

−1 −0.5 0 0.5 1−1.5

−1

−0.5

0

0.5

1

1.5

input, x

o u t p

u t ,

f ( x )

underlying function and noisy data

training data

Assume an underlying process which generates “clean” data.

Goal: recover underlying process from noisy observed data.




4/25

Introduction

Supervised Learning: Regression (2)

Training data are D = {x(i ), y (i ) | i = 1, . . . , n}.

Each input is a vector x of dimension d .

Each target is a real-valued scalar y = f (x) + noise.

Collect inputs in d × n matrix, X , and targets in vector, y :

D = {X , y}.

Wish to infer f for unseen input x, using P (f |x,D).




5/25

Introduction

Gaussian Process Models: Inference in Function Space

0 0.2 0.4 0.6 0.8 1−1

0

1

2

3samples from the prior

input, x

o u t p

u t ,

f ( x )

0 0.2 0.4 0.6 0.8 10.4

0.6

0.8

1

1.2

1.4samples from the posterior

input, x

o u t p

u t ,

f ( x )

A Gaussian process defines a distribution over functions.Inference takes place directly in function space.




6/25

Bayesian Linear Regression

Part I

Regression: The Weight-Space View



B i Li R i


7/25


Bayesian Linear Regression (1)

−1 −0.5 0 0.5 1−1.5

−1

−0.5

0

0.5

1

1.5

input, x

o

u t p u t ,

f ( x )

training data

Assuming noise ∼ N (0, σ2

), the linear regression model is:

f (x|w) = xw, y = f + .



B i Li R i


8/25



Likelihood of parameters is:

P (y|X , w) = N (X w, σ2I ).

Assume a Gaussian prior over parameters:

P (w) = N (0, Σp ).

Apply Bayes’ theorem to obtain posterior:

P (w|y, X ) ∝ P (y|X , w)P (w).





9/25



Posterior distribution over w is:

P (w|y, X ) = N ( 1

σ2

A−1X y, A−1) where A = Σ−1p + 1

σ2

XX .

Predictive distribution is:

P (f |x, X , y) =

f (x|w)P (w|X , y)d w

= N ( 1σ2

xA−1X y, xA−1x).





10/25


Increasing Expressiveness

Use a set of basis functions Φ(x) to project a d dimensionalinput x into m dimensional feature space:

e.g. Φ(x ) = (1, x , x 2, . . . )

P (f |x, X , y) can be expressed in terms of inner products infeature space:

Can now use the kernel trick.

How many basis functions should we use?



Gaussian Process Regression


11/25


Part II

Regression: The Function-Space View





12/25


Gaussian Processes: Definition

A Gaussian process is a collection of random variables, anyfinite number of which have a joint Gaussian distribution.

Consistency:If the GP specifies y (1), y (2) ∼ N (µ, Σ), then it must alsospecify y (1) ∼ N (µ1, Σ11):

A GP is completely specified by a mean function and a

positive definite covariance function.





13/25

g

Gaussian Processes: A Distribution over Functions

e.g. Choose mean function zero, and covariance function:

K p ,q = Cov(f (x(p )), f (x(q ))) = K (x(p ), x(q ))

For any set of inputs x(1), . . . , x(n) we may compute K whichdefines a joint distribution over function values:

f (x(1)), . . . , f (x(n)) ∼ N (0, K ).

Therefore a GP specifies a distribution over functions.





14/25

g

Gaussian Processes: Simple Example

Can obtain a GP from the Bayesin linear regression model:

f (x) = xw with w ∼ N (0, Σp ).

Mean function is given by:

E[f (x)] = xE[w] = 0.

Covariance function is given by:

E[f (x)f (x)] = xE[ww]x = xΣp x.




15/25



16/25

The Covariance Function

Specifies the covariance between pairs of random variables.

e.g. Squared exponential covariance function:

Cov(f (x(p )), f (x(q ))) = K (x(p ), x(q )) = exp (−1

2|x(p ) − x(q )|2).

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1K(x

(p) = 5, x

(q)) as a function of x

(q)





17/25

Gaussian Process Prior

Given a set of inputs x(1), . . . , x(n) we may draw samplesf (x(1)), . . . , f (x(n)) from the GP prior:

f (x(1)), . . . , f (x(n)) ∼ N (0, K ).

Four samples:

0 0.2 0.4 0.6 0.8 1−1

0

1

2

3samples from the prior

input, x

o u t p u t ,

f ( x )





18/25

Posterior: Noise-Free Observations (1)

Given noise-free training data:

D = {x(i ), f (i ) | i = 1, . . . , n} = {X , f }.

Want to make predictions f at test points X .

According to GP prior, joint distribution of f and f is:

f f ∼ N

0, K (X , X ) K (X , X )

K (X , X ) K (X , X )

.





19/25


Condition {X , f } on D = {X , f } obtain the posterior.

Restrict prior to contain only functions which agree with D .

The posterior, P (f |X , X , f ), is Gaussian with:

µ = K (X , X )K (X , X )−1f , and

Σ = K (X , X ) − K (X , X )K (X , X )−1K (X , X ).





20/25


0 0.2 0.4 0.6 0.8 10.4

0.6

0.8

1

1.2

1.4samples from the posterior

input, x

o u t

p u t ,

f ( x )

Samples all agree with the observations D = {X , f }.Greatest variance is in regions with few training points.





21/25

Prediction: Noisy Observations

Typically we have noisy observations:

D = {X , y}, where y = f +

Assume additive noise ∼ N (0, σ2I ).

Conditioning on D = {X , y} gives a Gaussian with:

µ = K (X , X )[K (X , X ) + σ2I ]−1y, and

Σ = K (X , X ) − K (X , X )[K (X , X ) + σ2I ]−1K (X , X ).





22/25

Model Selection: Hyperparameters

e.g. the ARD covariance function:

k (x (p ), x (q )) = exp (− 1

2θ2(x (p ) − x (q ))2).

How best to choose θ?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

−1

−0.5

0

0.5

1

1.5

samples from the posterior, θ = 0.1

input, x

o

u t p u t ,

f ( x )

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3


input, x

o

u t p u t ,

f ( x )

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3


input, x

o

u t p u t ,

f ( x )





23/25

Model Selection: Optimizing Marginal Likelihood (1)

In absence of a strong prior P (θ), the posterior forhyperparameter θ is proportional to the marginal likelihood:

P (θ|X , y) ∝ P (y|X , θ)

Choose θ to optimize the marginal log-likelihood:

log P (y|X , θ) = −1

2 log |K (X , X ) + σ2I |−

12

y(K (X , X ) + σ2I )−1y − n2

log 2π.





24/25

Model Selection: Optimizing Marginal Likelihood (2)

θML = 0.3255:

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4


input, x

o u t p u t ,

f ( x

)

Using θML

is an approximation to the true Bayesian method of integrating over all θ values weighted by their posterior.

Hanna M. Wallach [email protected] to Gaussian Process Regression



25/25

References

1 Carl Edward Rasmussen. Gaussian Processes in Machinelearning. Machine Learning Summer School , T ubingen, 2003.

http://www.kyb.tuebingen.mpg.de/~carl/mlss03/2 Carl Edward Rasmussen and Chris Williams. Gaussian

Processes for Machine Learning. Forthcoming.

3 Carl Edward Rasmussen. The Gaussian Process Website.http://www.gatsby.ucl.ac.uk/~edward/gp/

Hanna M. Wallach [email protected] to Gaussian Process Regression
http://www.kyb.tuebingen.mpg.de/~carl/mlss03/http://www.gatsby.ucl.ac.uk/~edward/gp/http://www.gatsby.ucl.ac.uk/~edward/gp/http://www.kyb.tuebingen.mpg.de/~carl/mlss03/

introduction to gpr

Documents