introduction to gpr

Upload: donsuni

Post on 07-Aug-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/20/2019 Introduction to GPR

    1/25

    Introduction

    Introduction to Gaussian Process Regression

    Hanna M. Wallach

    [email protected]

    January 25, 2005

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    2/25

    Introduction

    Outline

    Regression: weight-space view

    Regression: function-space view (Gaussian processes)Weight-space and function-space correspondence

    Making predictions

    Model selection: hyperparameters

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    3/25

    Introduction

    Supervised Learning: Regression (1)

    −1 −0.5 0 0.5 1−1.5

    −1

    −0.5

    0

    0.5

    1

    1.5

    input, x

      o  u   t  p

      u   t ,

       f   (  x   )

    underlying function and noisy data

    training data

    Assume an underlying process which generates “clean” data.

    Goal: recover underlying process from noisy observed data.

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    4/25

    Introduction

    Supervised Learning: Regression (2)

    Training data are  D = {x(i ), y (i ) |   i  = 1, . . . , n}.

    Each input is a vector  x  of dimension  d .

    Each target is a real-valued scalar  y  = f   (x) + noise.

    Collect inputs in  d  × n  matrix,  X , and targets in vector,  y :

    D = {X , y}.

    Wish to infer  f    for unseen input  x, using  P (f   |x,D).

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    5/25

    Introduction

    Gaussian Process Models: Inference in Function Space

    0 0.2 0.4 0.6 0.8 1−1

    0

    1

    2

    3samples from the prior

    input, x

      o  u   t  p

      u   t ,

       f   (  x   )

    0 0.2 0.4 0.6 0.8 10.4

    0.6

    0.8

    1

    1.2

    1.4samples from the posterior

    input, x

      o  u   t  p

      u   t ,

       f   (  x   )

    A Gaussian process defines a distribution over functions.Inference takes place directly in function space.

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    6/25

    Bayesian Linear Regression

    Part I

    Regression: The Weight-Space View

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    B i Li R i

  • 8/20/2019 Introduction to GPR

    7/25

    Bayesian Linear Regression

    Bayesian Linear Regression (1)

    −1 −0.5 0 0.5 1−1.5

    −1

    −0.5

    0

    0.5

    1

    1.5

    input, x

      o

      u   t  p  u   t ,

       f   (  x   )

    training data

    Assuming noise   ∼ N (0, σ2

    ), the linear regression model is:

    f   (x|w) = xw,   y  = f    + .

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    B i Li R i

  • 8/20/2019 Introduction to GPR

    8/25

    Bayesian Linear Regression

    Bayesian Linear Regression (2)

    Likelihood of parameters is:

    P (y|X , w) =  N (X w, σ2I ).

    Assume a Gaussian prior over parameters:

    P (w) =  N (0, Σp ).

    Apply Bayes’ theorem to obtain posterior:

    P (w|y, X ) ∝ P (y|X , w)P (w).

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Bayesian Linear Regression

  • 8/20/2019 Introduction to GPR

    9/25

    Bayesian Linear Regression

    Bayesian Linear Regression (3)

    Posterior distribution over  w   is:

    P (w|y, X ) =  N ( 1

    σ2

    A−1X y, A−1) where  A = Σ−1p    +  1

    σ2

    XX .

    Predictive distribution is:

    P (f   |x, X , y) =

       f   (x|w)P (w|X , y)d w

    =  N (  1σ2

    xA−1X y, xA−1x).

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Bayesian Linear Regression

  • 8/20/2019 Introduction to GPR

    10/25

    Bayesian Linear Regression

    Increasing Expressiveness

    Use a set of basis functions Φ(x) to project a  d   dimensionalinput  x  into  m  dimensional feature space:

    e.g.   Φ(x ) = (1, x , x 2, . . . )

    P (f   |x, X , y) can be expressed in terms of inner products infeature space:

    Can now use the kernel trick.

    How many basis functions should we use?

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    11/25

    Gaussian Process Regression

    Part II

    Regression: The Function-Space View

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    12/25

    Gaussian Process Regression

    Gaussian Processes: Definition

    A Gaussian process is a collection of random variables, anyfinite number of which have a joint Gaussian distribution.

    Consistency:If the GP specifies  y (1), y (2) ∼ N (µ, Σ), then it must alsospecify y (1) ∼ N (µ1, Σ11):

    A GP is completely specified by a mean function and a

    positive definite covariance function.

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    13/25

    g

    Gaussian Processes: A Distribution over Functions

    e.g.  Choose mean function zero, and covariance function:

    K p ,q  = Cov(f   (x(p )), f   (x(q ))) = K (x(p ), x(q ))

    For any set of inputs  x(1), . . . , x(n) we may compute  K   whichdefines a joint distribution over function values:

    f   (x(1)), . . . , f   (x(n)) ∼ N (0, K ).

    Therefore a GP specifies a distribution over functions.

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    14/25

    g

    Gaussian Processes: Simple Example

    Can obtain a GP from the Bayesin linear regression model:

    f   (x) = xw  with  w ∼ N (0, Σp ).

    Mean function is given by:

    E[f   (x)] = xE[w] = 0.

    Covariance function is given by:

    E[f   (x)f   (x)] = xE[ww]x = xΣp x.

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    15/25

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    16/25

    The Covariance Function

    Specifies the covariance between pairs of random variables.

    e.g.  Squared exponential covariance function:

    Cov(f   (x(p )), f   (x(q ))) = K (x(p ), x(q )) = exp (−1

    2|x(p ) − x(q )|2).

    0 2 4 6 8 100

    0.2

    0.4

    0.6

    0.8

    1K(x

    (p) = 5, x

    (q)) as a function of x

    (q)

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    17/25

    Gaussian Process Prior

    Given a set of inputs  x(1), . . . , x(n) we may draw samplesf   (x(1)), . . . , f   (x(n)) from the GP prior:

    f   (x(1)), . . . , f   (x(n)) ∼ N (0, K ).

    Four samples:

    0 0.2 0.4 0.6 0.8 1−1

    0

    1

    2

    3samples from the prior

    input, x

      o  u   t  p  u   t ,

       f   (  x   )

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    18/25

    Posterior: Noise-Free Observations (1)

    Given noise-free training data:

    D = {x(i ), f   (i ) |   i  = 1, . . . , n} = {X , f }.

    Want to make predictions   f  at test points  X .

    According to GP prior, joint distribution of  f   and   f  is:

    f f ∼ N 

    0, K (X , X )   K (X , X )

    K (X , X )   K (X , X )

    .

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    19/25

    Posterior: Noise-Free Observations (2)

    Condition  {X , f }  on  D  = {X , f }  obtain the posterior.

    Restrict prior to contain only functions which agree with  D .

    The posterior,  P (f |X , X , f ), is Gaussian with:

    µ =  K (X , X )K (X , X )−1f ,   and

    Σ = K (X , X ) − K (X , X )K (X , X )−1K (X , X ).

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    20/25

    Posterior: Noise-Free Observations (3)

    0 0.2 0.4 0.6 0.8 10.4

    0.6

    0.8

    1

    1.2

    1.4samples from the posterior

    input, x

      o  u   t

      p  u   t ,

       f   (  x   )

    Samples all agree with the observations  D  = {X , f }.Greatest variance is in regions with few training points.

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    21/25

    Prediction: Noisy Observations

    Typically we have noisy observations:

    D = {X , y},   where  y = f  + 

    Assume additive noise   ∼ N (0, σ2I ).

    Conditioning on  D  = {X , y}  gives a Gaussian with:

    µ = K (X , X )[K (X , X ) + σ2I ]−1y,   and

    Σ = K (X , X ) − K (X , X )[K (X , X ) + σ2I ]−1K (X , X ).

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    22/25

    Model Selection: Hyperparameters

    e.g.  the ARD covariance function:

    k (x (p ), x (q )) = exp (−  1

    2θ2(x (p ) − x (q ))2).

    How best to choose  θ?

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5

    −1

    −0.5

    0

    0.5

    1

    1.5

    samples from the posterior, θ = 0.1

    input, x

      o

      u   t  p  u   t ,

       f   (  x   )

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    samples from the posterior, θ = 0.3

    input, x

      o

      u   t  p  u   t ,

       f   (  x   )

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    samples from the posterior, θ = 0.5

    input, x

      o

      u   t  p  u   t ,

       f   (  x   )

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    23/25

    Model Selection: Optimizing Marginal Likelihood (1)

    In absence of a strong prior  P (θ), the posterior forhyperparameter  θ   is proportional to the marginal likelihood:

    P (θ|X , y) ∝ P (y|X , θ)

    Choose  θ  to optimize the marginal log-likelihood:

    log P (y|X , θ) = −1

    2 log |K (X , X ) + σ2I |−

    12

    y(K (X , X ) + σ2I )−1y −  n2

     log 2π.

    Hanna M. Wallach   [email protected]

    Introduction to Gaussian Process Regression

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    24/25

    Model Selection: Optimizing Marginal Likelihood (2)

    θML = 0.3255:

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

    0.6

    0.7

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    samples from the posterior, θ = 0.3255

    input, x

      o  u   t  p  u   t ,

       f   (  x

       )

    Using  θML

    is an approximation to the true Bayesian method of integrating over all  θ  values weighted by their posterior.

    Hanna M. Wallach   [email protected] to Gaussian Process Regression

    Gaussian Process Regression

  • 8/20/2019 Introduction to GPR

    25/25

    References

    1  Carl Edward Rasmussen. Gaussian Processes in Machinelearning.  Machine Learning Summer School , T ubingen, 2003.

    http://www.kyb.tuebingen.mpg.de/~carl/mlss03/2  Carl Edward Rasmussen and Chris Williams. Gaussian

    Processes for Machine Learning. Forthcoming.

    3  Carl Edward Rasmussen. The Gaussian Process Website.http://www.gatsby.ucl.ac.uk/~edward/gp/

    Hanna M. Wallach   [email protected] to Gaussian Process Regression

    http://www.kyb.tuebingen.mpg.de/~carl/mlss03/http://www.gatsby.ucl.ac.uk/~edward/gp/http://www.gatsby.ucl.ac.uk/~edward/gp/http://www.kyb.tuebingen.mpg.de/~carl/mlss03/