nus school of computing summer school [3mm] gaussian …scarlett/gp_slides/lecture... ·...

NUS School of Computing Summer School

Gaussian Process Methods in Machine Learning

Jonathan [email protected]

Lecture 1: Gaussian Processes, Kernels, and Regression

August 2018

License Information

These slides are an edited version of those for EE-620 Advanced Topics in MachineLearning at EPFL (taught by Prof. Volkan Cevher, LIONS group), with the followinglicense information:I This work is released under a Creative Commons License with the following terms:I Attribution

I The licensor permits others to copy, distribute, display, and perform the work. In return,licensees must give the original authors credit.

I Non-CommercialI The licensor permits others to copy, distribute, display, and perform the work. In return,

licensees may not use the work for commercial purposes – unless they get the licensor’spermission.

I Share AlikeI The licensor permits others to distribute derivative works only under a license identical

to the one that governs the licensor’s work.I Full Text of the License

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 2/ 36

http://creativecommons.org/licenses/by-nc-sa/1.0/

http://creativecommons.org/licenses/by-nc-sa/1.0/legalcode

Outline of Lectures

• Lecture 0: Bayesian Modeling and Regression

• Lecture 1: Gaussian Processes, Kernels, and Regression

• Lecture 2: Optimization with Gaussian Processes

• Lecture 3: Advanced Bayesian Optimization Methods


Outline: This Lecture

I This lecture1. Gaussian processes2. Kernels3. Posterior updates4. GP regression


Two Paths to the Same Predictor

Least Squares

Regularize

Kernel Trick

Linear Model Gaussian

Process Model

Bayesian Update

y(x) = k(x)(K + �I)�1y


A Bayesian Approach to Regression

• Recall: We wish to find an accurate mapping from input variables x to outputvariables y based on samples {(xt, yt)}Tt=1.

• A Bayesian approach:I Model the (x, y) relationship as being of the form

y = f(x) + z (1)

where z is random noiseI Place a prior distribution on f to model smoothness

Note: A distribution on f is a distribution over functions (recall stochastic processes)I For any point x, the function value f(x) is a random variableI For multiple points (x1,x2,x3), the triplet

(f(x1), f(x2), f(x3)

)has a joint

distribution (and similarly for collections of 4, 5, 6, . . . function values)I Intuition: If nearby points are more highly correlated, then f should be smooth


Approach:

• Use a model y = f (x) + z with arandom function f

• Use Bayesian inference to predict unseenvalues given the observed ones


Smoothness: Enter Gaussian Processes

• We will study a versatile class of random functions called Gaussian processes (GPs)

• Preview: A GP is a random function f such that any collection of pointsf(x1), . . . , f(xN ) is jointly Gaussian

• First, let’s revise some basics of multivariate Gaussian distributions


Multivariate Gaussian Basics (I)

• Univariate Gaussian probability density function (PDF) for Z ∼ N(µ, σ2):

p(z) =1√

2πe− (z−µ)2

2σ2

• Multivariate Gaussian probability density function (PDF) for Z ∼ N(µ,Σ) in Rd:

p(z) =1√

(2π)d det Σe−

(z−µ)TΣ−1(z−µ)2

• Basic properties:I If Z1 ∼ N(µ1,Σ1) and Z2 ∼ N(µ2,Σ2), then Z1 + Z2 ∼ N(µ1 + µ2,Σ1 + Σ2)I If Z ∼ N(µ, σ2), then cZ ∼ N(cµ, c2σ2) for constant cI If Z ∼ N(µ,Σ), then AZ ∼ N(Aµ,AΣAT ) for matrix AI If Z ∼ N(µ,Σ), then ZS ∼ N(µS ,ΣSS) for any index set SI Jointly Gaussian RVs are independent ⇐⇒ they are uncorrelated


Multivariate Gaussian Basics (II)

• For multivariate Gaussians, µ determines the location and Σ determines the shapeof the density function

• Examples (independent on left, highly correlated on right):1

Figure: Examples of 2D Gaussian PDFs

1Taken from http://www.philender.com/courses/multivariate/notes2/norm1.html


http://www.philender.com/courses/multivariate/notes2/norm1.html

Multivariate Gaussian Basics (III)

• Particularly important this lecture: Conditioning

• Let’s split Z ∼ N(µ,Σ) into ZT = [ZT1 ZT2 ], and accordingly

µ =[

µ1µ2

], Σ =

[Σ11 Σ12Σ21 Σ22

].

• Key conditioning property:

(Z1|Z2 = z2) ∼ N(µ′,Σ′),

where

µ′ = µ1 + Σ12Σ−122 (z2 − µ2)

Σ′ = Σ11 −Σ12Σ−122 Σ21.

• Unique fact: The conditional covariance Σ′ has no dependence on z2I Does not hold for general (non-Gaussian) random variables


Smoothness: Enter Gaussian Processes

• Gaussian process:I A versatile class of random functions fI Defining feature: The joint distribution of any set of points is jointly GaussianI Written as

f(·) ∼ GP(µ(·), k(·, ·)),

I µ(x): Mean of f(x)I k(x,x′): Covariance between f(x) and f(x′)I k(x,x): Variance of f(x)

I If the domain is finite, f is just a multivariate Gaussian. However, here thedomain may be continunous (e.g., x ∈ [0, 1]d).

• Recall: A Gaussian random variable is fully specified by its mean and variance. Theanalogy is that a GP is fully specified by µ and k.

• For simplicity, we will focus on the zero-mean case (often without loss of generality)I The general case is no more difficult – just extra notation!


Examples of GPs

• Different covariance functions (kernels) can lead to very different behavior:

Figure: Functions sampled from GP(0, kSE) and from GP(0, kMatérn)


Exercise

• Question 1. I have written k(x,x′) for Cov[f(x), f(x′)] using the same notation asa kernel that measures similarity. What does the covariance between function valueshave to do with similarity between inputs?

• Question 2. Suppose f(x) = θTx for i.i.d. Gaussian θ, i.e., θ ∼ N(0, I). Whatkernel does this correspond to?

• Question 3. Consider the 1D case. Explain why a covariance function decaying withdistance, say k(x, x′) = e−(x−x′)2 , should produce a smooth function.

(Note: The covariance of Z1 and Z2 is equal to E[(Z1 − µ1)(Z2 − µ2)], which alsoimplies Cov[Z,Z] = Var[Z])


Applications of GPs (I)

• GPs are well-suited to (but not restricted to) modeling “smoothly-varying” data

• Example 1. Modeling spatial environmental data [Gonzalez et al., 2007]


Applications of GPs (II)

• Example 2. Modeling performance as a function of parameters [Sangbae Kim, MIT]


Applications of GPs (III)

• Example 3. Modeling time series data [Ploysuwan and Chaisricharoen, 2017]


The Bayesian Mechanics (I)

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian settingyt = f(xt) + zt, where zt ∼ N (0, σ2)

I Given yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(Kt + σ2It

)−1yt

σ2t+1(x,x′) = k(x,x′)− kt(x)T

(Kt + σ2It

)−1kt(x′),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1


The Bayesian Mechanics (II)

• The more points we observe, the more the true function is limited to those exhibitingcertain behavior (with high probability). Illustration from [Schulz et al., 2018]:


Derivation of Posterior Distribution

• Outline of posterior mean/covariance derivation:I Suppose we sample X = (x1, . . . ,xt) and observe y = (y1, . . . , yt). For any x′

with value f ′ = f(x′), we have the joint distribution[f ′

y

]∼ N

(0,[k(x′,x′) k(x′,X)k(X,x∗) k(X,X) + σ2It

])(2)

where:I k(x′,X) is a vector with t-th entry k(x′,xt) (and k(X,x′) is its transpose)I k(X,X) is a matrix with i, j-th entry k(xi,xj)I It is the identity matrix (this term comes from the N(0, σ2) sampling noise)I The mean of 0 is because we are focusing on zero-mean GPs

I We get the formulas on the previous slides by computing the conditionaldistribution of f∗ given y using the conditional Gaussian formula shown earlier


Note:

In a GP model, the posterior distribution ofunobserved points conditioned on observedpoints has a simple closed-form expression


Using GPs for Regression

• Once we have computed the posterior from the n samples, the idea is simple:

Predict any new point x to havecorresponding function value µn(x)

• Moreover, σ2n(x) gives an estimate of the uncertainty in this prediction


Advantages and Disadvantages

• Key advantages of GPs:I Explicit estimates of uncertainty in the predictionI Wide variety of kernels available for rich modeling of functionsI Often highly effective even when limited data is available

• Disadvantages:I Computing the (exact) posterior takes O(n3) timeI Choosing a good kernel can be difficultI Difficulties in scaling to a large input dimension d


Kernels for GPs

• If the input x only takes one of finitely many values, “manually” specifying thecovariance between all possible pairs might be possible.

• Otherwise, one chooses a kernel function k(x,x′) that specifies the covariancecorresponding to two pointsI The choice of kernel is application-dependentI One can also learn a good kernel from the data (more later!)


Stationarity

• An important class of kernels is the class of stationary kernels:

k(x,x′) = k(τ ), τ = x− x′

• In other words, the kernel depends on x,x′ only through their differenceI Special case: Setting x = x′, we find that the variance is the same at every point

• Intuition: The local statistics in one part of the domain are the same as the localstatistics in another part of the domain (translation invariance)


Commonly-Used Examples (I)

• Common examples:

I Squared exponential (SE): kSE(x,x′) = exp(−‖x− x′‖2

2`2

)I ` is a length-scale (roughly, the scale over which the function varies)I Also known as Gaussian kernel or Radial basis function (RBF) kernel

I Matérn: kMatérn(x,x′) =21−ν

Γ(ν)

( √2ν‖x− x′‖`

)νJν

( √2ν‖x− x′‖`

),

I Γ and Jν are the Gamma and Bessel functionsI ν is a smoothness parameter (higher = more smooth)I ν → ∞ recovers SE kernel


Commonly-Used Examples (II)

• ...and many more!

• A brief guide (and the source of these images):http://www.cs.toronto.edu/~duvenaud/cookbook/index.html


http://www.cs.toronto.edu/~duvenaud/cookbook/index.html

Kernels in More General Machine Learning

• Notes on kernels in machine learning:I Many machine learning algorithms depend on the data x1, . . . ,xn only through

the inner products 〈xi,xj〉I Example 1: Ridge regressionI Example 2: Dual form of Support Vector Machine (SVM)I Example 3: Nearest-neighbor methods

I We know that moving to feature spaces can help, so we could map eachxi → φ(xi) and apply the algorithm using 〈φ(xi), φ(xj)〉

I A kernel function k(xi,xj) can be thought of as an inner product in a possiblyimplicit feature spaceI No need to explicitly map to feature space at all!I The implicit space may be infinite-dimensional (e.g., SE and Matérn), so we could not

explicitly map to it even if we wanted to


Importance of the Kernel Choice

tim• Different kernel choices can behave very differently (and sometimes very badly):

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1.5

-1

-0.5

0

0.5

1

1.5

` too small

` too large

Good choice of `


Notes:

• Choosing a suitable GP kernel is crucial• (just like choosing a Bayesian prior more generally)• (or choosing a model even more generally)


Choosing a Kernel

• The choice of kernel can have a considerable impact on the performance

• Principled approach: Learn a kernel from some training data (xi, yi) w/i = 1, . . . , nT

• Maximum-likelihood:I Fix a class of kernels and optimize its parameter(s) (e.g., `)I Specifically, search for

arg maxν,l

P`(y1, . . . , ynT )

where P` is the probability density for the observations with inputs(x1, . . . ,xnT ) and parameter `

I Exact maximization may be difficult; alternatives includeI Grid searchI Random searchI Iterative optimization


Classification with GPs I

• GPs can also be used for classification:I Input vector x ∈ Rd, binary label y ∈ {1,−1}I Example 1: x represents an email, y indicates whether it is spamI Example 2: x describes medical test outcomes, y indicates whether the patient

has some diseaseI Example 3: x is a robotics parameter configuration, y indicates whether the

robot successfully performed a task

• The setup is not tot too different from regression, but we have discrete rather thancontinuous observations


Classification with GPs II

• Richer modeling than basic linear classification:

Classify as +1

Classify as -1

+ ++

+

++

+

+++ +

+ ++

++

++

++

+

-

- --

-- -

- - -- -- -- -- -+ ++

+--- - ----

-++

-

-

-- ---

- -


Classification with GPs III

• Basic idea:I Model the true classification rule as being of the form

y ={

1 f(x) ≥ 0−1 f(x) < 0.

I As with regression, use data {(xt, yt)}nt=1 to form a posterior distributionI Predict according to the posterior mean

• Key difference: Now the posterior cannot be computed exactly (we only get toobserve the sign of f , not the function value)I However, Monte Carlo type approximations can be doneI See for example [Chu and Ghahramani, ICML 2005] for more details


Further Reading

• Recent tutorial on GP regression: A tutorial on Gaussian process regression:Modelling, exploring, and exploiting functions (Schulz et al., 2018)

• Popular GP book: Gaussian Processes for Machine Learning (Rasmussen, 2006)


References

[1] Wei Chu and Zoubin Ghahramani.Gaussian processes for ordinal regression.J. Mach. Learn. Res., 6(Jul):1019–1041, 2005.

[2] Juan Pablo Gonzalez, Simon E Cook, Thomas Oberthür, Andy Jarvis, J AndrewBagnell, and M Bernardine Dias.Creating low-cost soil maps for tropical agriculture using Gaussian processes.2007.

[3] Tuchsanai Ploysuwan and Roungsan Chaisricharoen.Gaussian process kernel crossover for automated forex trading system.In Int. Conf. EE., Telecomms., and Inf. Tech. (ECTI-CON), pages 802–805, 2017.

[4] Carl Edward Rasmussen.Gaussian processes for machine learning.MIT Press, 2006.

[5] Eric Schulz, Maarten Speekenbrink, and Andreas Krause.A tutorial on Gaussian process regression: Modelling, exploring, and exploitingfunctions.J. Math. Psychology, 85:1 – 16, 2018.

[6] Alex J Smola and Bernhard Schölkopf.Learning with kernels.GMD-Forschungszentrum Informationstechnik, 1998.


nus school of computing summer school [3mm] gaussian …scarlett/gp_slides/lecture... ·...

Documents