nus school of computing summer school [3mm] gaussian …scarlett/gp_slides/lecture... ·...

50
NUS School of Computing Summer School Gaussian Process Methods in Machine Learning Jonathan Scarlett [email protected] Lecture 1: Gaussian Processes, Kernels, and Regression August 2018

Upload: others

Post on 18-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

NUS School of Computing Summer School

Gaussian Process Methods in Machine Learning

Jonathan [email protected]

Lecture 1: Gaussian Processes, Kernels, and Regression

August 2018

Page 2: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

License Information

These slides are an edited version of those for EE-620 Advanced Topics in MachineLearning at EPFL (taught by Prof. Volkan Cevher, LIONS group), with the followinglicense information:I This work is released under a Creative Commons License with the following terms:I Attribution

I The licensor permits others to copy, distribute, display, and perform the work. In return,licensees must give the original authors credit.

I Non-CommercialI The licensor permits others to copy, distribute, display, and perform the work. In return,

licensees may not use the work for commercial purposes – unless they get the licensor’spermission.

I Share AlikeI The licensor permits others to distribute derivative works only under a license identical

to the one that governs the licensor’s work.I Full Text of the License

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 2/ 36

Page 3: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Outline of Lectures

• Lecture 0: Bayesian Modeling and Regression

• Lecture 1: Gaussian Processes, Kernels, and Regression

• Lecture 2: Optimization with Gaussian Processes

• Lecture 3: Advanced Bayesian Optimization Methods

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 3/ 36

Page 4: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Outline: This Lecture

I This lecture1. Gaussian processes2. Kernels3. Posterior updates4. GP regression

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 4/ 36

Page 5: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Two Paths to the Same Predictor

Least Squares

Regularize

Kernel Trick

Linear Model Gaussian

Process Model

Bayesian Update

y(x) = k(x)(K + �I)�1y

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 5/ 36

Page 6: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

A Bayesian Approach to Regression

• Recall: We wish to find an accurate mapping from input variables x to outputvariables y based on samples {(xt, yt)}Tt=1.

• A Bayesian approach:I Model the (x, y) relationship as being of the form

y = f(x) + z (1)

where z is random noiseI Place a prior distribution on f to model smoothness

Note: A distribution on f is a distribution over functions (recall stochastic processes)I For any point x, the function value f(x) is a random variableI For multiple points (x1,x2,x3), the triplet

(f(x1), f(x2), f(x3)

)has a joint

distribution (and similarly for collections of 4, 5, 6, . . . function values)I Intuition: If nearby points are more highly correlated, then f should be smooth

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 6/ 36

Page 7: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

A Bayesian Approach to Regression

• Recall: We wish to find an accurate mapping from input variables x to outputvariables y based on samples {(xt, yt)}Tt=1.

• A Bayesian approach:I Model the (x, y) relationship as being of the form

y = f(x) + z (1)

where z is random noiseI Place a prior distribution on f to model smoothness

Note: A distribution on f is a distribution over functions (recall stochastic processes)I For any point x, the function value f(x) is a random variableI For multiple points (x1,x2,x3), the triplet

(f(x1), f(x2), f(x3)

)has a joint

distribution (and similarly for collections of 4, 5, 6, . . . function values)I Intuition: If nearby points are more highly correlated, then f should be smooth

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 6/ 36

Page 8: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Approach:

• Use a model y = f (x) + z with arandom function f

• Use Bayesian inference to predict unseenvalues given the observed ones

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 7/ 36

Page 9: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Smoothness: Enter Gaussian Processes

• We will study a versatile class of random functions called Gaussian processes (GPs)

• Preview: A GP is a random function f such that any collection of pointsf(x1), . . . , f(xN ) is jointly Gaussian

• First, let’s revise some basics of multivariate Gaussian distributions

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 8/ 36

Page 10: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Multivariate Gaussian Basics (I)

• Univariate Gaussian probability density function (PDF) for Z ∼ N(µ, σ2):

p(z) =1√

2πe− (z−µ)2

2σ2

• Multivariate Gaussian probability density function (PDF) for Z ∼ N(µ,Σ) in Rd:

p(z) =1√

(2π)d det Σe−

(z−µ)TΣ−1(z−µ)2

• Basic properties:I If Z1 ∼ N(µ1,Σ1) and Z2 ∼ N(µ2,Σ2), then Z1 + Z2 ∼ N(µ1 + µ2,Σ1 + Σ2)I If Z ∼ N(µ, σ2), then cZ ∼ N(cµ, c2σ2) for constant cI If Z ∼ N(µ,Σ), then AZ ∼ N(Aµ,AΣAT ) for matrix AI If Z ∼ N(µ,Σ), then ZS ∼ N(µS ,ΣSS) for any index set SI Jointly Gaussian RVs are independent ⇐⇒ they are uncorrelated

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 9/ 36

Page 11: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Multivariate Gaussian Basics (I)

• Univariate Gaussian probability density function (PDF) for Z ∼ N(µ, σ2):

p(z) =1√

2πe− (z−µ)2

2σ2

• Multivariate Gaussian probability density function (PDF) for Z ∼ N(µ,Σ) in Rd:

p(z) =1√

(2π)d det Σe−

(z−µ)TΣ−1(z−µ)2

• Basic properties:I If Z1 ∼ N(µ1,Σ1) and Z2 ∼ N(µ2,Σ2), then Z1 + Z2 ∼ N(µ1 + µ2,Σ1 + Σ2)I If Z ∼ N(µ, σ2), then cZ ∼ N(cµ, c2σ2) for constant cI If Z ∼ N(µ,Σ), then AZ ∼ N(Aµ,AΣAT ) for matrix AI If Z ∼ N(µ,Σ), then ZS ∼ N(µS ,ΣSS) for any index set SI Jointly Gaussian RVs are independent ⇐⇒ they are uncorrelated

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 9/ 36

Page 12: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Multivariate Gaussian Basics (II)

• For multivariate Gaussians, µ determines the location and Σ determines the shapeof the density function

• Examples (independent on left, highly correlated on right):1

Figure: Examples of 2D Gaussian PDFs

1Taken from http://www.philender.com/courses/multivariate/notes2/norm1.html

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 10/ 36

Page 13: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Multivariate Gaussian Basics (III)

• Particularly important this lecture: Conditioning

• Let’s split Z ∼ N(µ,Σ) into ZT = [ZT1 ZT2 ], and accordingly

µ =[

µ1µ2

], Σ =

[Σ11 Σ12Σ21 Σ22

].

• Key conditioning property:

(Z1|Z2 = z2) ∼ N(µ′,Σ′),

where

µ′ = µ1 + Σ12Σ−122 (z2 − µ2)

Σ′ = Σ11 −Σ12Σ−122 Σ21.

• Unique fact: The conditional covariance Σ′ has no dependence on z2I Does not hold for general (non-Gaussian) random variables

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 11/ 36

Page 14: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Multivariate Gaussian Basics (III)

• Particularly important this lecture: Conditioning

• Let’s split Z ∼ N(µ,Σ) into ZT = [ZT1 ZT2 ], and accordingly

µ =[

µ1µ2

], Σ =

[Σ11 Σ12Σ21 Σ22

].

• Key conditioning property:

(Z1|Z2 = z2) ∼ N(µ′,Σ′),

where

µ′ = µ1 + Σ12Σ−122 (z2 − µ2)

Σ′ = Σ11 −Σ12Σ−122 Σ21.

• Unique fact: The conditional covariance Σ′ has no dependence on z2I Does not hold for general (non-Gaussian) random variables

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 11/ 36

Page 15: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Smoothness: Enter Gaussian Processes

• Gaussian process:I A versatile class of random functions fI Defining feature: The joint distribution of any set of points is jointly GaussianI Written as

f(·) ∼ GP(µ(·), k(·, ·)),

I µ(x): Mean of f(x)I k(x,x′): Covariance between f(x) and f(x′)I k(x,x): Variance of f(x)

I If the domain is finite, f is just a multivariate Gaussian. However, here thedomain may be continunous (e.g., x ∈ [0, 1]d).

• Recall: A Gaussian random variable is fully specified by its mean and variance. Theanalogy is that a GP is fully specified by µ and k.

• For simplicity, we will focus on the zero-mean case (often without loss of generality)I The general case is no more difficult – just extra notation!

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 12/ 36

Page 16: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Smoothness: Enter Gaussian Processes

• Gaussian process:I A versatile class of random functions fI Defining feature: The joint distribution of any set of points is jointly GaussianI Written as

f(·) ∼ GP(µ(·), k(·, ·)),

I µ(x): Mean of f(x)I k(x,x′): Covariance between f(x) and f(x′)I k(x,x): Variance of f(x)

I If the domain is finite, f is just a multivariate Gaussian. However, here thedomain may be continunous (e.g., x ∈ [0, 1]d).

• Recall: A Gaussian random variable is fully specified by its mean and variance. Theanalogy is that a GP is fully specified by µ and k.

• For simplicity, we will focus on the zero-mean case (often without loss of generality)I The general case is no more difficult – just extra notation!

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 12/ 36

Page 17: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Examples of GPs

• Different covariance functions (kernels) can lead to very different behavior:

Figure: Functions sampled from GP(0, kSE) and from GP(0, kMatérn)

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 13/ 36

Page 18: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Exercise

• Question 1. I have written k(x,x′) for Cov[f(x), f(x′)] using the same notation asa kernel that measures similarity. What does the covariance between function valueshave to do with similarity between inputs?

• Question 2. Suppose f(x) = θTx for i.i.d. Gaussian θ, i.e., θ ∼ N(0, I). Whatkernel does this correspond to?

• Question 3. Consider the 1D case. Explain why a covariance function decaying withdistance, say k(x, x′) = e−(x−x′)2 , should produce a smooth function.

(Note: The covariance of Z1 and Z2 is equal to E[(Z1 − µ1)(Z2 − µ2)], which alsoimplies Cov[Z,Z] = Var[Z])

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 14/ 36

Page 19: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Applications of GPs (I)

• GPs are well-suited to (but not restricted to) modeling “smoothly-varying” data

• Example 1. Modeling spatial environmental data [Gonzalez et al., 2007]

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 15/ 36

Page 20: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Applications of GPs (II)

• Example 2. Modeling performance as a function of parameters [Sangbae Kim, MIT]

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 16/ 36

Page 21: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Applications of GPs (III)

• Example 3. Modeling time series data [Ploysuwan and Chaisricharoen, 2017]

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 17/ 36

Page 22: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

The Bayesian Mechanics (I)

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian settingyt = f(xt) + zt, where zt ∼ N (0, σ2)

I Given yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(Kt + σ2It

)−1yt

σ2t+1(x,x′) = k(x,x′)− kt(x)T

(Kt + σ2It

)−1kt(x′),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 18/ 36

Page 23: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

The Bayesian Mechanics (I)

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian settingyt = f(xt) + zt, where zt ∼ N (0, σ2)

I Given yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(Kt + σ2It

)−1yt

σ2t+1(x,x′) = k(x,x′)− kt(x)T

(Kt + σ2It

)−1kt(x′),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 18/ 36

Page 24: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

The Bayesian Mechanics (I)

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian settingyt = f(xt) + zt, where zt ∼ N (0, σ2)

I Given yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(Kt + σ2It

)−1yt

σ2t+1(x,x′) = k(x,x′)− kt(x)T

(Kt + σ2It

)−1kt(x′),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 18/ 36

Page 25: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

The Bayesian Mechanics (I)

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian settingyt = f(xt) + zt, where zt ∼ N (0, σ2)

I Given yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(Kt + σ2It

)−1yt

σ2t+1(x,x′) = k(x,x′)− kt(x)T

(Kt + σ2It

)−1kt(x′),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 18/ 36

Page 26: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

The Bayesian Mechanics (I)

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian settingyt = f(xt) + zt, where zt ∼ N (0, σ2)

I Given yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(Kt + σ2It

)−1yt

σ2t+1(x,x′) = k(x,x′)− kt(x)T

(Kt + σ2It

)−1kt(x′),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 18/ 36

Page 27: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

The Bayesian Mechanics (I)

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian settingyt = f(xt) + zt, where zt ∼ N (0, σ2)

I Given yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(Kt + σ2It

)−1yt

σ2t+1(x,x′) = k(x,x′)− kt(x)T

(Kt + σ2It

)−1kt(x′),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 18/ 36

Page 28: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

The Bayesian Mechanics (I)

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian settingyt = f(xt) + zt, where zt ∼ N (0, σ2)

I Given yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(Kt + σ2It

)−1yt

σ2t+1(x,x′) = k(x,x′)− kt(x)T

(Kt + σ2It

)−1kt(x′),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 18/ 36

Page 29: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

The Bayesian Mechanics (I)

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian settingyt = f(xt) + zt, where zt ∼ N (0, σ2)

I Given yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(Kt + σ2It

)−1yt

σ2t+1(x,x′) = k(x,x′)− kt(x)T

(Kt + σ2It

)−1kt(x′),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 18/ 36

Page 30: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

The Bayesian Mechanics (I)

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian settingyt = f(xt) + zt, where zt ∼ N (0, σ2)

I Given yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(Kt + σ2It

)−1yt

σ2t+1(x,x′) = k(x,x′)− kt(x)T

(Kt + σ2It

)−1kt(x′),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 18/ 36

Page 31: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

The Bayesian Mechanics (I)

0.0 0.2 0.4 0.6 0.8 1.0

4

2

0

2

4

Mean predictions + 3 standard deviations

• Processing noisy observations in the Bayesian settingyt = f(xt) + zt, where zt ∼ N (0, σ2)

I Given yt = [y1, ..., yt] at points x1:t, the posterior is also a GP:

µt+1(x) = kt(x)T(Kt + σ2It

)−1yt

σ2t+1(x,x′) = k(x,x′)− kt(x)T

(Kt + σ2It

)−1kt(x′),

where Kt =[k(x,x′)

]x,x′∈x1:t

and kt(x) =[k(xi,x)

]ti=1

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 18/ 36

Page 32: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

The Bayesian Mechanics (II)

• The more points we observe, the more the true function is limited to those exhibitingcertain behavior (with high probability). Illustration from [Schulz et al., 2018]:

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 19/ 36

Page 33: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Derivation of Posterior Distribution

• Outline of posterior mean/covariance derivation:I Suppose we sample X = (x1, . . . ,xt) and observe y = (y1, . . . , yt). For any x′

with value f ′ = f(x′), we have the joint distribution[f ′

y

]∼ N

(0,[k(x′,x′) k(x′,X)k(X,x∗) k(X,X) + σ2It

])(2)

where:I k(x′,X) is a vector with t-th entry k(x′,xt) (and k(X,x′) is its transpose)I k(X,X) is a matrix with i, j-th entry k(xi,xj)I It is the identity matrix (this term comes from the N(0, σ2) sampling noise)I The mean of 0 is because we are focusing on zero-mean GPs

I We get the formulas on the previous slides by computing the conditionaldistribution of f∗ given y using the conditional Gaussian formula shown earlier

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 20/ 36

Page 34: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Note:

In a GP model, the posterior distribution ofunobserved points conditioned on observedpoints has a simple closed-form expression

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 21/ 36

Page 35: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Using GPs for Regression

• Once we have computed the posterior from the n samples, the idea is simple:

Predict any new point x to havecorresponding function value µn(x)

• Moreover, σ2n(x) gives an estimate of the uncertainty in this prediction

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 22/ 36

Page 36: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Advantages and Disadvantages

• Key advantages of GPs:I Explicit estimates of uncertainty in the predictionI Wide variety of kernels available for rich modeling of functionsI Often highly effective even when limited data is available

• Disadvantages:I Computing the (exact) posterior takes O(n3) timeI Choosing a good kernel can be difficultI Difficulties in scaling to a large input dimension d

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 23/ 36

Page 37: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Kernels for GPs

• If the input x only takes one of finitely many values, “manually” specifying thecovariance between all possible pairs might be possible.

• Otherwise, one chooses a kernel function k(x,x′) that specifies the covariancecorresponding to two pointsI The choice of kernel is application-dependentI One can also learn a good kernel from the data (more later!)

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 24/ 36

Page 38: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Stationarity

• An important class of kernels is the class of stationary kernels:

k(x,x′) = k(τ ), τ = x− x′

• In other words, the kernel depends on x,x′ only through their differenceI Special case: Setting x = x′, we find that the variance is the same at every point

• Intuition: The local statistics in one part of the domain are the same as the localstatistics in another part of the domain (translation invariance)

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 25/ 36

Page 39: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Commonly-Used Examples (I)

• Common examples:

I Squared exponential (SE): kSE(x,x′) = exp(−‖x− x′‖2

2`2

)I ` is a length-scale (roughly, the scale over which the function varies)I Also known as Gaussian kernel or Radial basis function (RBF) kernel

I Matérn: kMatérn(x,x′) =21−ν

Γ(ν)

( √2ν‖x− x′‖`

)νJν

( √2ν‖x− x′‖`

),

I Γ and Jν are the Gamma and Bessel functionsI ν is a smoothness parameter (higher = more smooth)I ν → ∞ recovers SE kernel

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 26/ 36

Page 40: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Commonly-Used Examples (II)

• ...and many more!

• A brief guide (and the source of these images):http://www.cs.toronto.edu/~duvenaud/cookbook/index.html

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 27/ 36

Page 41: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Kernels in More General Machine Learning

• Notes on kernels in machine learning:I Many machine learning algorithms depend on the data x1, . . . ,xn only through

the inner products 〈xi,xj〉I Example 1: Ridge regressionI Example 2: Dual form of Support Vector Machine (SVM)I Example 3: Nearest-neighbor methods

I We know that moving to feature spaces can help, so we could map eachxi → φ(xi) and apply the algorithm using 〈φ(xi), φ(xj)〉

I A kernel function k(xi,xj) can be thought of as an inner product in a possiblyimplicit feature spaceI No need to explicitly map to feature space at all!I The implicit space may be infinite-dimensional (e.g., SE and Matérn), so we could not

explicitly map to it even if we wanted to

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 28/ 36

Page 42: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Importance of the Kernel Choice

tim• Different kernel choices can behave very differently (and sometimes very badly):

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1.5

-1

-0.5

0

0.5

1

1.5

` too small

` too large

Good choice of `

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 29/ 36

Page 43: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Notes:

• Choosing a suitable GP kernel is crucial• (just like choosing a Bayesian prior more generally)• (or choosing a model even more generally)

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 30/ 36

Page 44: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Choosing a Kernel

• The choice of kernel can have a considerable impact on the performance

• Principled approach: Learn a kernel from some training data (xi, yi) w/i = 1, . . . , nT

• Maximum-likelihood:I Fix a class of kernels and optimize its parameter(s) (e.g., `)I Specifically, search for

arg maxν,l

P`(y1, . . . , ynT )

where P` is the probability density for the observations with inputs(x1, . . . ,xnT ) and parameter `

I Exact maximization may be difficult; alternatives includeI Grid searchI Random searchI Iterative optimization

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 31/ 36

Page 45: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Choosing a Kernel

• The choice of kernel can have a considerable impact on the performance

• Principled approach: Learn a kernel from some training data (xi, yi) w/i = 1, . . . , nT

• Maximum-likelihood:I Fix a class of kernels and optimize its parameter(s) (e.g., `)I Specifically, search for

arg maxν,l

P`(y1, . . . , ynT )

where P` is the probability density for the observations with inputs(x1, . . . ,xnT ) and parameter `

I Exact maximization may be difficult; alternatives includeI Grid searchI Random searchI Iterative optimization

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 31/ 36

Page 46: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Classification with GPs I

• GPs can also be used for classification:I Input vector x ∈ Rd, binary label y ∈ {1,−1}I Example 1: x represents an email, y indicates whether it is spamI Example 2: x describes medical test outcomes, y indicates whether the patient

has some diseaseI Example 3: x is a robotics parameter configuration, y indicates whether the

robot successfully performed a task

• The setup is not tot too different from regression, but we have discrete rather thancontinuous observations

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 32/ 36

Page 47: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Classification with GPs II

• Richer modeling than basic linear classification:

Classify as +1

Classify as -1

+ ++

+

++

+

+++ +

+ ++

++

++

++

+

-

- --

-- -

- - -- -- -- -- -+ ++

+--- - ----

-++

-

-

-- ---

- -

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 33/ 36

Page 48: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Classification with GPs III

• Basic idea:I Model the true classification rule as being of the form

y ={

1 f(x) ≥ 0−1 f(x) < 0.

I As with regression, use data {(xt, yt)}nt=1 to form a posterior distributionI Predict according to the posterior mean

• Key difference: Now the posterior cannot be computed exactly (we only get toobserve the sign of f , not the function value)I However, Monte Carlo type approximations can be doneI See for example [Chu and Ghahramani, ICML 2005] for more details

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 34/ 36

Page 49: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

Further Reading

• Recent tutorial on GP regression: A tutorial on Gaussian process regression:Modelling, exploring, and exploiting functions (Schulz et al., 2018)

• Popular GP book: Gaussian Processes for Machine Learning (Rasmussen, 2006)

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 35/ 36

Page 50: NUS School of Computing Summer School [3mm] Gaussian …scarlett/gp_slides/lecture... · 2018-08-06 · NUS School of Computing Summer School GaussianProcessMethodsinMachineLearning

References

[1] Wei Chu and Zoubin Ghahramani.Gaussian processes for ordinal regression.J. Mach. Learn. Res., 6(Jul):1019–1041, 2005.

[2] Juan Pablo Gonzalez, Simon E Cook, Thomas Oberthür, Andy Jarvis, J AndrewBagnell, and M Bernardine Dias.Creating low-cost soil maps for tropical agriculture using Gaussian processes.2007.

[3] Tuchsanai Ploysuwan and Roungsan Chaisricharoen.Gaussian process kernel crossover for automated forex trading system.In Int. Conf. EE., Telecomms., and Inf. Tech. (ECTI-CON), pages 802–805, 2017.

[4] Carl Edward Rasmussen.Gaussian processes for machine learning.MIT Press, 2006.

[5] Eric Schulz, Maarten Speekenbrink, and Andreas Krause.A tutorial on Gaussian process regression: Modelling, exploring, and exploitingfunctions.J. Math. Psychology, 85:1 – 16, 2018.

[6] Alex J Smola and Bernhard Schölkopf.Learning with kernels.GMD-Forschungszentrum Informationstechnik, 1998.

SoC Summer School (Gaussian Processes) | Jonathan Scarlett , [email protected] Slide 36/ 36