introduction to gaussian processes- regression dsa2019 addis … · 2019-06-25 · introduction to...

35
Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1 , Michael Mayhew 2 African University of Science and Technology, Nigeria and Baze University, Nigeria 1 Inflammatrix, USA 2 isaidu at aust dot edu dot ng 1 mmayhew at inflammatix dot com 2 June 5, 2019 Charles I. Saidu 1 , Michael Mayhew 2 (AUST/Baze University, Nigeria) Gaussian Processes June 5, 2019 1 / 35

Upload: others

Post on 03-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Introduction to Gaussian Processes- RegressionDSA2019 Addis Ababa

Charles I. Saidu 1, Michael Mayhew 2

African University of Science and Technology, Nigeria and Baze University, Nigeria 1

Inflammatrix, USA 2

isaidu at aust dot edu dot ng 1

mmayhew at inflammatix dot com 2

June 5, 2019

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 1 / 35

Page 2: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Introduction Gaussian Processes(GP)- Regression

1 Data Modelling (Pre-introduction to GaussianProcesses-regression):Regression problem

2 Parametric Models : Motivation

3 Non-Parametric Models, and the Gaussian Process

4 Gaussian Processes

5 Additional Resources

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 2 / 35

Page 3: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Data Modelling: Regression

Let say we have data D = {X , y}We are interested in finding the function f, such that y = f (x) + edescribes the behaviour of data D with some error bars e

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 3 / 35

Page 4: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Data Modelling: Regression

Possible approaches towards finding f will be:

Parametric Approach: Neural Networks family,Non-Parametric Approach - KNN, SVMs, Gaussian Processes

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 4 / 35

Page 5: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Data Modelling: Regression - Parametric Models

We have Data D = {X , y}Parametric Approach:

Choose a function class f (x) or a mapping - With FIXED NUMBERof parameters ΘLearn the PARAMETERS Θ∗ of the model f (x).

Parameter Estimations

Maximum Likelihood Estimation (MLE): Find the optimal value Θ∗

Maximum A-Posterior Distribution (MAP) → learn a pointestimate (Mode of posterior distribution) of the FIXED Θ∗, using aprior over Θ∗. Roburst to overfittingFull Bayesian Methods: Learn plausable/feasible distribution overparameters Θ∗. Using approximation methods: Variational Methods,MCMC, Laplace

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 5 / 35

Page 6: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Data Modelling: Graphical Intuition - Motivation:Parametric models (Point Estimates vs Distribution)

Consider the data scatter plot below

How would you fit this model f (x)?

x

y

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 6 / 35

Page 7: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Data Modelling: Graphical Intuition - Motivation: LinearParametric models - Parametric models (Point Estimatesvs Distribution)

Linear Model f (x) = θ1x + θ2 - fitted using MLE or MAPOptimal parameters Θ∗ = {θ1, θ2}NOTE: Fixed Number of Parameters (2-parameters in this example)

x

y

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 7 / 35

Page 8: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Data Modelling: Graphical Intuition - Motivation: LinearParametric models

Linear Model f (x) = θ1x + θ2

Full Bayesian - Distribution of the parameters Θ∗,

Optimal parameters Θ∗ = θ1, θ2

NOTE: Still Fixed Number of Parameters (2-parameters in thisexample)

x

y

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 8 / 35

Page 9: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Data Modelling: Graphical Intuition - Motivation:Quadratic Parametric models - MLE

Quadratic Model f (x) = θ1x2 + θ2x + θ3 - fitted using MLE or MAP

Optimal parameters Θ∗ = {θ1, θ2, θ3} NOTE: Fixed Number ofParameters (3-parameters in this example)

x

y

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 9 / 35

Page 10: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Data Modelling: Graphical Intuition - Motivation:Quadratic Parametric models - MAP

Quadratic Model f (x) = θ1x2 + θ2x + θ3

Full Bayesian - Distribution of the parameters Θ∗

Optimal parameters Θ∗ = {θ1, θ2, θ3}NOTE: Still Fixed Number of Parameters (3-parameters in thisexample)

x

y

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 10 / 35

Page 11: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Data Modelling: Graphical Intuition - Motivation: CubicParametric models - MLE

Quadratic Model f (x) = θ1x3 + θ2x

2 + θ3x + θ4 - fitted using MLE orMAP

Optimal parameters Θ∗ = {θ1, θ2, θ3, θ4}NOTE: Fixed Number of Parameters (4-parameters in this example)

x

y

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 11 / 35

Page 12: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Data Modelling: Graphical Intuition - Motivation: CubicParametric models - MAP

Quadratic Model f (x) = θ1x3 + θ1x

2 + θ2x + θ3

Full Bayesian. Distribution of the parameters Θ∗

Optimal parameters Θ∗ = {θ1, θ2, θ3, θ4}NOTE: Fixed Number of Parameters (4-parameters in this example)

x

y

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 12 / 35

Page 13: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Data Modeling: PUNCH-LINE

What if we dont want to specify the number of parameters upfront inour model?

Also what if we want to consider a distribution over plausablefunctions that describe our data, such that these functionscomplexity/parameters scale with the data

Also we might want our model to be able to handle missing Missingdata: aka Generative Model

How???

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 13 / 35

Page 14: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non Parametric models

What is a Non-parametric model?

No! It does NOT mean the model has no parameters

Simply means the models’s number of parameters is NOT fixed ordetermined upfront like in the previous examples - parametric models

When you hear nonparametric, think models whose parameters scalewith amount/complexity of data

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 14 / 35

Page 15: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non Parametric models -GPs

So we want a model whose parameters scale with data/complexity

We also want to model plausable functions f (x) that describes ourdata

Consequently, we want is a distribution over these functions

Sounds Cool!!!

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 15 / 35

Page 16: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non Parametric models -GPs

But How??

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 16 / 35

Page 17: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non Parametric models -Gaussian Processes

Lets define a vector of function values evaluated at n points forxi ∈ X as f = (f (x1), f (x2), .., f (xn))Lets also assume the notion of smoothness of f to mean points(f (xi ), f (xi+1)) that are closer in space are highly correlated.

Figure: Smoothness Assumption. Source: Neil LawrenceCharles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 17 / 35

Page 18: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non Parametric models -Gaussian Processes

Definition: Gaussian Process:

Guassian processes GPs assume neighbouring points xi , xi+1 arecorrelated and function values fi , fi+1 are distributed multivariategaussian

Hence, GPs are parameterized by µ(x) and covariance function orkernel K (xi , xi+1)

p(fi , fi+1) = GP(µ,K ) (1)

µ =

[µ(xi )µ(xi+1)

],K =

[K (xi , xi ) K (xi , xi+1)K (xi+1, xi ) K (xi+1, xi+1)

](2)

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 18 / 35

Page 19: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non Parametric models - Gaussian Processes

Similarly p(f) = p(f (x1), f (x2), ....., f (xn)) is also multivariate guassiangiven by

p(f) = N(µ,K ) (3)

where

µ =

µ(x1)µ(x1)..

µ(xn)

,K =

K (x1, x1) K (x1, x2) .. K (x1, xn)K (x2, x1) K (x2, x2) .. K (x2, xn).....

K (xn, x1) K (xn, x2) .. K (xn, xn)

(4)

Note:

function K generates the covariance matrix Σ

Σ must be positive definite functions/matrices

Note also that f could easily be infinite dimension as n tend toinfinitiny

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 19 / 35

Page 20: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Brief Note on Multivariate Norm

Multivariate Normal - Statistic’s swiss army knife

X |µ,Σ ∼ MVNorm(µ,Σ)

A highly useful joint distribution for continous, vector-valuedobservations

Parameterized by mean vector µ and covariance matrix Σ

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 20 / 35

Page 21: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Properties Multivariate Norm

TheoremSuppose x = (x1, x2) is jointly Gaussian with parameters

Then the marginals are also Gaussians given by

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 21 / 35

Page 22: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Properties Multivariate Norm

Theorem - continuesThe posterior is also gaussian given by

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 22 / 35

Page 23: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non Parametric models - Gaussian Processes

GPs HAVE parameters: they are parameterized by µ and class ofkernel function K (xi , xj) :

However, parameters scale with complexity/data

An example of a Kernel function is

K (xi , xj |Θ) = θ0exp[−||xi − xj ||2

2`2

](5)

Hyper parameters = [θ0, `] - parameter vector

` is the lengthscale,θ0 is known as the amplitude

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 23 / 35

Page 24: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non Parametric models - Gaussian Processes

Some kernel functions

Figure: Effect on choosing different kernels on the prior function distribution.Source: wikipedia

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 24 / 35

Page 25: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non Parametric models - Gaussian Processes

Once we design on our kernel functionGaussian processes can thus be used for bayesian regression:

p(f|D) =p(D|f)p(f)

p(D)(6)

Where p(f) represents our prior before of the functionsp(D|f is our likelihood of the Data D given the functionsp(f|D) is our posterior after observing the data D

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 25 / 35

Page 26: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non Parametric models - Recap Bayes theorem

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 26 / 35

Page 27: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non Parametric models - Where does the likelihood comefrom?

All probability modeling starts with a preliminary analysis or visualinpsection of the data

Called Exploratory Data Analysis (EDA)Motivates choice/formulation of the likelihood

NOTE

Carrying out EDA doesnt violate spirit of prior specification unless the prioris engineered to look exactly like whats in the data

This is why we tend to

elicit priors from third-party expertsuse flat, non-informative priors

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 27 / 35

Page 28: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

GPs - Regression Prediction

We have training data D = {X , y}We want to predict y∗ given points X∗Our model is

yn = fn + enf ∼ GP(0,K )

Then we can make predictions by combining the likelihood andposterior theoretically as

p(y∗|X∗,D) =

∫p(y∗|X∗, f ,D)p(f |D)df (7)

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 28 / 35

Page 29: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non-Parametric Models - Gaussian Processes -RegressionPrediction

If we assume Gaussian noise: yn = fn + en, where e ∼ N(0, σ2)

Likelihood is gaussian : IID samples

Predictive distribution has Gaussian Analytical solution as

Gaussian Process

p(y∗|X∗,D) ∼ N(f |µ∗,Σ∗) (8)

µ∗ = KT∗ K−1

y y (9)

Σ∗ = K∗∗ − K−1y K∗ (10)

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 29 / 35

Page 30: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non-Parametric Models - Gaussian Processes -RegressionPrediction

Gaussian Process

p(y∗|X∗,D) ∼ N(f |µ∗,Σ∗) (11)

µ∗ = KT∗ K−1

y y (12)

Σ∗ = K∗∗ − K−1y K∗ (13)

Where

Ky = K + σI

K - is a kernel function covariance matrix of of x1, x2, ..., xn

K∗ correlation between x1, x2, ..., xn and X∗ - the test points

K∗∗ correlation between X∗

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 30 / 35

Page 31: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Non-Parametric Models - Gaussian Processes -RegressionPrediction

How do we choose hyper-parameters

Optimizations to find hyperparameters

What about NON-Gaussian Likelihood functions

For NON Gaussian Likelihood, The posterior does NOT haveanalytical form. NO SUMMARISING STATISTICS. Hence, weobtain posterior via

SamplingAnalytic approximations

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 31 / 35

Page 32: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Why Gaussian Processes

What are they good for

Good for time series data

Directly captures model uncertainty

Work very well no so large datasets

Ability to be able to encode prior information of the model

handles model complexity and scalability quite well

Some limitations

Not so great for large dataset (time/space complexity). However,parallelization, Sparse GPs and other techniques tries to solve this

May not be your number one go-to option for classification problems

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 32 / 35

Page 33: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

GPs: Demos GPs from Scratch Intro. Using GPy Library

Notebook on coding GPs using the equations above using pythonnumpy included (Just the intuition).

Practical handson using GPy Coming up!

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 33 / 35

Page 34: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Additional resources

C. E., Rasmussen and C. K. I. Williams (2006) Gaussian Processes forMachine Learning

Lecture Notes Neil Lawrence - http://inverseprobability.com

Lecture Notes Lehel Csato - http://www.cs.ubbcluj.ro/~csatol/

Lecture Notes Nando de Freitas Video -http://www.cs.ox.ac.uk/people/nando.defreitas/

Gaussian processes website - http://www.gaussianprocess.org/

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 34 / 35

Page 35: Introduction to Gaussian Processes- Regression DSA2019 Addis … · 2019-06-25 · Introduction to Gaussian Processes- Regression DSA2019 Addis Ababa Charles I. Saidu 1, Michael Mayhew

Thank you: Questions?

Charles I. Saidu 1, Michael Mayhew 2 (AUST/Baze University, Nigeria)Gaussian Processes June 5, 2019 35 / 35