statistical machine learning - gitlab · statistical machine learning c 2020 ong & walder &...

36
Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 1of 825 Statistical Machine Learning Christian Walder Machine Learning Research Group CSIRO Data61 and College of Engineering and Computer Science The Australian National University Canberra Semester One, 2020. (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Upload: others

Post on 20-May-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Outlines

OverviewIntroductionLinear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel MethodsSparse Kernel Methods

Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis

AutoencodersGraphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research GroupCSIRO Data61

and

College of Engineering and Computer ScienceThe Australian National University

CanberraSemester One, 2020.

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Page 2: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

160of 825

Part IV

Linear Regression 2

Page 3: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

161of 825

Linear Regression

Basis functionsMaximum Likelihood with Gaussian NoiseRegularisationBias variance decomposition

Page 4: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

162of 825

Training and Testing:(Non-Bayesian) Point Estimate

model with adjustable parameter w

training data x

training targets

t

model with fixed parameter w*

test data x

test target

t

Training Phase

Test Phasefix the most appropriate w*

Page 5: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

163of 825

Bayesian Regression

Bayes Theorem

posterior =likelihood× prior

normalisationp(w | t) =

p(t |w) p(w)

p(t)

where we left out the conditioning on x (always assumed),and β, which is assumed to be constant.I.i.d. regression likelihood for additive Gaussian noise is

p(t |w) =

N∏n=1

N (tn | y(xn,w), β−1)

=

N∏n=1

N (tn |w>φ(xn), β−1)

= const× exp{−β 12

(t−Φw)>(t−Φw)}

= N (t |Φw, β−1I)

Page 6: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

164of 825

How to choose a prior?

The choice of prior affords an intuitive control over ourinductive biasAll inference schemes have such biases, and often arisemore opaquely than the prior in Bayes’ rule.Can we find a prior for the given likelihood which

makes sense for the problem at handallows us to find a posterior in a ’nice’ form

An answer to the second question:

Definition ( Conjugate Prior)

A class of prior probability distributions p(w) is conjugate to aclass of likelihood functions p(x |w) if the resulting posteriordistributions p(w | x) are in the same family as p(w).

Page 7: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

165of 825

Examples of Conjugate Prior Distributions

Table: Discrete likelihood distributions

Likelihood Conjugate PriorBernoulli BetaBinomial BetaPoisson Gamma

Multinomial Dirichlet

Table: Continuous likelihood distributions

Likelihood Conjugate PriorUniform Pareto

Exponential GammaNormal Normal (mean parameter)

Multivariate normal Multivariate normal (mean parameter)

Page 8: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

166of 825

Conjugate Prior to a Gaussian Distribution

Example : If the likelihood function is Gaussian, choosinga Gaussian prior for the mean will ensure that theposterior distribution is also Gaussian.Given a marginal distribution for x and a conditionalGaussian distribution for y given x in the form

p(x) = N (x |µ,Λ−1)

p(y | x) = N (y |Ax + b,L−1)

we get

p(y) = N (y |Aµ + b,L−1 + AΛ−1A>)

p(x | y) = N (x |Σ{A>L(y− b) + Λµ},Σ)

where Σ = (Λ + A>L A)−1.

Note that the covariance Σ does not involve y.

Page 9: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

167of 825

Conjugate Prior to a Gaussian Distribution(intuition)

Given

p(x) = N (x |µ,Λ−1)

p(y | x) = N (y |Ax + b,L−1)⇔ y = Ax + b +N (0,L−1)

We have E[y] = E[Ax + b] = Aµ + b and by the easily proven Bienayméformula for the variance of the sum of uncorrelated variables,

cov[y] = cov[Ax + b]︸ ︷︷ ︸=E[Ax(Ax)>]=AE[xx>]A>=AΛ−1A>

+ cov[N (0,L−1)]︸ ︷︷ ︸=L−1

.

So y is Gaussian with

p(y) = N (y |Aµ + b,L−1 + AΛ−1A>)

Then letting Σ = (Λ + A>L A)−1 and

p(x | y) = N (x |Σ{A>L(y− b) + Λµ},Σ)

⇔ x = Σ{A>L(y− b) + Λµ}+N (0,Σ)

yields the correct moments for x, since

E[x] = E[Σ{A>L(y− b) + Λµ}] = Σ{A>L(Aµ + b− b) + Λµ}

= Σ{A>LAµ + Λµ} = (Λ + A>L A)−1{A>LA + Λ}µ = µ,

and it is similar (but tedious ; don’t do it) to recover cov[x] = Λ.

Page 10: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

168of 825

Bayesian Regression

Choose a Gaussian prior with mean m0 and covariance S0

p(w) = N (w |m0,S0)

Same likelihood as before (here written in vector form):

p(t|w, β) = N (t |Φ>w, β−1I)

Given N data pairs (xn, tn), the posterior is

p(w | t) = N (w |mN ,SN)

where

mN = SN(S−10 m0 + βΦ>t)

S−1N = S−1

0 + βΦ>Φ

(derive this with the identities on the previous slides)

Page 11: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

169of 825

Bayesian Regression: Zero Mean, Isotropic Prior

For simplicity we proceed with m0 = 0 and S0 = α−1I, so

p(w |α) = N (w | 0, α−1I)

The posterior becomes p(w | t) = N (w |mN ,SN) with

mN = βSNΦ>t

S−1N = αI + βΦ>Φ

For α� β we get

mN → wML = (Φ>Φ)−1Φ>t

Log of posterior is sum of log likelihood and log of prior

ln p(w | t) = −β2

(t−Φw)>(t−Φw)− α

2w>w + const

Page 12: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

170of 825

Bayesian Regression

Log of posterior is sum of log likelihood and log of prior

ln p(w | t) = −β 12‖t−Φw‖2︸ ︷︷ ︸

sum-of-squares-error

−α2‖w‖2︸ ︷︷ ︸

regulariser

+ const.

The maximum a posteriori estimator

wm.a.p. = arg maxw

p(w|t)

corresponds to minimising the sum-of-squares errorfunction with quadratic regularisation coefficient λ = α/β.The posterior is Gaussian so mode = mean: wm.a.p. = mN .For α� β the we recover unregularised least squares(equivalently m.a.p. approaches maximum likelihood), forexample in case of

an infinitely broad prior with α→ 0an infinitely precise likelihood with β →∞

Page 13: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

171of 825

Bayesian Inference in General:Sequential Update of Belief

If we have not yet seen any data point (N = 0), theposterior is equal to the prior.

Sequential arrival of data points : the posterior given someobserved data acts as the prior for the future data.Nicely fits a sequential learning framework.

Page 14: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

172of 825

Sequential Update of the Posterior

Example of a linear basis function modelSingle input x, single output t

Linear model y(x,w) = w0 + w1x.True data distribution sampling procedure:

1 Choose an xn from the uniform distribution U(x | − 1,+1).2 Calculate f (xn, a) = a0 + a1xn, where a0 = −0.3, a1 = 0.5.3 Add Gaussian noise with standard deviation σ = 0.2,

tn = N (xn | f (xn, a), 0.04)

Set the precision of the uniform prior to α = 2.0.

Page 15: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

173of 825

Sequential Update of the Posterior

Page 16: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

174of 825

Sequential Update of the Posterior

Page 17: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

175of 825

Predictive Distribution

In the training phase, data x and targets t are providedIn the test phase, a new data value x is given and thecorresponding target value t is asked forBayesian approach: Find the probability of the test target tgiven the test data x, the training data x and the trainingtargets t

p(t | x, x, t)

This is the Predictive Distribution (c.f. the posteriordistribution, which is over the parameters).

Page 18: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

176of 825

How to calculate the Predictive Distribution?

Introduce the model parameter w via the sum rule

p(t | x, x, t) =

∫p(t,w | x, x, t)dw

=

∫p(t |w, x, x, t)p(w | x, x, t)dw

The test target t depends only on the test data x and themodel parameter w, but not on the training data and thetraining targets

p(t |w, x, x, t) = p(t |w, x)

The model parameter w are learned with the training datax and the training targets t only

p(w | x, x, t) = p(w | x, t)

Predictive Distribution

p(t | x, x, t) =

∫p(t |w, x)p(w | x, t)dw

Page 19: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

177of 825

Proof of the Predictive Distribution

The predictive distribution is

p(t | x, x, t) =

∫p(t |w, x, x, t)p(w | x, x, t)dw

because∫p(t |w, x, x, t)p(w | x, x, t)dw =

∫p(t,w, x, x, t)p(w, x, x, t)

p(w, x, x, t)p(x, x, t)

dw

=

∫p(t,w, x, x, t)

p(x, x, t)dw

=p(t, x, x, t)p(x, x, t)

= p(t | x, x, t),

or simply∫p(t |w, x, x, t)p(w | x, x, t)dw =

∫p(t,w | x, x, t)dw

= p(t | x, x, t).

Page 20: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

178of 825

Predictive Distribution with Simplified Prior

Find the predictive distribution

p(t | t, α, β) =

∫p(t |w, β) p(w | t, α, β)dw

(remember : conditioning on x is often suppressed tosimplify the notation.)Now we know (neglecting as usual to notate conditioningon x)

p(t |w, β) = N (t |w>φ(x), β−1)

and the posterior was

p(w | t, α, β) = N (w |mN ,SN)

where

mN = βSNΦ>t

S−1N = αI + βΦ>Φ

Page 21: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

179of 825

Predictive Distribution with Simplified Prior

If we do the integral (it turns out to be the convolution ofthe two Gaussians), we get for the predictive distribution

p(t | x, t, α, β) = N (t |m>N φ(x), σ2N(x))

where the variance σ2N(x) is given by

σ2N(x) =

+ φ(x)>SNφ(x).

This is more easily shown using a similar approach to theearlier “intution” slide and again with the Bienayméformula, now using

t = w>φ(x) +N (0, β−1).

However this is a linear-Gaussian specific trick and ingeneral we need to integrate out the parameters.

Page 22: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

180of 825

Predictive Distribution with Simplified Prior

Example with artificial sinusoidal data from sin(2πx) (green)and added noise. Number of data points N = 1.

x

t

0 1

−1

0

1

Mean of the predictive distribution (red) and regions of onestandard deviation from mean (red shaded).

Page 23: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

181of 825

Predictive Distribution with Simplified Prior

Example with artificial sinusoidal data from sin(2πx) (green)and added noise. Number of data points N = 2.

x

t

0 1

−1

0

1

Mean of the predictive distribution (red) and regions of onestandard deviation from mean (red shaded).

Page 24: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

182of 825

Predictive Distribution with Simplified Prior

Example with artificial sinusoidal data from sin(2πx) (green)and added noise. Number of data points N = 4.

x

t

0 1

−1

0

1

Mean of the predictive distribution (red) and regions of onestandard deviation from mean (red shaded).

Page 25: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

183of 825

Predictive Distribution with Simplified Prior

Example with artificial sinusoidal data from sin(2πx) (green)and added noise. Number of data points N = 25.

x

t

0 1

−1

0

1

Mean of the predictive distribution (red) and regions of onestandard deviation from mean (red shaded).

Page 26: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

184of 825

Predictive Distribution with Simplified Prior

Plots of the function y(x,w) using samples from the posteriordistribution over w. Number of data points N = 1.

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

Page 27: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

185of 825

Predictive Distribution with Simplified Prior

Plots of the function y(x,w) using samples from the posteriordistribution over w. Number of data points N = 2.

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

Page 28: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

186of 825

Predictive Distribution with Simplified Prior

Plots of the function y(x,w) using samples from the posteriordistribution over w. Number of data points N = 4.

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

Page 29: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

187of 825

Predictive Distribution with Simplified Prior

Plots of the function y(x,w) using samples from the posteriordistribution over w. Number of data points N = 25.

x

t

0 1

−1

0

1

x

t

0 1

−1

0

1

Page 30: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

188of 825

Limitations of Linear Basis Function Models

Basis function φj(x) are fixed before the training data set isobserved.Curse of dimensionality : Number of basis function growsrapidly, often exponentially, with the dimensionality D.But typical data sets have two nice properties which canbe exploited if the basis functions are not fixed :

Data lie close to a nonlinear manifold with intrinsicdimension much smaller than D. Need algorithms whichplace basis functions only where data are (e.g. kernelmethods / Gaussian processes).Target variables may only depend on a few significantdirections within the data manifold. Need algorithms whichcan exploit this property (e.g. linear methods or shallowneural networks).

Page 31: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

189of 825

Curse of Dimensionality

Linear Algebra allows us to operate in n-dimensionalvector spaces using the intution from our 3-dimensionalworld as a vector space. No surprises as long as n is finite.If we add more structure to a vector space (e.g. innerproduct, metric), our intution gained from the3-dimensional world around us may be wrong.Example: Sphere of radius r = 1. What is the fraction ofthe volume of the sphere in a D-dimensional space whichlies between radius r = 1 and r = 1− ε ?Volume scales like rD, therefore the formula for the volumeof a sphere is VD(r) = KDrD.

VD(1)− VD(1− ε)VD(1)

= 1− (1− ε)D

Page 32: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

190of 825

Curse of Dimensionality

Fraction of the volume of the sphere in a D-dimensionalspace which lies between radius r = 1 and r = 1− ε

VD(1)− VD(1− ε)VD(1)

= 1− (1− ε)D

ε

volu

me

frac

tion

D = 1

D = 2

D = 5

D = 20

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Page 33: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

191of 825

Curse of Dimensionality

Probability density with respect to radius r of a Gaussiandistribution for various values of the dimensionality D.

D = 1

D = 2

D = 20

r

p(r)

0 2 40

1

2

Page 34: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

192of 825

Curse of Dimensionality

Probability density with respect to radius r of a Gaussiandistribution for various values of the dimensionality D.Example: D = 2; assume µ = 0,Σ = I

N (x | 0, I) =1

2πexp

{−1

2x>x}

=1

2πexp

{−1

2(x2

1 + x22)

}Coordinate transformation

x1 = r cos(φ) x2 = r sin(φ)

Probability in the new coordinates

p(r, φ | 0, I) = N (r(x), φ(x) | 0, I) | J |

where | J | = r is the determinant of the Jacobian for thegiven coordinate transformation.

p(r, φ | 0, I) =1

2πr exp

{−1

2r2}

Page 35: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

193of 825

Curse of Dimensionality

Probability density with respect to radius r of a Gaussiandistribution for D = 2 (and µ = 0,Σ = I)

p(r, φ | 0, I) =1

2πr exp

{−1

2r2}

Integrate over all angles φ

p(r | 0, I) =

∫ 2π

0

12π

r exp

{−1

2r2}

dφ = r exp

{−1

2r2}

D = 1

D = 2

D = 20

r

p(r)

0 2 40

1

2

Page 36: Statistical Machine Learning - GitLab · Statistical Machine Learning c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Outlines Overview Introduction

Statistical MachineLearning

c©2020Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Review

Bayesian Regression

Sequential Update of thePosterior

Predictive Distribution

Proof of the PredictiveDistribution

Predictive Distributionwith Simplified Prior

Limitations of LinearBasis Function Models

194of 825

Summary: Linear Regression

Basis functionsMaximum likelihood with Gaussian noiseRegularisationBias variance decompositionConjugate priorBayesian linear regressionSequential update of the posteriorPredictive distributionCurse of dimensionality