gaussian processesgaussian processes marc deisenroth centre for artificial intelligence department...
TRANSCRIPT
![Page 1: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/1.jpg)
Gaussian Processes
Marc DeisenrothCentre for Artificial IntelligenceDepartment of Computer ScienceUniversity College London
@[email protected]://deisenroth.cc
AIMS Rwanda and AIMS Ghana
March/April 2020
![Page 2: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/2.jpg)
Reference
http://www.gaussianprocess.org/
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 2
![Page 3: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/3.jpg)
Model Selection
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 41
![Page 4: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/4.jpg)
Influence of Prior on Posterior
Generalization error measured by log-predictive density (lpd)
lpd “ log ppy˚|x˚,X,y, `q
for different length-scales ` and different datasets
Shorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42
![Page 5: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/5.jpg)
Influence of Prior on Posterior
Generalization error measured by log-predictive density (lpd)
lpd “ log ppy˚|x˚,X,y, `q
for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42
![Page 6: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/6.jpg)
Influence of Prior on Posterior
Generalization error measured by log-predictive density (lpd)
lpd “ log ppy˚|x˚,X,y, `q
for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42
![Page 7: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/7.jpg)
Influence of Prior on Posterior
Generalization error measured by log-predictive density (lpd)
lpd “ log ppy˚|x˚,X,y, `q
for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42
![Page 8: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/8.jpg)
Influence of Prior on Posterior
Generalization error measured by log-predictive density (lpd)
lpd “ log ppy˚|x˚,X,y, `q
for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42
![Page 9: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/9.jpg)
Influence of Prior on Posterior
Generalization error measured by log-predictive density (lpd)
lpd “ log ppy˚|x˚,X,y, `q
for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42
![Page 10: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/10.jpg)
Influence of Prior on Posterior
Generalization error measured by log-predictive density (lpd)
lpd “ log ppy˚|x˚,X,y, `q
for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42
![Page 11: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/11.jpg)
Influence of Prior on Posterior
Generalization error measured by log-predictive density (lpd)
lpd “ log ppy˚|x˚,X,y, `q
for different length-scales ` and different datasetsShorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42
![Page 12: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/12.jpg)
Influence of Prior on Posterior
Generalization error measured by log-predictive density (lpd)
lpd “ log ppy˚|x˚,X,y, `q
for different length-scales ` and different datasets
Shorter length-scale More flexible model Faster increasein uncertainty away from data Bad generalization properties
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 42
![Page 13: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/13.jpg)
Model Selection
Make predictions equipped with uncertainty
Choice of prior (e.g., length-scale) influences predictions
Different tasks require different priors
How do we select a good prior?
Model Selection in GPs§ Choose hyper-parameters of the GP
§ Choose good mean function and kernel
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 43
![Page 14: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/14.jpg)
Model Selection
Make predictions equipped with uncertainty
Choice of prior (e.g., length-scale) influences predictions
Different tasks require different priors
How do we select a good prior?
Model Selection in GPs§ Choose hyper-parameters of the GP
§ Choose good mean function and kernel
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 43
![Page 15: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/15.jpg)
Model Selection
Make predictions equipped with uncertainty
Choice of prior (e.g., length-scale) influences predictions
Different tasks require different priors
How do we select a good prior?
Model Selection in GPs§ Choose hyper-parameters of the GP
§ Choose good mean function and kernel
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 43
![Page 16: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/16.jpg)
Hyper-Parameters of a GP
The GP possesses a set of hyper-parameters:
Parameters of the mean function
Parameters of the covariance function (e.g., length-scales andsignal variance)
Likelihood parameters (e.g., noise variance �2n)
Train a GP to find a good set of hyper-parameters
Higher-level model selection to find good mean and covariancefunctions(can also be automated: Automatic Statistician (Lloyd et al., 2014))
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 44
![Page 17: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/17.jpg)
Hyper-Parameters of a GP
The GP possesses a set of hyper-parameters:
Parameters of the mean function
Parameters of the covariance function (e.g., length-scales andsignal variance)
Likelihood parameters (e.g., noise variance �2n)
Train a GP to find a good set of hyper-parameters
Higher-level model selection to find good mean and covariancefunctions(can also be automated: Automatic Statistician (Lloyd et al., 2014))
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 44
![Page 18: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/18.jpg)
Hyper-Parameters of a GP
The GP possesses a set of hyper-parameters:
Parameters of the mean function
Parameters of the covariance function (e.g., length-scales andsignal variance)
Likelihood parameters (e.g., noise variance �2n)
Train a GP to find a good set of hyper-parameters
Higher-level model selection to find good mean and covariancefunctions(can also be automated: Automatic Statistician (Lloyd et al., 2014))
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 44
![Page 19: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/19.jpg)
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 45
Gp p pCaibdb sumrule LinearRegression
level2 hyper parameters mof theGaussianprocess 4
t reel InbL den Hell
Of Effatek
observed
functionvalues noisy Maximum likelihood
angqaxs aigmmaf.si
sumrmeLPrior
argfqaxpto lya
fplylflplf.IO df plylH prior on Bayes theorem
This is the marginallikelihood Maximum likelihoodType2
![Page 20: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/20.jpg)
Gaussian Process Training: Hyper-Parameters
GP TrainingFind good hyper-parameters ✓ (kernel/meanfunction parameters , noise variance �2
n)
�nyixi
f
N
Place a prior pp✓q on hyper-parameters
Posterior over hyper-parameters:
pp✓|X,yq “pp✓q ppy|X,✓q
ppy|Xqppy|X,✓q “
ªppy|f,Xqppf |X,✓qdf
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 46
![Page 21: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/21.jpg)
Gaussian Process Training: Hyper-Parameters
GP TrainingFind good hyper-parameters ✓ (kernel/meanfunction parameters , noise variance �2
n)
�nyixi
f
N
Place a prior pp✓q on hyper-parameters
Posterior over hyper-parameters:
pp✓|X,yq “pp✓q ppy|X,✓q
ppy|Xqppy|X,✓q “
ªppy|f,Xqppf |X,✓qdf
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 46
![Page 22: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/22.jpg)
Gaussian Process Training: Hyper-Parameters
Posterior over hyper-parameters:
pp✓|X,yq “pp✓q ppy|X,✓q
ppy|Xqppy|X,✓q “
ªppy|fpXqq ppfpXq|✓q df
�nyixi
f
N
Choose hyper-parameters ✓˚, such that
✓˚ P argmax✓
log pp✓q ` log ppy|X,✓q
Maximize marginal likelihood if pp✓q “ U (uniform prior)
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 47
nal likelihood evidencemarginalized likelihoodmargi
integrateout
likelihood
p.ly X fplylXiOIplf7dO marginalized marginal likelihood
Pmarginallikelihood
f f plyl f Kilpl full Adf plot domargideidence
![Page 23: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/23.jpg)
Gaussian Process Training: Hyper-Parameters
Posterior over hyper-parameters:
pp✓|X,yq “pp✓q ppy|X,✓q
ppy|Xqppy|X,✓q “
ªppy|fpXqq ppfpXq|✓q df
�nyixi
f
N
Choose hyper-parameters ✓˚, such that
✓˚ P argmax✓
log pp✓q ` log ppy|X,✓q
Maximize marginal likelihood if pp✓q “ U (uniform prior)
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 47
![Page 24: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/24.jpg)
Gaussian Process Training: Hyper-Parameters
Posterior over hyper-parameters:
pp✓|X,yq “pp✓q ppy|X,✓q
ppy|Xqppy|X,✓q “
ªppy|fpXqq ppfpXq|✓q df
�nyixi
f
N
Choose hyper-parameters ✓˚, such that
✓˚ P argmax✓
log pp✓q ` log ppy|X,✓q
Maximize marginal likelihood if pp✓q “ U (uniform prior)
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 47
![Page 25: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/25.jpg)
Training via Marginal Likelihood Maximization
GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters, where the unwieldy f has beenintegrated out) Also called Maximum Likelihood Type-II
Marginal likelihood (with a prior mean function mp¨q ” 0):
ppy|X,✓q “ª
ppy|fpXqq ppfpXq|✓q df
һ
N`y | fpXq, �2
nI˘
N`fpXq |0, K
˘df
“ N`y |0, K ` �2
nI˘
Learning the GP hyper-parameters:
✓˚ P argmax✓
log ppy|X,✓q
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 48
![Page 26: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/26.jpg)
Training via Marginal Likelihood Maximization
GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters, where the unwieldy f has beenintegrated out) Also called Maximum Likelihood Type-II
Marginal likelihood (with a prior mean function mp¨q ” 0):
ppy|X,✓q “ª
ppy|fpXqq ppfpXq|✓q df
һ
N`y | fpXq, �2
nI˘
N`fpXq |0, K
˘df
“ N`y |0, K ` �2
nI˘
Learning the GP hyper-parameters:
✓˚ P argmax✓
log ppy|X,✓q
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 48
![Page 27: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/27.jpg)
Training via Marginal Likelihood Maximization
GP TrainingMaximize the evidence/marginal likelihood (probability of the datagiven the hyper-parameters, where the unwieldy f has beenintegrated out) Also called Maximum Likelihood Type-II
Marginal likelihood (with a prior mean function mp¨q ” 0):
ppy|X,✓q “ª
ppy|fpXqq ppfpXq|✓q df
һ
N`y | fpXq, �2
nI˘
N`fpXq |0, K
˘df
“ N`y |0, K ` �2
nI˘
Learning the GP hyper-parameters:
✓˚ P argmax✓
log ppy|X,✓q
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 48
![Page 28: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/28.jpg)
Gradient-based Optimization
Log-marginal likelihood:
log ppy|X,✓q “ ´12y
JK´1✓ y ´ 1
2 log |K✓| ` const
K✓ :“ K ` �2nI
Gradient-based optimization to get hyper-parameters ✓˚:
B log ppy|X,✓qB✓i
“ 12y
JK´1✓
BK✓
B✓iK´1
✓ y ´ 12 tr
`K´1
✓
BK✓
B✓i˘
“ 12 tr
`p↵↵J ´ K´1
✓ qBK✓
B✓i˘,
↵ :“ K´1✓ y
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 49
![Page 29: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/29.jpg)
Gradient-based Optimization
Log-marginal likelihood:
log ppy|X,✓q “ ´12y
JK´1✓ y ´ 1
2 log |K✓| ` const
K✓ :“ K ` �2nI
Gradient-based optimization to get hyper-parameters ✓˚:
B log ppy|X,✓qB✓i
“ 12y
JK´1✓
BK✓
B✓iK´1
✓ y ´ 12 tr
`K´1
✓
BK✓
B✓i˘
“ 12 tr
`p↵↵J ´ K´1
✓ qBK✓
B✓i˘,
↵ :“ K´1✓ y
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 49
![Page 30: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/30.jpg)
Marginal Likelihood Illustration1
“ELBO” refers to the log-marginal likelihood
Data-fit term gets worse, but marginal likelihood increases
1Thanks to Mark van der Wilk
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 50
![Page 31: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/31.jpg)
Inspecting the Marginal Likelihood
Log-marginal likelihood:
log ppy|X,✓q “ ´12y
JK´1✓ y ´ 1
2 log |K✓| ` const , K✓ :“ K ` �2nI
Quadratic term measures whether observation y is within thevariation allowed by the prior
Determinant is the product of the variances of the prior (volumeof the prior) Volume « richness of model class
Marginal likelihoodAutomatic trade-off between data fit and model complexity
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 51
![Page 32: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/32.jpg)
Inspecting the Marginal Likelihood
Log-marginal likelihood:
log ppy|X,✓q “ ´12y
JK´1✓ y ´ 1
2 log |K✓| ` const , K✓ :“ K ` �2nI
Quadratic term measures whether observation y is within thevariation allowed by the prior
Determinant is the product of the variances of the prior (volumeof the prior) Volume « richness of model class
Marginal likelihoodAutomatic trade-off between data fit and model complexity
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 51
![Page 33: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/33.jpg)
Inspecting the Marginal Likelihood
Log-marginal likelihood:
log ppy|X,✓q “ ´12y
JK´1✓ y ´ 1
2 log |K✓| ` const , K✓ :“ K ` �2nI
Quadratic term measures whether observation y is within thevariation allowed by the prior
Determinant is the product of the variances of the prior (volumeof the prior) Volume « richness of model class
Marginal likelihoodAutomatic trade-off between data fit and model complexity
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 51
![Page 34: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/34.jpg)
Inspecting the Marginal Likelihood
Log-marginal likelihood:
log ppy|X,✓q “ ´12y
JK´1✓ y ´ 1
2 log |K✓| ` const , K✓ :“ K ` �2nI
Quadratic term measures whether observation y is within thevariation allowed by the prior
Determinant is the product of the variances of the prior (volumeof the prior) Volume « richness of model class
Marginal likelihoodAutomatic trade-off between data fit and model complexity
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 51
![Page 35: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/35.jpg)
Marginal Likelihood Surface
Several plausible hyper-parameters (local optima)
What do you expect to happen in each local optimum?
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 52
![Page 36: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/36.jpg)
Marginal Likelihood Surface
Several plausible hyper-parameters (local optima)
What do you expect to happen in each local optimum?
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 52
![Page 37: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/37.jpg)
Demo
https://drafts.distill.pub/gp/
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 53
![Page 38: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/38.jpg)
Marginal Likelihood and Parameter Learning
The marginal likelihood is non-convex
Especially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:
Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)Hybrid
Re-start hyper-parameter optimization from random initializationto mitigate the problem
With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.
Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54
![Page 39: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/39.jpg)
Marginal Likelihood and Parameter Learning
The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:
Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)Hybrid
Re-start hyper-parameter optimization from random initializationto mitigate the problem
With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.
Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54
![Page 40: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/40.jpg)
Marginal Likelihood and Parameter Learning
The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:
Short length-scales, low noise (highly nonlinear mean functionwith little noise)
Long length-scales, high noise (everything is considered noise)Hybrid
Re-start hyper-parameter optimization from random initializationto mitigate the problem
With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.
Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54
![Page 41: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/41.jpg)
Marginal Likelihood and Parameter Learning
The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:
Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)
Hybrid
Re-start hyper-parameter optimization from random initializationto mitigate the problem
With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.
Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54
![Page 42: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/42.jpg)
Marginal Likelihood and Parameter Learning
The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:
Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)Hybrid
Re-start hyper-parameter optimization from random initializationto mitigate the problem
With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.
Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54
![Page 43: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/43.jpg)
Marginal Likelihood and Parameter Learning
The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:
Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)Hybrid
Re-start hyper-parameter optimization from random initializationto mitigate the problem
With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.
Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54
![Page 44: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/44.jpg)
Marginal Likelihood and Parameter Learning
The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:
Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)Hybrid
Re-start hyper-parameter optimization from random initializationto mitigate the problem
With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.
Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54
![Page 45: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/45.jpg)
Marginal Likelihood and Parameter Learning
The marginal likelihood is non-convexEspecially in the very-small-data regime, a GP can end up inthree different situations when optimizing the hyper-parameters:
Short length-scales, low noise (highly nonlinear mean functionwith little noise)Long length-scales, high noise (everything is considered noise)Hybrid
Re-start hyper-parameter optimization from random initializationto mitigate the problem
With increasing data set size the GP typically ends up in the“hybrid” mode. Other modes are unlikely.
Ideally, we would integrate the hyper-parameters outNo closed-form solution Markov chain Monte Carlo
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 54
![Page 46: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/46.jpg)
Why Does the Marginal Likelihood Work?
Overall goal: Good generalization performance on unseen testdata
Minimizing training error is not a good idea (e.g., maximumlikelihood) Overfitting
Just adding uncertainty does not help either if the model iswrong, but it makes predictions more cautious
Marginal likelihood seems to find a good balance between fittingthe data and finding a simple model (Occam’s razor)
Why does the marginal likelihood lead to models that generalizewell?
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 55
![Page 47: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/47.jpg)
Why Does the Marginal Likelihood Work?
Overall goal: Good generalization performance on unseen testdata
Minimizing training error is not a good idea (e.g., maximumlikelihood) Overfitting
Just adding uncertainty does not help either if the model iswrong, but it makes predictions more cautious
Marginal likelihood seems to find a good balance between fittingthe data and finding a simple model (Occam’s razor)
Why does the marginal likelihood lead to models that generalizewell?
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 55
![Page 48: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/48.jpg)
Why Does the Marginal Likelihood Work?
Overall goal: Good generalization performance on unseen testdata
Minimizing training error is not a good idea (e.g., maximumlikelihood) Overfitting
Just adding uncertainty does not help either if the model iswrong, but it makes predictions more cautious
Marginal likelihood seems to find a good balance between fittingthe data and finding a simple model (Occam’s razor)
Why does the marginal likelihood lead to models that generalizewell?
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 55
![Page 49: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/49.jpg)
Marginal Likelihood: Incremental Prediction
“Probability of the training data” given the parameters
General factorization (ignoring inputs X):
ppy|✓q “ ppy1, . . . , yN |✓q
“ ppy1|✓qppy2|y1,✓qppy3|y1, y2,✓q ¨ . . . ¨ ppyN |y1, . . . , yN´1,✓q
“ ppy1|✓qNπ
n“2
ppyn|y1, . . . , yn´1,✓q
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 56
pca b plans pCb
marginal likelihoodsplyjotplyzly.ttplyzly yaOte
ply ydf
plynly yµ fP 3101
![Page 50: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/50.jpg)
Marginal Likelihood: Incremental Prediction
“Probability of the training data” given the parameters
General factorization (ignoring inputs X):
ppy|✓q “ ppy1, . . . , yN |✓q“ ppy1|✓qppy2|y1,✓qppy3|y1, y2,✓q ¨ . . . ¨ ppyN |y1, . . . , yN´1,✓q
“ ppy1|✓qNπ
n“2
ppyn|y1, . . . , yn´1,✓q
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 56
![Page 51: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/51.jpg)
Marginal Likelihood: Incremental Prediction
“Probability of the training data” given the parameters
General factorization (ignoring inputs X):
ppy|✓q “ ppy1, . . . , yN |✓q“ ppy1|✓qppy2|y1,✓qppy3|y1, y2,✓q ¨ . . . ¨ ppyN |y1, . . . , yN´1,✓q
“ ppy1|✓qNπ
n“2
ppyn|y1, . . . , yn´1,✓q
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 56
![Page 52: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/52.jpg)
Marginal Likelihood: Incremental Prediction (2)
ppy|✓q “ ppy1|✓qNπ
n“2
ppyn|y1, . . . , yn´1,✓q
If we think of this as a sequence model (where data arrivessequentially), the marginal likelihood predicts the nth trainingobservation given all “previous” observations
Predict training data yn that has not been accounted for (we onlycondition on y1, . . . , yn´1) Treat next data point as test data
Intuition: If it continuously predicted well on all N previouspoints, it probably will do well next time
Proxy for generalization error on unseen test data
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 57
![Page 53: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/53.jpg)
Marginal Likelihood: Incremental Prediction (2)
ppy|✓q “ ppy1|✓qNπ
n“2
ppyn|y1, . . . , yn´1,✓q
If we think of this as a sequence model (where data arrivessequentially), the marginal likelihood predicts the nth trainingobservation given all “previous” observations
Predict training data yn that has not been accounted for (we onlycondition on y1, . . . , yn´1) Treat next data point as test data
Intuition: If it continuously predicted well on all N previouspoints, it probably will do well next time
Proxy for generalization error on unseen test data
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 57
![Page 54: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/54.jpg)
Marginal Likelihood: Incremental Prediction (2)
ppy|✓q “ ppy1|✓qNπ
n“2
ppyn|y1, . . . , yn´1,✓q
If we think of this as a sequence model (where data arrivessequentially), the marginal likelihood predicts the nth trainingobservation given all “previous” observations
Predict training data yn that has not been accounted for (we onlycondition on y1, . . . , yn´1) Treat next data point as test data
Intuition: If it continuously predicted well on all N previouspoints, it probably will do well next time
Proxy for generalization error on unseen test data
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 57
![Page 55: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/55.jpg)
Marginal Likelihood: Incremental Prediction (2)
ppy|✓q “ ppy1|✓qNπ
n“2
ppyn|y1, . . . , yn´1,✓q
If we think of this as a sequence model (where data arrivessequentially), the marginal likelihood predicts the nth trainingobservation given all “previous” observations
Predict training data yn that has not been accounted for (we onlycondition on y1, . . . , yn´1) Treat next data point as test data
Intuition: If it continuously predicted well on all N previouspoints, it probably will do well next time
Proxy for generalization error on unseen test data
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 57
![Page 56: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/56.jpg)
Marginal Likelihood: Incremental Prediction (2)
ppy|✓q “ ppy1|✓qNπ
n“2
ppyn|y1, . . . , yn´1,✓q
If we think of this as a sequence model (where data arrivessequentially), the marginal likelihood predicts the nth trainingobservation given all “previous” observations
Predict training data yn that has not been accounted for (we onlycondition on y1, . . . , yn´1) Treat next data point as test data
Intuition: If it continuously predicted well on all N previouspoints, it probably will do well next time
Proxy for generalization error on unseen test data
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 57
![Page 57: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/57.jpg)
Marginal Likelihood: Illustration
ppy|✓q “ ppy1, . . . , yN |✓q “ ppy1|✓qNπ
n“2
ppyn|y1, . . . , yn´1,✓q
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 58
![Page 58: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/58.jpg)
Marginal Likelihood Computation in Action2
Short length-scale
2Thanks to Mark van der Wilk
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 59
![Page 59: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/59.jpg)
Marginal Likelihood Computation in Action3
Long length-scale
3Thanks to Mark van der Wilk
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 60
![Page 60: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/60.jpg)
Marginal Likelihood Computation in Action4
Optimal length-scale
4Thanks to Mark van der Wilk
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 61
![Page 61: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/61.jpg)
Marginal Likelihood Evolution
Short lengthscale: consistently overestimates varianceNo high density, even with observations inside the error bars
Long lengthscale: consistently underestimates varianceLow density because observations are outside the error bars
Optimal lengthscale: trades off both behaviors reasonably well
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 62
![Page 62: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/62.jpg)
Marginal Likelihood Evolution
Short lengthscale: consistently overestimates varianceNo high density, even with observations inside the error bars
Long lengthscale: consistently underestimates varianceLow density because observations are outside the error bars
Optimal lengthscale: trades off both behaviors reasonably well
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 62
![Page 63: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/63.jpg)
Marginal Likelihood Evolution
Short lengthscale: consistently overestimates varianceNo high density, even with observations inside the error bars
Long lengthscale: consistently underestimates varianceLow density because observations are outside the error bars
Optimal lengthscale: trades off both behaviors reasonably well
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 62
![Page 64: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/64.jpg)
Model Selection—Mean Function and Kernel
Assume we have a finite set of models Mi, each one specifyinga mean function mi and a kernel ki. How do we find the bestone?
Some options:Cross validationBayesian Information Criterion, Akaike Information CriterionCompare marginal likelihood values (assuming a uniform prior onthe set of models)
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 63
![Page 65: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/65.jpg)
Model Selection—Mean Function and Kernel
Assume we have a finite set of models Mi, each one specifyinga mean function mi and a kernel ki. How do we find the bestone?Some options:
Cross validationBayesian Information Criterion, Akaike Information CriterionCompare marginal likelihood values (assuming a uniform prior onthe set of models)
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 63
![Page 66: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/66.jpg)
Example
x-4 -3 -2 -1 0 1 2 3 4
f(x)
-2
-1
0
1
2
3
Four different kernels (mean function fixed to m ” 0)MAP hyper-parameters for each kernelLog-marginal likelihood values for each (optimized) model
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 64
![Page 67: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/67.jpg)
Example
x-4 -3 -2 -1 0 1 2 3 4
f(x)
-2
-1
0
1
2
3Constant kernel, LML=-1.1073
Four different kernels (mean function fixed to m ” 0)MAP hyper-parameters for each kernelLog-marginal likelihood values for each (optimized) model
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 64
![Page 68: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/68.jpg)
Example
x-4 -3 -2 -1 0 1 2 3 4
f(x)
-2
-1
0
1
2
3Linear kernel, LML=-1.0065
Four different kernels (mean function fixed to m ” 0)MAP hyper-parameters for each kernelLog-marginal likelihood values for each (optimized) model
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 64
![Page 69: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/69.jpg)
Example
x-4 -3 -2 -1 0 1 2 3 4
f(x)
-2
-1
0
1
2
3Matern kernel, LML=-0.8625
Four different kernels (mean function fixed to m ” 0)MAP hyper-parameters for each kernelLog-marginal likelihood values for each (optimized) model
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 64
![Page 70: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/70.jpg)
Example
x-4 -3 -2 -1 0 1 2 3 4
f(x)
-2
-1
0
1
2
3Gaussian kernel, LML=-0.69308
Four different kernels (mean function fixed to m ” 0)MAP hyper-parameters for each kernelLog-marginal likelihood values for each (optimized) model
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 64
![Page 71: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/71.jpg)
GP Training: Illustration5
Prior: fpxq “ ✓sfsmoothpxq ` ✓pfperiodicpxq, with smooth andperiodic GP priors, respectively.
Amount of periodicity vs. smoothness is automatically chosenby selecting hyper-parameters ✓s, ✓p.
Marginal likelihood learns how to generalize, not just to fit thedata
5Thanks to Mark van der Wilk
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 65
![Page 72: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/72.jpg)
GP Training: Illustration5
Prior: fpxq “ ✓sfsmoothpxq ` ✓pfperiodicpxq, with smooth andperiodic GP priors, respectively.
Amount of periodicity vs. smoothness is automatically chosenby selecting hyper-parameters ✓s, ✓p.
Marginal likelihood learns how to generalize, not just to fit thedata
5Thanks to Mark van der Wilk
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 65
![Page 73: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/73.jpg)
Limitations and Guidelines
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 66
![Page 74: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/74.jpg)
Limitations of Gaussian Processes
Computational and memory complexityTraining set size: N
Training scales in OpN3qPrediction (variances) scales in OpN2qMemory requirement: OpND ` N2q
Practical limit N « 10, 000
Some solution approaches:Sparse GPs with inducing variables (e.g., Snelson &Ghahramani, 2006; Quinonero-Candela & Rasmussen, 2005;Titsias 2009; Hensman et al., 2013; Matthews et al., 2016)Combination of local GP expert models (e.g., Tresp 2000; Cao &Fleet 2014; Deisenroth & Ng, 2015)Variational Fourier features (Hensman et al., 2018)
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 67
![Page 75: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/75.jpg)
Limitations of Gaussian Processes
Computational and memory complexityTraining set size: N
Training scales in OpN3qPrediction (variances) scales in OpN2qMemory requirement: OpND ` N2q
Practical limit N « 10, 000
Some solution approaches:Sparse GPs with inducing variables (e.g., Snelson &Ghahramani, 2006; Quinonero-Candela & Rasmussen, 2005;Titsias 2009; Hensman et al., 2013; Matthews et al., 2016)Combination of local GP expert models (e.g., Tresp 2000; Cao &Fleet 2014; Deisenroth & Ng, 2015)Variational Fourier features (Hensman et al., 2018)
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 67
![Page 76: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/76.jpg)
Tips and Tricks for Practitioners
To set initial hyper-parameters, use domain knowledge.
Standardize input data and set initial length-scales ` to « 0.5.Standardize targets y and set initial signal variance to �f « 1.Often useful: Set initial noise level relatively high (e.g.,�n « 0.5 ˆ �f amplitude), even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.Mitigate the problem of numerical instability (Choleskydecomposition of K ` �2
nI) by penalizing high signal-to-noiseratios �f{�n
https://drafts.distill.pub/gp
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 68
![Page 77: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/77.jpg)
Tips and Tricks for Practitioners
To set initial hyper-parameters, use domain knowledge.Standardize input data and set initial length-scales ` to « 0.5.
Standardize targets y and set initial signal variance to �f « 1.Often useful: Set initial noise level relatively high (e.g.,�n « 0.5 ˆ �f amplitude), even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.Mitigate the problem of numerical instability (Choleskydecomposition of K ` �2
nI) by penalizing high signal-to-noiseratios �f{�n
https://drafts.distill.pub/gp
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 68
![Page 78: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/78.jpg)
Tips and Tricks for Practitioners
To set initial hyper-parameters, use domain knowledge.Standardize input data and set initial length-scales ` to « 0.5.Standardize targets y and set initial signal variance to �f « 1.
Often useful: Set initial noise level relatively high (e.g.,�n « 0.5 ˆ �f amplitude), even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.Mitigate the problem of numerical instability (Choleskydecomposition of K ` �2
nI) by penalizing high signal-to-noiseratios �f{�n
https://drafts.distill.pub/gp
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 68
![Page 79: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/79.jpg)
Tips and Tricks for Practitioners
To set initial hyper-parameters, use domain knowledge.Standardize input data and set initial length-scales ` to « 0.5.Standardize targets y and set initial signal variance to �f « 1.Often useful: Set initial noise level relatively high (e.g.,�n « 0.5 ˆ �f amplitude), even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.
When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.Mitigate the problem of numerical instability (Choleskydecomposition of K ` �2
nI) by penalizing high signal-to-noiseratios �f{�n
https://drafts.distill.pub/gp
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 68
![Page 80: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/80.jpg)
Tips and Tricks for Practitioners
To set initial hyper-parameters, use domain knowledge.Standardize input data and set initial length-scales ` to « 0.5.Standardize targets y and set initial signal variance to �f « 1.Often useful: Set initial noise level relatively high (e.g.,�n « 0.5 ˆ �f amplitude), even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.
Mitigate the problem of numerical instability (Choleskydecomposition of K ` �2
nI) by penalizing high signal-to-noiseratios �f{�n
https://drafts.distill.pub/gp
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 68
![Page 81: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/81.jpg)
Tips and Tricks for Practitioners
To set initial hyper-parameters, use domain knowledge.Standardize input data and set initial length-scales ` to « 0.5.Standardize targets y and set initial signal variance to �f « 1.Often useful: Set initial noise level relatively high (e.g.,�n « 0.5 ˆ �f amplitude), even if you think your data have lownoise. The optimization surface for your other parameters will beeasier to move in.When optimizing hyper-parameters, try random restarts or othertricks to avoid local optima are advised.Mitigate the problem of numerical instability (Choleskydecomposition of K ` �2
nI) by penalizing high signal-to-noiseratios �f{�n
https://drafts.distill.pub/gp
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 68
![Page 82: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/82.jpg)
Application Areas
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 69
![Page 83: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/83.jpg)
Application Areas
−2 0 2−5
0
5
angle in rad
ang.
vel.
in ra
d/s
−2
0
2
4
6
8
Reinforcement learning and roboticsModel value functions and/or dynamics with GPs
Bayesian optimization (Experimental Design)Model unknown utility functions with GPs
GeostatisticsSpatial modeling (e.g., landscapes, resources)
Sensor networksTime-series modeling and forecasting
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 70
![Page 84: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/84.jpg)
Summary
!5 !4 !3 !2 !1 0 1 2 3 4 5 6 7 8!3
!2
!1
0
1
2
3
x
f(x)
Gaussian processes are the gold-standard for regressionClosely related to Bayesian linear regressionComputations boil down to manipulating multivariate GaussiandistributionsMarginal likelihood objective automatically trades off data fit andmodel complexity
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 71
![Page 85: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/85.jpg)
References I
[1] G. Bertone, M. P. Deisenroth, J. S. Kim, S. Liem, R. R. de Austri, and M. Welling. Accelerating the BSM Interpretation ofLHC Data with Machine Learning. arXiv preprint arXiv:1611.02704, 2016.
[2] R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold Gaussian Processes for Regression. InProceedings of the International Joint Conference on Neural Networks, 2016.
[3] Y. Cao and D. J. Fleet. Generalized Product of Experts for Automatic and Principled Fusion of Gaussian ProcessPredictions. http://arxiv.org/abs/1410.7827, 2014.
[4] N. A. C. Cressie. Statistics for Spatial Data. Wiley-Interscience, 1993.
[5] M. Cutler and J. P. How. Efficient Reinforcement Learning for Robots using Informative Simulated Priors. In Proceedings
of the International Conference on Robotics and Automation, 2015.
[6] M. P. Deisenroth and J. W. Ng. Distributed Gaussian Processes. In Proceedings of the International Conference on
Machine Learning, 2015.
[7] M. P. Deisenroth, C. E. Rasmussen, and D. Fox. Learning to Control a Low-Cost Manipulator using Data-EfficientReinforcement Learning. In Proceedings of Robotics: Science and Systems, 2011.
[8] M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian Process Dynamic Programming. Neurocomputing,72(7–9):1508–1524, Mar. 2009.
[9] M. P. Deisenroth, R. Turner, M. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust Filtering and Smoothing withGaussian Processes. IEEE Transactions on Automatic Control, 57(7):1865–1871, 2012.
[10] R. Frigola, F. Lindsten, T. B. Schon, and C. E. Rasmussen. Bayesian Inference and Learning in Gaussian ProcessState-Space Models with Particle MCMC. In Advances in Neural Information Processing Systems. 2013.
[11] N. HajiGhassemi and M. P. Deisenroth. Approximate Inference for Long-Term Forecasting with Periodic GaussianProcesses. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2014.
[12] J. Hensman, N. Durrande, and A. Solin. Variational Fourier Features for Gaussian Processes. Journal of Machine
Learning Research, pages 1–52, 2018.
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 72
![Page 86: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/86.jpg)
References II
[13] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian Processes for Big Data. In Proceedings of the Conference on
Uncertainty in Artificial Intelligence, 2013.
[14] A. Krause, A. Singh, and C. Guestrin. Near-Optimal Sensor Placements in Gaussian Processes: Theory, EfficientAlgorithms and Empirical Studies. Journal of Machine Learning Research, 9:235–284, Feb. 2008.
[15] M. C. H. Lee, H. Salimbeni, M. P. Deisenroth, and B. Glocker. Patch Kernels for Gaussian Processes inHigh-Dimensional Imaging Problems. In NIPS Workshop on Practical Bayesian Nonparametrics, 2016.
[16] J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic Construction andNatural-Language Description of Nonparametric Regression Models. In AAAI Conference on Artificial Intelligence,pages 1–11, 2014.
[17] D. J. C. MacKay. Introduction to Gaussian Processes. In C. M. Bishop, editor, Neural Networks and Machine Learning,volume 168, pages 133–165. Springer, 1998.
[18] A. G. d. G. Matthews, J. Hensman, R. Turner, and Z. Ghahramani. On Sparse Variational Methods and theKullback-Leibler Divergence between Stochastic Processes. In Proceedings of the International Conference on Artificial
Intelligence and Statistics, 2016.
[19] M. A. Osborne, S. J. Roberts, A. Rogers, S. D. Ramchurn, and N. R. Jennings. Towards Real-Time InformationProcessing of Sensor Network Data Using Computationally Efficient Multi-output Gaussian Processes. In Proceedings
of the International Conference on Information Processing in Sensor Networks, pages 109–120. IEEE ComputerSociety, 2008.
[20] J. Quinonero-Candela and C. E. Rasmussen. A Unifying View of Sparse Approximate Gaussian Process Regression.Journal of Machine Learning Research, 6(2):1939–1960, 2005.
[21] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.
[22] S. Roberts, M. A. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian Processes for Time SeriesModelling. Philosophical Transactions of the Royal Society (Part A), 371(1984), Feb. 2013.
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 73
![Page 87: Gaussian ProcessesGaussian Processes Marc Deisenroth Centre for Artificial Intelligence Department of Computer Science University College London @mpd37 m.deisenroth@ucl.ac.uk AIMS](https://reader036.vdocument.in/reader036/viewer/2022062415/604718e3c7036e5448016e7a/html5/thumbnails/87.jpg)
References III
[23] B. Scholkopf and A. J. Smola. Learning with Kernels—Support Vector Machines, Regularization, Optimization, and
Beyond. Adaptive Computation and Machine Learning. The MIT Press, Cambridge, MA, USA, 2002.
[24] E. Snelson and Z. Ghahramani. Sparse Gaussian Processes using Pseudo-inputs. In Y. Weiss, B. Scholkopf, and J. C.Platt, editors, Advances in Neural Information Processing Systems 18, pages 1257–1264. The MIT Press, Cambridge,MA, USA, 2006.
[25] M. K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Proceedings of the
International Conference on Artificial Intelligence and Statistics, 2009.
[26] V. Tresp. A Bayesian Committee Machine. Neural Computation, 12(11):2719–2741, 2000.
[27] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep Kernel Learning. In Proceedings of the International
Conference on Artificial Intelligence and Statistics, 2016.
Marc Deisenroth (UCL) Gaussian Processes March/April 2020 74