sparse gaussian processes using pseudo-inputs
TRANSCRIPT
Sparse Gaussian processes usingpseudo-inputs
Ed Snelson ([email protected])
Gatsby Computational Neuroscience Unit, UCL
NIPS 2005, 6th December
Work done with Zoubin Ghahramani
Nonlinear regression
Consider the problem of nonlinear regression:
You want to learn a function f with error bars from data D = {X,y}
x
y
A Gaussian process is a prior over functions p(f) which can be used
for Bayesian regression:
p(f |D) =p(f)p(D|f)
p(D)
1
Gaussian process (GP) priors
GP: consistent Gaussian prior on any set of function values f ={fn}N
n=1, given corresponding inputs X = {xn}Nn=1
one sample function
x
f
prior
p(f |X) = N (0,KN)
KN
Covariance: Knn′ = K(xn,xn′ ;θ), hyperparameters θ
Knn′ = v exp
−12
D∑d=1
(x
(d)n − x
(d)n′
rd
)2
2
Gaussian process (GP) priors
GP: consistent Gaussian prior on any set of function values f ={fn}N
n=1, given corresponding inputs X = {xn}Nn=1
f1
f2
f3 fN
N function values
x
f
prior
p(f |X) = N (0,KN)
KN
Covariance: Knn′ = K(xn,xn′ ;θ), hyperparameters θ
Knn′ = v exp
−12
D∑d=1
(x
(d)n − x
(d)n′
rd
)2
2
GP regressionGaussian observation noise: yn = fn + εn, where εn ∼ N (0, σ2)
sample data
x
ymarginal likelihood
p(y|X) = N (0,KN + σ2I)
predictive
x
y
predictive distribution
p(y∗|x∗,X,y) = N (µ∗, σ2∗)
µ∗ = K∗N(KN + σ2I)−1yσ2∗ = K∗∗ −K∗N(KN + σ2I)−1KN∗ + σ2
Problem: N3 computation3
GP regressionGaussian observation noise: yn = fn + εn, where εn ∼ N (0, σ2)
sample data
x
ymarginal likelihood
p(y|X) = N (0,KN + σ2I)
x∗
predictive
x
y
predictive distribution
p(y∗|x∗,X,y) = N (µ∗, σ2∗)
µ∗ = K∗N(KN + σ2I)−1yσ2∗ = K∗∗ −K∗N(KN + σ2I)−1KN∗ + σ2
Problem: N3 computation3
Overview
This talk contains 2 key ideas:
1. A new sparse Gaussian process approximation based on a small
set of M ‘pseudo-inputs’ (M � N). This reduces computational
complexity to O(M2N)
2. A gradient based learning procedure for finding the pseudo-
inputs and hyperparameters of the Gaussian process, in one joint
optimization
4
Two stage generative model
X
x
fpseudo-input prior
p(f |X) = N (0,KM)
1. Choose any set of M (pseudo-) inputs X
2. Draw corresponding function values f from prior
5
Two stage generative model
X
x
f
3. Draw f conditioned on f
conditional
p(f |f) = N (µ,Σ)
µ = KNMK−1M f
Σ = KN −KNMK−1M KMN
Σ
• This two stage procedure defines exactly the same GP prior
• We have not gained anything yet, but it inspires a sparse
approximation ...
6
Factorized approximation
X
x
f
single point conditional
p(fn|f) = N (µn, λn)
µn = KnMK−1M f
λn = Knn −KnMK−1M KMn
Λ
Approximate: p(f |f) ≈∏n
p(fn|f) = N (µ,Λ) , Λ = diag(λ)
Minimum KL: minqn
KL[p(f |f) ‖
∏n
qn(fn)]
7
Sparse pseudo-input Gaussian processes (SPGP)
Integrate out f to obtain SPGP prior: p(f) =∫
df∏
n p(fn|f) p(f)
GP prior
N (0,KN) ≈SPGP prior
p(f) = N (0,KNMK−1M KMN + Λ)
≈ = +
• SPGP covariance inverted in O(M2N) ⇒ sparse
• SPGP = GP with non-stationary covariance parameterized by X
• Given data {X,y} with noise σ2, predictive mean and variance
can be computed in O(M) and O(M2) per test case respectively
8
How to find pseudo-inputs?
Pseudo-inputs are like extra hyperparameters: we jointly maximize
marginal likelihood w.r.t. (X,θ, σ2)
p(y|X, X,θ, σ2) = N(0, KNMK−1
M KMN + Λ + σ2I)
Key advantages over many related sparse methods 1:
1. Pseudo-inputs not constrained to subset of data (‘active set’) =
improved accuracy and flexibility
2. Joint optimization avoids discontinuities that arise when active
set selection is interleaved with hyperparameter learning
1Tresp (2000), Smola & Bartlett (2001), Csato & Opper (2002), Seeger et al. (2003)
9
1D demo
amplitudeamplitudeamplitude
x
y
X
amplitude lengthscale noise
Initialize adversarially: amplitude and lengthscale too bignoise too smallpseudo-inputs bunched up
10
1D demo
amplitudeamplitudeamplitude
x
y
X
amplitude lengthscale noise
Pseudo-inputs and hyperparameters optimized
10
Selected Results:(more at poster)
11
kin40k1 — SPGP vs random
testm.sq.error
0 200 400 600 800 1000 120010−2
10−1
n = 10000random
size of ‘active set’ M
10000 train
30000 test
9 attributes
horizontal line – full GP on subset size 2000 (∗)black – random subset – hyperparameters obtained from ∗red squares – SPGP – pseudo-inputs optimized, hyperparameters obtained from ∗blue circles – SPGP – pseudo-inputs and hyperparameters optimized
1as tested in Seeger et al. (2003)
12
kin40k — SPGP vs info-gain
testm.sq.error
0 200 400 600 800 1000 120010−2
10−1
n = 10000info−gain
size of ‘active set’ M
10000 train
30000 test
9 attributes
horizontal line – full GP on subset size 2000 (∗)black – info-gain1 – hyperparameters obtained from ∗red squares – SPGP – pseudo-inputs optimized, hyperparameters obtained from ∗blue circles – SPGP – pseudo-inputs and hyperparameters optimized
1Seeger et al. (2003)
13
kin40k — SPGP vs Smo-Bart
testm.sq.error
0 200 400 600 800 1000 120010−2
10−1
n = 10000smo−bart
size of ‘active set’ M
10000 train
30000 test
9 attributes
horizontal line – full GP on subset size 2000 (∗)black – Smo-Bart1 – hyperparameters obtained from ∗red squares – SPGP – pseudo-inputs optimized, hyperparameters obtained from ∗blue circles – SPGP – pseudo-inputs and hyperparameters optimized
1Smola & Bartlett (2001)
14
Local maxima and overfitting?
• Many local maxima, but can initialize pseudo-inputs on random
subset of data. Hyperparameter initialization more tricky
• Many parameters: MD + |θ|+ 1 instead of |θ|+ 1. Overfitting?
(D = input space dimension, M = no. of pseudo-inputs)
• Consider M = N and X = X
– Here KMN = KM = KN, Λ = σ2I⇒ SPGP collapses to full GP
• However interaction with hyperparameter learning can lead to
overfitting behaviour
• For full Bayesian treatment: sample pseudo-inputs and
hyperparameters from p(X,θ, σ2|X,y) instead of optimizing
15
Modeling non-stationarity
standard GP
x
y
SPGP
X
x
y
Extra flexibility of SPGP allows some non-stationary effects to be
modeled
16
Limitations and possible extensions
• Large pseudo set size M and/or high dimensional input space D
means optimization can become impractically big
– possible solution: learning a projection of input space into lower
dimensional space [work in progress]
– may also prevent overfitting
• We used CG or L-BFGS but many optimization schemes available:
– Optimize subsets of variables iteratively (chunking)
– Stochastic gradient descent
– hybrid — pick some points randomly, optimize others
– EM algorithm
• Extension to classification and other likelihood functions
17
Conclusions
• New method for sparse GP regression
• Significant decrease in test error, especially for very sparse solutions
• Added flexibility of moving pseudo-inputs which are not constrained
to lie on the data leads to better solutions
• Hyperparameters jointly learned with pseudo-inputs in a single
smooth optimization
• Matlab code, draft paper, and this talk available
(www.gatsby.ucl.ac.uk/~snelson)
18
Acknowledgements
Sheffield GP Round-table
Carl RasmussenJoaquin Quinonero-Candela
Peter SollichMatthias SeegerNeil LawrenceChris WilliamsTom MinkaSam Roweis
PASCAL European Network of Excellence
19
Relation of SPGP to PLV1
SPGPApproximate conditional:
p(f |f) ≈∏
n p(fn|f) = N (µ,Λ)minimum KL fully factorized
approximation
Marginal likelihood:
N (0,KNMK−1M KMN + Λ + σ2I)
marginal variances match full GP
everywhere
Pseudo-inputs:
not constrained to data – optimized by
gradient ascent on marginal likelihood,
together with hyperparameters
PLVApproximate conditional:
p(f |f) ≈ N (µ,0)uncertainty not taken into account –
deterministic approximation
Marginal likelihood:
N (0,KNMK−1M KMN + σ2I)
marginal variances decay to σ2 away
from ‘active set’ points
Active set:
chosen as subset of data using greedy
info-gain criteria; active set selection and
hyperparameter learning interleaved
1Seeger et al. (2003)
20
PLV with pseudo-inputs
x
y
(a)
x
y
(c)
x
y
(b)
Predictive distributions for: (a) full GP, (b) gradient ascent on SPGP
likelihood, (c) gradient ascent on PLV likelihood.
Initial pseudo point positions — red crosses
Final pseudo point positions — blue crosses
21