feature representation with deep gaussian processes
TRANSCRIPT
Feature representation with Deep Gaussianprocesses
Andreas Damianou
Department of Neuro- and Computer Science, University ofSheffield, UK
Feature Extraction with Gaussian Processes Workshop, Sheffield,18/09/2014
Outline
Part 1: A general viewDeep modelling and deep GPs
Part 2: Structure in the latent space (priors for the features)DynamicsAutoencoders
Part 3: Deep Gaussian processesBayesian regularizationInducing PointsStructure: ARD and MRD (multi-view)Examples
Summary
Outline
Part 1: A general viewDeep modelling and deep GPs
Part 2: Structure in the latent space (priors for the features)DynamicsAutoencoders
Part 3: Deep Gaussian processesBayesian regularizationInducing PointsStructure: ARD and MRD (multi-view)Examples
Summary
Feature representation + feature relationships
Feature representation + feature relationships
I Good data representations facilitate learning / inference.
I Deep structures facilitate learning associated with abstractinformation (Bengio, 2009).
Hierarchy of features
[Lee et al. 09, “Convolutional DBNs for Scalable Unsupervised Learning of Hierarchical Representations”]
Hierarchy of features
Abstract concepts
Local features
...
...
...
...
...
...
...
...
[Damianou and Lawrence 2013, “Deep Gaussian Processes”]
Deep learning (directed graph)
Y = f(f(· · · f(X))), Hi = fi(Hi−1)
Deep Gaussian processes - Big Picture
Deep GP:
I Directed graphical model
I Non-parametric, non-linear mappings f
I Mappings f marginalised out analytically
I Likelihood is a non-linear function of the inputs
I Continuous variables
I NOT a GP!
Challenges:
I Marginalise out H
I No sampling: analytic approximation of objective
Solution:
I Variational approximation
I This also gives access to the model evidence
Deep Gaussian processes - Big Picture
Deep GP:
I Directed graphical model
I Non-parametric, non-linear mappings f
I Mappings f marginalised out analytically
I Likelihood is a non-linear function of the inputs
I Continuous variables
I NOT a GP!
Challenges:
I Marginalise out H
I No sampling: analytic approximation of objective
Solution:
I Variational approximation
I This also gives access to the model evidence
Deep Gaussian processes - Big Picture
Deep GP:
I Directed graphical model
I Non-parametric, non-linear mappings f
I Mappings f marginalised out analytically
I Likelihood is a non-linear function of the inputs
I Continuous variables
I NOT a GP!
Challenges:
I Marginalise out H
I No sampling: analytic approximation of objective
Solution:
I Variational approximation
I This also gives access to the model evidence
Hierarchical GP-LVM
I Hidden layers are not marginalised out.
I This leads to some difficulties.
[Lawrence and Moore, 2004]
Hierarchical GP-LVM
I Hidden layers are not marginalised out.
I This leads to some difficulties.
[Lawrence and Moore, 2004]
Sampling from a deep GP
Yf
f
Input
Output
Unobserved
Outline
Part 1: A general viewDeep modelling and deep GPs
Part 2: Structure in the latent space (priors for the features)DynamicsAutoencoders
Part 3: Deep Gaussian processesBayesian regularizationInducing PointsStructure: ARD and MRD (multi-view)Examples
Summary
Dynamics
GP-LVM
Y
X
Dynamics
GP-LVM
Y
H
Dynamics
GP-LVM
Y
H
tI If Y form is a multivariate time-series, then H
also has to be one
I Place a temporal GP prior on the latent space:
h = h(t) = GP(0, kh(t, t))f = f(h) = GP(0, kf (h, h))y = f(h) + ε
Dynamics
I Dynamics are encoded in the covariance matrix K = k(t, t).
I We can consider special forms for K.
Model individual sequences Model periodic data
I https://www.youtube.com/watch?v=i9TEoYxaBxQ (missa)
I https://www.youtube.com/watch?v=mUY1XHPnoCU (dog)
I https://www.youtube.com/watch?v=fHDWloJtgk8 (mocap)
Autoencoder
Y
H
YGP-LVM:
−6 −5 −4 −3 −2 −1 0 1 2 3 4−6
−4
−2
0
2
4
6
Non-parametric auto-encoder:
−10 −5 0 5 10
−10
−5
0
5
10
Outline
Part 1: A general viewDeep modelling and deep GPs
Part 2: Structure in the latent space (priors for the features)DynamicsAutoencoders
Part 3: Deep Gaussian processesBayesian regularizationInducing PointsStructure: ARD and MRD (multi-view)Examples
Summary
Sampling from a deep GP
Yf
f
Input
Output
Unobserved
Modelling a step function - Posterior samples
0
0.5
1
0
0.5
1
−1 −0.5 0 0.5 1 1.5 2
0
0.5
1
I GP (top), 2 and 4 layer deep GP (middle, bottom).I Posterior is bimodal.
Image by J. Hensman
MAP optimisation?
y
h1
h2
x
f
f
f
I Joint = p(y|h1)p(h1|h2)p(h2|x)
I MAP optimization is extremely problematicbecause:
• Dimensionality of hs has to be decided a priori
• Prone to overfitting, if h are treated as parameters
• Deep structures are not supported by the model’sobjective but have to be forced [Lawrence & Moore ’07]
Regularization solution: approximate Bayesian framework
I Analytic variational bound F ≤ p(y|x)• Extend the Variational Free Energy sparse GPs (Titsias 09) /
Variational Compression tricks.• Approximately marginalise out h
I Automatic structure discovery (nodes, connections, layers)
• Use the Automatic / Manifold Relevance Determination trick
I ...
Direct marginalisation of h is intractable (O o)
I New objective: p(y|x) =∫h1
(p(y|h1)
∫h2p(h1|h2)p(h2|x)
)I p(h1|x) =
∫h2,f2
p(h1|f2)p(f2|h2) p(h2|x)
I p(h1|x, h̃2) =∫h2,f2,u2
p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)
I log p(h1|x, h̃2)≥∫h2,f2,u2
Q log p(h1|f2)(((((p(f2|u2,h2)p(u2|h̃2)p(h2|x)Q=(((((p(f2|u2,h2)q(u2)q(h2)
I log p(h1|x, h̃2)≥∫h2,f2,u2
Q log p(h1|f2)p(u2|h̃2)p(h2|x)Q=q(u2)q(h2)
p(u2|h̃2) contains k(h̃2, h̃2)−1
Direct marginalisation of h is intractable (O o)
I New objective: p(y|x) =∫h1
(p(y|h1)
∫h2p(h1|h2)p(h2|x)
)
Direct marginalisation of h is intractable (O o)
I New objective: p(y|x) =∫h1
(p(y|h1)
∫h2p(h1|h2)p(h2|x)
)I p(h1|x) =
∫h2,f2
p(h1|f2)p(f2|h2) p(h2|x)
Direct marginalisation of h is intractable (O o)
I New objective: p(y|x) =∫h1
(p(y|h1)
∫h2p(h1|h2)p(h2|x)
)I p(h1|x) =
∫h2,f2
p(h1|f2)p(f2|h2) p(h2|x)
Direct marginalisation of h is intractable (O o)
I New objective: p(y|x) =∫h1
(p(y|h1)
∫h2p(h1|h2)p(h2|x)
)I p(h1|x) =
∫h2,f2
p(h1|f2) p(f2|h2)︸ ︷︷ ︸contains
(k(h2, h2))−1
p(h2|x)
Direct marginalisation of h is intractable (O o)
I New objective: p(y|x) =∫h1
(p(y|h1)
∫h2p(h1|h2)p(h2|x)
)I p(h1|x) =
∫h2,f2
p(h1|f2)p(f2|h2) p(h2|x)
I p(h1|x, h̃2) =∫h2,f2,u2
p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)
Direct marginalisation of h is intractable (O o)
I New objective: p(y|x) =∫h1
(p(y|h1)
∫h2p(h1|h2)p(h2|x)
)I p(h1|x) =
∫h2,f2
p(h1|f2)p(f2|h2) p(h2|x)
I p(h1|x, h̃2) =∫h2,f2,u2
p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)
I log p(h1|x, h̃2)≥∫h2,f2,u2
Q log p(h1|f2)p(f2|u2,h2)p(u2|h̃2)p(h2|x)Q
Direct marginalisation of h is intractable (O o)
I New objective: p(y|x) =∫h1
(p(y|h1)
∫h2p(h1|h2)p(h2|x)
)I p(h1|x) =
∫h2,f2
p(h1|f2)p(f2|h2) p(h2|x)
I p(h1|x, h̃2) =∫h2,f2,u2
p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)
I log p(h1|x, h̃2)≥∫h2,f2,u2
Q log p(h1|f2)(((((p(f2|u2,h2)p(u2|h̃2)p(h2|x)Q=(((((p(f2|u2,h2)q(u2)q(h2)
Direct marginalisation of h is intractable (O o)
I New objective: p(y|x) =∫h1
(p(y|h1)
∫h2p(h1|h2)p(h2|x)
)I p(h1|x) =
∫h2,f2
p(h1|f2)p(f2|h2) p(h2|x)
I p(h1|x, h̃2) =∫h2,f2,u2
p(h1|f2)p(f2|u2, h2)p(u2|h̃2)p(h2|x)
I log p(h1|x, h̃2)≥∫h2,f2,u2
Q log p(h1|f2)(((((p(f2|u2,h2)p(u2|h̃2)p(h2|x)Q=(((((p(f2|u2,h2)q(u2)q(h2)
I log p(h1|x, h̃2)≥∫h2,f2,u2
Q log p(h1|f2)p(u2|h̃2)p(h2|x)Q=q(u2)q(h2)
p(u2|h̃2) contains k(h̃2, h̃2)−1
Inducing points: sparseness, tractability and Big Data
h1 f1h2 f2· · · · · ·h30 f30h31 f31· · · · · ·hN fN
Inducing points: sparseness, tractability and Big Data
h1 f1h2 f2· · · · · ·h30 f30h̃(i) u(i)
h31 f31· · · · · ·hN fN
Inducing points: sparseness, tractability and Big Data
h1 f1h2 f2· · · · · ·h30 f30h̃(i) u(i)
h31 f31· · · · · ·hN fN
I Inducing points originally introduced for faster (sparse) GPsI Our manipulation allows to compress information from the
inputs of every layer (var. compression)I This induces tractabilityI Viewing them as global variables⇒ extension to Big Data [Hensman et al., UAI 2013]
Automatic dimensionality detection
I Achieved by employing automatic relevance determination(ARD) priors for the mapping f .
I f ∼ GP(0, kf ) with:
kf (xi,xj) = σ2 exp
−1
2
Q∑q=1
wq (xi ,q − xj ,q)2
I Example:
−3 −2 −1 0 1 2−2
−1.5
−1
−0.5
0
0.5
1
1.5
Deep GP: MNIST example
http://staffwww.dcs.sheffield.ac.uk/people/A.Damianou/research/index.html#DeepGPs
MNIST: The first layer
1 layer GP-LVM:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 150
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
5 layer deep GP (showing 1st layer):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Manifold Relevance Determination
I Observations come into two different views: Y and Z.
I The latent space is segmented into parts private to Y , privateto Z and shared between Y and Z.
I Used for data consolidation and discovering commonalities.
MRD weights
Deep GPs: Another multi-view example
X
−2
−1
0
1
X
.15
0.1
.05
0
.05
Automatic structure discovery
Tools:
I ARD: Eliminate uncessary nodes/connections
I MRD: Conditional independencies
I Approximating evidence: Number of layers (?)
Automatic structure discovery
Tools:
I ARD: Eliminate uncessary nodes/connections
I MRD: Conditional independencies
I Approximating evidence: Number of layers (?)
Deep GP variants
Y
H1
H2
X
Deep GP -Supervised
Y
H1
H2
Deep GP -Unsupervised
Y Z
Hs1HY
1 HZ1
X
Multi-view
Y
H
X
Warped GP
Y
H
t
Temporal
Y
H1
H2
Y
Autoencoder
Summary
I A deep GP is not a GP.
I Sampling is straight-forward. Regularization and trainingneeds to be worked out.
I The solution is a special treatment of auxiliary variables.
I Many variants: multi-view, temporal, autoencoders ...
I Future: how to make it scalable?
I Future: how does it compare to / complement moretraditional deep models?
Thanks
Thanks to Neil Lawrence, James Hensman, Michalis Titsias, CarlHenrik Ek.
References:
• N. D. Lawrence (2006) “The Gaussian process latent variable model” Technical Report no
CS-06-03, The University of Sheffield, Department of Computer Science
• N. D. Lawrence (2006) “Probabilistic dimensional reduction with the Gaussian process
latent variable model” (talk)
• C. E. Rasmussen (2008), “Learning with Gaussian Processes”, Max Planck Institute for
Biological Cybernetics, Published: Feb. 5, 2008 (Videolectures.net)
• Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine
Learning. MIT Press, Cambridge, MA, 2006. ISBN 026218253X.
• M. K. Titsias (2009), “Variational learning of inducing variables in sparse Gaussian
processes”, AISTATS 2009
• A. C. Damianou, M. K. Titsias and N. D. Lawrence (2011), “Variational Gaussian process
dynamical systems”, NIPS 2011
• A. C. Damianou, C. H. Ek, M. K. Titsias and N. D. Lawrence (2012), “Manifold Relevance
Determination”, ICML 2012
• A. C. Damianou and N. D. Lawrence (2013), “Deep Gaussian processes”, AISTATS 2013
• J. Hensman (2013), “Gaussian processes for Big Data”, UAI 2013