neil d. lawrence 11th july 2015 deep learning@icml

70
Deep Gaussian Processes Neil D. Lawrence 11th July 2015 Deep Learning@ICML

Upload: others

Post on 18-Dec-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Deep Gaussian Processes

Neil D. Lawrence

11th July 2015Deep Learning@ICML

Outline

Background

Massively Missing Data

Deep Gaussian Processes

Samples and Results

Ouroboros

Ouroboros

Separation of Model and Algorithm

I Machine Learning:

data +model = prediction

I Model encodes our beliefs about the regularities of theuniverse.

I I’m using simple and complex in the sense of how easy tounderstand.

I Neural network: complex model, simple algorithmI Gaussian process: simple model, complex algorithm

Is the Model Complex?

I More details in this blog post:http://inverseprobability.com/2015/02/28/questions-on-deep-gaussian-processes/

I This talk: It’s about the model.I Talk later today at Large Scale Kernel Machines is about

part of the algorithm.

GaussianFace

(Lu and Tang, 2014)

I First system to surpass human performance on croppedLearning Faces in Wild Data.http://tinyurl.com/nkt9a38

I Lots of feature engineering, followed by a DiscriminativeGP-LVM.

0 0.05 0.1 0.15 0.20.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

High dimensional LBP (95.17%) [Chen et al. 2013] Fisher Vector Faces (93.03%) [Simonyan et al. 2013]TL Joint Bayesian (96.33%) [Cao et al. 2013]Human, cropped (97.53%) [Kumar et al. 2009]DeepFace-ensemble (97.35%) [Taigman et al. 2014]ConvNet-RBM (92.52%) [Sun et al. 2013]GaussianFace-FE + GaussianFace-BC (98.52%)

false positive rate

true

posi

tive

rate

Figure 4: The ROC curve on LFW. Our method achieves the bestperformance, beating human-level performance.

papers comparing human and computer-based face verifica-tion performance (Tang and Wang 2004; O’Toole et al. 2007;Phillips and O’Toole 2014). It has been shown that thebest current face verification algorithms perform better thanhumans in the good and moderate conditions. So, it is reallynot that difficult to beat human performance in some specificscenarios.

As pointed out by (O’Toole et al. 2012; Sinha et al.2005), humans and computer-based algorithms have dif-ferent strategies in face verification. Indeed, by contrast toperformance with unfamiliar faces, human face verificationabilities for familiar faces are relatively robust to changesin viewing parameters such as illumination and pose. Forexample, Bruce (Bruce 1982) found human recognitionmemory for unfamiliar faces dropped substantially whenthere were changes in viewing parameters. Besides, humanscan take advantages of non-face configurable informationfrom the combination of the face and body (e.g., neck,shoulders). It has also been examined in (Kumar et al. 2009),where the human performance drops from 99.20% (testedusing the original LFW images) to 97.53% (tested using thecropped LFW images). Hence, the experiments comparinghuman and computer performance may not show humanface verification skill at their best, because humans wereasked to match the cropped faces of people previously unfa-miliar to them. To the contrary, those experiments can fullyshow the performance of computer-based face verificationalgorithms. First, the algorithms can exploit informationfrom enough training images with variations in all viewingparameters to improve face verification performance, whichis similar to information humans acquire in developing faceverification skills and in becoming familiar with individuals.Second, the algorithms might exploit useful, but subtle,image-based detailed information that give them a slight, butconsistent, advantage over humans.

Therefore, surpassing the human-level performance mayonly be symbolically significant. In reality, a lot of chal-lenges still lay ahead. To compete successfully with humans,more factors such as the robustness to familiar faces andthe usage of non-face information, need to be considered indeveloping future face verification algorithms.

Figure 5: The two rows present examples of matched andmismatched pairs respectively from LFW that were incorrectlyclassified by the GaussianFace model.

Conclusion and Future WorkThis paper presents a principled Multi-Task Learning ap-proach based on Discriminative Gaussian Process LatentVariable Model, named GaussianFace, for face verificationby including a computationally more efficient equivalentform of KFDA and the multi-task learning constraint tothe DGPLVM model. We use Gaussian Processes approx-imation and anchor graphs to speed up the inference andprediction of our model. Based on the GaussianFace model,we propose two different approaches for face verification.Extensive experiments on challenging datasets validate theefficacy of our model. The GaussianFace model finallysurpassed human-level face verification accuracy, thanks toexploiting additional data from multiple source-domains toimprove the generalization performance of face verificationin the target-domain and adapting automatically to complexface variations.

Although several techniques such as the Laplace approx-imation and anchor graph are introduced to speed up theprocess of inference and prediction in our GaussianFacemodel, it still takes a long time to train our model forthe high performance. In addition, large memory is alsonecessary. Therefore, for specific application, one needsto balance the three dimensions: memory, running time,and performance. Generally speaking, higher performancerequires more memory and more running time. In the future,the issue of running time can be further addressed by thedistributed parallel algorithm or the GPU implementationof large matrix inversion. To address the issue of memory,some online algorithms for training need to be developed.Another more intuitive method is to seek a more efficientsparse representation for the large covariance matrix.

ReferencesAhonen, T.; Hadid, A.; and Pietikainen, M. 2006. Facedescription with local binary patterns: Application to facerecognition. TPAMI.Ben-Hur, A.; Horn, D.; Siegelmann, H. T.; and Vapnik, V.2002. Support vector clustering. JMLR.Berg, T., and Belhumeur, P. N. 2012. Tom-vs-pete classifiersand identity-preserving alignment for face verification. InBMVC.

Latent Variable

I The core component in GaussianFace is dimensionalityreduction.

I Model data, y, with a vector value function that maps fromx to y.

yi = f(xi) + ε

I Treat the unknown latent dimension x as a nuiscanceparameter and place prior, p(x) over it to integrate it out.

I In GaussianFace the latent variable model uses a Gaussianprocess to handle f.

Difficulty for Probabilistic Approaches

I Propagate a probability distribution through a non-linearmapping.

I Normalisation of distribution becomes intractable.x 2

x1

y j = f j(x)−→

Figure : A three dimensional manifold formed by mapping from atwo dimensional space to a three dimensional space.

Difficulty for Probabilistic Approaches

y 2

y1

x

y1 = f1(x)−→

y2 = f2(x)

Figure : A string in two dimensions, formed by mapping from onedimension, x, line to a two dimensional space, [y1, y2] usingnonlinear functions f1(·) and f2(·).

Difficulty for Probabilistic Approaches

p(y)p(x)

y = f (x) + ε−→

Figure : A Gaussian distribution propagated through a non-linearmapping. yi = f (xi) + εi. ε ∼ N

(0, 0.22

)and f (·) uses RBF basis, 100

centres between -4 and 4 and ` = 0.1. New distribution over y (right)is multimodal and difficult to normalize.

Example: Latent Doodle Space

(Baxter and Anjyo, 2006)

http://vimeo.com/3235882

Example: Latent Doodle Space

(Baxter and Anjyo, 2006)

Generalization with much less Data than Dimensions

I Powerful uncertainly handling of GPs leads to surprisingproperties.

I Non-linear models can be used where there are fewer datapoints than dimensions without overfitting.

Spiral of Progression

Massive Missing Data

I If missing at random it can be marginalized.I As data sets become very large (39 million in EMIS) data

becomes extremely sparse.I Imputation becomes impractical.

Imputation

I Expectation Maximization (EM) is gold standardimputation algorithm.

I Exact EM optimizes the log likelihood.I Approximate EM optimizes a lower bound on log

likelihood.I e.g. variational approximations (VIBES, Infer.net).

I Convergence is guaranteed to a local maxima in loglikelihood.

Expectation Maximization

Require: An initial guess for missing data

repeatUpdate model parametersUpdate guess of missing data

until convergence

(M-step)(E-step)

Expectation Maximization

Require: An initial guess for missing datarepeat

Update model parametersUpdate guess of missing data

until convergence

(M-step)(E-step)

Expectation Maximization

Require: An initial guess for missing datarepeat

Update model parameters

Update guess of missing datauntil convergence

(M-step)

(E-step)

Expectation Maximization

Require: An initial guess for missing datarepeat

Update model parametersUpdate guess of missing data

until convergence

(M-step)(E-step)

Expectation Maximization

Require: An initial guess for missing datarepeat

Update model parametersUpdate guess of missing data

until convergence

(M-step)(E-step)

Imputation is Impractical

I In very sparse data imputation is impractical.I EMIS: 39 million patients, thousands of tests.I For most people, most tests are missing.I M-step becomes confused by poor imputation.

Direct Marginalization is the Answer

I Perhaps we need joint distribution of two test outcomes,

p(y1, y2)

I Obtained through marginalizing over all missing data,

p(y1, y2) =∫

p(y1, y2, y3, . . . , yp)dy3, . . .dyp

I Where y3, . . . , yp contains:1. all tests not applied to this patient2. all tests not yet invented!!

Magical Marginalization in Gaussians

Multi-variate Gaussians

I Given 10 dimensional multivariate Gaussian, y ∼ N (0,C).I Generate a single correlated sample y =

[y1, y2 . . . y10

].

I How do we find the marginal distribution of y1, y2?

Gaussian Marginalization Property

-2

-1

0

1

2

0 2 4 6 8 10

y i

i

(a) A 10 dimensional samplej

i-4-3-2-101234

(b) colormap showing covariance be-tween dimensions.

Figure : A sample from a 10 dimensional correlated Gaussiandistribution.

Gaussian Marginalization Property

-2

-1

0

1

2

0 2 4 6 8 10

y i

i

(a) A 10 dimensional samplej

i-4-3-2-101234

(b) colormap showing covariance be-tween dimensions.

Figure : A sample from a 10 dimensional correlated Gaussiandistribution.

Gaussian Marginalization Property

-2

-1

0

1

2

0 2 4 6 8 10

y i

i

(a) A 10 dimensional sample-4-3-2-101234

(b) colormap showing covariance be-tween dimensions.

Figure : A sample from a 10 dimensional correlated Gaussiandistribution.

Gaussian Marginalization Property

-2

-1

0

1

2

0 2 4 6 8 10

y i

i

(a) A 10 dimensional sample-4-3-2-101234

(b) colormap showing covariance be-tween dimensions.

Figure : A sample from a 10 dimensional correlated Gaussiandistribution.

Gaussian Marginalization Property

-2

-1

0

1

2

0 2 4 6 8 10

y i

i

(a) A 10 dimensional sample-4-3-2-101234

(b) colormap showing covariance be-tween dimensions.

Figure : A sample from a 10 dimensional correlated Gaussiandistribution.

Gaussian Marginalization Property

-2

-1

0

1

2

0 2 4 6 8 10

y i

i

(a) A 10 dimensional sample-4-3-2-101234

(b) colormap showing covariance be-tween dimensions.

Figure : A sample from a 10 dimensional correlated Gaussiandistribution.

Gaussian Marginalization Property

-2

-1

0

1

2

0 2 4 6 8 10

y i

i

(a) A 10 dimensional sample-4-3-2-101234

(b) colormap showing covariance be-tween dimensions.

Figure : A sample from a 10 dimensional correlated Gaussiandistribution.

Gaussian Marginalization Property

-2

-1

0

1

2

0 2 4 6 8 10

y i

i

(a) A 10 dimensional sample

4.1 3.1111

3.1111 2.5198

(b) covariance between y1 and y2.

Figure : A sample from a 10 dimensional correlated Gaussiandistribution.

Gaussian Marginalization Property

-2

-1

0

1

2

0 2 4 6 8 10

y i

i

(a) A 10 dimensional sample

1 0.96793

0.96793 1

(b) correlation between y1 and y2.

Figure : A sample from a 10 dimensional correlated Gaussiandistribution.

Avoid Imputation: Marginalize Directly

I Our approach: Avoid Imputation, Marginalize Directly.I Explored in context of Collaborative Filtering.I Similar challenges:

I many users (patients),I many items (tests),I sparse data

I Implicitly marginalizes over all future tests too.

Work with Raquel Urtasun (Lawrence and Urtasun, 2009) and ongoingwork with Max Zwießele and Nicolo Fusi.

Marginalization in Bipartite Undirected Graph

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

latent variables

Marginalization in Bipartite Undirected Graph

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5 y6 y7 y8 y9 y10

latent variables

Marginalization in Bipartite Undirected Graph

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5 y6 y7 y8 y9

y10

latent variables

Marginalization in Bipartite Undirected Graph

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5 y6 y7 y8 y9

y10

latent variables

additional layerof latent variables

Marginalization in Bipartite Undirected Graph

x1 x2 x3 x4 x5

y1 y2 y3 y4 y5 y6 y7 y8 y9

y10

latent variables

additional layerof latent variables

For massive missing data, how many additional latent variables?

Methods that Interrelate Covariates

I Need Class of models that interrelates data, but allows forvariable p.

I Common assumption: high dimensional data lies on lowdimensional manifold.

I Want to retain the marginalization property of Gaussiansbut deal with non-Gaussian data!

Gaussian Processes: Extremely Short Overview

-6-4-20246

0 2 4 6 8 10

Gaussian Processes: Extremely Short Overview

-6-4-20246

0 2 4 6 8 10

Gaussian Processes: Extremely Short Overview

-6-4-20246

0 2 4 6 8 10

Gaussian Processes: Extremely Short Overview

-6-4-20246

0 2 4 6 8 10-6-4-20246

0 2 4 6 8 10

Gaussian Processes and Kernels

I The kernel, K acts as the covariance of the Gaussian.I Predictions in Gaussian processes consist of the posterior

mean function and covariance.I For non degenerate covariance matrix, K, the model is non

parametric.

Gaussian Process Summer School

http://gpss.cc14th-17th September 2015

Sheffield, UK

What’s the Algorithm?

I For some algorithmic details:1. Attend a Gaussian process summer school

(http://gpss.cc)2. Come to my talk at the Large Scale Kernel Machines

Workshop (11:15 am, Pasteur Room)

I We make our software available on GitHub:https://github.com/SheffieldML/GPy

Original Motivation for Deep

I Assuming low dimensional embedding is one way ofhandling high dimensional data.

I Another is to assume conditional independencies (sparsegraph structure).

I This inspires a layered hierarchy.

Hierarchical GP-LVM

(Lawrence and Moore, 2007)

Hierarchical GP-LVM

(Lawrence and Moore, 2007)

Hierarchical GP-LVM

(Lawrence and Moore, 2007)

Structure Learning

I Then we used MAP inference for learning.I Prevents learning of structures.I Our ‘modern’ approaches use variational approximations.I Allows for learning of structure of model (number of latent

variables in each layer).I Allows for learning the right depth of model (number of

layers).

Mathematically

I Composite multivariate function

g(x) = f5(f4(f3(f2(f1(x)))))

Why Deep?

I Gaussian processes give priors over functions.I Elegant properties:

I e.g. Derivatives of process are also Gaussian distributed (ifthey exist).

I For particular covariance functions they are ‘universalapproximators’, i.e. all functions can have support underthe prior.

I Gaussian derivatives might ring alarm bells.I E.g. a priori they don’t believe in function ‘jumps’.

Process Composition

I From a process perspective: process composition.I A (new?) way of constructing more complex processes

based on simpler components.

Note: To retain Kolmogorov consistency introduce IBP priors overlatent variables in each layer (Zhenwen Dai).

Analysis of Deep GPs

I Duvenaud et al. (2014) Duvenaud et al show that thederivative distribution of the process becomes more heavytailed as number of layers increase.

I Gal and Ghahramani (2015) Gal and Ghahramani showthat Drop Out is a variational approximation to a deepGaussian process.

Outline

Background

Massively Missing Data

Deep Gaussian Processes

Samples and Results

Structures for Extracting Information from Data

y

x1

x2

x3

x4 Latent layer 4

Latent layer 3

Latent layer 2

Latent layer 1

Data space

Damianou and Lawrence (2013)

Deep Gaussian Processes

Andreas C. Damianou Neil D. LawrenceDept. of Computer Science & Sheffield Institute for Translational Neuroscience,

University of Sheffield, UK

Abstract

In this paper we introduce deep Gaussian process(GP) models. Deep GPs are a deep belief net-work based on Gaussian process mappings. Thedata is modeled as the output of a multivariateGP. The inputs to that Gaussian process are thengoverned by another GP. A single layer model isequivalent to a standard GP or the GP latent vari-able model (GP-LVM). We perform inference inthe model by approximate variational marginal-ization. This results in a strict lower bound on themarginal likelihood of the model which we usefor model selection (number of layers and nodesper layer). Deep belief networks are typically ap-plied to relatively large data sets using stochas-tic gradient descent for optimization. Our fullyBayesian treatment allows for the application ofdeep models even when data is scarce. Model se-lection by our variational bound shows that a fivelayer hierarchy is justified even when modellinga digit data set containing only 150 examples.

1 Introduction

Probabilistic modelling with neural network architecturesconstitute a well studied area of machine learning. The re-cent advances in the domain of deep learning [Hinton andOsindero, 2006, Bengio et al., 2012] have brought this kindof models again in popularity. Empirically, deep modelsseem to have structural advantages that can improve thequality of learning in complicated data sets associated withabstract information [Bengio, 2009]. Most deep algorithmsrequire a large amount of data to perform learning, how-ever, we know that humans are able to perform inductivereasoning (equivalent to concept generalization) with onlya few examples [Tenenbaum et al., 2006]. This provokes

Appearing in Proceedings of the 16th International Conference onArtificial Intelligence and Statistics (AISTATS) 2013, Scottsdale,AZ, USA. Volume 31 of JMLR: W&CP 31. Copyright 2013 bythe authors.

the question as to whether deep structures and the learningof abstract structure can be undertaken in smaller data sets.For smaller data sets, questions of generalization arise: todemonstrate such structures are justified it is useful to havean objective measure of the model’s applicability.

The traditional approach to deep learning is based aroundbinary latent variables and the restricted Boltzmann ma-chine (RBM) [Hinton, 2010]. Deep hierarchies are con-structed by stacking these models and various approxi-mate inference techniques (such as contrastive divergence)are used for estimating model parameters. A significantamount of work has then to be done with annealed impor-tance sampling if even the likelihood1 of a data set underthe RBM model is to be estimated [Salakhutdinov and Mur-ray, 2008]. When deeper hierarchies are considered, the es-timate is only of a lower bound on the data likelihood. Fit-ting such models to smaller data sets and using Bayesianapproaches to deal with the complexity seems completelyfutile when faced with these intractabilities.

The emergence of the Boltzmann machine (BM) at the coreof one of the most interesting approaches to modern ma-chine learning is very much a case of a the field going backto the future: BMs rose to prominence in the early 1980s,but the practical implications associated with their train-ing led to their neglect until families of algorithms weredeveloped for the RBM model with its reintroduction as aproduct of experts in the late nineties [Hinton, 1999].

The computational intractabilities of Boltzmann machinesled to other families of methods, in particular kernel meth-ods such as the support vector machine (SVM), to be con-sidered for the domain of data classification. Almost con-temporaneously to the SVM, Gaussian process (GP) mod-els [Rasmussen and Williams, 2006] were introduced as afully probabilistic substitute for the multilayer perceptron(MLP), inspired by the observation [Neal, 1996] that, un-der certain conditions, a GP is an MLP with infinite units inthe hidden layer. MLPs also relate to deep learning models:deep learning algorithms have been used to pretrain autoen-coders for dimensionality reduction [Hinton and Salakhut-

1We use emphasis to clarify we are referring to the model like-lihood, not the marginal likelihood required in Bayesian modelselection.

Deep Models

y1 y2 y3 y4 y5 y6 y7 y8

x11 x1

2 x13 x1

4 x15 x1

6

x21 x2

2 x23 x2

4 x25 x2

6

x31 x3

2 x33 x3

4

x41 x4

2 x43 x4

4 Latent layer 4

Latent layer 3

Latent layer 2

Latent layer 1

Data space

Deep Models

y

x1

x2

x3

x4 Latent layer 4

Latent layer 3

Latent layer 2

Latent layer 1

Data space

Deep Models

y

x1

x2

x3

x4 Abstract features

More com-bination

Combinationof low level

features

Low levelfeatures

Data space

Deep Gaussian Processes

Damianou and Lawrence (2013)

I Deep architectures allow abstraction of features (Bengio, 2009;

Hinton and Osindero, 2006; Salakhutdinov and Murray, 2008).I We use variational approach to stack GP models.

Motion Capture

I ‘High five’ data.I Model learns structure between two interacting subjects.

Deep hierarchies – motion capture

38 Deep Gaussian processes

Digits Data Set

I Are deep hierarchies justified for small data sets?I We can lower bound the evidence for different depths.I For 150 6s, 0s and 1s from MNIST we found at least 5

layers are required.

Deep hierarchies – MNIST

37 Deep Gaussian processes

Deep Health

I1I2

x11 x1

2 x13 x1

4 x15

y2 y3y5y4

x21 x2

2 x23 x2

4y1y6

x31 x3

2 x33 x3

4

G E EG

latent representationof disease stratification

survivalanalysis

gene ex-pression

clinical mea-surements

and treatment

clinicalnotes

socialnet-

work,musicdata

X-raybiopsy

environment epigenotypegenotype

References I

W. V. Baxter and K.-I. Anjyo. Latent doodle space. In EUROGRAPHICS,volume 25, pages 477–485, Vienna, Austria, September 4-8 2006.

Y. Bengio. Learning Deep Architectures for AI. Found. Trends Mach. Learn., 2(1):1–127, Jan. 2009. ISSN 1935-8237. [DOI].

A. Damianou and N. D. Lawrence. Deep Gaussian processes. In C. Carvalhoand P. Ravikumar, editors, Proceedings of the Sixteenth InternationalWorkshop on Artificial Intelligence and Statistics, volume 31, AZ, USA, 2013.JMLR W&CP 31. [PDF].

D. Duvenaud, O. Rippel, R. Adams, and Z. Ghahramani. Avoidingpathologies in very deep networks. In S. Kaski and J. Corander, editors,Proceedings of the Seventeenth International Workshop on Artificial Intelligenceand Statistics, volume 33, Iceland, 2014. JMLR W&CP 33.

Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation:Representing model uncertainty in deep learning. arXiv:1506.02142, 2015.

G. E. Hinton and S. Osindero. A fast learning algorithm for deep belief nets.Neural Computation, 18:2006, 2006.

References II

N. D. Lawrence and A. J. Moore. Hierarchical Gaussian process latentvariable models. In Z. Ghahramani, editor, Proceedings of the InternationalConference in Machine Learning, volume 24, pages 481–488. Omnipress,2007. [Google Books] . [PDF].

N. D. Lawrence and R. Urtasun. Non-linear matrix factorization withGaussian processes. In L. Bottou and M. Littman, editors, Proceedings of theInternational Conference in Machine Learning, volume 26, San Francisco, CA,2009. Morgan Kauffman. [PDF].

C. Lu and X. Tang. Surpassing human-level face verification performance onLFW with GaussianFace. Technical report,

R. Salakhutdinov and I. Murray. On the quantitative analysis of deep beliefnetworks. In S. Roweis and A. McCallum, editors, Proceedings of theInternational Conference in Machine Learning, volume 25, pages 872–879.Omnipress, 2008.