chained gaussian processesproceedings.mlr.press/v51/saul16.pdf · 2020. 8. 18. · chained gaussian...

10
Chained Gaussian Processes Alan D. Saul James Hensman Aki Vehtari Neil D. Lawrence Department of Computer Science University of Sheffield CHICAS, Faculty of Health and Medicine Lancaster University Helsinki Institute for Information Technology HIIT Department of Computer Science Aalto University Department of Computer Science University of Sheffield Abstract Gaussian process models are flexible, Bayesian non-parametric approaches to regression. Prop- erties of multivariate Gaussians mean that they can be combined linearly in the manner of addi- tive models and via a link function (like in gen- eralized linear models) to handle non-Gaussian data. However, the link function formalism is restrictive, link functions are always invertible and must convert a parameter of interest to an linear combination of the underlying processes. There are many likelihoods and models where a non-linear combination is more appropriate. We term these more general models Chained Gaus- sian Processes: the transformation of the GPs to the likelihood parameters will not generally be invertible, and that implies that linearisation would only be possible with multiple (localized) links, i.e a chain. We develop an approximate inference procedure for Chained GPs that is scal- able and applicable to any factorized likelihood. We demonstrate the approximation on a range of likelihood functions. 1 Introduction Gaussian process models are flexible distributions that can provide priors over non linear functions. They rely on properties of the multivariate Gaussian for their tractabil- ity and their non-parametric nature. In particular, the sum of two functions, drawn from a Gaussian process is also given by a Gaussian process. Mathematically, if f N (μ f ,k f ) and g ⇠N (μ g ,k f ) and we define y = g + f then properties of multivariate Gaussian give us that y N (μ f + μ g ,k f + k g ) where μ f and μ g are deterministic Appearing in Proceedings of the 19 th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016, Cadiz, Spain. JMLR: W&CP volume 51. Copyright 2016 by the authors. functions of a single input, k f and k g are deterministic, positive semi definite functions of two inputs and y, g and f are stochastic processes. This elementary property of the Gaussian process is the foundation of much of its power. It makes additive mod- els trivial, and means we can easily combine any process with Gaussian noise. Naturally, it can be applied recur- sively, and covariance functions can be designed to reflect the underlying structure of the problem at hand (e.g. Hens- man et al. [2013b] uses additive structure to account for variation in replicate behavior in gene expression). In practice observations are often non Gaussian. In re- sponse, statistics has developed the field of generalized lin- ear models [Nelder and Wedderburn, 1972]. In a gener- alized linear model a link function is used to connect the mean function of the Gaussian process with the mean func- tion of another distribution of interest. For example, the log link can be used to relate the rate in a Poisson distribution with our GP, log λ = f +g. Or for classification the logistic distribution can be used to represent the mean probability of positive outcome, log p 1-p = f + g. Writing models in terms of the link function captures the linear nature of the underlying model, but it is somewhat against the probabilistic approach to modeling where we consider the generative model of our data. While there’s nothing wrong with this relationship mathematically, when we consider the generative model we never apply the link function directly, we consider the inverse link or transfor- mation function. For the log link this turns into the expo- nential, λ = exp(f + g). Writing the model in this form emphasizes the importance that the transformation function has on the generative model (see e.g. work on warped GPs [Snelson et al., 2004]). The log link implies a multiplicative combination of f and g, λ = exp(f + g) = exp(f ) exp(g), but in some cases we might wish to consider an additive model, λ = exp(f ) + exp(g). Such a model no longer falls within the class of generalized linear models as there is no link function that renders the two underlying pro- cesses additive Gaussian. In this paper we address this is- sue and use variational approximations to develop a frame- 1431

Upload: others

Post on 19-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chained Gaussian Processesproceedings.mlr.press/v51/saul16.pdf · 2020. 8. 18. · Chained Gaussian Processes work for non-linear combination of the latent processes. Because these

Chained Gaussian Processes

Alan D. Saul James Hensman Aki Vehtari Neil D. LawrenceDepartment of

Computer ScienceUniversity of Sheffield

CHICAS, Faculty ofHealth and MedicineLancaster University

Helsinki Institute forInformation Technology HIIT

Department of Computer ScienceAalto University

Department ofComputer Science

University of Sheffield

Abstract

Gaussian process models are flexible, Bayesiannon-parametric approaches to regression. Prop-erties of multivariate Gaussians mean that theycan be combined linearly in the manner of addi-tive models and via a link function (like in gen-eralized linear models) to handle non-Gaussiandata. However, the link function formalism isrestrictive, link functions are always invertibleand must convert a parameter of interest to anlinear combination of the underlying processes.There are many likelihoods and models where anon-linear combination is more appropriate. Weterm these more general models Chained Gaus-sian Processes: the transformation of the GPsto the likelihood parameters will not generallybe invertible, and that implies that linearisationwould only be possible with multiple (localized)links, i.e a chain. We develop an approximateinference procedure for Chained GPs that is scal-able and applicable to any factorized likelihood.We demonstrate the approximation on a range oflikelihood functions.

1 Introduction

Gaussian process models are flexible distributions that canprovide priors over non linear functions. They rely onproperties of the multivariate Gaussian for their tractabil-ity and their non-parametric nature. In particular, the sumof two functions, drawn from a Gaussian process is alsogiven by a Gaussian process. Mathematically, if f ⇠N (µf , kf ) and g ⇠ N (µg, kf ) and we define y = g + fthen properties of multivariate Gaussian give us that y ⇠N (µf + µg, kf + kg) where µf and µg are deterministic

Appearing in Proceedings of the 19th International Conferenceon Artificial Intelligence and Statistics (AISTATS) 2016, Cadiz,Spain. JMLR: W&CP volume 51. Copyright 2016 by the authors.

functions of a single input, kf and kg are deterministic,positive semi definite functions of two inputs and y, g andf are stochastic processes.

This elementary property of the Gaussian process is thefoundation of much of its power. It makes additive mod-els trivial, and means we can easily combine any processwith Gaussian noise. Naturally, it can be applied recur-sively, and covariance functions can be designed to reflectthe underlying structure of the problem at hand (e.g. Hens-man et al. [2013b] uses additive structure to account forvariation in replicate behavior in gene expression).

In practice observations are often non Gaussian. In re-sponse, statistics has developed the field of generalized lin-ear models [Nelder and Wedderburn, 1972]. In a gener-alized linear model a link function is used to connect themean function of the Gaussian process with the mean func-tion of another distribution of interest. For example, the loglink can be used to relate the rate in a Poisson distributionwith our GP, log � = f+g. Or for classification the logisticdistribution can be used to represent the mean probabilityof positive outcome, log p

1�p = f + g.

Writing models in terms of the link function captures thelinear nature of the underlying model, but it is somewhatagainst the probabilistic approach to modeling where weconsider the generative model of our data. While there’snothing wrong with this relationship mathematically, whenwe consider the generative model we never apply the linkfunction directly, we consider the inverse link or transfor-mation function. For the log link this turns into the expo-nential, � = exp(f + g). Writing the model in this formemphasizes the importance that the transformation functionhas on the generative model (see e.g. work on warped GPs[Snelson et al., 2004]). The log link implies a multiplicativecombination of f and g, � = exp(f +g) = exp(f) exp(g),but in some cases we might wish to consider an additivemodel, � = exp(f) + exp(g). Such a model no longerfalls within the class of generalized linear models as thereis no link function that renders the two underlying pro-cesses additive Gaussian. In this paper we address this is-sue and use variational approximations to develop a frame-

1431

Page 2: Chained Gaussian Processesproceedings.mlr.press/v51/saul16.pdf · 2020. 8. 18. · Chained Gaussian Processes work for non-linear combination of the latent processes. Because these

Chained Gaussian Processes

work for non-linear combination of the latent processes.Because these models cannot be written in the form a sin-gle link function we call this approach “Chained GaussianProcesses”.

2 Background

Assume we have access to a training dataset of n input-output observations {(xi, yi)}n

i=1, yi is assumed to be anoisy realisation of an underlying latent function f = f(x),i.e. yi = f(xi) + ✏i. For a Gaussian likelihood ✏i ⇠N

�µ,�2

�, xi 2 Rq and yi 2 R. Normally the mean of the

likelihood is assumed to be input dependent and given a GPprior µ = fi = f(xi) where f(x) ⇠ GP(µf , kg(x,x0)),and � is fixed at an optimal point. In this case the integralsrequired to infer a posterior, p(f |y), are tractable.

One extension of this model is the heteroscedastic GPregression model [Goldberg et al., 1998, Lazaro-Gredillaand Titsias, 2011], where the noise variance � is depen-dent on the input. The noise variance can be assigneda log GP prior, yi ⇠ N

�f(xi), e

g(xi)�, where g(x) =

GP(µg, kg(x,x0)), i.e. a log link function is used. Unfor-tunately this generalization of the original Gaussian pro-cess model is not analytically tractable and requires an ap-proximation to be made. Suggested approximations in-clude MCMC [Goldberg et al., 1998], variational infer-ence [Lazaro-Gredilla and Titsias, 2011], Laplace approx-imation [Vanhatalo et al., 2013] and expectation propaga-tion [Hernandez-Lobato et al., 2014] (EP).

Another generalization of the standard GP is to vary thescale of the process as a function of the inputs. Adams andStegle [2008] suggest a log GP prior for the scale of the pro-cess giving rise to non-parametric non-stationarity in themodel. Turner and Sahani [2011] took a related approachto develop probabilistic amplitude demodulation, here theamplitude (or scale) of the process was given by a Gaussianprocess with a link function given by � = log(exp(f)�1).Finally Tolvanen et al. [2014] assign both the noise vari-ance and the scale a log GP prior.

Both these two variations on Gaussian process regressioncombine processes in a non-linear way within a Gaussianlikelihood, but the idea may be further generalized to sys-tems that deal with non-Gaussian observation noise.

In this paper we describe a general approach to combiningprocesses in a non-linear way. We assume that the likeli-hood factorizes across the data, but is a general non-linearfunction of b input dependent latent functions. Our main fo-cus will be examples of likelihoods with b = 2, f(xi) andg(xi), such that, p(y|f(xi), g(xi)), though the ideas canall be generalized to b > 2. Previous work in this domaininclude the use of the Laplace approximation [Vanhataloet al., 2013], however this method scales poorly, O(b3n3)and so isn’t applicable to datasets of a moderate size.

To render the model tractable we extend recent advances inlarge scale variational inference approaches to GPs [Hens-man et al., 2013a]. With non-Gaussian likelihoods restric-tions on the latent function values may differ, and a non-linear transformation of the latent function, g 2 Rq may berequired. The inference approach builds on work by Hens-man et al. [2015], that in turn builds on the variational infer-ence method proposed by Opper and Archambeau [2009].

In other work [Nguyen and Bonilla, 2014] mixtures ofGaussian latent functions have also been applied for non-Gaussian likelihoods, we expect such mixture distributionswould also be applicable to our case. More recently this ap-proach [Dezfouli and Bonilla, 2015] was extended to pro-vide scalability using sparse methods similar those used inthis work.

3 Chained Gaussian Processes

Our approach to approximate inference in chained GPsbuilds on previous work in inducing point methods forsparse approximations of GPs [Snelson and Ghahramani,2006, Titsias, 2009, Hensman et al., 2015, 2013a]. Induc-ing point methods introduce m ‘pseudo inputs’, known asinducing inputs, at locations Z = {zi}m

i=1. The corre-sponding function values are given by ui = f(zi). Theseinducing inputs points do not effect the marginal of f be-cause

p(f |X,Z) =

Zp(f |u,X)p(u|Z)du,

where p(u|Z) = N (u|0,Kuu) and p(f |u,X) =N

�f |KfuK

�1uuu,K↵ � KfuK

�1uuKuf

�. The part-

covariances given by Kfu = kf (X,Z) where X is thelocations of f , define the relationship between inducingvariables and the latent function of interest, f . Themarginal likelihood is p(y) =

Rp(y|f)p(f |u)p(u)df du.

To avoid O(n3) computation complexity Titsias [2009]invokes Jensen’s inequality to obtain a lower bound onthe marginal likelihood log p(y), an approach known asvariational compression. This approximation also formsthe basis of our approach for non-Gaussian models.

3.1 Variational Bound

For non-Gaussian likelihoods, even with a single latent pro-cess the marginal likelihood, p(y), is not tractable, but itcan be lower bounded variationally. We assume that thelatent functions, f = f(x) and g = g(x) are a priori inde-pendent

p(f ,g|uf ,ug) = p(f |uf )p(g|ug). (1)

The derivation of the variational lower bound then followsa similar form as [Hensman et al., 2015] with the extensionto multiple latent functions. We begin by writing down our

1432

Page 3: Chained Gaussian Processesproceedings.mlr.press/v51/saul16.pdf · 2020. 8. 18. · Chained Gaussian Processes work for non-linear combination of the latent processes. Because these

Alan D. Saul, James Hensman, Aki Vehtari, Neil D. Lawrence

log marginal likelihood,

log p(y) = log

Zp(y|f ,g)p(f ,g|uf ,ug)

⇥ p(uf )p(ug)df dg duf dug

then introduce a variational approximation to the posterior,

p(f ,g,ug,uf |y) ⇡ p(f |uf )p(g|ug)q(uf )q(ug), (2)

where we have made the additional assumption that the la-tent functions factorize in the variational posterior.

Using Jensen’s inequality and the factorization of the la-tent functions (1), a variational lower bound can then beobtained for the log marginal likelihood,

log p(y) = log

Zp(y|f ,g)p(f |uf )p(g|ug)

⇥ p(uf )p(ug)df dg duf dug

�Z

q(f)q(g) log p(y|f ,g)df dg

� KL (q(uf ) k p(uf )) � KL (q(ug) k p(ug)) ,(3)

where q(f) =R

p(f |uf )q(uf )duf and q(g) =Rp(g|ug)q(ug)dug , and KL (p(a) k p(b)) denotes the KL

divergence between the two distributions. For Gaussianprocess priors on the latent functions we recover

p(f |uf ) = N⇣f |Kfuf

K�1ufuf

uf ,K↵ � Q↵

p(g|ug) = N⇣g|KgugK

�1ugug

ug,Kgg � Qgg

⌘,

where Q↵ = KfufK�1

ufufKuf f and Qgg =

KgugK�1ugug

Kugg. Note that covariances for f andg, can differ though their inducing input locations, Z, areshared.

We take q(uf ) and q(ug) to be Gaussian distributionswith variational parameters, q(uf ) = N (uf |µf , Sf ) andq(ug) = N (ug|µg, Sg). Using the properties of multivari-ate Gaussians this results in tractable integrals for q(f) andq(g),

q(f) = N⇣f |Kfuf

K�1ufuf

µf ,K↵ + Q↵

⌘(4)

q(g) = N⇣g|Kgug

K�1ugug

µg,Kgg + Qgg

⌘, (5)

where Q↵ = KfufK�1

ufuf(Sf �Kufuf

)K�1ufuf

Kuf f andQgg = Kgug

K�1ugug

(Sg � Kugug)K�1

ugugKugg.

The KL terms in (3) and their derivative can be com-puted in closed form and are inexpensive as they are di-vergence between Gaussians. However, an intractable in-tegral,

Rq(f)q(g) log p(y|f ,g)df dg, still remains. Since

the likelihood factorizes,

p(y|f ,g) =Yn

i=1p(yi|fi,gi),

the problematic integral in (3) also factorizes across datapoints, allowing us to use stochastic variational infer-ence [Hensman et al., 2013a, Hoffman et al., 2013],

Zq(f)q(g) log p(y|f ,g)df dg

=

Zq(f)q(g) log

nY

i=1

p(yi|fi,gi)df dg

=Xn

i=1

Zq(fi)q(gi) log p(yi|fi,gi)dfi dgi. (6)

We are then left with n, b dimensional Gaussian integralsover the log-likelihood,

log p(y) �Xn

i=1

Zq(fi)q(gi) log p(yi|fi,gi)dfi dgi

� KL (q(uf ) k p(uf )) � KL (q(ug) k p(ug)) .(7)

The bound will also hold for any additional number of la-tent functions by assuming they all factorize in the varia-tional posterior.

The bound decomposes into a sum over data, as such then input points can be visited in mini-batches, and the gra-dients and log-likelihood of each mini-batch can be sub-sequently summed, this operation can be also be paral-lelized [Gal et al., 2014]. A single mini-batch can in-stead be visited obtaining a stochastic gradient for use ina stochastic optimization [Hensman et al., 2013a, Hoffmanet al., 2013]. This provides the ability to scale to hugedatasets.

If the likelihood is Gaussian these integrals are ana-lytic [Lazaro-Gredilla and Titsias, 2011], though the noisevariance must be constrained positive via a transformationof the latent function, e.g an exponent. In this case,Z

q(fi)q(gi) log p(yi|fi,gi)dfi dgi

=

ZN

�fi|mf i, vf i

�N

�gi|mgi, vgi

�log N (yi|fi, egi)

= log N⇣yi|mf i, e

mgi�vgi2

⌘� vgi

4� vf ie

�mgi+vgi2

2

where we define mf = KfufK�1

ufufµf , vf = K↵ + Q↵ ,

mg = KgugK�1

ugugµg , vg = Kgg + Qgg. vf i denotes the

ith diagonal element of the matrix with vf along its diago-nal. It may be possible in this Gaussian case to find the op-timal q(f) such that the bound collapses to that of Lazaro-Gredilla and Titsias [2011], however this would not allowfor stochastic optimization. Here we arrive at a sparse ex-tension, where a Gaussian distribution is assumed for theposterior over of f , where as previously q(f) has been col-lapsed out and could take any form. This sparse extensionprovides the ability to scale to much larger datasets whilstmaintaining a similar variational lower bound.

1433

Page 4: Chained Gaussian Processesproceedings.mlr.press/v51/saul16.pdf · 2020. 8. 18. · Chained Gaussian Processes work for non-linear combination of the latent processes. Because these

Chained Gaussian Processes

The model is however not restricted to heteroscedasticGaussian likelihoods. If the integral (6) and its gradientscan be computed in an unbiased way, any factorizing like-lihood can be used. This can be seen as a chained Gaus-sian process. There is no single link function that allowsthe specification of this model under the modelling as-sumptions of a generalised linear model. An example thatwill be revisited in the experiments is the beta distribu-tion, yi ⇠ B(↵,�) where ↵,� 2 R+ and observationsyi 2 (0, 1), xi 2 Rq . Since ↵,� must maintain positive-ness, then can be assigned log GP priors,

yi ⇠ B(↵ = ef(xi),� = eg(xi)), (8)

where f(x) = GP(µf , kf (x,x0)) and g(x) =GP(µg, kg(x,x0)). This allows the shape of the beta dis-tribution to change over time, see Supplementary Materialand section 4.2.2 for an example plots.

Using the variational bound above, all that is required is toperform a series of n two dimensional quadratures, for boththe log-likelihood and its gradients, a relatively simple taskand computationally feasible when looking at modest batchsizes. From this example the power and adaptability of themethod should be apparent.

A major strength of this method is that performing this in-tegral is the only requirement to implement a new noisemodel, similarly to [Nguyen and Bonilla, 2014, Hensmanet al., 2015]. Further, since a stochastic optimizer is usedthe gradients do not need to be exact. Our implementa-tions can use off the shelf stochastic optimizer, such asAdagrad [Duchi et al., 2011] or RMSProp [Tieleman andHinton, 2012]. Further, for many likelihoods some por-tion of these integrals is analytically tractable, reducing thevariance introduced by numerical integration. See supple-mentary material for an investigation.

3.2 Posterior and Predictive Distributions

Following from (2) it is clear that when the variationallower bound bound has been maximised with respect to thevariational parameters, p(uf |y) ⇡ q(uf ) and p(ug|y) ⇡q(ug). The posterior for p(f⇤|y⇤) under this approximationis

p(f⇤|y⇤) =

Zp(f⇤|x, f)p(f |uf )p(uf |y)df duf

=

Zp(f⇤|uf )p(uf |y)duf

⇡Z

p(f⇤|uf )q(uf )duf = q(f⇤),

where q(f⇤) and q(g⇤) become similar to (4).

Finally, treating each prediction point independently, thepredictive distribution for each data pair {(x⇤

i ,y⇤i )}n⇤

i=1 fol-

lows as

p(y⇤i |yi,xi) =

Zp(y⇤

i |f⇤i ,g⇤i )q(f⇤i )q(g⇤

i )df⇤i dg⇤i .

This integral is analytically intractable in the general case,but again can be computed using a series of two dimen-sional quadrature or simple Monte Carlo sampling.

4 Experiments

To evaluate the effectiveness of our chained GP approxi-mations we consider a range of real and synthetic datasets.The performance measure used throughout is the negativelog predictive density (NLPD) on held out data, table 1.1

The results for mean absolute error (MAE) (Supplemen-tary Material) show comparable results between methods.5-fold cross-validation is used throughout. The non-linearoptimization of (hyper-) parameters is subject to local min-ima, as such multiple runs were performed on each foldwith a range of parameter initialisations. The solution ob-taining the highest log-likelihood on the training data ofeach fold was retained. Automatic relevance determinationexponentiated quadratic kernels are used throughout allow-ing one lengthscale per input dimension, in addition to abias kernel.2. In all experiments 100 inducing points wereused and their locations were optimized with respect to thelower bound of the log marginal likelihood following [Tit-sias, 2009].

4.1 Heteroscedastic Gaussian

In our introduction we related our ideas to heteroscedasticGPs. We will first use our approximations to show howthe addition of input dependent noise to a Gaussian pro-cess regression model effects performance, compared witha sparse Gaussian process model [Titsias, 2009]. Perfor-mance is shown to improve as more data is provided aswould be expected, making it clear that both models canscale with data, though the new model is more flexiblewhen handling the distributions tails. A sparse Gaussianprocess with Gaussian likelihood is chosen in these experi-ments as a baseline, as a non-sparse Gaussian process can-not scale to the size of all the experiments.

The Elevator1000 uses a subset of 1, 000 of the Elevatordataset . In this data the heteroscedastic model (ChainedGP) offers considerable improvement in terms of negativelog predictive density (NLPD) over the sparse GP (Table 1.Our second experiment with the Gaussian likelihood, Ele-vator10000, examines scaling of the model. Here a subsetof 10, 000 data points of the Elevator dataset are used, and

1Data used in the experiments can be downloaded via the podspackage: https://github.com/sods/ods

2Code is publically available at: https://github.com/SheffieldML/ChainedGP

1434

Page 5: Chained Gaussian Processesproceedings.mlr.press/v51/saul16.pdf · 2020. 8. 18. · Chained Gaussian Processes work for non-linear combination of the latent processes. Because these

Alan D. Saul, James Hensman, Aki Vehtari, Neil D. Lawrence

Data NLPDG CHG Lt Vt CHt

elevators1000 0.39 ± 0.13 0.1 ± 0.01 NA NA NAelevators10000 0.07 ± 0.01 0.03 ± 0.02 NA NA NAmotorCorrupt 2.04 ± 0.06 1.79 ± 0.05 1.73 ± 0.05 2.52 ± 0.09 1.7 ± 0.05

boston 0.27 ± 0.02 0.09 ± 0.01 0.23 ± 0.02 0.19 ± 0.02 0.09 ± 0.02

Table 1: Results NLPD over 5 cross-validation folds with 10 replicates each. Models shown in comparison are sparseGaussian (G), chained heteroscedastic Gaussian (CHG), Student-t Laplace approximation (Lt), Student-t VB approxima-tion (Vt), and chained heteroscedastic Student-t (CHt).

G Lt CHG CHt Vt

1

2

3

4

(a) NLPD motor

G Lt CHG CHt Vt�0.2

0

0.2

0.4

0.6

(b) NLPD Boston

Figure 1: a) NLPD on corrupt motorcycle dataset. b)NLPD of Boston housing dataset. In NLPD lower isbetter, models shown in comparison are sparse Gaus-sian (G), Student-t Laplace approximation (Lt), Student-tVB approximation (Vt), chained heteroscedastic Gaussian(CHG), and chained heteroscedastic Student-t (CHt). Box-plots show the variation over 5 folds.

performance is improved as expected. Previous models forheteroscedastic Gaussian process models cannot scale, thechained GP can implement the heteroscedastic setting andscale.

4.2 Non-Gaussian Heteroscedastic likelihoods

One of the major strengths of the approximation over purescalability, is the ability to use more general non-Gaussianlikelihoods. In this section we will investigate this flexibil-ity by performing inference with non-standard likelihoods.This allows models to be specified that correspond to themodellers belief about the data in a flexible way.

We first investigate an extension of the Student-t likelihoodthat endows it with an input-dependent scale parameter.This is straightforward in the chained GP paradigm.

The corrupt motorcycle dataset is an artificial modificationto the benchmark motorcycle dataset [Silverman, 1985]and shows the models capabilities more clearly. The orig-inal motorcycle dataset has had 25 of its data randomlycorrupted with Gaussian noise of N (0, 3), simulating spu-rious accelerometer readings. We hope that our methodwill be robust and ignore such outlying values. An input-dependent mean, µ, is set alongside an input dependentscale which must be positive, �. A constant degrees of free-

dom parameter ⌫, is initalized to 4.0 and then is optimizedto its MAP solution.

yi ⇠ St(µ = f(xi),�2 = eg(xi), ⌫) (9)

where f(x) = GP(µf , kf (x,x0)) and g(x) =GP(µg, kg(x,x0)). This provides a heteroscedastic exten-sion to the Student-t likelihood. We compare the modelwith a Gaussian process with homogeneous Student-t like-lihood, approximated variationally [Hensman et al., 2015]and the Laplace approximation. Figure 2 shows the im-proved quality of the error bars with the chained het-eroscedastic Student-t model. Learning a model withheavy tails allows outliers to be ignored, and so its input de-pendent variance can be collapsed around just points closeto the underlying function, which in this case is knownto be well modelled with a heteroscedastic Gaussian pro-cess [Lazaro-Gredilla and Titsias, 2011, Goldberg et al.,1998]. It is also interesting to note the heteroscedasticGaussian’s performance, although not able to completelyignore outliers the model has learnt a very short length-scale. This renders the prior over the scale parameterindependent across the data, meaning that the resultinglikelihood is more akin to a scale-mixture of Gaussians(which endows appropriate robustness characteristics). Themain difference is that the scale-mixture is based on a log-Gaussian prior, as opposed to the Student-t which is basedon an inverse Gamma.

Figure 1 shows the NLPD on the corrupt motorcycledataset and Boston housing dataset. The Boston hous-ing dataset shows the median house prices throughout theBoston area, quantified by 506 data points, with 13 ex-planatory input variables [Kuß, 2006]. We find that thechained heteroscedastic Gaussian process model alreadyoutperforms the Student-t model on this dataset, and theadditional ability to use heavier tails in the chained Student-t is not used. This ability to regress back to an already pow-erful model is a useful property of the chained Student-tmodel.

4.2.1 Survival Analysis

Survival analysis focuses on the analysis of time-to-eventdata. This data arises frequently in clinical trials, thoughit is also commonly found in failure tests within engineer-ing. In these settings it is common to observe censoring.

1435

Page 6: Chained Gaussian Processesproceedings.mlr.press/v51/saul16.pdf · 2020. 8. 18. · Chained Gaussian Processes work for non-linear combination of the latent processes. Because these

Chained Gaussian Processes

�2 �1 0 1 2�6

�4

�2

0

2

4

6

Standard Gaussian Process

�2 �1 0 1 2�6

�4

�2

0

2

4

6

Heteroscedastic Gaussian

�2 �1 0 1 2�6

�4

�2

0

2

4

6

Heteroscedastic Student-t

Figure 2: Corrupted motorcycle dataset, fitted with a Gaussian process model with a Gaussian likelihood, a Gaussianprocess with input dependent noise (heteroscedastic) with a Gaussian likelihood, and a Gaussian process with Student-tlikelihood, with an input dependent shape parameter. The mean is shown in solid and the variance is shown as dotted

0.2

0.4

0.6

0.8

1.0

Sen

timen

t

Exit poll

05/02/15 05/03/15 05/04/15 05/05/15 05/06/15 05/07/15 05/08/150.45

0.50

0.55

0.60

Mea

n

0.00.20.40.60.81.0

0.045

0.050

0.055

0.060

0.065

0.070

Varia

nce

Figure 3: Twitter sentiment from the UK general election modelled using a heteroscedastic beta distribution. The timingof the exit poll is marked and is followed by a night of tweets as election counts come in. Other night time periods have areduced volume of tweets and a corresponding increase in sentiment variance. Ticks on the x-axis indicate midnight.

Censoring occurs when an event is only observed to existbetween two times, but no further information is available.For right-censoring, the most common type of censoring,the event time T 2 [t,1).

A common model to analyse this type of data is an accel-erated failure time model. This suggests that the distribu-tion of when an random event, T , may occur, is multiplica-tively effected by some function of the covariates, f(x),thus accelerating or retarding time; akin to notion of dogyears. In a generalized linear model we may write this aslog T = log T0 + log f(x), where T is the random vari-able for failure time of the individual with covariates x,and T0 follows a parametric distribution describing a non-accelerated failure time.

To account for censoring the cumulative distribution needsto be computable and event times are restricted to be posi-tive. A common parametric distribution for T0 that fulfillsthese restrictions is the log-logistic distribution, with the

median being some function of the covariates, f(x). Thishowever is restrictive as the shape of failure time distri-bution is assumed to be the same for all patients. We relaxthis assumption by allowing the shape parameter of the log-logistic distribution to vary with response to the input,

yi ⇠ LL(↵ = ef(xi),� = eg(xi)),

where f(x) = GP(µf , kf (x,x0)) and g(x) =GP(µg, kg(x,x0)). This allows both skewed unimodal andexponential shaped distributions for the failure time distri-bution depending on the individual, as shown in Figure 4.Again there is no associated link-function in this case, andthe model can be modelled as a chained-survival model.

Table 2 shows the models performance a real and syntheticdatasets. The leukemia dataset [Henderson et al., 2002]contains censored event times for 1043 leukemia patientsand is known to have non-linear responses certain covari-ates [Gelman et al., 2013]. We find little advantage fromusing the chained-survival model, but as usual the model is

1436

Page 7: Chained Gaussian Processesproceedings.mlr.press/v51/saul16.pdf · 2020. 8. 18. · Chained Gaussian Processes work for non-linear combination of the latent processes. Because these

Alan D. Saul, James Hensman, Aki Vehtari, Neil D. Lawrence

Figure 4: Resulting model on synthetic survival dataset.Shows variation of median survival time and shape of log-logistic distribution, in response to differing covariate in-formation. Background colour shows the chained-survivalspredictions, coloured dots show ground truth. Lower fig-ures show associated failure time distributions and hazardsfor two different synthetic patients. Shapes can be bothunimodal or exponential.

Data NLPDG LSurv VSurv CHSurv

leuk 4.03 ± .08 1.57 ± .01 1.57 ± .01 1.56 ± .01Surv 5.45 ± .06 2.52 ± .02 2.52 ± .02 2.16 ± .02

Table 2: Results NLPD over 5 cross-validation folds with10 replicates each. Models shown in comparison are sparseGaussian (G), survival Laplace approximation (LSurv),survival VB approximation (VSurv), chained heteroscedas-tic survival (CHSurv).

robust such that performance isn’t degraded in this case.We additionally show the results on a synthetic datasetwhere the shape parameter is known to vary with responseto the input, in this case an increase in performance is seen.See Appendix A.5 for more details on the model and syn-thetic dataset.

4.2.2 Twitter Sentiment Analysis in the UK Election

The final experiment shows the adaptability of the modeleven further, on a novel dataset and with a novel het-eroscedastic model. We consider sentiment in the UK gen-eral election, focussing on tweets tagged as supporters ofthe Labour party. We used a sentiment analysis taggingsystem3 to evaluate the positiveness of 105,396 tweets con-taining hashtags relating to recent the major political par-ties, over the run in to the UK 2015 general election. We are

3Available from https://www.twinword.com/

interested in modeling the distribution of positive sentimentas a function of time. The sentiment value is constrained tobe to be between zero and one, and we do not believe thedistribution of tweets through time to be necessarily uni-modal. A natural likelihood to use in this case is the betalikelihood. This allows us to accommodate bathtub shapeddistributions, indicating tweets are either extremely posi-tive or extremely negative. We then allow the distributionover tweets to be heterogenous throughout time by usingGaussian process models for each parameter of the beta dis-tribution, yi ⇠ B(↵ = ef(xi),� = eg(xi)), where f(x) =GP(µf , kf (x,x0)) and g(x) = GP(µg, kg(x,x0)).

The upper section of Figure 3 shows the data and the prob-ability of each sentiment value throughout time. The lowerpart shows the corresponding mean and variance functionsinduced by the above parameterization. This year’s generalelection was particularly interesting: polls throughout theelection showed it to be a close race between the two majorparties, Conservative and Labour. But at the end of pollingan exit poll was released that predicted an outright win forthe Conservatives. This exit poll proved accurate and is as-sociated with a corresponding dip in the sentiment of thetweets. Other interesting aspects of the analysis include thereduction in number of tweets during the night and the cor-responding increase in the variance of our estimates.

4.2.3 Decomposition of Poisson Processes

The intensity, �(x), of a Poisson process can be modelledas the product of two positive latent functions, exp(f(x))and exp(g(x)), as a generalrised linear model,

log(�) = f(x) + g(x)

y ⇠ Poisson(� = exp(f + g) = exp(f(x)) exp(g(x))),

using a log link function.

Instead imagine we form a new process by combining twodifferent underlying Poisson processes through addition.The superposition property of Poissons means that the re-sulting process is also Poisson with rates given by the sumof the underlying rates.

To model this via a Gaussian process we have to assumethat the intensity of the resulting Poisson, �(x) is a sum oftwo positive functions exp(f(x)) and exp(g(x)),

y ⇠ Poisson(� = exp(f(x)) + exp(g(x))), (10)

there is no link function representation for this model, ittakes the form of a chained-GP.

Focusing purely on the generative model of the data, thelack of an link function does not present an issue. Figure 5shows a simple demonstration of the idea in a simulateddata set.

Using an additive model for the rate rather than a multi-plicative model for counting processes has been discussed

1437

Page 8: Chained Gaussian Processesproceedings.mlr.press/v51/saul16.pdf · 2020. 8. 18. · Chained Gaussian Processes work for non-linear combination of the latent processes. Because these

Chained Gaussian Processes

Figure 5: Even with 350 data we can start to see the differ-entiation of the addition of a long lengthscale positive pro-cess and a short lengthscale positive process. Red crossesdenote observations, dotted lines are the true latent func-tions generating the data using Eq (10), the solid line andassociated error bars are the approximate posterior predic-tions, q(f⇤), q(g⇤), of the latent processes.

previously in the context of linear models for survival anal-ysis, with promising results Lin and Ying [1995].

To illustrate the model on real data we considered homi-cide data in Chicago. Taking data from http://homicides.redeyechicago.com/ (see also Lin-derman and Adams [2014]) we aggregated data into threemonths periods by zip code. We considered an additivePoisson process with a particular structure for the covari-ance functions. We constructed a rate of the form:

⇤(x, t) = �1(x)µ1(t) + �2(x)µ2(t)

where �1(x) = exp(f1(x)), �2(x) = exp(g1(x)), µ1(t) =exp(f2(t)) and µ2(t) = exp(g2(t)) where f1(x), g1(x)are spatial GPs and f2(t) and g2(t) are temporal GPs.The overall rate decomposes into two separable rate func-tions, but the overall rate function is not separable. Wehave a sum of separable [Alvarez et al., 2012] rate func-tions. This structure allows us to decompose the homicidemap into separate spatial maps that each evolve at differ-ent time rates. We selected one spatial map with a lengthscale of 0.04 and one spatial map with a length scale of0.09. The time scales and variances of the temporal ratefunctions were optimized by maximum likelihood. The re-sults are shown in Figure 6. The long length scale processhardly fluctuates across time, whereas the short lengthscaleprocess, which represents more localized homicide activ-ity, fluctuates across the seasons with scaled increases ofaround 1.25 deaths per month per zip code. This decom-position is possible and interpretable due to the structuredunderlying nature of the GPs inside the chained model.

Figure 6: Homicide rate maps for Chicago. The shortlength scale spatial process, �1(x) (above-left) is multi-plied in the model by a temporal process, µ1(t) (below-left) which fluctuates with passing seasons. Contours ofspatial process are plotted as deaths per month per zip codearea. Error bars on temporal processes are at 5th and 95thpercentile. The longer length scale spatial process, �2(x)(above-right) has been modeled with little to no fluctuationtemporally µ2(t) (below-right).

5 Conclusions

We have introduced “Chained Gaussian Process” models.They allow us to make predictions which are based ona non-linear combination of underlying latent functions.This gives a far more flexible formalism than the gener-alized linear models that are classically applied in this do-main.

Chained Gaussian processes are a general formalism andtherefore are intractable in the base case. We derived anapproximation framework that is applicable for any factor-ized likelihood. For the cases we considered, involving twolatent functions, the approximation made use of two dimen-sional Gauss-Hermite quadrature. We speculated that whenthe idea is extended to higher numbers of latent functionsit may be necessary to resort to Monte Carlo sampling.

Our approximation is highly scalable through the use ofstochastic variational inference. This enables the full rangeof standard stochastic optimizers to be applied in the frame-work.

Acknowledgments

AS was supported by a University of Sheffield, FacultyScholarship, JH was supported by a MRC fellowship. Theauthors also thank Amazon for donated AWS compute timeand the anonymous reviewers for their comments.

1438

Page 9: Chained Gaussian Processesproceedings.mlr.press/v51/saul16.pdf · 2020. 8. 18. · Chained Gaussian Processes work for non-linear combination of the latent processes. Because these

Alan D. Saul, James Hensman, Aki Vehtari, Neil D. Lawrence

ReferencesRyan Prescott Adams and Oliver Stegle. Gaussian process prod-

uct models for nonparametric nonstationarity. In Proceedingsof the 25th International Conference on Machine Learning,pages 1–8, 2008.

Mauricio Alvarez, Lorenzo Rosasco, and Neil D. Lawrence.Kernels for vector-valued functions: A review. Foundationsand Trends in Machine Learning, 4(3):195–266, 2012. doi:10.1561/2200000036.

Amir Dezfouli and Edwin V Bonilla. Scalable inference for Gaus-sian process models with black-box likelihoods. In C. Cortes,N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, ed-itors, Advances in Neural Information Processing Systems 28,pages 1414–1422. Curran Associates, Inc., 2015.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgra-dient methods for online learning and stochastic optimization.J. Mach. Learn. Res., 12:2121–2159, July 2011. ISSN 1532-4435.

Yarin Gal, Mark van der Wilk, and Carl E. Rasmussen. Dis-tributed variational inference in sparse Gaussian process re-gression and latent variable models. In Zoubin Ghahramani,Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q.Weinberger, editors, Advances in Neural Information Process-ing Systems, volume 27, Cambridge, MA, 2014.

Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dun-son, Aki Vehtari, and Donald B. Rubin. Bayesian Data Anal-ysis, Third Edition. CRC Press, November 2013. ISBN9781439840955.

Paul W. Goldberg, Christopher K. I. Williams, and Christopher M.Bishop. Regression with input-dependent noise: A Gaussianprocess treatment. In Michael I. Jordan, Michael J. Kearns,and Sara A. Solla, editors, Advances in Neural InformationProcessing Systems, volume 10, pages 493–499, Cambridge,MA, 1998. MIT Press.

R. Henderson, S. Shimakura, and Gorst D. Modeling spatial vari-ation in leukemia survival data. Journal of the American Sta-tistical Association, 97:965–972, 2002.

James Hensman, Nicolo Fusi, and Neil D. Lawrence. Gaus-sian processes for big data. In Ann Nicholson and PadhraicSmyth, editors, Uncertainty in Artificial Intelligence, vol-ume 29. AUAI Press, 2013a.

James Hensman, Neil D. Lawrence, and Magnus Rattray.Hierarchical Bayesian modelling of gene expression timeseries across irregularly sampled replicates and clusters.BMC Bioinformatics, 14(252), 2013b. doi: doi:10.1186/1471-2105-14-252.

James Hensman, Alexander G D G Matthews, and ZoubinGhahramani. Scalable variational Gaussian process classifi-cation. In In 18th International Conference on Artificial Intel-ligence and Statistics, pages 1–9, San Diego, California, USA,May 2015.

Daniel Hernandez-Lobato, Viktoriia Sharmanska, Kristian Kerst-ing, Christoph H Lampert, and Novi Quadrianto. Mind the nui-sance: Gaussian process classification using privileged noise.In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence,and K.Q. Weinberger, editors, Advances in Neural InformationProcessing Systems 27, pages 837–845. Curran Associates,Inc., 2014.

Matthew D. Hoffman, David M. Blei, Chong Wang, and JohnPaisley. Stochastic variational inference. Journal of MachineLearning Research, 14:1303–1347, 2013.

Pasi Jylanki, Jarno Vanhatalo, and Aki Vehtari. Robust Gaussianprocess regression with a Student-t likelihood. J. Mach. Learn.Res., 12:3227–3257, November 2011. ISSN 1532-4435.

D. P. Kingma and M. Welling. Stochastic gradient VB and thevariational auto-encoder. In 2nd International Conference onLearning Representations (ICLR), Banff, 2014.

Malte Kuß. Gaussian Process Models for Robust Regression,Classification, and Reinforcement Learning. PhD thesis, TUDarmstadt, April 2006.

Miguel Lazaro-Gredilla and Michalis Titsias. Variational het-eroscedastic Gaussian process regression. In Lise Getoorand Tobias Scheffer, editors, Proceedings of the 28th Inter-national Conference on Machine Learning (ICML-11), ICML’11, pages 841–848, New York, NY, USA, June 2011. ACM.ISBN 978-1-4503-0619-5.

D. Y. Lin and Zhiliang Ying. Semiparametric analysis of generaladditive-multiplicative hazard models for counting processes.The Annals of Statistics, 23(5):1712–1734, 10 1995. doi: 10.1214/aos/1176324320.

Scott Linderman and Ryan Adams. Discovering latent networkstructure in point process data. In ICML, 2014.

John Nelder and Robert Wedderburn. Generalized linear models.Journal of the Royal Statistical Society, A, 135(3), 1972. doi:10.2307/2344614.

Trung V Nguyen and Edwin V Bonilla. Automated variationalinference for Gaussian process models. In Z. Ghahramani,M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,editors, Advances in Neural Information Processing Systems27, pages 1404–1412. Curran Associates, Inc., 2014.

Manfred Opper and Cedric Archambeau. The variational Gaus-sian approximation revisited. Neural Computation, 21(3):786–792, 2009.

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.Stochastic back-propagation and variational inference in deeplatent Gaussian models. Technical report, 2014.

B. W. Silverman. Some aspects of the spline smoothing approachto non-parametric regression curve fitting (with discussion).Journal of the Royal Statistical Society, B, 47(1):1–52, 1985.

Edward Snelson and Zoubin Ghahramani. Sparse Gaussianprocesses using pseudo-inputs. In Yair Weiss, BernhardScholkopf, and John C. Platt, editors, Advances in Neural In-formation Processing Systems, volume 18, Cambridge, MA,2006. MIT Press.

Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahra-mani. Warped Gaussian processes. In Sebastian Thrun,Lawrence Saul, and Bernhard Scholkopf, editors, Advancesin Neural Information Processing Systems, volume 16, Cam-bridge, MA, 2004. MIT Press.

T. Tieleman and G Hinton. Divide the gradient by a running av-erage of its recent magnitude. In: COURSERA: Neural Net-works for Machine Learning, 2012.

Michalis K. Titsias. Variational learning of inducing variablesin sparse Gaussian processes. In David van Dyk and MaxWelling, editors, Proceedings of the Twelfth InternationalWorkshop on Artificial Intelligence and Statistics, volume 5,pages 567–574, Clearwater Beach, FL, 16-18 April 2009.JMLR W&CP 5.

Ville Tolvanen, Pasi Jylanki, and Aki Vehtari. Expectation prop-agation for nonstationary heteroscedastic Gaussian process re-gression. In Machine Learning for Signal Processing (MLSP),2014 IEEE International Workshop, 2014.

1439

Page 10: Chained Gaussian Processesproceedings.mlr.press/v51/saul16.pdf · 2020. 8. 18. · Chained Gaussian Processes work for non-linear combination of the latent processes. Because these

Chained Gaussian Processes

Richard E. Turner and Maneesh Sahani. Demodulation as prob-abilistic inference. IEEE Transactions on Audio, Speech, andLanguage Processing, 19:2398–2411, 2011.

Jarno Vanhatalo, Jaakko Riihimaki, Jouni Hartikainen, PasiJylanki, Ville Tolvanen, and Aki Vehtari. GPstuff: Bayesianmodeling with Gaussian processes. Journal of Machine Learn-ing Research, 14(1):1175–1179, 2013. http://mloss.org/software/view/451/.

1440