abstract arxiv:1401.4082v1 [stat.ml] 16 jan 2014blei/fogm/2019f/readings/rezende...we marry ideas...

Stochastic Back-propagation and Variational Inference inDeep Latent Gaussian Models

Danilo Jimenez Rezende, Shakir Mohamed, Daan Wierstra{danilo, shakir, daan}@deepmind.com

DeepMind Technologies, London

Abstract

We marry ideas from deep neural networksand approximate Bayesian inference to derivea generalised class of deep, directed genera-tive models, endowed with a new algorithmfor scalable inference and learning. Ouralgorithm introduces a recognition modelto represent approximate posterior distribu-tions, and that acts as a stochastic encoderof the data. We develop stochastic back-propagation – rules for back-propagationthrough stochastic variables – and use this todevelop an algorithm that allows for joint op-timisation of the parameters of both the gen-erative and recognition model. We demon-strate on several real-world data sets that themodel generates realistic samples, providesaccurate imputations of missing data and isa useful tool for high-dimensional data visu-alisation.

1. Introduction

There is an immense effort in machine learning andstatistics to develop accurate and scalable probabilisticmodels of data. Such models are called upon wheneverwe are faced with tasks requiring probabilistic reason-ing, such as prediction, missing data imputation anduncertainty estimation; or for those tasks requiring theability to quickly generate samples from the model, al-lowing for confabulation and simulation-based analysisused in many scientific fields such as genetics, roboticsand control.

Recent efforts to develop generative models have fo-cused on directed models, since samples are easily ob-tained by ancestral sampling from the generative pro-cess. Directed models such as belief networks and sim-

ArXiv preprint. Copyright 2014 by the author(s).

ilar latent variable models (Dayan et al., 1995; Frey,1996; Saul et al., 1996; Bartholomew & Knott, 1999;Uria et al., 2013; Gregor et al., 2013) can be easilysampled from, but in most cases, efficient inferencealgorithms have remained elusive. These efforts, com-bined with the demand for accurate probabilistic in-ferences and fast simulation lead us to seek generativemodels that are i) deep, since hierarchical structure al-lows us to capture higher moments of the data, ii) arenon-linear, allowing for complex structure in the datato be modelled, iii) allow for fast sampling of fantasydata from the inferred model, and iv) are tractable andscalable to large amounts of data.

We meet these desiderata by introducing a class ofdeep, directed generative models with Gaussian la-tent variables at each layer. To allow for efficient andtractable inference, we also introduce a correspond-ing recognition model, which can be interpreted as astochastic encoder of the data, and represents an ap-proximate posterior distribution. For the generativemodel, we derive the objective function for optimisa-tion using variational principles; for the recognitionmodel, we specify its structure and regularisation byexploiting recent advances in deep learning. Using thisconstruction, we can train the entire model by a mod-ified gradient back-propagation, which allows for opti-misation of the parameters of both the generative andrecognition models jointly.

We build upon the large body of prior work (contex-tualised in section 5) and make the following contribu-tions:• We marry ideas from deep neural networks and

probabilistic latent variable modelling to derive ageneral class of deep, non-linear latent Gaussianmodels (section 2).

• We present a new approach for scalable varia-tional inference that allows for joint optimisationof both variational and model parameters by ex-ploiting the properties of latent Gaussian distri-butions and gradient back-propagation (sections3 and 4).

arX

iv:1

401.

4082

v1 [

stat

.ML

] 1

6 Ja

n 20

14

Stochastic Back-propagation in DLGMs

• We provide a comprehensive and systematic eval-uation of the model demonstrating its applicabil-ity to problems in simulation, visualisation, pre-diction and missing data imputation (section 6).

2. Deep Latent Gaussian Models

Deep latent Gaussian models (DLGM) are a generalclass of deep directed graphical models that consist ofGaussian latent variables at each layer of a processinghierarchy. The model consists of L layers of latent vari-ables. To generate a sample from the model, we beginat the top-most layer (L) by drawing from a Gaus-sian distribution. The activation hl at any lower layeris formed by a non-linear transformation of the layerabove hl+1, perturbed by Gaussian noise. We thengenerate observations v by sampling from the likeli-hood using the activation h1. This process is describedgraphically in figure 1.

This generative process is described by the followingequations:

ξl = N (ξ|0, I), l = 1, . . . , L (1)

hL = GLξL, (2)

hl = Tl(hl+1) + Glξl, l = 1 . . . L− 1 (3)

v ∼ π(v|T0(h1)), (4)

where ξl are Gaussian variables. The transformationsTl are represented by multi-layer perceptrons (MLPs)and Gl are matrices. At the visible layer, the datais generated from any appropriate distribution π(v|·)whose parameters are specified by a transformation ofthe first latent layer. This allows us to deal with datathat may be of any type, such as binary, categorical,counts, real-values, or a heterogeneous combination ofthese data types. We typically make use of distribu-tions in the exponential family, such as the Bernoullidistribution for binary data, but we are not restrictedto this set of likelihood functions.

The joint probability distribution of this model can beexpressed in two equivalent ways:

p(v,h) = p(v|h1)p(hL)p(θg)

L−1∏l=1

pl(hl|hl+1), (5)

p(v, ξ) = p(v|h1(ξ1...L))p(θg)

L∏l=1

N (ξl). (6)

A graphical model corresponding to these two viewsis shown in figure 1(a),(b). The conditional distribu-tion p(hl|hl+1) is a distribution implicitly defined byequation (3) and are Gaussian distributions with meanµl = Tl(hl+1) and covariance Sl = GlG

>l . Equation

hn,2

hn,1

vn

n = 1 . . . N

!g

(a)

n = 1 . . . N

!n,2

vn

!n,1

"g

(b)

generative recognition

. . .

. . .

. . . . . .

µ

C

µ

C

. . . µ

C

...

...

...

(c)

Figure 1. (a) Graphical model for deep latent Gaussianmodels. (b) Graphical model corresponding to the equiva-lent representation (6). (c) The computational diagram in-dicating the connectivity of the generative and recognitionmodels. Black arrows indicate the forward pass of sam-pling from the recognition and generative models, and redarrows indicate the backward pass for gradient computa-tion. The dashed arrows indicates points where stochasticrather than deterministic back-propagation is used.

(6) makes explicit that this generative model works byapplying a highly non-linear transformation to a spher-ical Gaussian distribution such that the transformeddistribution tries to match the empirical distribution.We discuss this in more detail in appendix A. Through-out the paper we refer to the set of parameters in Tland Gl by θg. We adopt a weak Gaussian prior overθg, p(θg) = N (θ|0, κI).

This specification for deep latent Gaussian models gen-eralises a number of well known models. When we haveonly one layer of latent variables and use a linear map-ping T (·), we recover factor analysis (Bartholomew &Knott, 1999) – more general mappings allow for a non-linear factor analysis. When the mappings are of theform Tl(h) = Alf(h) + bl, for simple element-wisenon-linearities f such as the probit function or therectified linearity, we recover the non-linear Gaussianbelief network (Frey & Hinton, 1999). We describethe relationship to other existing models in section 5.Given this specification, our key task is to developa method for tractable inference. A number of ap-proaches are known and currently used, in particularalgorithms based on mean-field variational EM (Beal,2003), the wake-sleep algorithm (Dayan, 2000), policy-gradient methods such as REINFORCE (Williams,1992), or stochastic gradient methods (Hoffman et al.,2012). We develop an alternative to these inferencealgorithms that overcomes many of their limitations,and that is both scalable and efficient.

3. Stochastic Back-propagation

The key tool we develop to allow for efficient inferenceis the set of rules to allow for gradient computations

more neural net like than DEFsclosely related; this appears to be a type of DEFthe likelihood is just as a DEF


in models with random variables. We develop theseidentities and refer to them as the rules for stochasticback-propagation.

Gradient descent methods in latent variable mod-els typically require computations of the form:∇θEqθ [f(ξ)], where θ is a set of parameters of a distri-bution qθ(ξ) and f is a loss function, which we assumeto be integrable and smooth. Of special interest hereare cases where the distribution q is a K-dimensionalGaussian N (ξ|µ,C) with parameters θ = {µ,C}.In this setting, gradients can be computed using theGaussian gradient identities:

∇µiEN (µ,C) [f(ξ)] = EN (µ,C) [∇ξif(ξ)] (7)

∇CijEN (µ,C) [f(ξ)] =1

2EN (µ,C)

[∇2ξi,ξjf(ξ)

], (8)

which are due to the theorems by Bonnet (1964) andPrice (1958), respectively. This allows us to write ageneral rule for Gaussian gradient computation.

Gaussian back-propagation (GBP). Com-bining equations (7) and (8) and using the chain rulewe can derive the full gradient

∇θEN (µ,C)[f(ξ)]=EN (0,I)

[g>∂µ

∂θ+ 1

2Tr

(H∂C

∂θ

)], (9)

where g and H are the gradient and the Hessian of thefunction f(ξ), respectively. Equation (9) can be inter-preted as a modified back-propagation rule for Gaus-sian distributions that takes into account the gradientsthrough the mean and covariance and that reduces tothe standard back-propagation rule when C is con-stant. Unfortunately this rule requires knowledge ofthe Hessian matrix of f(ξ), which has an algorithmiccomplexity O(K3). For inference in DLGMs, we laterintroduce an unbiased though higher variance estima-tor that requires only quadratic complexity.

We can also derive general back-propagation rules forq-distributions other than the Gaussian. Such rulescan be derived in two ways:

1. Using the product rule for integrals.For many exponential family distributions, it is pos-sible to use the product rule for integrals to expressthe gradient with respect to the parameters (mean ornatural) as an expectation of gradients with respectto the random variables itself:

∇θEp(ξ|θ)[f(ξ)]== −Ep(ξ|θ)[∇ξ[B(ξ)f(ξ)]] (10)

We can thus reduce the problem to searching for asuitable transformation function B(ξ) that allows us

to compute the required derivatives directly. This ap-proach can be used to derive rules for many otherdistributions e.g., the Gaussian, inverse Gamma, log-Normal and Wald distributions. We discuss this inmore detail in appendix B.

2. Using suitable co-ordinate transformations.We can also derive stochastic back-propagation rulesfor any distribution that can be written as an smooth,invertible transformation of a canonical base distribu-tion. For example, any Gaussian distribution N (µ,C)can be obtained as transformation of a spherical Gaus-sian ε ∼ N (0, I), using the transformation y = µ+Rεand C = RR>. This co-ordinate transformation al-lows the gradient of the expectation with respect to Rto be written as:

∇REN(µ,C)[f(ξ)] =∇REN(0,I)[f(µ+Rε)]

=EN(0,I)[εg>], (11)

where g is the gradient of f evaluated at µ + Rε andprovides a lower-cost alternative to Price’s theorem(8). The estimator (11) will in general have highervariance than the estimators based on (8). A shortanalysis of the variance of these estimators is discussedsection 5 and in appendix C.

This approach can be more general than the firstapproach, and such transformations are well knownfor many distributions, especially those with a self-similarity property or location-scale formulations, suchas the Gaussian, Student’s t-distribution, the class ofstable distributions, and the class of generalised ex-treme value distributions. We provide further discus-sion in appendix B.

4. Scalable Inference in DLGMs

4.1. Free Energy Objective

To perform inference in DLGMs we integrate out theeffect of the latent variables, requiring us to computethe integrated or marginal likelihood. In general, thiswill be an intractable integration and instead we op-timise a lower bound on the marginal likelihood. Weintroduce an approximate posterior distribution q andapplying Jensen’s inequality, following the variationalprinciple (Zemel, 1993; Beal, 2003) to obtain:

log p(V) = log

∫p(V|ξ,θg)p(ξ,θg)dξ

= log

∫q(ξ)

q(ξ)p(V|ξ,θg)p(ξ,θg)dξ (12)

≥L(V)=−DKL[q(ξ)‖p(ξ)]+Eq [log p(V|ξ,θg)p(θg)] .

We use the variational free energy, defined as F(V) =−L(V), as our objective function. This objective con-


sists of two terms: the first is the KL-divergence be-tween the variational distribution and the prior distri-bution (which acts a regulariser), and the second is areconstruction error. We refer to the approximate pos-terior distribution q(ξ) as a recognition model, and itsdesign is important for the success of such variationalmethods.

We use a variational distribution q(ξ) that factorisesacross the L layers, but not necessarily within a layer:

q(ξ|V,θr) =

N∏n=1

L∏l=1

N(ξn,l|µl(vn),Cl(vn)

), (13)

where µl(v) and Cl(v) are generic maps representedby MLPs corresponding to the mean vector and co-variance matrix of the units in layer l respectively. Werefer to the parameters of the recognition model q(ξ|v)as θr.

For a Gaussian prior and a Gaussian recognitionmodel, the KL term in the free energy (12) can becomputed analytically and the free energy becomes:

F(V) = −∑n

Eq [log p(vn|h(ξn))] + 12κ‖θ

g‖2

+1

2

∑n,l

[‖µn,l‖2+Tr(Cn,l)−log |Cn,l|−1

],

(14)

where Tr(C) and |C| indicate the trace and the deter-minant of the covariance matrix C, respectively.

4.2. Parameterising the RecognitionCovariance

There are a number of approaches for parameterisingthe covariance matrix of the recognition model. Main-taining a full covariance matrix C in equation (14)would entail an algorithmic complexity O(LK3) whereK is the number of latent variables per layer and L isthe number of latent layers.

The simplest approach is to use a diagonal covariancematrix C = diag(d), where d is a K-dimensional vec-tor. This approach is appealing since it allows forlinear-time computation and sampling, but only allowsfor axis-aligned posterior distributions.

We can improve upon the diagonal approximation byconsidering a structured covariance parameterised bytwo independent vectors d and u. We parameterisethe precision matrix C−1 (Magdon-Ismail & Purnell,2010):

C−1 = D + uuT , (15)

where D = diag(d). This representation allows forarbitrary rotations of the Gaussian distribution alongone principal direction, with relatively few additionalparameters.

By application of the matrix inversion lemma, we ob-tain the covariance matrix in terms of d and u as:

C = D−1 + ηD−1uuTD−1, η = 1

uTD−1u+1,

log |C| = log η − log |D|. (16)

This allows both the trace Tr(C) and the log |C|needed in the computation of the Gaussian KL to becomputed in O(KL) time.

In addition, we can factorise the covariance matrix asthe product of two matrices C = RRT . R is a squarematrix of the same size as C and can be computeddirectly in terms of the vectors d and u. One solutionfor R is:

R = D−12 −

[1−√ηuTD−1u

]D−1uuTD−

12 . (17)

Due to this structure, the product of R with an arbi-trary vector can be computed in O(K) without havingto compute R explicitly. This allows us to also sam-ple efficiently from this low-rank Gaussian, since anyGaussian random variable ξ with mean µ and covari-ance matrix C = RRT can be written as ξ = µ+Rε,where ε is a standard Gaussian variate.

4.3. Gradients of the Free Energy

We wish to minimise the free energy F(V) (14) usingstochastic gradient descent methods. For this, we needefficient estimators of the gradients of all terms in (14)with respect to both the parameters of the generativemodel θg and the recognition model θr.

The gradients with respect to the jth generative pa-rameter θgj can computed using:

∇θgjF(V) = −Eq[∇θgj log p(V|h)

]+

1

κθgj . (18)

An unbiased estimator of ∇θgjF(V) is obtained by ap-

proximating (18) with a small number of samples fromq (or even a single sample).

To obtain gradients with respect to the recognitionparameters θr, we use the rules for Gaussian back-propagation developed in section 3. To address thecomplexity of the Hessian in the general rule (9), weuse the co-ordinate transformation for the Gaussian towrite the gradient with respect to the matrix R insteadof the covariance C (recalling C = RR>) derived in

note their q distribution

they call the variational distribution the recognition model.


equation (11), where derivatives are computed for thefunction f(ξ) = log p(v|h(ξ)).

The gradients of F(v) with respect to the variationalmean µl(v) and the matrices Rl(v) are given by

∇µlF(v) = −Eq[∇ξl log p(v|h(ξ))

]+ µl (19)

∇Rl,i,jF(v) = −1

2Eq[εl,j∇ξl,i log p(v|h(ξ))

]+

1

2∇Rl,i,j [TrCn,l − log |Cn,l|] , (20)

where the gradients ∇Rl,i,j [TrCn,l − log |Cn,l|] arecomputed by back-propagation.

Unbiased estimators of the gradients (19) and (20) arethen obtained jointly by sampling from the recognitionmodel ξ ∼ q(ξ|v) (bottom-up pass) and updating thevalues of the generative model layers using equation(3) (top-down pass).

Finally the gradients ∇θrjF(v), obtained from equa-

tions (19) and (20), are:

∇θrF(v)=∇µF(v)>∂µ

∂θr+Tr

(∇RF(v)

∂R

∂θr

). (21)

We can now use the gradients (18) and (21) to de-scend the free-energy surface with respect to both thegenerative and recognition parameters in a single op-timisation step.

4.4. Algorithm Summary and Complexity

Figure 1 (c) shows the flow of computation in themodel. Our algorithm proceeds by performing aforward pass (black arrows): consisting of bottom-up (recognition) phase and a top-down (generationphase), which updates the hidden activations of therecognition model and parameters of any Gaussian dis-tributions; and a backward pass (red arrows) in whichgradients are computed using the appropriate back-propagation rule for deterministic and stochastic lay-ers.

We take a gradient descent step using:

∆θg,r = −Γg,r∇θg,rF(V), (22)

where Γg,r are diagonal pre-conditioning matricescomputed using the RMSprop heuristic1. The learningprocedure is summarised in algorithm (1).

The computational complexity for producing S sam-ples from the generative model is O(SLK2) where K

1Described by G. Hinton, RMSprop: Divide the gradientby a running average of its recent magnitude, in Courseraonline course: Neural networks for machine learning, lec-ture 6e, 2012.

Algorithm 1 Learning in DLGMs

while hasNotConverged() doV← getMiniBatch()ξn ∼ q(ξn|vn) (bottom-up pass) eq. (13)h← h(vξ) (top-down pass) eq. (3)updateGradients() eqs (19), (18), (20)θg,r ← ∆θg,r

end while

is the average number of neurons per layer, L is thenumber of layers (counting together deterministic andstochastic layers). The computational complexity pertraining sample during training is O(LK2) , and is thesame as that of an auto-encoder with a decoder andencoder of same size as the generative and recognitionmodels, respectively.

5. Related Work

Directed Graphical Models. There is a broadexisting literature on directed graphical models withdepth. Our model is directly related to non-linearGaussian belief networks (NLGNBN) (Frey & Hinton,1999), which also use latent Gaussian distributionsthroughout a multi-layer hierarchy, but whose meansare simple non-linear transformations of higher layers.Sigmoid belief networks (SBN) (Saul et al., 1996) areclosely related and are models with Bernoulli variablesat every layer. Both these models are trained by amean-field variational EM algorithm. More recently,Gregor et al. (2013) described Deep Auto-regressiveNetworks (DARN), which also form a directed graph-ical model that uses auto-regressive Bernoulli dis-tributions at each layer. The Gaussian process la-tent variable model (GPLVM) (Lawrence, 2005) isthe non-parametric analogue of our model, and em-ploys Gaussian process priors over the non-linear func-tions between each layer. Some of the best results us-ing directed models are provided by the neural auto-regressive density estimator (NADE) (Larochelle &Murray, 2011; Uria et al., 2013), which uses func-tion approximation to model conditional distributionswithin a directed acyclic graph. We will also see thisperformance on the benchmark tests we present, butmajor drawbacks exist, such as its inability to allowfor deep representations and difficulties in extendingto locally connected models (e.g., through the use ofconvolutional layers) that allow for high-dimensionalscaling.

Relation to denoising auto-encoders. Denoisingauto-encoders (DAE) (Vincent et al., 2010) introducea random corruption to the encoder network and at-


tempt to minimize the expected reconstruction errorunder this corruption noise with additional regularisa-tion terms. Bengio et al. (2013) show how these modelscan be used as generative models, but since the gener-ative process underlying the denoising auto-encoder isunknown, simulation from the model requires a slowMarkov chain sampling procedure. In the model wedescribe, the recognition distribution q(ξ|v) can be in-terpreted as a stochastic encoder in the DAE setting.We can readily see the correspondence between the ex-pression for the free energy (12) and the reconstructionerror and regularization terms used in denoising auto-encoders (c.f. equation (4) of Bengio et al. (2013)).

It is possible to now see denoising auto-encoders in theframework for variational learning in latent variablemodels. The key difference is that the form of encod-ing ‘corruption’ and regularisation terms used in ourmodel have been derived from the variational principleto provide a strict bound on the data log-likelihood ofa known directed generative model and that allows foreasy generation of samples.

Alternative Gradient Estimates. Other ap-proaches for gradient estimation are typically usedin the literature. The most general approachesare policy-gradient methods such as REINFORCE(Williams, 1992) that are simple to implement and ap-plicable to both discrete and continuous models. Thisestimator can be written as:

∇θEp[f(ξ)] = Ep[(f(ξ)− b)∇θ log p(ξ|θ)], (23)

where b is a baseline typically chosen to reduce thevariance of the estimator; we use draws ξ ∼ p(ξ|θ).

Unfortunately the variance of (23) scales poorlywith the number of random variables (Dayanet al., 1995). To see this limitation, consider

functions of the form f(ξ) =∑Di=1 f(ξi), where

each individual term has a bounded variance, i.e.,Var[f(ξi)] ≤ κ∀i for some κ > 0, and considerindependent or weakly correlated random vari-ables. Given these assumptions we have the followingbounds for the variances of (7) and REINFORCE (23):

GBP: Var[∇ξf(ξ)] ≤ κREINFORCE: Var

[(ξ−µ)σ2 (f(ξ)− E[f(ξ)])

]≤ Dκ.

Thus, the REINFORCE estimator has the unde-sirable property that its variance scales linearly withthe number of independent random variables in thetarget function, while the variance of GBP is boundedby a constant.

Any correlation between the different terms in f(ξ)would reduce the absolute value of the variance (5).The assumption of weakly correlated terms is relevant

for variational learning in larger generative modelswhere variables tend to depend strongly only on othervariables in their Markov blanket (i.e. only neigh-bouring nodes in the graphical model tend to displaystrong correlations). Moreover, due to independenceassumptions and structure in the variational distribu-tion, the resulting free energies are often summationsover weakly correlated or independent terms, furthersupporting this view. We provide a second view onthe issue in appendix C.

Another very general alternative to REINFORCE isthe wake-sleep algorithm (Dayan et al., 1995). Thewake-sleep algorithm fails to optimise a single consis-tent objective function and there is thus no guaranteethat it leads to a decrease of the free energy (12). Butit has been shown to work well in some small practicalexamples. In figure 2(c) we provide a comparison ofthe performance achieved by our model and the wake-sleep algorithm.

Stochastic back-propagation in other contexts.The key theorems for the Gaussian stochastic back-propagation were first exploited by Opper & Archam-beau (2009) for variational learning in Gaussian pro-cess regression, and subsequently by Graves (2011) forlearning the parameters of large neural networks.

Opper & Archambeau (2009) make use of an uncondi-tional variational approximation, which typically re-quires several iterations for every visible pattern vin order to converge to a local minimum of the freeenergy. In contrast, we use a parametric recognitionmodel (eq. (13)) which can produce a one-shot samplefrom the approximate posterior distribution (similar tothe procedure in Dayan et al. (1995); Dayan (2000)).Using a conditional, parametric recognition model alsoprovides a cheap probabilistic encoding of the datasetand provides a framework for treating generative mod-els and classes of auto-encoders on the same princi-pled ground (as described above). Recently, Kingma& Welling (2013) also make the connection betweenstochastic back-propagation, generative auto-encodersand variational inference that we describe here. Thiswork was developed simultaneously with ours and pro-vides an additional perspective on the use and deriva-tion of stochastic back-propagation rules.

6. Results

Generative models, such as the deep latent Gaussianmodel that we focus on, have a number of applicationsin simulation, prediction, data visualisation, missingdata imputation and other forms of probabilistic rea-soning. We describe the testing methodology we use


and present results on a number of these tasks.

6.1. Testing Methodology

We use training data of various types including binaryand count-based data sets. In all cases, we train usingmini-batches, which requires the introduction of scal-ing terms in the free energy objective function (14) inorder to maintain the correct scale between the priorover the parameters and the remaining terms (Ahnet al., 2012; Welling & Teh, 2011). We make use ofthe objective:

F(V) = −λ∑n

Eq [log p(vn|h(ξn))] + 12κ‖θ

g‖2

+λ

2

∑n,l

[‖µn,l‖2 + Tr(Cn,l)− log |Cn,l| − 1

], (24)

where n is an index over observations in the mini-batchand λ is equal to the ratio between the data-set andthe mini-batch size. At each iteration, a random mini-batch of size 200 observations is chosen.

All parameters of the model were initialized using sam-ples from a Gaussian distribution with mean zero andvariance 1× 106; the prior variance of the parameterswas κ = 1× 106. We compute the marginal likelihoodon the test data by importance sampling using samplesfrom the recognition model; we describe our estimatorin appendix D.

Since the recognition model is parameterised by adeep neural network, careful regularisation is neededto ensure that it provides useful inferences for unseendata. We regularise by introducing additional noise tothe recognition model, specifically, bit-flip or drop-outnoise at the input layer and small additional Gaussiannoise to samples generated by the recognition model.We also use rectified non-linearities for any hidden lay-ers.

6.2. Analysing the Approximate Posterior

The specification of the recognition model affects howwell we are able to capture the statistics of the dataand the accuracy and efficiency of the learning.

We examine the quality of the approximate posteriordistribution learnt by the recognition model in com-parison to the true posterior distribution by traininga deep latent Gaussian model on the MNIST data setof handwritten images. The images are of size 28×28,and we use the binarised data set from Larochelle &Murray (2011).

We use sampling to evaluate the true posterior distri-bution for a number of MNIST digits under the diago-

nal and the structured covariance parameterisation ofthe recognition model, described in section 4.2. We vi-sualise the posterior distribution for a model with twoGaussian latent variables in figure 2. The true pos-terior distribution is shown by the grey regions andwas computed by importance sampling with a largenumber of particles aligned in a grid between -5 and5. In figure 2(a) we see that many posterior distribu-tions are elliptical or spherical in shape and thus, it isreasonable to assume that they can be well describedby a Gaussian approximation. Samples from the prior(shown in green) are spread widely over the space andvery few samples fall in the region of significant pos-terior mass, explaining the inefficiency of estimationmethods that rely on samples from the prior. Sam-ples from the recognition model (shown in blue) areconcentrated on the posterior mass, indicating thatthe recognition model has learnt the correct posteriorstatistics and should lead to efficient learning.

In figure 2(a) we see that samples from the recognitionmodel are aligned to the axis and cannot capture anycorrelation in the posterior. Using the low-rank co-variance model, we see in figure 2(b) that we are ableto capture the posterior correlation. Not all posteri-ors are Gaussian in shape, but the recognition placesmass in the best location possible to provide a reason-able approximation. This aspect emphasises the im-portance of posterior approximation, and one we con-tinue to investigate. The performance in terms of testlog-likelihood on the MNIST data is shown for four al-gorithms on the same model architecture. We comparefactor analysis (FA), the wake-sleep algorithm, and ourapproach using a diagonal and low-rank covariance.

6.3. Simulation and prediction

We evaluate the performance of a three layer latentGaussian model on the MNIST data set. The modelconsists of two deterministic layers with 200 hiddenunits and a stochastic layer of 200 latent variables.We use mini-batches of 200 observations and trainedthe model using stochastic back-propagation. Samplesfrom this model are shown in figure 3. We also com-pare the test log-likelihood to a large number of exist-ing approaches in table 1. We used exactly the data setused in Uria et al. (2013) and quote the log-likelihoodsin the lower part of the table from this work. Theseresults show that we are able to compete with someof the best models currently available. The generateddigits also match the true data well and visually ap-pear as good as some of the best visualisations fromthese competing approaches.

We also analysed the performance of our model on

these look like hacks. what are their inferential interpretations?

what does this mean?


(a) Diagonal covariance (b) Low-rank covariance

LowR Diag Wake−Sleep FA84

88

92

96

100

104

Tes

t neg

. mar

gina

l lik

elih

ood

(c) Performance

Figure 2. (a),(b) Analysis of the exact posterior vs. our recognition model for 6 MNIST digits. The grey regions show thetrue posterior distribution, green markers indicate samples from the prior distribution, blue markers are samples from therecognition model, and the red marker indicates the MAP estimate. Within each image we show four views of the sameposterior, zooming in on the region of high posterior mass. (c) Comparison of test log likelihoods for different inferencealgorithms.

Figure 3. Performance on the MNIST dataset. Top left: Samples from the training data. Top right: Samples from thelearned model. Bottom right: sampled pixel probabilities

Table 1. Comparison of negative log-probabilities on thetest set for the binarised MNIST data.Model − ln p(v)

Factor Analysis 106.00NLGBN (Frey & Hinton, 1999) 95.80Wake-Sleep (Dayan, 2000) 91.3DLGM diagonal covariance 87.30DLGM rank-one covariance 86.60

Results below from Uria et al. (2013)

MoBernoullis K=10 168.95MoBernoullis K=500 137.64RBM (500 h, 25 CD steps) approx. 86.34DBN 2hl approx. 84.55NADE 1hl (fixed order) 88.86NADE 1hl (fixed order, RLU, minibatch) 88.33EoNADE 1hl (2 orderings) 90.69EoNADE 1hl (128 orderings) 87.71EoNADE 2hl (2 orderings) 87.96EoNADE 2hl (128 orderings) 85.10

three high-dimensional real image data sets. TheNORB object recognition data set consists of 24, 300images that are 96×96 pixels. We use a model consist-ing of 1 deterministic layer of 400 hidden units and onestochastic layer of 100 latent variables. Samples pro-duced from this model are shown in figure 4(a). TheCIFAR10 natural images data set consists of 50, 000RGB images that are 32 × 32 pixels, which we splitinto 8× 8 patches. We use the same model as used forthe MNIST experiment and show samples from themodel in figure 4(b). The Frey faces data set con-sists of almost 2, 000 images of different facial expres-sions2 of size 28 × 20 pixels. In all cases we see thatthe model has learnt the different image statistics andproduces recognisable samples. Results of this kind areextremely promising and suggest that performance canbe improved greatly by scaling the model to more re-alistic scenarios by considering locally connected andconvolutional architectures.

6.4. Missing Data Imputation and Denoising

The generative models we describe are often used forproblems in missing data imputation that form thecore of applications in recommender system, bioinfor-

2Images of Brendan Frey. Data from http://www.cs.nyu.edu/~roweis/data.html

http://www.cs.nyu.edu/~roweis/data.html

http://www.cs.nyu.edu/~roweis/data.html


(a) NORB (b) CIFAR (c) Frey

Figure 4. a) Performance on the NORB dataset. Left: Samples from the training data. Right: sampled pixel means fromthe model. b) Performance on CIFAR10 patches. Left: Samples from the training data. Right: Sampled pixel meansfrom the model. c) Frey faces data. Left: data samples. Right: model samples.

Figure 5. Imputation results on MNIST digits. The firstcolumn shows the true data. Column 2 shows pixel loca-tions set as missing in grey. The remaining columns showimputations and denoising of the images for 15 iterations,starting left to right. Top: 60% missingness. Middle: 80%missingness. Bottom: 5x5 patch missing.

matics and experimental design. We show the abilityof the model to impute missing data using the MNISTdata set in figure 5. We test the imputation abilityunder two different missingness types (Little & Rubin,1987): Missing-at-random (MAR), where we consider60% and 80% of the pixels to be missing randomly, andNot Missing-at-random (NMAR), where we consider asquare region of the image to be missing. The modelproduces very good completions in both test cases.There is uncertainty in the identity of the image. Thisis expected and reflected in the errors in these comple-tions as the resampling procedure is run, and furtherdemonstrates the ability of the model to capture thediversity of the underlying data. We do not integrateover the missing values in our imputation procedure,but use a procedure that simulates a Markov chainthat we show converges to the true marginal distribu-tion. The procedure to sample from the missing pixelsgiven the observed pixels is explained in appendix E.

Figure 6. Two dimensional embedding of the MNIST dataset. Each colour corresponds to one of the digit classes.

6.5. Data Visualisation

Latent variable models such as DLGMs are often usedfor visualisation of high-dimensional data sets. Weproject the MNIST data set to a 2-dimensional latentspace and use this 2-D embedding as a visualisation ofthe data. A 2-dimensional embedding of the MNISTdata set is shown in figure 6. The classes separateinto different regions indicating that such a tool canbe useful in gaining insight into the structure of high-dimensional data sets.

7. Discussion

Our algorithm generalises to a large class of modelswith continuous latent variables, which include Gaus-sian, non-negative or sparsity-promoting latent vari-ables. For models with discrete latent variables (e.g.,sigmoid belief networks), policy-gradient approachesthat improve upon the REINFORCE approach remainthe most general, but intelligent design is needed tocontrol the gradient-variance in high dimensional set-tings.

These models are typically used with a large number


of latent variables. In this setting, and under the ap-propriate conditions, the required expectations for in-ference could be well approximated by Gaussian inte-grals. Thus, we believe that our approach is applicableeven in the high-dimensional discrete setting: we canapply the gradient estimators derived in section 4 asan approximation in high-dimensional discrete latentvariable models, and potentially obtain new learningrules for these models.

An additional feature of our approach is that it caneasily be combined with convolutional architecturesand computation using GPUs to allow for scaling tothe large-data settings we are now routinely facedwith. More investigation is required to fully explorethe impact of the structure of the recognition modeland to allow more accurate covariance estimation.This is particularly important in the high-dimensionalsetting where we lack intuition regarding the charac-teristics of the posterior distribution. This and otherextensions form the basis of much future work.

8. Conclusion

We have developed a class of general-purpose and flex-ible generative models with Gaussian latent variablesat each layer. Our approach introduces a recognitionmodel, which can be seen as a stochastic encoding ofthe data, to allow for efficient and tractable inference.We derived a lower bound on the marginal likelihoodfor the generative model and specified the structureand regularisation of the recognition model by exploit-ing recent advances in deep learning. By developingmodified rules for back-propagation through stochasticlayers, we derived an efficient inference algorithm thatallows for joint optimisation of all parameters, i.e. pa-rameters of the generative and recognition models. Wehave demonstrated on several real-world data sets thatthe model generates realistic samples, provides accu-rate imputations of missing data and can be a usefultool for high-dimensional data visualisation.


References

Ahn, S., Balan, A. K., and Welling, M. Bayesian poste-rior sampling via stochastic gradient Fisher scoring.In ICML, 2012.

Bartholomew, D. J. and Knott, M. Latent variablemodels and factor analysis, volume 7 of Kendall’slibrary of statistics. Arnold, 2nd edition, 1999.

Beal, M. J. Variational Algorithms for approximateBayesian inference. PhD thesis, University of Cam-bridge, 2003.

Bengio, Y., Yao, L., Alain, G., and Vincent, P. Gener-alized denoising auto-encoders as generative models.pp. 1–9, 2013.

Bonnet, G. Transformations des signaux aleatoiresa travers les systemes non lineaires sans memoire.Annales des Telecommunications, 19(9-10):203–220,1964.

Dayan, P., Hinton, G. E., Neal, R. M., and Zemel,R. S. The Helmholtz machine. Neural computation,7(5):889–904, September 1995.

Dayan, P. Helmholtz machines and wake-sleep learn-ing. Handbook of Brain Theory and Neural Network.MIT Press, Cambridge, MA, 44(0), 2000.

Frey, B. J. Variational inference for continuous sig-moidal Bayesian networks. 6th International Work-shop on Artificial Intelligence and Statistics, 1996.

Frey, B. J. and Hinton, G. E. Variational learning innonlinear Gaussian belief networks. Neural Compu-tation, 11(1):193–213, January 1999.

Graves, A. Practical variational inference for neuralnetworks. In Advances in Neural Information Pro-cessing Systems 24, pp. 2348–2356. 2011.

Gregor, K., Mnih, A., and Wierstra, D. Deep autore-gressive networks. ArXiv preprint. arXiv:1310.8499,October 2013.

Hoffman, M., Blei, D. M., Wang, C., and Paisley,J. Stochastic variational inference. arXiv preprintarXiv:1206.7051, 2012.

Kingma, D. P. and Welling, M. Auto-encoding varia-tional Bayes. arXiv preprint arXiv:1312.6114, 2013.

Larochelle, H. and Murray, I. The neural autoregres-sive distribution estimator. Proceedings of the 14thInternational Conference on Artificial Intelligenceand Statistics (AISTATS), JMLR W&CP 15:29–37,2011.

Lawrence, N. Probabilistic non-linear principal com-ponent analysis with Gaussian process latent vari-able models. The Journal of Machine Learning Re-search, 6:1783–1816, 2005.

Little, R. J. and Rubin, D. B. Statistical analysis with

missing data, volume 539. Wiley New York, 1987.

Magdon-Ismail, M. and Purnell, J. T. Approximat-ing the covariance matrix of GMMs with low-rankperturbations. pp. 300–307, 2010.

Minka, T. A family of algorithms for approximateBayesian inference. PhD thesis, MIT, 2001.

Neal, R. M. Probabilistic inference using MarkovChain Monte Carlo methods. Technical ReportCRG-TR-93-1, University of Toronto, 1993.

Opper, M. and Archambeau, C. The variational Gaus-sian approximation revisited. Neural computation,21(3):786–92, March 2009.

Price, R. A useful theorem for nonlinear devices hav-ing Gaussian inputs. IEEE Transactions on Infor-mation Theory, 4(2):69–72, 1958.

Rue, H., Martino, S., and Chopin, N. ApproximateBayesian inference for latent Gaussian models by us-ing integrated nested Laplace approximations. Jour-nal of the Royal Statistical Society: Series B (Sta-tistical Methodology), 71(2):319–392, 2009.

Saul, L. K., Jaakkola, T., and Jordan, M. I. Mean fieldtheory for sigmoid belief networks. Journal of Arti-ficial Intelligence Research (JAIR), 4:61–76, 1996.

Uria, B., Murray, I., and Larochelle, H. A deepand tractable density estimator. ArXiv preprint.arXiv:1310.1757v1, October 2013.

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., andManzagol, P.-A. Stacked Denoising Autoencoders:Learning Useful Representations in a Deep Net-work with a Local Denoising Criterion. The Journalof Machine Learning Research, 11:3371–3371–3408–3408, 2010.

Welling, M. and Teh, Y. W. Bayesian learning viastochastic gradient Langevin dynamics. In Proceed-ings of the 28th International Conference on Ma-chine Learning (ICML-11), pp. 681–688, 2011.

Williams, R. J. Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.Machine Learning, 8:229 – 256, 1992.

Zemel, R. S. A minimum description length frameworkfor unsupervised learning. January 1993.


A. Additional Model Details

In equation (6) we showed an alternative form ofthe joint log likelihood that makes explicit that thegenerative model works by applying a highly non-linear transformation to a spherical Gaussian distri-bution N (ξ) such that the transformed distributionbest matches the empirical distribution. We providemore details on this view here for clarity.

From the model description (3), (4), we can interpretthe variables hl as deterministic functions of the noisevariables ξl. This can be formally introduced as acoordinate transformation of the probability densityin equation (5): we perform a change of coordinateshl → ξl. The density of the transformed variablesξ can be expressed in terms of the density (5) timesthe determinant of the Jacobian of the transformationp(ξ) = p(h(ξ))| ∂h

∂ξ|. Since the coordinate transforma-

tion is linear we have | ∂h∂ξ| = |Gl| and the distribution

of ξ is:

p(ξ)=p(hL)|GL|L−1∏l=1

|Gl|pl(hl|hl+1)=

L∏l=1

|Gl||Sl|−12N (ξl)

=

L∏l=1

|Gl||GlGTl |−12N (ξl) =

L∏l=1

N(ξl), (25)

where N (ξ) is a Gaussian with mean zero andcovariance equal to the identity matrix. Using thisview, we rewrite the joint probability as in (6).

A simple recognition model that can be used,consists of a single deterministic layer and a stochasticGaussian layer with the rank-one covariace structureand is constructed as:

q(ξ|v) = N(ξ|µ; (diag(d) + uu>)−1

)(26)

µ = Wµz + bµ (27)

d = Wdz + bd; u = Wuz + bu (28)

z = f(Wvv + bv) (29)

where the function f is a rectified non-linearity (butother non-linearities such as tanh can be used.)

B. Deriving StochasticBack-propagation Rules

In section 3 we described two ways in which to derivestochastic back-propagation rules. We show specificexamples and provide some more discussion in this sec-tion.

B.1. Using the Product Rule for Integrals

We can derive rules for stochastic back-propagationfor many distributions by finding a appropriate non-linear function that allows us to express the gradientwith respect to the parameters of the distribution as agradient with respect to the random variable directly.The approach we described in the main text was:

∇θEp[f(x)]=

∫∇θp(x|θ)f(x)dx=

∫∇xp(x|θ)B(x)f(x)dx

= [B(x)f(x)p(x|θ)]supp(x) −∫p(x|θ)∇x[B(x)f(x)]

= −Ep(x|θ)[∇x[B(x)f(x)]] (30)

where we have introduced the non-linear function B(x)to allows the transformation of the gradients and haveapplied the product rule for integrals (rule for inte-gration by parts) to rewrite the integral in two partsin the second line, and the supp(x) indicates that theterm is evaluated at the boundaries of the support. Touse this approach, we require that the density we areanalysing be zero at the boundaries of the support toensure that the first term in the second line is zero.

As an alternative, we can also write this differentlyand find an non-linear function of the form:

∇θEp[f(x)]== −Ep(x|θ)[B(x)∇xf(x)]. (31)

Consider general exponential family distributions ofthe form:

p(x|θ) = h(x) exp(η(θ)>φ(x)−A(θ)) (32)

where h(x) is the base measure, θ is the set of meanparameters of the distribution, η is the set of naturalparameters, and A(θ) is the log-partition function. Wecan express the non-linear function in (30) using thesequantities as:

B(x) =[∇θη(θ)φ(x)−∇θA(θ)]

[∇x log[h(x)] + η(θ)T∇xφ(x)]. (33)

This can be derived for a number of distributions suchas the Gaussian, inverse Gamma, Log-Normal, Wald(inverse Gaussian) and other distributions. We showsome of these below:

The B(x) corresponding to the second formulation canalso be derived and may be useful in certain situa-tions, requiring the solution of a first order differentialequation. This approach of searching for non-lineartransformations leads us to the second approach forderiving stochastic back-propagation rules.


Family θ B(x)

Gaussian

(µσ2

) ( −1(x−µ−σ)(x−µ+σ)

2σ2(x−µ)

)Inv. Gamma

(αβ

) (x2(− ln x−Ψ(α)+ln β)

−x(α+1)+β

( x2

−x(α+1)+β)(− 1

x+ α

β)

)Log-Normal

(µσ2

) ( −1(ln x−µ−σ)(ln x−µ+σ)

2σ2(ln x−µ)

)

B.2. Using Alternative Co-ordinateTransformations

There are many distributions outside the exponentialfamily that we would like to consider using. A sim-pler approach is to search for a co-ordinate transfor-mation that allows us to separate the deterministicand stochastic parts of the distribution. We describedthe case of the Gaussian in section 3. Other distri-butions also have this property. As an example, con-sider the Levy distribution (which is a special case ofthe inverse Gamma considered above). Due to theself-similarity property of this distribution, if we drawX from a Levy distribution with known parametersX ∼ Levy(µ, λ), we can obtain any other Levy distri-bution by rescaling and shifting this base distribution:kX + b ∼ Levy(kµ+ b, kc).

Many other distributions hold this property, allow-ing stochastic back-propagation rules to be determinedfor distributions such as the Student’s t-distribution,Logistic distribution, the class of stable distributionsand the class of generalised extreme value distributions(GEV). Examples of co-ordinate transformations T (·)and the resulsting distributions are shown below forvariates X drawn from the standard distribution listedin the first column.

Std Distr. T (·) Gen. Distr.

GEV (µ, σ, 0) mX+b GEV (mµ+b,mσ, 0)Exp(1) µ+βln(1+exp(−X)) Logistic(µ, β)

Exp(1) λX1k Weibull(λ, k)

C. Univariate variance analysis

In analysing the variance properties of many estima-tors, we showed the general scaling of likelihood ratioapproaches in section 5. As an example to further em-phasise the high-variance nature of these alternativeapproaches, we present a short analysis in the univari-ate case.

Consider a random variable p(ξ) = N (ξ|µ, σ2) and asimple quadratic function of the form

f(ξ) = cξ2

2. (34)

For this function we immediately obtain the followingvariances

V ar[∇ξf(ξ)] = c2σ2 (35)

V ar[∇ξ2f(ξ)] = 0 (36)

V ar[(ξ − µ)

σ∇ξf(ξ)] = 2c2σ2 + µ2c2 (37)

V ar[(ξ − µ)

σ2(f(ξ)− E[f(ξ)])] = 2c2µ2 +

5

2c2σ2 (38)

Equations (35), (36) and (37) correspond to the vari-ance of the estimators based on (7), (8), (11) respec-tively whereas equation (38) corresponds to the vari-ance of the REINFORCE algorithm for the gradientwith respect to µ.

From these relations we see that, for any parameterconfiguration, the variance of the REINFORCE esti-mator is strictly larger than the variance of the estima-tor based on (7). Additionally, the ratio between thevariances of the former and later estimators is lower-bounded by 5/2. We can also see that the varianceof the estimator based on (8) is zero for this specificfunction whereas the variance of the estimator basedon (11) is not.

The high variance nature of the policy gradient ap-proaches are undesirable and is magnified in multi-varate settings, leading to the variance estimates thatscale with the dimensionality of the latent variablesthat were discussed in section 5

D. Estimating the Marginal Likelihood

We compute the marginal likelihood by importancesampling by generating S samples from the recognitionmodel and using the following estimator:

p(v) ≈ 1

S

S∑s=1

p(v|h(ξ(s)))p(ξ(s))

q(ξs); ξ(s) ∼ q(ξ|v)

(39)

E. Missing Data Imputation

Image completion can be approximatively achieved bya simple iterative procedure which consists of (i) ini-tializing the non-observed pixels with random values;(ii) sampling from the recognition distribution giventhe resulting image; (iii) reconstruct the image giventhe sample from the recognition model; (iv) iterate theprocedure.

We denote the observed and missing entries in an ob-servation as vm,vo, respectively. The imputation pro-cedure can be written formally as a Markov chain with


transition kernel T q(vm → v′m) given by

T q(vm → v′m) =

∫p(v′|ξ)q(ξ|v)dξ, (40)

where v = (vm,vo) and v′ = (v′m,vo).

Provided that the recognition model q(ξ|v) constitutesa good approximation of the true posterior p(ξ|v), (40)can be seen as an approximation of the Kernel

T (vm → v′m) =

∫p(v′|ξ)p(ξ|v)dξ. (41)

The kernel (41) has two important properties: (i)it has as eigen-distribution the model’s marginalp(vm|vo); (ii) T (v → v′) > 0∀v,v′. The prop-erty (i) can be derived by applying the kernel (41) tothe marginal p(vm|vo) and noting that it is an fixedpoint. Property (ii) is an immediate consequence ofthe smoothness of the model.

We apply the fundamental theorem for Markov chains(Neal, 1993, pp. 38) and conclude that given the aboveproperties, a Markov chain generated by (41) is guar-anteed to generate samples from the correct marginalp(vm|vo).

In practice, the stationary distribution of the com-pleted pixels will not be exactly the model’s marginalp(vm|vo), since we can only use the approximated ker-nel (40). We can nevertheless provide a bound on theL1 norm of the difference between the resulting sta-tionary marginal and the target marginal p(vm|vo) bythe following proposition.

Proposition E.1 (L1 bound on marginal error ). Ifthe recognition model q(ξ|v) is such that for all ξ

∃ε > 0 s.t.

∫ ∣∣∣∣q(ξ|v)p(v)

p(ξ)− p(v|ξ)

∣∣∣∣ dv ≤ ε (42)

then the marginal p(v) is a weak fixed point of thekernel (40) in the following sense:

∫ ∣∣∣∣∫ [T q(vm → v′m)− T (v→ v′)] p(v)dv

∣∣∣∣ dv′ < ε.

(43)

Proof.∫ ∣∣∣∣∫ [T q(vm → v′m)− T (v→ v′)] p(v)dv

∣∣∣∣ dv′=

∫ ∣∣∣∣∫ p(v′|ξ)p(v)[q(ξ|v)− p(ξ|v)]dvdξ

∣∣∣∣ dv′=

∫ ∣∣∣∣∫ p(v′|ξ)p(v)[q(ξ|v)− p(ξ|v)]p(v)

p(ξ)

p(ξ)

p(v)dvdξ

∣∣∣∣ dv′=

∫ ∣∣∣∣∫ p(v′|ξ)p(ξ)[q(ξ|v)p(v)

p(ξ)− p(v|ξ)]dvdξ

∣∣∣∣ dv′≤∫ ∫

p(v′|ξ)p(ξ)∫ ∣∣∣∣q(ξ|v)

p(v)

p(ξ)− p(v|ξ)

∣∣∣∣ dvdξdv′≤ ε.

where we apply the condition (42) to obtain the laststatement. That is, if the recognition model is suffi-ciently close to the true posterior to guarantee that(42) holds for some acceptable error ε than (43) guar-antees that the fixed point of the Markov chain in-duced by (40) is not further and ε away from the truemarginal with respect to the L1 norm.

F. Variational Bayes for Deep DirectedModels

In the main test we focussed on the variational prob-lem of specifying an posterior on the latent variablesonly. It is natural to consider the variational Bayesproblem in which we specify an approximate posteriorfor both the latent variables and model parameters.

Following the same construction and considering anGaussian approximate distribution on the model pa-rameters θg, the free energy becomes:

F(V) = −∑n

reconstruction error︷︸︸︷Eq[log p(vn|h(ξn))]

+1

2

∑n,l

[‖µn,l‖2 + TrCn,l − log |Cn,l| − 1

]︸︷︷︸

latent regularization term

+1

2

∑j

[m2j

κ+τjκ

+ log κ− log τj − 1

]︸︷︷︸

parameter regularization term

, (44)

which now includes an additional term for the cost ofusing parameters and their regularisation. We mustnow compute the additional set of gradients with re-


spect to the parameter’s mean mj and variance τj are:

∇mjF(v) = −Eq[∇θgj log p(v|h(ξ))

]+mj (45)

∇τjF(v) = − 12Eq

[θj −mj

τj∇θgj log p(v|h(ξ))

]+

1

2κ− 1

2τj

(46)

G. Additional Related Work

There are a number of other aspects of related work,which due to space reasons were not included in themain text, but are worth noting.

General Classes of Latent Gaussian Models.The model class we describe here builds upon otherwidely-used models with latent Gaussian distributionsand provides an inference mechanism for use withinthis diverse model class. Recognising the latent Gaus-sian structure shows the connection to models suchas generalised linear regression, Gaussian process re-gression, factor analysis, probabilistic principal com-ponents analysis, stochastic volatility models, andother latent Gaussian graphical models (such as log-Gaussian Cox processes and models for covariance se-lection).

Alternative Inference Approaches.The most common approach for inference in deep di-rected models has been variational EM. The alternat-ing minimisation often suffers from slow convergence,expensive parameter learning steps, and typically re-quires the specification of additional local bounds toallows for efficient computation of the required expec-tations. The wake-sleep algorithm (Dayan, 2000) is afurther alternative, but fails to optimise a single con-sistent objective function and has had limited successin the past. For latent Gaussian models, other alterna-tive inference algorithms have been proposed: the In-tegrated Nested Laplace Approximation (INLA) hasbeen popular, but it is limited to models with veryfew hyperparameters (Rue et al., 2009). Alternativevariational algorithms such as expectation propaga-tion (EP) (Minka, 2001) can also be used, but are nu-merically difficult to implement and have performancesimilar to approaches based on the variational lowerbound.

abstract arxiv:1401.4082v1 [stat.ml] 16 jan 2014blei/fogm/2019f/readings/rezende...we marry ideas...

Documents