abstract arxiv:1602.06725v2 [cs.lg] 1 jun 2016 · recently,burda et al.(2016) have derived a...

Variational Inference for Monte Carlo Objectives

Andriy Mnih [email protected] J. Rezende [email protected]

Google DeepMind

AbstractRecent progress in deep latent variable modelshas largely been driven by the development offlexible and scalable variational inference meth-ods. Variational training of this type involvesmaximizing a lower bound on the log-likelihood,using samples from the variational posterior tocompute the required gradients. Recently, Burdaet al. (2016) have derived a tighter lower boundusing a multi-sample importance sampling esti-mate of the likelihood and showed that optimiz-ing it yields models that use more of their ca-pacity and achieve higher likelihoods. This de-velopment showed the importance of such multi-sample objectives and explained the success ofseveral related approaches.

We extend the multi-sample approach to dis-crete latent variables and analyze the difficultyencountered when estimating the gradients in-volved. We then develop the first unbiased gra-dient estimator designed for importance-sampledobjectives and evaluate it at training generativeand structured output prediction models. The re-sulting estimator, which is based on low-varianceper-sample learning signals, is both simpler andmore effective than the NVIL estimator (Mnih& Gregor, 2014) proposed for the single-samplevariational objective, and is competitive with thecurrently used biased estimators.

1. IntroductionDirected latent variable models parameterized using neu-ral networks have recently enjoyed a surge in popularitydue to the recent advances in variational inference meth-ods that made it possible to train such models efficiently.These methods (Kingma & Welling, 2014; Rezende et al.,

Proceedings of the 33 rd International Conference on MachineLearning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

2014; Mnih & Gregor, 2014) approximate the intractableposterior of the model with a variational posterior parame-terized using a neural network and maximize a lower boundon the intractable marginal log-likelihood, estimating therequired gradients using samples from the variational pos-terior. This approach implements an efficient feedforwardapproximation to the expensive iterative process requiredby traditional variational inference methods for each datapoint.

One important weakness of variational methods is thattraining a powerful model using an insufficiently expres-sive variational posterior can cause the model to use onlya small fraction of its capacity. The most direct route toaddressing this issue is to develop more expressive but stilltractable variational posteriors as was done in (Salimanset al., 2015; Rezende & Mohamed, 2015; Gregor et al.,2015).

However, the crippling effect of an excessively simple pos-terior on the model can alternatively be seen as a conse-quence of the form of the lower bound optimized by thevariational methods (Burda et al., 2016). As the bound isbased on a single-sample estimate of the marginal likeli-hood of the observation, it heavily penalizes samples thatexplain the observation poorly and thus produce low esti-mates of the likelihood. As result, the variational posteriorlearns to cover only the high-probability areas of the trueposterior, which in turn assumes a simpler shape which iseasier to approximate by the variational posterior. A sim-ple way to minimize this effect is to average over multiplesamples when computing the marginal likelihood estimate.The resulting lower bound on the log-likelihood gets tighteras the number of samples increases (Burda et al., 2016),converging to the true value in the limit of infinitely manysamples. We will refer to such objectives derived from like-lihood estimates computed by averaging over independentsamples as Monte Carlo objectives. When using an objec-tive that averages over multiple samples, the distributionfor generating samples no longer explicitly represents thevariational posterior and instead is thought of as a proposaldistribution due to connections to importance sampling.

Multi-sample objectives of this type have been used for

arX

iv:1

602.

0672

5v2

[cs

.LG

] 1

Jun

201

6


generative modelling (Bornschein & Bengio, 2015; Burdaet al., 2016), structured output prediction (Raiko et al.,2015), and models with hard attention (Ba et al., 2015).As a multi-sample objective is a better proxy for the log-likelihood than a single-sample one, models trained usingmulti-sample objectives are likely to achieve better log-likelihoods. This has been empirically demonstrated in thecontext of generative models by Burda et al. (2016) andBornschein & Bengio (2015), who also showed that usingmore samples in the objective increased the number of la-tent variables used in the deeper layers.

Unfortunately, unless all the latent variables in the modelare continuous, learning the proposal distribution with amulti-sample objective is difficult as the gradient estima-tor obtained by differentiating the objective has very highvariance. As a result, with the exception of Burda et al.(2016), who used an alternative estimator available for con-tinuous latent variables, none of the above methods updatethe parameters of the proposal distribution by following thegradient of the multi-sample objective. Thus, updates forthe proposal distribution and the model parameters in thesemethods are not optimizing the same objective function,which can lead to suboptimal performance and even pre-vent convergence.

In this paper we develop a new unbiased gradient estimatorfor multi-sample objectives that replaces the single learn-ing signal of the naive estimator with much lower varianceper-sample learning signals. Unlike the NVIL estimator(Mnih & Gregor, 2014) designed for single-sample varia-tional objectives, our estimator does not require learningany additional parameters for variance reduction. We ex-pect that the availability of an effective unbiased gradientestimator will make it easier to integrate models with dis-crete latent variables into larger systems that can be trainedend-to-end.

2. Multi-sample stochastic lower bounds2.1. Estimating the likelihood

Suppose we would like to fit an intractable latent vari-able model P (x, h) to data. As the intractability of infer-ence rules out using maximum likelihood estimation, wewill proceed by maximizing a lower bound on the log-likelihood. One general way to derive such a lower boundis to start with an unbiased estimator I of the marginal like-lihood P (x) and then transform it. We will consider es-timators of the form I(h1:K) = 1

K

∑Ki=1 f(x, hi) where

h1, ..., hK are independent samples from some distributionQ(h|x) which can potentially depend on the observation x.Before showing how to transform such an estimator into abound, let us consider some possible choices for the likeli-hood estimator.

Perhaps the simplest estimator of this form can be con-structed by sampling hi’s from the prior P (h) and aver-aging the resulting conditional likelihoods:

I(h1:K) =1

K

∑K

i=1P (x|hi) with hi ∼ P (h). (1)

While this estimator is unbiased, it can have very high vari-ance in models where most latent configurations do not ex-plain a given observation well. For such models, the es-timator will greatly underestimate the likelihood for mostsets of K independent samples and substantially overes-timate it for a small number of such sets. This is a conse-quence of not taking into account the observation we wouldlike the latent variables to explain when sampling them.

We can incorporate the information about the observationwe are estimating the likelihood for by sampling the latentsfrom a proposal distribution Q(h|x) conditional on the ob-servation x and using importance sampling:

I(h1:K) =1

K

∑K

i=1

P (x, hi)

Q(hi|x)(2)

with h1:K ∼ Q(h1:K |x) ≡∏Ki=1Q(hi|x). In addition

to also being unbiased, the variance of this estimator canbe much lower than that of the preceding one because itcan assign high probability to the latent configurations withhigh joint probability with the given observation. In fact,if we were able to use the true posterior as the proposaldistribution, the estimator would have zero variance. Whilethis is infeasible for the models we are considering, this factsuggests that making the proposal distribution close to theposterior is a sensible strategy.

2.2. Lower-bounding the log-likelihood

Having chosen an estimator I for the likelihood, we can ob-tain an estimator L of a lower bound on the log-likelihoodsimply by taking the logarithm of I . We can justify this byapplying Jensen’s inequality:

EQ(h1:K |x)

[log I(h1:K)

]≤ logEQ(h1:K |x)

[I(h1:K)

]= logP (x),

where the equality follows from the fact that since I isunbiased, EQ(h1:K |x)[I(h1:K)] = P (x). Therefore, wecan think of L(h1:K) = log I(h1:K) as a stochastic lowerbound on the log-likelihood (Burda et al., 2016).

We note that this approach is not specific to the to estima-tors from Section 2.1 and can be used with any unbiasedlikelihood estimator based on random sampling. Thus itmight be possible to obtain better lower bounds by usingmethods from the importance sampling literature such ascontrol variates and adaptive importance sampling.


Despite the potential pitfalls described above, estimatorsinvolving sampling from the prior have been used success-fully for training models for structured output prediction(Tang & Salakhutdinov, 2013; Dauphin & Grangier, 2016)and models with hard attention (Mnih et al., 2014; Zaremba& Sutskever, 2015).

The multi-sample (K > 1) version of the above estima-tor has recently been used for variational training of latentvariable models (Burda et al., 2016; Bornschein & Ben-gio, 2015) as well as models with hard attention (Ba et al.,2015). The single-sample version of the estimator yieldsthe classical variational lower bound (Jordan et al., 1999)

L(x) = EQ(h|x)

[log

P (x, h)

Q(h|x)

], (3)

which is used as the objective in much of the recent workon training generative models (Kingma & Welling, 2014;Rezende et al., 2014; Mnih & Gregor, 2014).

The advantage of using a multi-sample stochastic lowerbound is that increasing the number of samples K is guar-anteed to make the bound tighter (Burda et al., 2016), thusmaking it a better proxy for the log-likelihood. Intuitively,averaging over the K samples inside the log, removes theburden of every sample having to explain the observationwell, which leads to the proposal distribution being con-siderably less concentrated than the variational posterior,which is its single-sample counterpart. Training models byoptimizing a multi-sample objective can be seen as a gen-eralization of variational training that does not explicitlyrepresent the variational posterior.

2.3. Objective

Thus we will be interested in training models by maximiz-ing objectives of the form

LK(x) =EQ(h1:K |x)

[log

1

K

∑K

i=1f(x, hi)

], (4)

which can be seen as lower bounds on the log-likelihood.This class of objectives is a rich one, including the onesused in variational inference, generative modelling, struc-tured prediction, and hard attention.

2.4. Gradient analysis

In this section we will analyze the gradient of the objectivew.r.t. the parameters of the model and the proposal distri-bution and explain why developing an effective unbiasedestimator for the gradient is difficult in general. In the spe-cial case of continuous latent variables an alternative ap-proach to gradient estimation based on reparameterization(Kingma & Welling, 2014; Burda et al., 2016) is likely to bepreferable to the more general approach we follow in thispaper, which is applicable to all types of latent variables.

As shown in the supplementary material, differentiatingLK(x) w.r.t. the parameters θ of Q and f gives

∇θLK(x) =EQ(h1:K |x)

[∑jL(h1:K)∇θ logQ(hj |x)

]+

EQ(h1:K |x)

[∑jwj∇θ log f(x, hj)

], (5)

where wj ≡ f(x,hj)∑Ki=1 f(x,h

i).

As our objective LK(x) is an expectation of the stochas-tic lower bound L(h1:K) w.r.t. to the proposal distribu-tion, it can depend on any given parameter through the pro-posal distribution, through the value of the stochastic lowerbound as a function of a set of K samples, or both. Intu-itively, the first and the second terms in Eq. 5 capture theeffect of θ on LK(x) though its effect on the proposal dis-tribution and the value of the stochastic lower bound as afunction of a set of samples respectively.

Let us inspect these two terms, both of which are linearcombinations of the gradients corresponding to theK sam-ples. The second term is well-behaved and is easy to esti-mate because weights {wj} are non-negative and sum to 1,ensuring that the norm of the linear combination of the gra-dients is at most as large as the norm of the largest of theKgradients. In mixture modelling terms, we can think of wj

as the responsibility of sample j for the observation x —a measure of how well sample j explains the observationcompared to the other K − 1 samples.

The first term however is considerably more problematicfor two reasons. First, the gradients for all K samplesare multiplied by the same scalar L(h1:K), which can bethought of as the learning signal for the proposal distri-bution (Mnih & Gregor, 2014). As a result, the gradientfor a sample that explains the observation well is not givenany more weight than the gradient for a sample in the sameset of K that explains the observation poorly. This meansthat the first term does not implement credit assignmentwithin each set ofK samples, unlike the second term whichachieves that by weighting the gradients using the respon-sibilities. Thus the learning signal for each sample hi willhave high variance, making learning slow.1

Another important source of variance when estimating thefirst term is the magnitude of the learning signal. Unlike theresponsibilities used in the second term, which are between0 and 1, the learning signal can have potentially unboundedmagnitude, which means that the norm of the first term canbecome much larger than the norm of any of the individualsample gradients. This issue can be especially pronouncedearly in training, when all samples from the proposal Q ex-plain the data poorly, resulting in a very small I(h1:K) and

1 Despite not performing credit assignment within sets of Ksamples, the first term does perform correct credit assignment inexpectation over such sets.


thus a very negative learning signal. Thus unless specialmeasures are taken, the first term in the gradient will over-whelm the second term and make the overall estimate verynoisy.

2.5. Gradient estimation

The difficulties described in the previous section affectonly the gradient for the parameters the sampling distri-bution depends on. For all other parameters ψ the firstterm is identically zero, which leaves only the second, well-behaved term. As a result, the following naive Monte Carloestimator based on a single set of K samples works well:

∇ψLK(x) '∑j w

j∇ψ log f(x, hj), (6)

where hi ∼ Q(h|x). While it is possible to reduce thevariance of the estimator by averaging over multiple setsof samples, in this paper we follow the common practice ofusing a single set and relying on averaging over the trainingcases in a minibatch to reduce the variance to a reasonablelevel instead. We will now turn out attention to the morechallenging problem of estimating gradients for parametersthat affect the proposal distribution.

2.5.1. NAIVE

We will start with the simplest estimator, also based onnaive Monte Carlo:

∇θLK(x) '∑j L(h1:K)∇θ logQ(hj |x)

+∑j w

j∇θ log f(x, hj), (7)

with hi ∼ Q(h|x). This estimator does not attempt to elim-inate either of the two sources of variance described in Sec-tion 2.4 and we include it here for completeness only.

2.5.2. WITH BASELINES (NVIL)

One simple way to reduce the variance due the large mag-nitude of the learning signal is to reduce its magnitude bysubtracting a quantity, called a baseline, correlated with thelearning signal but not dependent on the latent variables.This transformation of the learning signal leaves the gra-dient estimator unbiased because it amounts to subtractinga term which has the expectation of 0 under the proposaldistribution. In our use of baselines, we will follow theNeural Variational Inference and Learning (NVIL, Mnih& Gregor, 2014) method for training generative models,which is based on optimizing the classical variational lowerbound (Eq. 3). The main idea behind the NVIL estima-tor is to reduce the magnitude of the learning signal forthe parameters of the variational distribution (which is thesingle-sample counterpart of our proposal distribution) bysubtracting two baselines from it: a constant baseline b andan input-dependent one b(x).

The following estimator is a straightforward adaptation ofthe same idea to multi-sample objectives:

∇θLK(x) '∑j(L(h1:K)− b(x)− b)∇θ logQ(hj |x)

+∑j w

j∇θ log f(x, hj), (8)

with hi ∼ Q(h|x). The constant baseline b tracks themean of the learning signal, while the input-dependent oneis fit to minimize the squared residual of the learning signalL(h1:K)− b(x)− b, with the goal of capturing the effect ofthe observation on the magnitude of the learning signal. Weimplement the input dependent baseline using a one-hiddenlayer neural network.

While introducing baselines can addresses the estimatorvariance due to the large magnitude of the learning signal,it has no effect on the variance resulting from having thesame learning signal for all samples in a set of K.

2.5.3. PER-SAMPLE LEARNING SIGNALS

We can reduce the effect of the second source of varianceby defining a different local learning signal for each samplein a way that minimizes its dependence on the other sam-ples in the set. This can be accomplished by using a sepa-rate baseline for each sample that depends on the value ofall other samples and eliminates much of the variance dueto them. We will now show that this approach does not biasthe resulting estimator.

Let h−j denote the set of K − 1 samples obtained by leav-ing out sample j from the original set. Since the samples ina set are independent, evaluating the expectations with re-spect to them in any order produces the same result. Thusthe contribution of sample j to the first term in Eq. 5 can beexpressed as

EQ(h1:K |x)

[L(h1:K)∇θ logQ(hj |x)

]=

EQ(h−j |x)

[EQ(hj |x)

[L(h1:K)∇θ logQ(hj |x)

∣∣∣h−j]] .Since in the inner-most expectation all samples except forhj are conditioned on, adding any function of them to thelearning signal for hj has no effect on the value of the ex-pectation. Thus we can define a baseline that depends onh−j in addition to x. We would like this baseline to be asclose to L(h1:K) as possible without using the value of hj .

Inspecting the global learning signal L(h1:K) (Eq. 4) sug-gests that we can obtain an effective baseline for the learn-ing signal for sample j by replacing f(x, hj) in it by somequantity close to it but independent of hj . We could, for ex-ample, use some mapping f(x) trained to predict f(x, hi)from the observation x. This gives rise to the followinglocal learning signal for sample j:

L(hj |h−j) = L(h1:K)− log1

K

(∑i 6=j

f(x, hi) + f(x)).


For K = 1, this estimator becomes essentially equivalentto the NVIL estimator, with log f(x) corresponding to theinput-dependent baseline b(x).

We can avoid having to learn an additional mappingby taking advantage of the fact that we have morethan one sample in a set. Since the samples in aset are IID, so are the corresponding values f(x, hi),which means that we can get a reasonable estimatef(x, h−j) by combining the f(x, hi) values for allthe other samples in the set using averaging of somesort. We experimented with using the arithmetic mean(f(x, h−j) = 1

K−1∑i 6=j f(x, hi)) and the geometric

mean (f(x, h−j) = exp(

1K−1

∑i 6=j log f(x, hi)

)) and

found that the geometric mean worked slightly better. Theresulting local learning signals can be written as

L(hj |h−j) = (9)

L(h1:K)− log1

K

(∑i 6=j

f(x, hi) + f(x, h−j)).

This approach to variance reduction, unlike the one aboveor NVIL, does not require learning any additional param-eters for performing variance reduction. Moreover, as thetotal cost of computing the per-sample learning signals inEq. 9 is of the same order as that of computing of the globallearning signal, this approach allows us to implement effec-tive variance reduction in the multi-sample case essentiallyat no cost. This approach relies on having more than onesample for the same observation, however, and so is notapplicable in the single-sample setting.

The final estimator has the form

∇θLK(x) '∑

jL(hj |h−j)∇θ logQ(hj |x)

+∑

jwj∇θ log f(x, hj). (10)

We will refer to this estimator as the VIMCO (VariationalInference for Monte Carlo Objectives) estimator. The pseu-docode for computing it is provided in the supplementarymaterial. This estimator is a black-box one, in the sensethat it can be easily applied to any model for which wecan compute the complete log-likelihood logP (x, h) andits parameter gradients exactly. As such, it can be seen asas an alternative to Black Box Variational Inference (Ran-ganath et al., 2014) and NVIL, specialized for multi-sampleobjectives.

3. Structured output predictionStructured output prediction (SOP) is a type of supervisedlearning with high-dimensional outputs with rich structuresuch as images or text. The particular emphasis of SOP ison capturing the dependencies between the output variablesin addition to capturing their dependence on the inputs.

Here we will take the approach of viewing SOP as con-ditional probabilistic modelling with latent variables (Tang& Salakhutdinov, 2013; Sohn et al., 2015).

To stay consistent with the terminology for generative mod-els we used so far, we will refer to inputs as contextsand to outputs as observations Thus, given a set of con-text/observation pairs (c, x), we would like to fit a latentvariable model P (x, h|c) to capture the dependencies be-tween the contexts and the observations, as well as thosebetween the observed dimensions. Typically such a modelfactorizes as P (x, h|c) = P (x|h, c)P (h|c), with both theconditional likelihood and the prior terms being conditionalon the context. Thus, this is essentially the same settingas for generative modelling, with the only difference beingthat every distribution now also conditions on the context c,which makes it straightforward to apply the estimators wepresented.

However, historically such models have been trained us-ing samples from the prior P (h|c), with the gradients com-puted using either importance sampling (Tang & Salakhut-dinov, 2013) or heuristic rules for backpropagating throughbinary units (Raiko et al., 2015). Since using the prior asthe proposal distribution does not allow it to use the infor-mation about the observation, such methods tend to requirea large number of samples to perform well. Though vari-ational training has been applied recently to SOP modelswith continuous latent variables (Sohn et al., 2015), we arenot aware of any work that uses a learned proposal distri-bution conditional on the observations to train SOP modelswith multi-sample objectives. We will explore the effec-tiveness of using this approach in Section 5.2.

4. Related workMulti-sample objectives: The idea of using a multi-sample objective for latent variable models was proposedby Raiko et al. (2015), who thought of it not as a lowerbound on the log-likelihood but an objective in its ownright. They evaluated several gradient estimators at op-timizing it for training structured prediction models andshowed that a simple biased estimator emulating backprop-agation performed best. Tang & Salakhutdinov (2013) pro-posed an estimator based on importance sampling for anEM-like bound on the log-likelihood using samples fromthe prior. This is also a biased estimator as it relies on self-normalized importance sampling to approximate the poste-rior using a set of weighted samples. Burda et al. (2016)pointed out that the multi-sample objective of Raiko et al.(2015) was a tighter lower bound on the log-likelihood thanthe single-sample variational lower bound and presented amethod for training variational autoencoders by optimizingthis multi-sample objective. Their method relies on an un-biased gradient estimator which can be used only for mod-


els with continuous latent variables.

Reweighted Wake Sleep: Though the Reweighted WakeSleep algorithm (RWS, Bornschein & Bengio, 2015) fortraining generative models has been derived from the per-spective of approximating the log-likelihood gradients us-ing importance sampling, it is closely related to the boundoptimization approach we follow in this paper. Burda et al.(2016) have shown that the RWS gradient estimator for themodel parameters is identical to the one given by Eq. 6,which means that the RWS model parameter update aimsto maximize the lower bound on the log-likelihood basedon the multi-sample importance sampling estimator fromEq. 2. RWS performs two types of updates for the proposaldistribution parameters, the first of which, called the wakeupdate, is based on the same weights {wj} as the modelparameter update

∆θ ∝∑j w

j∇θ logQ(hj |x) (11)

and is motivated as a (biased) estimator of the gradientKL(P (h|x)||Q(h|x)). Its bias decreases with the increas-ing number of samples, vanishing in the limit of infinitelymany samples.

The second update, called the sleep update, having the form

∆θ ∝ ∇θ logQ(h|x), (12)

is based on a sample (x, h) from the model and comes fromthe original Wake-Sleep algorithm (Hinton et al., 1995).The wake update tends to work better than the sleep up-date, and using the two updates together works even bet-ter (Bornschein & Bengio, 2015). As neither of these up-dates appears to be related to the lower bound optimized themodel parameter update, RWS does not seem to optimize awell-defined objective, a feature it shares with the originalWake-Sleep algorithm. Despite this theoretical weaknessRWS works well in practice, outperforming original Wake-Sleep and NVIL, which are single-sample algorithms, us-ing as few as 5 samples per observation.

Black Box Methods: As our approach does not assumeanything about the structure of the model or the distribu-tion(s) of its latent variables, it can be seen as a black boxmethod for multi-sample objectives. A number of blackbox methods have been developed for the classical vari-ational objective, usually based around unbiased gradientestimators for the proposal distribution. Black Box Varia-tional Inference (BBVI Ranganath et al., 2014) and NVIL(Mnih & Gregor, 2014) are two such methods.

VIMCO shares some similarities with the black boxmethod of the local expectations (LE) of Titsias & Lázaro-Gredilla (2015). The LE method provides a relatively lowvariance unbiased estimator based on local learning signalsderived from computing an exact expectation w.r.t. each

variable in the model. Both methods work well withoutbaselines and involve considering multiple values for la-tent variables. Unlike VIMCO, the LE method optimizes asingle-sample objective and requires computing exact ex-pectations for each variable, which makes it much morecomputationally expensive.

5. ResultsWe evaluate the effectiveness of the proposed approachat training models for generative modelling and structuredoutput prediction. We chose these two tasks because theyinvolve models with hundreds of latent variables, whichposes formidable challenges when estimating the gradientsfor the proposal distributions. In both cases we compare theperformance of the VIMCO estimator to that of the NVILestimator as well as to an effective biased estimator fromthe literature. We experiment with varying the number ofsamples in the objective to see how that affects the perfor-mance of the resulting models when using different estima-tors. The details of the training procedure are given in thesupplementary material.

5.1. Generative modelling

We start by applying the proposed estimator to traininggenerative models, concentrating on sigmoid belief net-works (SBN) (Neal, 1992) which consist of layers of binarylatent variables. SBNs have been used to evaluate a num-ber of variational training methods for models with discretelatent variables (Mnih & Gregor, 2014; Bornschein & Ben-gio, 2015; Gu et al., 2016).

Our first comparison is on the MNIST dataset of 28 ×28 images of handwritten digits, using the binarizationof Salakhutdinov & Murray (2008) and the standard50000/10000/10000 split into the training, validation, andtest sets. We use an SBN with three hidden layers of 200binary latent variables (200-200-200-768) as the generativemodel. The proposal distribution is parameterized as anSBN with the same architecture but going in the oppositedirection, from the observation to the deepest hidden layer(768-200-200-200).

As our primary goal is here is to see how well the VIMCOestimator performs at optimizing the multisample objec-tive, we train the above model using each of the VIMCO,NVIL, and RWS estimators to optimize the lower bound(Eq. 4) based on 2, 5, 10, and 50 samples (K). To match thecomputational complexity of the other two estimators, weused only the better-performing wake update for the pro-posal distribution in RWS. We also trained the model byoptimizing the classical variational objective (K = 1) us-ing NVIL to serve as a single-sample baseline. In all cases,the model parameter gradients were estimated using Eq. 6.


−120

−110

−100

−90

0e+00 1e+06 2e+06 3e+06 4e+06Steps

Bou

nd (

nats

)

VIMCO(2)VIMCO(5)VIMCO(10)VIMCO(50)NVIL(1)NVIL(2)NVIL(5)NVIL(10)NVIL(50)

−120

−110

−100

−90

0e+00 1e+06 2e+06 3e+06 4e+06Steps

Bou

nd (

nats

)

RWS(2)RWS(5)RWS(10)RWS(50)VIMCO(2)VIMCO(5)VIMCO(10)VIMCO(50)

Figure 1. Generative modelling: Comparison of multi-sample objective on the validation set for the SBNs trained on MNIST usingVIMCO and those trained using (Left) NVIL and (Right) Reweighted Wake Sleep. The number in brackets specifies the number ofsamples used in the training objective.

Figure 1 shows the evolution of the training objective onthe validation set as training proceeds. From the left plot,which compares the models trained using VIMCO to thosetrained NVIL, it is apparent that VIMCO is far more effec-tive than NVIL at optimizing the multi-sample objectiveand benefits much more from using more samples. NVILperformance improves slightly when using a modest num-ber of samples before starting to degrade upon reachingK = 10. The right plot shows the comparison betweenVIMCO and RWS. The two methods perform similarly,with VIMCO performing better when using 2 samples andRWS learning slightly faster when using more samples.

Having selected the best model for each method/number ofsamples combination based on its validation score, we esti-mated its negative log-likelihood on the test set using 1000proposal samples for each data point. The results in Table 1show that VIMCO and NVIL perform slightly better thanRWS for 2 samples. However, as the number of samples in-creases, VIMCO and RWS performance steadily improveswhile NVIL performance stays virtually the same untilreachingK = 50, when it becomes markedly worse. Over-all, RWS and VIMCO perform similarly, though VIMCOseems to have a slight edge over RWS for all numbers ofsamples we considered.

We also investigated the effectiveness of VIMCO andNVIL variance reduction techniques more directly, bymonitoring the magnitude of their learning signals duringtraining. While VIMCO and NVIL performed compara-bly when using the 2-sample objective, VIMCO benefitedmuch more from using more samples. For the 10-sampleobjective, the average magnitude of the VIMCO learningsignal was 3 times lower than that of NVIL. More detailsare given in the supplementary material.

Table 1. Estimates of the negative log-likelihood (in nats) for gen-erative modelling on MNIST. The model is an SBN with threelatent layers of 200 binary units.

NUMBER OF TRAINING ALG.SAMPLES VIMCO NVIL RWS

1 — 95.2 —2 93.5 93.6 94.65 92.8 93.7 93.4

10 92.6 93.4 93.050 91.9 96.2 92.5

5.2. Structured output prediction

In the second set of experiments we evaluated the proposedestimator at training structured output prediction models.We chose a task that has been used as a benchmark forevaluating gradient estimators for models with binary latentvariables by Raiko et al. (2015) and Gu et al. (2016), whichinvolves predicting the lower half of an MNIST digit fromits top half. We trained two SBN models, one with two andone with three layers of 200 binary latent variables betweenthe 392-dimensional (14× 28) input and output layers. Weuse the same binarized MNIST dataset for this task as forthe generative modelling experiments in Section 5.1.

We consider two different kinds of proposal distributionsfor training the models. In the first case, we follow thestandard practice for training structured output predictionmodels and use the model prior as the proposal distribution.However, as the prior does not have access to the observa-tion information which is available during training, most ofthe resulting samples are unlikely to explain the observa-tion well, potentially leading to inefficient use of samplesand unnecessarily noisy learning signal. Hence, in the sec-ond case we learn a separate proposal distribution that takesboth the context and the observation halves of the image asinput. We parameterize the proposal distribution using anSBN with the same structure as the prior except that the last


−90

−80

−70

−60

−50

−40

0e+00 1e+06 2e+06 3e+06 4e+06Steps

Bou

nd (

nats

)

−90

−80

−70

−60

−50

−40

0e+00 1e+06 2e+06 3e+06 4e+06Steps

Bou

nd (

nats

)

VIMCO(2)VIMCO(5)VIMCO(20)VIMCO(50)NVIL(1)NVIL(2)NVIL(5)NVIL(20)NVIL(50)

Figure 2. Structured output prediction: Comparison of multi-sample objective on the validation set for a 3-hidden-layer SBN trainedwith VIMCO against those trained with NVIL using sampling from (Left) the prior and (Right) the learned proposal distribution. Thenumber in brackets specifies the number of samples used in the training objective.

layer of latent variables in addition to being conditioned onthe preceding layer is also conditioned on the observation.

We train the models with VIMCO and NVIL using 2, 5,20, and 50 sample objectives. As in the previous exper-iment, we also train single-sample baseline models usingboth types of proposals with NVIL (and K = 1). Fig-ure 2 shows the resulting multi-sample bound values forthe three-layer models on the validation set as a function ofthe number of parameter updates. The left plot, containingthe results for models trained by sampling from the prior,shows that model performance improves dramatically asthe number of samples is increased. Though NVIL with1 or 2 samples, performs better than VIMCO with 2 sam-ples, as the number of samples increases their roles reverse,with VIMCO making much faster progress than NVIL for20 and 50 samples. The fact that increasing the number ofsamples has such an effect on model performance stronglysuggests that samples generated from the prior rarely ex-plain the observation well.

The right plot on Figure 2 shows the result of training with alearned proposal distribution. It is clear that using a learnedproposal leads to drastic improvement for all method /number of samples combinations. In fact, the worst re-sult obtained using a learned proposal distribution is betterthan the best result obtained by sampling from the prior.In terms of relative performance, the story here is similarto that from the generative modelling experiment: VIMCOperforms better than NVIL and benefits much more fromincreasing the number of samples. The gap between themethods is considerably smaller here, likely due to the taskbeing easier. Inspecting the conditional digit completionssampled from the models shows that the models trained us-ing a learned proposal distribution capture multimodalityinherent in the task very well. We show conditional com-pletions from a three-layer model trained using VIMCO

with 20 samples in the supplementary material.

Finally, to compare to the results of Raiko et al. (2015), wefollowed their evaluation protocol and estimated the nega-tive log-likelihoods for the trained models using 100 sam-ples. Their best result on this task was 53.8 nats, obtainedusing a 2-layer SBN trained using a biased estimator emu-lating backprop to optimize the 20-sample objective. WithVIMCO training, the same model achieves 56.5 nats us-ing the prior as the proposal and 46.1 nats with a learnedproposal, which is the first sub-50 nat result on this task.

6. DiscussionIn this paper we introduced VIMCO, the first unbiasedgeneral gradient estimator designed specifically for multi-sample objectives that generalize the classical variationallower bound. By taking advantage of the structure of theobjective function, it implements simple and effective vari-ance reduction at no extra computational cost, eliminatingthe need for the learned baselines relied on by other generalunbiased estimators such as NVIL.

We demonstrated the effectiveness of VIMCO by applyingit to variational training of generative and structured outputprediction models. It consistently outperformed NVIL andwas competitive with the currently used biased estimators.

While classical variational methods can perform poorlywhen using an insufficiently expressive variational poste-rior, multi-sample objectives provide a graceful way oftrading computation for quality of fit simply by increasingthe number of samples used inside the objective. Com-bining such objectives with black box variational inferencemethods could make the latter substantially more effective.We thus hope that the proposed approach will increase theappeal and applicability of black box variational inference.


ACKNOWLEDGEMENTS

We thank Alex Graves, Guillaume Desjardins, KorayKavukcuoglu, Volodymyr Mnih, Hugo Larochelle, andMélanie Rey for their helpful comments.

ReferencesBa, Jimmy, Salakhutdinov, Ruslan R, Grosse, Roger B, and

Frey, Brendan J. Learning wake-sleep recurrent attentionmodels. In Advances in Neural Information ProcessingSystems, pp. 2575–2583, 2015.

Bornschein, Jörg and Bengio, Yoshua. Reweighted wake-sleep. ICLR, 2015.

Burda, Yuri, Grosse, Roger, and Salakhutdinov, Ruslan.Importance weighted autoencoders. ICLR, 2016.

Dauphin, Yann N and Grangier, David. Predicting distribu-tions with linearizing belief networks. ICLR, 2016.

Gregor, Karol, Danihelka, Ivo, Graves, Alex, Rezende,Danilo Jimenez, and Wierstra, Daan. DRAW: A recur-rent neural network for image generation. In Proceed-ings of the 32nd International Conference on MachineLearning, pp. 1462–1471, 2015.

Gu, Shixiang, Levine, Sergey, Sutskever, Ilya, and Mnih,Andriy. MuProp: Unbiased backpropagation forstochastic neural networks. ICLR, 2016.

Hinton, Geoffrey E, Dayan, Peter, Frey, Brendan J, andNeal, Radford M. The "wake-sleep" algorithm for un-supervised neural networks. Science, 268(5214):1158–1161, 1995.

Jordan, Michael I., Ghahramani, Zoubin, Jaakkola,Tommi S., and Saul, Lawrence K. An introductionto variational methods for graphical models. MachineLearning, 37(2):183–233, 1999.

Kingma, Diederik and Ba, Jimmy. Adam: A method forstochastic optimization. ICLR, 2015.

Kingma, Diederik P and Welling, Max. Auto-encodingvariational bayes. ICLR, 2014.

Mnih, Andriy and Gregor, Karol. Neural variational infer-ence and learning in belief networks. In Proceedings ofthe 31st International Conference on Machine Learning,2014.

Mnih, Volodymyr, Heess, Nicolas, and Graves, Alex. Re-current models of visual attention. In Advances in NeuralInformation Processing Systems, pp. 2204–2212, 2014.

Neal, Radford M. Connectionist learning of belief net-works. Artificial intelligence, 56(1):71–113, 1992.

Raiko, Tapani, Berglund, Mathias, Alain, Guillaume, andDinh, Laurent. Techniques for learning binary stochasticfeedforward neural networks. ICLR, 2015.

Ranganath, Rajesh, Gerrish, Sean, and Blei, David M.Black box variational inference. AISTATS, 2014.

Rezende, Danilo Jimenez and Mohamed, Shakir. Varia-tional inference with normalizing flows. In Proceedingsof the 32nd International Conference on Machine Learn-ing, pp. 1530–1538, 2015.

Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra,Daan. Stochastic backpropagation and approximate in-ference in deep generative models. In Proceedings ofthe 31st International Conference on Machine Learning,2014.

Salakhutdinov, Ruslan and Murray, Iain. On the quantita-tive analysis of Deep Belief Networks. In Proceedingsof the 25th Annual International Conference on MachineLearning (ICML 2008), 2008.

Salimans, Tim, Kingma, Diederik P., and Welling, Max.Markov chain monte carlo and variational inference:Bridging the gap. In Proceedings of the 32nd Interna-tional Conference on Machine Learning, pp. 1218–1226,2015.

Sohn, Kihyuk, Lee, Honglak, and Yan, Xinchen. Learningstructured output representation using deep conditionalgenerative models. In Advances in Neural InformationProcessing Systems, pp. 3465–3473, 2015.

Tang, Yichuan and Salakhutdinov, Ruslan R. Learningstochastic feedforward neural networks. In Advancesin Neural Information Processing Systems, pp. 530–538,2013.

Titsias, Michalis and Lázaro-Gredilla, Miguel. Local ex-pectation gradients for black box variational inference.In Advances in Neural Information Processing Systems,pp. 2620–2628, 2015.

Zaremba, Wojciech and Sutskever, Ilya. Reinforce-ment learning neural Turing machines. arXiv preprintarXiv:1505.00521, 2015.


A. Algorithm for computing VIMCOgradientsAlgorithm 1 provides an outline of our implementation ofVIMCO gradient computation for a single training case.This version uses the geometric mean to estimate f(x, hj)from the other K − 1 terms. The computations are per-formed in the log domain for better numerical stability.

Algorithm 1 Compute gradient estimates for the model andproposal distribution parameters for a single observationRequire: x , K ≥ 2

for i = 1 to K dohi ∼ Q(h|x)l[i] = log f(x, hi)

end for{Compute the multi-sample stochastic bound}L = LogSumExp(l)− logK{Precompute the sum of log f}s = Sum(l){Compute the baseline for each sample}for i = 1 to K do

{Save the current log f for future use and replace it}{with the average of the other K-1 log f terms}temp = l[i]l[i] = (s− l[i])/(K − 1)L−i = LogSumExp(l)− logKl[i] = temp {Restore the saved value}

end forw = SoftMax(l) {Compute the importance weights}∇θ = 0,∇ψ = 0{Sum the gradient contributions from the K samples}for i = 1 to K do

{Proposal distribution gradient contributions}∇θ = ∇θ + (L− L−i)∇θ logQ(hi|x)∇θ = ∇θ + w[i]∇θ log f(x, hi){Model gradient contribution}∇ψ = ∇ψ + w[i]∇ψ log f(x, hi)

end for

B. Details of the experimental protocolAll models were trained using the Adam optimizer(Kingma & Ba, 2015) with minibatches of size 24. Theinput to the proposal distribution/inference network wascentered by subtracting the mean. For each trainingmethod/number of samples combination we trained themodel several times using different learning rates, savingthe model with the best validation score achieved duringeach training run. The plots and the scores shown in thepaper were obtained from the saved model with the highestvalidation score. For generative training, we considered thelearning rates of {3× 10−4, 1× 10−3, 3× 10−3}. For thestructured output prediction experiments, the learning rates

0.0

2.5

5.0

7.5

10.0

0 500000 1000000 1500000Steps

RM

S

VIMCO(2)VIMCO(5)VIMCO(10)NVIL(1)NVIL(2)NVIL(5)NVIL(10)

Figure 3. The magnitude (root mean square) of the learning sig-nal for VIMCO and NVIL as a function of the number of samplesused in the objective and the number of parameter updates.

were {3×10−4, 1×10−3, 3×10−3} for VIMCO and RWSand {1× 10−4, 3× 10−4, 1× 10−3} for NVIL.

Our NVIL implementation used both constant and input-dependent baselines as well as variance normalization. Theinput-dependent baseline for NVIL was a neural networkwith one hidden layer of 100 tanh units. VIMCO used thegeometric mean for computing the per-sample learning sig-nals.

C. Effect of variance reduction on thelearning signalAs explained in Sections 2.4 and 2.5, the magnitude of thelearning signal used for learning the proposal distributionparameters is closely related to the variance of the resultinggradient estimator. Both VIMCO and NVIL aim to reducethe estimator variance by subtracting a baseline from theoriginal learning signal L(h1:K) in order to reduce its mag-nitude, while keeping the estimator unbiased. We exam-ined the effectiveness of these two approaches by plotting asmoothed estimate of the magnitude of the resulting learn-ing signal (L(hj |h−j) for VIMCO and L(h1:K)− b(x)− bfor NVIL) as a function of the number of parameter up-dates when training the SBN on MNIST in Section 5.1.The magnitude of the learning signal was estimated by tak-ing the square root of the mean of the squared signal valuesfor each minibatch. The results for different numbers ofsamples shown in Figure 3 suggest that while VIMCO andNVIL are equally effective at reducing variance when usinga 2-sample objective, VIMCO becomes much more effec-tive than NVIL when using more than 2 samples. For 10samples, the average magnitude of the learning signal forVIMCO is about 3 times lower than for NVIL, which sug-gests almost an order of magnitude lower variance of thegradient estimates.


D. Gradient derivation for the multi-sampleobjectiveIn this section we will derive the gradient for the multi-sample objective

LK(x) =EQ(h1:K |x)

[L(h1:K)

]=EQ(h1:K |x)

[log I(h1:K)

]=EQ(h1:K |x)

[log

1

K

K∑i=1

f(x, hi)

].

We start by using the product rule:

∇θLK(x) =∇θEQ(h1:K |x)

[L(h1:K)

]=∇θ

∑h1:K

Q(h1:K |x)L(h1:K)

=∑h1:K

[L(h1:K)∇θQ(h1:K |x)+

Q(h1:K |x)∇θL(h1:K)]. (13)

Using the identity ∇θg(x) = g(x)∇θ log g(x), we can ex-press the gradient of Q(h1:K |x) as

∇θQ(h1:K |x) =Q(h1:K |x)∇θ logQ(h1:K |x)

=Q(h1:K |x)∇θ log

K∏j=1

Q(hj |x)

=Q(h1:K |x)

K∑j=1

∇θ logQ(hj |x). (14)

We use the chain rule along with the same identity to com-pute the gradient of L(h1:K):

∇θL(h1:K) =∇θ log1

K

K∑j=1

f(x, hj)

=1∑K

i=1 f(x, hi)

K∑j=1

∇θf(x, hj)

=1∑K

i=1 f(x, hi)

K∑j=1

f(x, hj)∇θ log f(x, hj)

=

K∑j=1

wj∇θ log f(x, hj) (15)

where wj ≡ f(x,hj)∑Ki=1 f(x,h

i). Substituting Eq. 14 and Eq. 15

into Eq. 13 we obtain

∇θLK(x) =∑h1:K

(L(h1:K)Q(h1:K |x)

K∑j=1

∇θ logQ(hj |x)+

Q(h1:K |x)

K∑j=1

wj∇θ log f(x, hj)),

=∑h1:K

Q(h1:K |x)L(h1:K)

K∑j=1

∇θ logQ(hj |x)+

∑h1:K

Q(h1:K |x)

K∑j=1

wj∇θ log f(x, hj),

=EQ(h1:K |x)

∑j

L(h1:K)∇θ logQ(hj |x)

+

EQ(h1:K |x)

∑j

wj∇θ log f(x, hj)

. (16)

E. Structured output prediction: digitcompletionsFigure 4 shows multiple completions for the same set oftop digit image halves generated using a three-layer (200-200-200) SBN trained using VIMCO with the 20-sampleobjective. The completions were obtained by computingobservation probabilities based on a single sample from theprior. The variability of the completions shows how themodel captured the multimodality of the data distribution.


Figure 4. Structured output prediction: Conditional completions generated by sampling from a three-layer SBN trained using VIMCOwith the 20-sample objective. The top row shows the original full digit images. The remaining rows combine the top half from theoriginal image with the bottom half generated from the model.

abstract arxiv:1602.06725v2 [cs.lg] 1 jun 2016 · recently,burda et al.(2016) have derived a...

Documents