piecewise latent variables for neural variational text ... · to signicant progress in a diverse...

Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing, pages 52–62Copenhagen, Denmark, September 7–11, 2017. c©2017 Association for Computational Linguistics

Piecewise Latent Variables for Neural Variational Text Processing

Iulian V. Serban1∗ and Alexander G. Ororbia II2∗ and Joelle Pineau3 and Aaron Courville1

1 Department of Computer Science and Operations Research, Universite de Montreal2College of Information Sciences & Technology, Penn State University

3School of Computer Science, McGill Universityiulian [DOT] vlad [DOT] serban [AT] umontreal [DOT] ca

ago109 [AT] psu [DOT] edujpineau [AT] cs [DOT] mcgill [DOT] ca

aaron [DOT] courville [AT] umontreal [DOT] ca

AbstractAdvances in neural variational inferencehave facilitated the learning of power-ful directed graphical models with con-tinuous latent variables, such as varia-tional autoencoders. The hope is thatsuch models will learn to represent rich,multi-modal latent factors in real-worlddata, such as natural language text. How-ever, current models often assume simplis-tic priors on the latent variables — suchas the uni-modal Gaussian distribution —which are incapable of representing com-plex latent factors efficiently. To over-come this restriction, we propose the sim-ple, but highly flexible, piecewise constantdistribution. This distribution has the ca-pacity to represent an exponential num-ber of modes of a latent target distribution,while remaining mathematically tractable.Our results demonstrate that incorporatingthis new latent distribution into differentmodels yields substantial improvements innatural language processing tasks such asdocument modeling and natural languagegeneration for dialogue.

1 Introduction

The development of the variational autoencoderframework (Kingma and Welling, 2014; Rezendeet al., 2014) has paved the way for learning large-scale, directed latent variable models. This has ledto significant progress in a diverse set of machinelearning applications, ranging from computer vi-sion (Gregor et al., 2015; Larsen et al., 2016) tonatural language processing tasks (Mnih and Gre-gor, 2014; Miao et al., 2016; Bowman et al., 2015;

∗The first two authors contributed equally.

Serban et al., 2017b). It is hoped that this frame-work will enable the learning of generative pro-cesses of real-world data — including text, audioand images — by disentangling and representingthe underlying latent factors in the data. How-ever, latent factors in real-world data are oftenhighly complex. For example, topics in newswiretext and responses in conversational dialogue of-ten posses latent factors that follow non-linear(non-smooth), multi-modal distributions (i.e. dis-tributions with multiple local maxima).

Nevertheless, the majority of current models as-sume a simple prior in the form of a multivariateGaussian distribution in order to maintain mathe-matical and computational tractability. This is of-ten a highly restrictive and unrealistic assumptionto impose on the structure of the latent variables.First, it imposes a strong uni-modal structure onthe latent variable space; latent variable samplesfrom the generating model (prior distribution) allcluster around a single mean. Second, it forcesthe latent variables to follow a perfectly symmet-ric distribution with constant kurtosis; this makesit difficult to represent asymmetric or rarely occur-ring factors. Such constraints on the latent vari-ables increase pressure on the down-stream gen-erative model, which in turn is forced to carefullypartition the probability mass for each latent factorthroughout its intermediate layers. For complex,multi-modal distributions — such as the distribu-tion over topics in a text corpus, or natural lan-guage responses in a dialogue system — the uni-modal Gaussian prior inhibits the model’s abilityto extract and represent important latent structurein the data. In order to learn more expressive latentvariable models, we therefore need more flexible,yet tractable, priors.

In this paper, we introduce a simple, flexible

52

prior distribution based on the piecewise constantdistribution. We derive an analytical, tractableform that is applicable to the variational autoen-coder framework and propose a differentiableparametrization for it. We then evaluate the ef-fectiveness of the distribution when utilized bothas a prior and as approximate posterior acrossvariational architectures in two natural languageprocessing tasks: document modeling and natu-ral language generation for dialogue. We showthat the piecewise constant distribution is able tocapture elements of a target distribution that can-not be captured by simpler priors — such as theuni-modal Gaussian. We demonstrate state-of-the-art results on three document modeling tasks,and show improvements on a dialogue natural lan-guage generation. Finally, we illustrate qualita-tively how the piecewise constant distribution rep-resents multi-modal latent structure in the data.

2 Related Work

The idea of using an artificial neural network toapproximate an inference model dates back to theearly work of Hinton and colleagues (Hinton andZemel, 1994; Hinton et al., 1995; Dayan and Hin-ton, 1996). Researchers later proposed Markovchain Monte Carlo methods (MCMC) (Neal,1992), which do not scale well and mix slowly,as well as variational approaches which requirea tractable, factored distribution to approximatethe true posterior distribution (Jordan et al., 1999).Others have since proposed using feed-forward in-ference models to initialize the mean-field infer-ence algorithm for training Boltzmann architec-tures (Salakhutdinov and Larochelle, 2010; Oror-bia II et al., 2015). Recently, the variationalautoencoder framework (VAE) was proposed byKingma and Welling (2014) and Rezende et al.(2014), closely related to the method proposed byMnih and Gregor (2014). This framework allowsthe joint training of an inference network and a di-rected generative model, maximizing a variationallower-bound on the data log-likelihood and facil-itating exact sampling of the variational posterior.Our work extends this framework.

With respect to document modeling, neural ar-chitectures have been shown to outperform well-established topic models such as Latent Dirich-let Allocation (LDA) (Hofmann, 1999; Blei et al.,2003). Researchers have successfully proposedseveral models involving discrete latent vari-

ables (Salakhutdinov and Hinton, 2009; Hintonand Salakhutdinov, 2009; Srivastava et al., 2013;Larochelle and Lauly, 2012; Uria et al., 2014;Lauly et al., 2016; Bornschein and Bengio, 2015;Mnih and Gregor, 2014). The success of such dis-crete latent variable models — which are able topartition probability mass into separate regions —serves as one of our main motivations for investi-gating models with more flexible continuous latentvariables for document modeling. More recently,Miao et al. (2016) proposed to use continuous la-tent variables for document modeling.

Researchers have also investigated latent vari-able models for dialogue modeling and dialoguenatural language generation (Bangalore et al.,2008; Crook et al., 2009; Zhai and Williams,2014). The success of discrete latent variablemodels in this task also motivates our investi-gation of more flexible continuous latent vari-ables. Closely related to our proposed ap-proach is the Variational Hierarchical Recur-rent Encoder-Decoder (VHRED, described below)(Serban et al., 2017b), a neural architecture withlatent multivariate Gaussian variables.

Researchers have explored more flexible dis-tributions for the latent variables in VAEs, suchas autoregressive distributions, hierarchical prob-abilistic models and approximations based onMCMC sampling (Rezende et al., 2014; Rezendeand Mohamed, 2015; Kingma et al., 2016; Ran-ganath et al., 2016; Maaløe et al., 2016; Salimanset al., 2015; Burda et al., 2016; Chen et al., 2017;Ruiz et al., 2016). These are all complimentaryto our approach; it is possible to combine themwith the piecewise constant latent variables. Inparallel to our work, multiple research groups havealso proposed VAEs with discrete latent variables(Maddison et al., 2017; Jang et al., 2017; Rolfe,2017; Johnson et al., 2016). This is a promisingline of research, however these approaches oftenrequire approximations which may be inaccuratewhen applied to larger scale tasks, such as docu-ment modeling or natural language generation. Fi-nally, discrete latent variables may be inappropri-ate for certain natural language processing tasks.

3 Neural Variational Models

We start by introducing the neural variationallearning framework. We focus on modeling dis-crete output variables (e.g. words) in the contextof natural language processing applications. How-

53

ever, the framework can easily be adapted to han-dle continuous output variables.

3.1 Neural Variational Learning

Let w1, . . . , wN be a sequence of N tokens(words) conditioned on a continuous latent vari-able z. Further, let c be an additional observedvariable which conditions both z and w1, . . . , wN .Then, the distribution over words is:

Pθ(w1, . . . , wN |c) =� N�

n=1

Pθ(wn|w<n, z, c)Pθ(z|c)dz,

where θ are the model parameters. The model firstgenerates the higher-level, continuous latent vari-able z conditioned on c. Given z and c, it then gen-erates the word sequence w1, . . . , wN . For unsu-pervised modeling of documents, the c is excludedand the words are assumed to be independent ofeach other, when conditioned on z:

Pθ(w1, . . . , wN ) =� N�

n=1

Pθ(wn|z)Pθ(z)dz.

Model parameters can be learned using the varia-tional lower-bound (Kingma and Welling, 2014):

log Pθ(w1, . . . , wN |c)≥ Ez∼Qψ(z|w1,...,wN ,c)[log Pθ(wn|w<n, z, c)]

− KL [Qψ(z|w1, . . . , wN , c)||Pθ(z|c)] , (1)

where we note that Qψ(z|w1, . . . , wN , c) is theapproximation to the intractable, true posteriorPθ(z|w1, . . . , wN , c). Q is called the encoder,or sometimes the recognition model or inferencemodel, and it is parametrized by ψ. The distri-bution Pθ(z|c) is the prior model for z, wherethe only available information is c. The VAEframework further employs the re-parametrizationtrick, which allows one to move the derivative ofthe lower-bound inside the expectation. To ac-complish this, z is parametrized as a transforma-tion of a fixed, parameter-free random distribu-tion z = fθ(�), where � is drawn from a ran-dom distribution. Here, f is a transformation of�, parametrized by θ, such that fθ(�) ∼ Pθ(z|c).For example, � might be drawn from a standardGaussian distribution and f might be defined asfθ(�) = µ + σ�, where µ and σ are in the param-eter set θ. In this case, z is able to represent anyGaussian with mean µ and variance σ2.

Model parameters are learned by maximizingthe variational lower-bound in eq. (1) using gra-dient descent, where the expectation is computedusing samples from the approximate posterior.

The majority of work on VAEs propose toparametrize z as multivariate Gaussian distrib-tions. However, this unrealistic assumption maycritically hurt the expressiveness of the latent vari-able model. See Appendix A for a detailed dis-cussion. This motivates the proposed piecewiseconstant latent variable distribution.

3.2 Piecewise Constant DistributionWe propose to learn latent variables by parametriz-ing z using a piecewise constant probability den-sity function (PDF). This should allow z to rep-resent complex aspects of the data distribution inlatent variable space, such as non-smooth regionsof probability mass and multiple modes.

Let n ∈ N be the number of piecewise constantcomponents. We assume z is drawn from PDF:

P (z) =1K

n�i=1

1� i− 1n

≤z≤i

n

�ai, (2)

where 1(x) is the indicator function, which is onewhen x is true and otherwise zero. The distribu-tion parameters are ai > 0, for i = 1, . . . , n. Thenormalization constant is:

K =n�

i=1

Ki, where K0 = 0, Ki =ai

n, for i = 1, . . . , n.

It is straightforward to show that a piecewise con-stant distribution with more than n > 2 piecesis capable of representing a bi-modal distribution.When n > 2, a vector z of piecewise constantvariables can represent a probability density with2|z| modes. Figure 1 illustrates how these variableshelp model complex, multi-modal distributions.

In order to compute the variational bound, weneed to draw samples from the piecewise constantdistribution using its inverse cumulative distribu-tion function (CDF). Further, we need to computethe KL divergence between the prior and posterior.The inverse CDF and KL divergence quantities areboth derived in Appendix B. During training wemust compute derivatives of the variational boundin eq. (1). These expressions involve derivativesof indicator functions, which have derivatives zeroeverywhere except for the changing points wherethe derivative is undefined. However, the proba-bility of sampling the value exactly at its changing

54

Figure 1: Joint density plot of a pair of Gaussianand piecewise constant variables. The horizontalaxis corresponds to z1, which is a univariate Gaus-sian variable. The vertical axis corresponds to z2,which is a piecewise constant variable.

point is effectively zero. Thus, we fix these deriva-tives to zero. Similar approximations are used intraining networks with rectified linear units.

4 Latent Variable Parametrizations

In this section, we develop the parametrizationof both the Gaussian variable and our proposedpiecewise constant latent variable.

Let x be the current output sequence, which themodel must generate (e.g. w1, . . . , wN ). Let c bethe observed conditioning information. If the taskcontains additional conditioning information thiswill be embedded by c. For example, for dialoguenatural language generation c represents an em-bedding of the dialogue history, while for docu-ment modeling c = ∅.

4.1 Gaussian ParametrizationLet µprior and σ2,prior be the prior mean and vari-ance, and let µpost and σ2,post be the approximateposterior mean and variance. For Gaussian la-tent variables, the prior distribution mean and vari-ances are encoded using linear transformations ofa hidden state. In particular, the prior distribu-tion covariance is encoded as a diagonal covari-ance matrix using a softplus function:

µprior = Hpriorµ Enc(c) + bprior

µ ,

σ2,prior = diag(log(1 + exp(Hpriorσ Enc(c) + bprior

σ ))),

where Enc(c) is an embedding of the conditioninginformation c (e.g. for dialogue natural languagegeneration this might, for example, be producedby an LSTM encoder applied to the dialogue his-tory), which is shared across all latent variable

dimensions. The matrices Hpriorµ , H

priorσ and vec-

tors bpriorµ , b

priorσ are learnable parameters. For the

posterior distribution, previous work has shown itis better to parametrize the posterior distributionas a linear interpolation of the prior distributionmean and variance and a new estimate of the meanand variance based on the observation x (Fraccaroet al., 2016). The interpolation is controlled bya gating mechanism, allowing the model to turnon/off latent dimensions:

µpost =(1− αµ)µprior + αµ

�Hpost

µ Enc(c, x) + bpostµ

�,

σ2,post =(1− ασ)σ2,prior

+ ασdiag(log(1 + exp(Hpostσ Enc(c, x) + bpost

σ ))),

where Enc(c, x) is an embedding of both c andx. The matrices H

postµ , H

postσ and the vectors

bpostµ , b

postσ , αµ, ασ are parameters to be learned.

The interpolation mechanism is controlled by αµ

and ασ, which are initialized to zero (i.e. initial-ized such that the posterior is equal to the prior).

4.2 Piecewise Constant ParametrizationWe parametrize the piecewise prior parameters us-ing an exponential function applied to a lineartransformation of the conditioning information:

apriori = exp(Hprior

a,i Enc(c) + bpriora,i ), i = 1, . . . , n,

where matrix Hpriora and vector b

priora are learnable.

As before, we define the posterior parameters as afunction of both c and x:

aposti = exp(Hpost

a,i Enc(c, x) + bposta,i ), i = 1, . . . , n,

where Hposta and b

posta are parameters.

5 Variational Text Modeling

We now introduce two classes of VAEs. The mod-els are extended by incorporating the Gaussian andpiecewise latent variable parametrizations.

5.1 Document ModelThe neural variational document model (NVDM)model has previously been proposed for documentmodeling (Mnih and Gregor, 2014; Miao et al.,2016), where the latent variables are Gaussian.Since the original NVDM uses Gaussian latentvariables, we will refer to it as G-NVDM. We pro-pose two novel models building on G-NVDM. Thefirst model we propose uses piecewise constant la-tent variables instead of Gaussian latent variables.

55

We refer to this model as P-NVDM. The secondmodel we propose uses a combination of Gaus-sian and piecewise constant latent variables. Themodels sample the Gaussian and piecewise con-stant latent variables independently and then con-catenates them together into one vector. We referto this model as H-NVDM.

Let V be the vocabulary of document words.Let W represent a document matrix, where row wi

is the 1-of-|V | binary encoding of the i’th word inthe document. Each model has an encoder com-ponent Enc(W ), which compresses a documentvector into a continuous distributed representa-tion upon which the approximate posterior is built.For document modeling, word order informationis not taken into account and no additional condi-tioning information is available. Therefore, eachmodel uses a bag-of-words encoder, defined as amulti-layer perceptron (MLP) Enc(c = ∅, x) =Enc(x). Based on preliminary experiments, wechoose the encoder to be a two-layered MLP withparametrized rectified linear activation functions(we omit these parameters for simplicity). For theapproximate posterior, each model has the param-eter matrix W

posta and vector b

posta for the piece-

wise latent variables, and the parameter matricesW

postµ , W

postσ and vectors b

postµ , b

postσ for the Gaus-

sian means and variances. For the prior, eachmodel has parameter vector b

priora for the piece-

wise latent variables, and vectors bpriorµ , b

priorσ for

the Gaussian means and variances. We initializethe bias parameters to zero in order to start withcentered Gaussian and piecewise constant priors.The encoder will adapt these priors as learningprogresses, using the gating mechanism to turnon/off latent dimensions.

Let z be the vector of latent variables sampledaccording to the approximate posterior distribu-tion. Given z, the decoder Dec(w, z) outputs adistribution over words in the document:

Dec(w, z) =exp (−wTRz + bw)�w� exp (−wTRz + bw�)

,

where R is a parameter matrix and b is a parametervector corresponding to the bias for each word tobe learned. This output probability distribution iscombined with the KL divergences to compute thelower-bound in eq. (1). See Appendix C.

Our baseline model G-NVDM is an improve-ment over the original NVDM proposed by Mnihand Gregor (2014) and Miao et al. (2016). Welearn the prior mean and variance, while these

were fixed to a standard Gaussian in previouswork. This increases the flexibility of the modeland makes optimization easier. In addition, weuse a gating mechanism for the approximate pos-terior of the Gaussian variables. This gating mech-anism allows the model to turn off latent vari-able (i.e. fix the approximate posterior to equal theprior for specific latent variables) when computingthe final posterior parameters. Furthermore, Miaoet al. (2016) alternated between optimizing the ap-proximate posterior parameters and the generativemodel parameters, while we optimize all parame-ters simultaneously.

5.2 Dialogue ModelThe variational hierarchical recurrent encoder-decoder (VHRED) model has previously been pro-posed for dialogue modeling and natural languagegeneration (Serban et al., 2017b, 2016a). Themodel decomposes dialogues using a two-level hi-erarchy: sequences of utterances (e.g. sentences),and sub-sequences of tokens (e.g. words). Let wn

be the n’th utterance in a dialogue with N utter-ances. Let wn,m be the m’th word in the n’th utter-ance from vocabulary V given as a 1-of-|V | binaryencoding. Let Mn be the number of words in then’th utterance. For each utterance n = 1, . . . , N ,the model generates a latent variable zn. Condi-tioned on this latent variable, the model then gen-erates the next utterance:

Pθ(w1, z1, . . . ,wN , zN ) =N�

n=1

Pθ(zn|w<n)

×Mn�m=1

Pθ(wn,m|wn,<m,w<n, zn),

where θ are the model parameters. VHRED con-sists of three RNN modules: an encoder RNN,a context RNN and a decoder RNN. The en-coder RNN computes an embedding for each ut-terance. This embedding is fed into the contextRNN, which computes a hidden state summariz-ing the dialogue context before utterance n: hcon

n−1.This state represents the additional conditioninginformation, which is used to compute the priordistribution over zn:

Pθ(zn | w<n) = fpriorθ (zn; hcon

n−1),

where fprior is a PDF parametrized by both θ andhcon

n−1. A sample is drawn from this distribution:zn ∼ Pθ(zn|w<n). This sample is given as input

56

to the decoder RNN, which then computes the out-put probabilities of the words in the next utterance.The model is trained by maximizing the varia-tional lower-bound, which factorizes into indepen-dent terms for each sub-sequence (utterance):

log Pθ(w1, . . . ,wN )

≥N�

n=1

− KL [Qψ(zn | w1, . . . ,wn)||Pθ(zn | w<n)]

+ EQψ(zn|w1,...,wn) [log Pθ(wn | zn,w<n)] ,

where distribution Qψ is the approximate posteriordistribution with parameters ψ, computed simi-larly as the prior distribution but further condi-tioned on the encoder RNN hidden state of thenext utterance.

The original VHRED model (Serban et al.,2017b) used Gaussian latent variables. We re-fer to this model as G-VHRED. The first modelwe propose uses piecewise constant latent vari-ables instead of Gaussian latent variables. We re-fer to this model as P-VHRED. The second modelwe propose takes advantage of the representationpower of both Gaussian and piecewise constant la-tent variables. This model samples both a Gaus-sian latent variable z

gaussiann and a piecewise la-

tent variable zpiecewisen independently conditioned

on the context RNN hidden state:

Pθ(zgaussiann | w<n) = f

prior, gaussianθ (zgaussian

n ; hconn−1),

Pθ(zpiecewisen | w<n) = f

prior, piecewiseθ (zpiecewise

n ; hconn−1),

where fprior, gaussian and fprior, piecewise are PDFsparametrized by independent subsets of parame-ters θ. We refer to this model as H-VHRED.

6 Experiments

We evaluate the proposed models on two typesof natural language processing tasks: documentmodeling and dialogue natural language genera-tion. All models are trained with back-propagationusing the variational lower-bound on the log-likelihood or the exact log-likelihood. We usethe first-order gradient descent optimizer Adam(Kingma and Ba, 2015) with gradient clipping(Pascanu et al., 2012)1

Model 20-NG RCV1 CADE

LDA 1058 −− −−docNADE 896 −− −−NVDM 836 −− −−G-NVDM 651 905 339H-NVDM-3 607 865 258H-NVDM-5 566 833 294

Table 1: Test perplexities on three document mod-eling tasks: 20-NewGroup (20-NG), Reuters cor-pus (RCV1) and CADE12 (CADE). Perplexitieswere calculated using 10 samples to estimate thevariational lower-bound. The H-NVDM modelsperform best across all three datasets.

6.1 Document Modeling

Tasks We use three different datasets for docu-ment modeling experiments. First, we use the20 News-Groups (20-NG) dataset (Hinton andSalakhutdinov, 2009). Second, we use the Reuterscorpus (RCV1-V2), using a version that con-tained a selected 5,000 term vocabulary. Asin previous work (Hinton and Salakhutdinov,2009; Larochelle and Lauly, 2012), we transformthe original word frequencies using the equationlog(1 + TF), where TF is the original word fre-quency. Third, to test our document models on textfrom a non-English language, we use the BrazilianPortuguese CADE12 dataset (Cardoso-Cachopo,2007). For all datasets, we track the validationbound on a subset of 100 vectors randomly drawnfrom each training corpus.

Training All models were trained using mini-batches with 100 examples each. A learning rateof 0.002 was used. Model selection and early stop-ping were conducted using the validation lower-bound, estimated using five stochastic samples pervalidation example. Inference networks used 100units in each hidden layer for 20-NG and CADE,and 100 for RCV1. We experimented with both50 and 100 latent random variables for each classof models, and found that 50 latent variables per-formed best on the validation set. For H-NVDMwe vary the number of components used in thePDF, investigating the effect that 3 and 5 pieceshad on the final quality of the model. The number

1Code and scripts are available at https://github.com/ago109/piecewise-nvdm-emnlp-2017and https://github.com/julianser/hred-latent-piecewise.

57

G-NVDM H-NVDM-3 H-NVDM-5environment project scienceproject gov builtflight major highlab based technologymission earth worldlaunch include formfield science scaleworking nasa sunbuild systems specialgov technical area

Table 2: Word query similarity test on 20 News-Groups: for the query ‘space”, we retrieve thetop 10 nearest words in word embedding spacebased on Euclidean distance. H-NVDM-5 asso-ciates multiple meanings to the query, while G-NVDM only associates the most frequent meaning.

of hidden units was chosen via preliminary exper-imentation with smaller models. On 20-NG, weuse the same set-up as (Hinton and Salakhutdi-nov, 2009) and therefore report the perplexities ofa topic model (LDA, (Hinton and Salakhutdinov,2009)), the document neural auto-regressive esti-mator (docNADE, (Larochelle and Lauly, 2012)),and a neural variational document model with afixed standard Gaussian prior (NVDM, lowest re-ported perplexity, (Miao et al., 2016)).

Results In Table 1, we report the test docu-ment perplexity: exp(− 1

D

�n

1Ln

log Pθ(xn). Weuse the variational lower-bound as an approxima-tion based on 10 samples, as was done in (Mnihand Gregor, 2014). First, we note that the bestbaseline model (i.e. the NVDM) is more competi-tive when both the prior and posterior models arelearnt together (i.e. the G-NVDM), as opposed tothe fixed prior of (Miao et al., 2016). Next, weobserve that integrating our proposed piecewisevariables yields even better results in our docu-ment modeling experiments, substantially improv-ing over the baselines. More importantly, in the20-NG and Reuters datasets, increasing the num-ber of pieces from 3 to 5 further reduces perplex-ity. Thus, we have achieved a new state-of-the-art perplexity on 20 News-Groups task and — tothe best of our knowledge – better perplexities onthe CADE12 and RCV1 tasks compared to us-ing a state-of-the-art model like the G-NVDM. Wealso evaluated the converged models using an non-parametric inference procedure, where a separate

Figure 2: Latent variable approximate poste-rior means t-SNE visualization on 20-NG for G-NVDM and H-NVDM-5. Colors correspond to thetopic labels assigned to each document.

approximate posterior is learned for each test ex-ample in order to tighten the variational lower-bound. H-NVDM also performed best in this eval-uation across all three datasets, which confirmsthat the performance improvement is due to thepiecewise components. See appendix for details.

In Table 2, we examine the top ten highestranked words given the query term “space”, usingthe decoder parameter matrix. The piecewise vari-ables appear to have a significant effect on what isuncovered by the model.In the case of “space”, thehybrid with 5 pieces seems to value two senses ofthe word–one related to “outer space” (e.g., “sun”,“world”, etc.) and another related to the dimen-sions of depth, height, and width within whichthings may exist and move (e.g., “area”, “form”,“scale”, etc.). On the other hand, G-NVDM ap-pears to only capture the “outer space” sense of

58

Model Activity Entity

HRED 4.77 2.43G-VHRED 9.24 2.49P-VHRED 5 2.49H-VHRED 8.41 3.72

Table 3: Ubuntu evaluation using F1 metrics w.r.t.activities and entities. G-VHRED, P-VHRED andH-VHRED all outperform the baseline HRED.G-VHRED performs best w.r.t. activities and H-VHRED performs best w.r.t. entities.

the word. More examples are in the appendix.Finally, we visualized the means of the approx-

imate posterior latent variables on 20-NG througha t-SNE projection. As shown in Figure 2, bothG-NVDM and H-NVDM-5 learn representationswhich disentangle the topic clusters on 20-NG.However, G-NVDM appears to have more dis-persed clusters and more outliers (i.e. data pointsin the periphery) compared to H-NVDM-5. Al-though it is difficult to draw conclusions based onthese plots, these findings could potentially be ex-plained by the Gaussian latent variables fitting thelatent factors poorly.

6.2 Dialogue Modeling

Task We evaluate VHRED on a natural languagegeneration task, where the goal is to generate re-sponses in a dialogue. This is a difficult prob-lem, which has been extensively studied in therecent literature (Ritter et al., 2011; Lowe et al.,2015; Sordoni et al., 2015; Li et al., 2016; Ser-ban et al., 2016a,b). Dialogue response generationhas recently gained a significant amount of atten-tion from industry, with high-profile projects suchas Google SmartReply (Kannan et al., 2016) andMicrosoft Xiaoice (Markoff and Mozur, 2015).Even more recently, Amazon has announced theAlexa Prize Challenge for the research communitywith the goal of developing a natural and engagingchatbot system (Farber, 2016).

We evaluate on the technical support responsegeneration task for the Ubuntu operating system.We use the well-known Ubuntu Dialogue Corpus(Lowe et al., 2015, 2017), which consists of about1/2 million natural language dialogues extractedfrom the #Ubuntu Internet Relayed Chat (IRC)channel. The technical problems discussed spana wide range of software-related and hardware-related issues. Given a dialogue history — such

as a conversation between a user and a technicalsupport assistant — the model must generate thenext appropriate response in the dialogue. For ex-ample, when it is the turn of the technical supportassistant, the model must generate an appropriateresponse helping the user resolve their problem.

We evaluate the models using the activity- andentity-based metrics designed specifically for theUbuntu domain (Serban et al., 2017a). Thesemetrics compare the activities and entities in themodel generated responses with those of the ref-erence responses; activities are verbs referring tohigh-level actions (e.g. download, install, unzip)and entities are nouns referring to technical ob-jects (e.g. Firefox, GNOME). The more activitiesand entities a model response overlaps with thereference response (e.g. expert response) the morelikely the response will lead to a solution.

Training The models were trained to maxi-mize the log-likelihood of training examples us-ing a learning rate of 0.0002 and mini-batchesof size 80. We use a variant of truncated back-propagation. We terminate the training procedurefor each model using early stopping, estimatedusing one stochastic sample per validation exam-ple. We evaluate the models by generating dia-logue responses: conditioned on a dialogue con-text, we fix the model latent variables to their me-dian values and then generate the response using abeam search with size 5. We select model hyper-parameters based on the validation set using the F1activity metric, as described earlier.

It is often difficult to train generative modelsfor language with stochastic latent variables (Bow-man et al., 2015; Serban et al., 2017b). For thelatent variable models, we therefore experimentwith reweighing the KL divergence terms in thevariational lower-bound with values 0.25, 0.50,0.75 and 1.0. In addition to this, we linearly in-crease the KL divergence weights starting fromzero to their final value over the first 75000 train-ing batches. Finally, we weaken the decoder RNNby randomly replacing words inputted to the de-coder RNN with the unknown token with 25%probability. These steps are important for effec-tively training the models, and the latter two havebeen used in previous work by Bowman et al.(2015) and Serban et al. (2017b).

HRED (Baseline): We compare to the HREDmodel (Serban et al., 2016a): a sequence-to-sequence model, shown to outperform other es-

59

tablished models on this task, such as the LSTMRNN language model (Serban et al., 2017a). TheHRED model’s encoder RNN uses a bidirectionalGRU RNN encoder, where the forward and back-ward RNNs each have 1000 hidden units. Thecontext RNN is a GRU encoder with 1000 hiddenunits, and the decoder RNN is an LSTM decoderwith 2000 hidden units.2 The encoder and con-text RNNs both use layer normalization (Ba et al.,2016).3 We also experiment with an additionalrectified linear layer applied on the inputs to thedecoder RNN. As with other hyper-parameters,we choose whether to include this additional layerbased on the validation set performance. HRED,as well as all other models, use a word embeddingdimensionality of size 400.

G-HRED: We compare to G-VHRED, whichis VHRED with Gaussian latent variables (Serbanet al., 2017b). G-VHRED uses the same hyper-parameters for the encoder, context and decoderRNNs as the HRED model. The model has 100Gaussian latent variables per utterance.

P-HRED: The first model we propose is P-VHRED, which is VHRED model with piecewiseconstant latent variables. We use n = 3 numberof pieces for each latent variable. P-VHRED alsouses the same hyper parameters for the encoder,context and decoder RNNs as the HRED model.Similar to G-VHRED, P-VHRED has 100 piece-wise constant latent variables per utterance.

H-HRED: The second model we propose is H-VHRED, which has 100 piecewise constant (withn = 3 pieces per variable) and 100 Gaussian la-tent variables per utterance. H-VHRED also usesthe same hyper-parameters for the encoder, con-text and decoder RNNs as HRED.

Results: The results are given in Table 3.All latent variable models outperform HRED w.r.t.both activities and entities. This strongly suggeststhat the high-level concepts represented by thelatent variables help generate meaningful, goal-directed responses. Furthermore, each type oflatent variable appears to help with a differentaspects of the generation task. G-VHRED per-forms best w.r.t. activities (e.g. download, installand so on), which occur frequently in the dataset.

2Since training lasted between 1-3 weeks for each model,we had to fix the number of hidden units during preliminaryexperiments on the training and validation datasets.

3We did not apply layer normalization to the decoderRNN, because several of our colleagues have found that thismay hurt the performance of generative language models.

This suggests that the Gaussian latent variableslearn useful latent representations for frequent ac-tions. On the other hand, H-VHRED performsbest w.r.t. entities (e.g. Firefox, GNOME), whichare often much rarer and mutually exclusive inthe dataset. This suggests that the combination ofGaussian and piecewise latent variables help learnuseful representations for entities, which couldnot be learned by Gaussian latent variables alone.We further conducted a qualitative analysis of themodel responses, which supports these conclu-sions. See Appendix G.4

7 Conclusions

In this paper, we have sought to learn rich andflexible multi-modal representations of latent vari-ables for complex natural language processingtasks. We have proposed the piecewise constantdistribution for the variational autoencoder frame-work. We have derived closed-form expressionsfor the necessary quantities required for in the au-toencoder framework, and proposed an efficient,differentiable implementation of it. We have in-corporated the proposed piecewise constant dis-tribution into two model classes — NVDM andVHRED — and evaluated the proposed models ondocument modeling and dialogue modeling tasks.We have achieved state-of-the-art results on threedocument modeling tasks, and have demonstratedsubstantial improvements on a dialogue modelingtask. Overall, the results highlight the benefitsof incorporating the flexible, multi-modal piece-wise constant distribution into variational autoen-coders. Future work should explore other naturallanguage processing tasks, where the data is likelyto arise from complex, multi-modal latent factors.

Acknowledgments

The authors acknowledge NSERC, Canada Re-search Chairs, CIFAR, IBM Research, NuanceFoundation and Microsoft Maluuba for fund-ing. Alexander G. Ororbia II was fundedby a NACME-Sloan scholarship. The authorsthank Hugo Larochelle for sharing the News-Group 20 dataset. The authors thank Lau-rent Charlin, Sungjin Ahn, and Ryan Lowe forconstructive feedback. This research was en-abled in part by support provided by CalculQubec (www.calculquebec.ca) and Com-pute Canada (www.computecanada.ca).

4Results on a Twitter dataset are given in the appendix.

60

ReferencesJ. L. Ba, J. R. Kiros, and G. E. Hinton. 2016. Layer

normalization. arXiv preprint arXiv:1607.06450.

S. Bangalore, G. Di Fabbrizio, and A. Stent. 2008.Learning the structure of task-driven human–humandialogs. IEEE Transactions on Audio, Speech, andLanguage Processing, 16(7):1249–1259.

D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latentdirichlet allocation. JAIR, 3:993–1022.

J. Bornschein and Y. Bengio. 2015. Reweighted wake-sleep. In ICLR.

S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai,R. Jozefowicz, and S. Bengio. 2015. Generatingsentences from a continuous space. In Conferenceon Computational Natural Language Learning.

Y. Burda, R. Grosse, and R. Salakhutdinov. 2016. Im-portance weighted autoencoders. ICLR.

A. Cardoso-Cachopo. 2007. Improving Methods forSingle-label Text Categorization. PdD Thesis, Insti-tuto Superior Tecnico, Universidade Tecnica de Lis-boa.

X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhari-wal, J. Schulman, I. Sutskever, and P. Abbeel. 2017.Variational lossy autoencoder. In ICLR.

N. Crook, R. Granell, and S. Pulman. 2009. Unsuper-vised classification of dialogue acts using a dirichletprocess mixture model. In Special Interest Groupon Discourse and Dialogue (SIGDIAL), pages 341–348.

P. Dayan and G. E. Hinton. 1996. Varieties ofhelmholtz machine. Neural Networks, 9(8):1385–1403.

L. Devroye. 1986. Sample-based non-uniform randomvariate generation. In Proceedings of the 18th con-ference on Winter simulation, pages 260–265. ACM.

M. Farber. 2016. Amazon’s ’Alexa Prize’ Will GiveCollege Students Up To $2.5M To Create A Social-bot. Fortune.

M. Fraccaro, S. K. Sønderby, U. Paquet, andO. Winther. 2016. Sequential neural models withstochastic layers. In NIPS, pages 2199–2207.

K. Gregor, I. Danihelka, A. Graves, and D. Wierstra.2015. DRAW: A recurrent neural network for imagegeneration. In ICLR.

G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal.1995. The” wake-sleep” algorithm for unsupervisedneural networks. Science, 268(5214):1158–1161.

G. E. Hinton and R. Salakhutdinov. 2009. Replicatedsoftmax: an undirected topic model. In NIPS, pages1607–1614.

G. E. Hinton and R. S. Zemel. 1994. Autoencoders,minimum description length and helmholtz free en-ergy. In NIPS, pages 3–10. NIPS.

T. Hofmann. 1999. Probabilistic latent semantic index-ing. In ACM SIGIR Conference on Research andDevelopment in Information Retrieval, pages 50–57.ACM.

E. Jang, S. Gu, and B. Poole. 2017. Categorical repa-rameterization with gumbel-softmax. In ICLR.

M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P.Adams, and S. R. Datta. 2016. Composing graphicalmodels with neural networks for structured repre-sentations and fast inference. In NIPS, pages 2946–2954.

M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, andL. K. Saul. 1999. An introduction to variationalmethods for graphical models. Machine Learning,37(2):183–233.

A. Kannan, K. Kurach, et al. 2016. Smart Reply: Au-tomated Response Suggestion for Email. In KDD.

D. Kingma and J. Ba. 2015. Adam: A method forstochastic optimization. In ICLR.

D. P. Kingma, T. Salimans, and M. Welling. 2016. Im-proving variational inference with inverse autore-gressive flow. NIPS, pages 4736–4744.

D. P. Kingma and M. Welling. 2014. Auto-encodingvariational Bayes. ICLR.

H. Larochelle and S. Lauly. 2012. A neural autoregres-sive topic model. In NIPS, pages 2708–2716.

A. B. Lindbo Larsen, S. K. Sønderby, and O. Winther.2016. Autoencoding beyond pixels using a learnedsimilarity metric. In ICML, pages 1558–1566.

S. Lauly, Y. Zheng, A. Allauzen, and H. Larochelle.2016. Document neural autoregressive distributionestimation. arXiv preprint arXiv:1603.05962.

J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan.2016. A diversity-promoting objective function forneural conversation models. In The North AmericanChapter of the Association for Computational Lin-guistics (NAACL), pages 110–119.

Ryan Lowe, Nissan Pow, Iulian Serban, and JoellePineau. 2015. The Ubuntu Dialogue Corpus: ALarge Dataset for Research in Unstructured Multi-Turn Dialogue Systems. In Special Interest Groupon Discourse and Dialogue (SIGDIAL).

Ryan T. Lowe, Nissan Pow, Iulian V. Serban, Lau-rent Charlin, Chia-Wei Liu, and Joelle Pineau. 2017.Training End-to-End Dialogue Systems with theUbuntu Dialogue Corpus. Dialogue & Discourse,8(1).

61

Lars Maaløe, Casper Kaae Sønderby, Søren KaaeSønderby, and Ole Winther. 2016. Auxiliary deepgenerative models. In ICML, pages 1445–1453.

C. J. Maddison, A. Mnih, and Y. W. Teh. 2017. Theconcrete distribution: A continuous relaxation ofdiscrete random variables. In ICLR.

J. Markoff and P. Mozur. 2015. For Sympathetic Ear,More Chinese Turn to Smartphone Program. NewYork Times.

Y. Miao, L. Yu, and P. Blunsom. 2016. Neural varia-tional inference for text processing. In ICML, pages1727–1736.

A. Mnih and K. Gregor. 2014. Neural variational in-ference and learning in belief networks. In ICML,pages 1791–1799.

R. M. Neal. 1992. Connectionist learning of belief net-works. Artificial intelligence, 56(1):71–113.

A. G. Ororbia II, C. L. Giles, and D. Reitter. 2015.Online semi-supervised learning with deep hybridboltzmann machines and denoising autoencoders.arXiv preprint arXiv:1511.06964.

R. Pascanu, T. Mikolov, and Y. Bengio. 2012. Onthe difficulty of training recurrent neural networks.ICML, 28:1310–1318.

Rajesh Ranganath, Dustin Tran, and David Blei. 2016.Hierarchical variational models. In ICML, pages324–333.

D. J. Rezende and S. Mohamed. 2015. Variational in-ference with normalizing flows. In ICML, pages1530–1538.

D. J. Rezende, S. Mohamed, and D. Wierstra. 2014.Stochastic backpropagation and approximate infer-ence in deep generative models. In ICML, pages1278–1286.

A. Ritter, C. Cherry, and W. B. Dolan. 2011. Data-driven response generation in social media. In Pro-ceedings of the Conference on Empirical Methods inNatural Language Processing, pages 583–593.

J. T. Rolfe. 2017. Discrete variational autoencoders. InICLR.

Francisco R Ruiz, Michalis Titsias RC AUEB, andDavid Blei. 2016. The generalized reparameteriza-tion gradient. In NIPS, pages 460–468.

R. Salakhutdinov and G. E. Hinton. 2009. Semantichashing. International Journal of Approximate Rea-soning, 50(7):969–978.

R. Salakhutdinov and H. Larochelle. 2010. Efficientlearning of deep boltzmann machines. In AISTATs,pages 693–700.

T. Salimans, D. P Kingma, and M. Welling. 2015.Markov chain monte carlo and variational inference:Bridging the gap. In ICML, pages 1218–1226.

R. Sennrich, B. Haddow, and A. Birch. 2016. Neu-ral machine translation of rare words with subwordunits. In Association for Computational Linguistics(ACL).

I. V. Serban, T. Klinger, G. Tesauro, K. Talamadupula,B. Zhou, Y. Bengio, and A. Courville. 2017a. Mul-tiresolution recurrent neural networks: An applica-tion to dialogue response generation. In Thirty-FirstAAAI Conference (AAAI).

I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, andJ. Pineau. 2016a. Building end-to-end dialogue sys-tems using generative hierarchical neural networkmodels. In Thirtieth AAAI Conference (AAAI).

I. V. Serban, A. Sordoni, R. Lowe, L. Charlin,J. Pineau, A. Courville, and Y. Bengio. 2017b. Ahierarchical latent variable encoder-decoder modelfor generating dialogues. In Thirty-First AAAI Con-ference (AAAI).

Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, andJoelle Pineau. 2016b. Generative deep neural net-works for dialogue: A short review. In NIPS, Let’sDiscuss: Learning Methods for Dialogue Workshop.

A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji,M. Mitchell, J. Nie, J. Gao, and B. Dolan. 2015.A neural network approach to context-sensitive gen-eration of conversational responses. In Conferenceof the North American Chapter of the Associationfor Computational Linguistics (NAACL-HLT 2015),pages 196–205.

N. Srivastava, R. R Salakhutdinov, and G. E. Hin-ton. 2013. Modeling documents with deep boltz-mann machines. In Proceedings of the Twenty-NinthConference on Uncertainty in Artificial Intelligence(UAI), pages 616–624.

B. Uria, I. Murray, and H. Larochelle. 2014. A deepand tractable density estimator. In ICML, pages467–475.

K. Zhai and J. D. Williams. 2014. Discovering latentstructure in task-oriented dialogues. In Associationfor Computational Linguistics (ACL), pages 36–46.

62

piecewise latent variables for neural variational text ... · to signicant progress in a diverse...

Documents