stochastic wavenet: a generative latent variable …stochastic wavenet: a generative latent variable...

Stochastic WaveNet: A Generative Latent Variable Model for Sequential Data

Guokun Lai 1 Bohan Li 1 Guoqing Zheng 1 Yiming Yang 1

AbstractHow to model distribution of sequential data, in-cluding but not limited to speech and human mo-tions, is an important ongoing research problem.It has been demonstrated that model capacity canbe significantly enhanced by introducing stochas-tic latent variables in the hidden states of recur-rent neural networks. Simultaneously, WaveNet,equipped with dilated convolutions, achieves as-tonishing empirical performance in natural speechgeneration task. In this paper, we combine theideas from both stochastic latent variables anddilated convolutions, and propose a new architec-ture to model sequential data, termed as Stochas-tic WaveNet, where stochastic latent variables areinjected into the WaveNet structure. We arguethat Stochastic WaveNet enjoys powerful distri-bution modeling capacity and the advantage ofparallel training from dilated convolutions. In or-der to efficiently infer the posterior distributionof the latent variables, a novel inference networkstructure is designed based on the characteristicsof WaveNet architecture. State-of-the-art perfor-mances on benchmark datasets are obtained byStochastic WaveNet on natural speech modelingand high quality human handwriting samples canbe generated as well.

1. IntroductionLearning to capture complex distribution of sequential datais an important machine learning problem and has beenextensively studied in recent years. The autoregressive neu-ral network models, including Recurrent Neural Network(Hochreiter and Schmidhuber, 1997; Chung et al., 2014),PixelCNN (Oord et al., 2016) and WaveNet (Van Den Oordet al., 2016), have shown strong empirical performance inmodeling natural language, images and human speeches.All these methods are aimed at learning a deterministic

1Language Technology Institute, Carnegie Mellon University,Pittsburgh, PA 15213, USA. Correspondence to: Guokun Lai<[email protected]>.

mapping from the data input to the output. Recently, evi-dence has been found (Fabius and van Amersfoort, 2014;Gan et al., 2015; Gu et al., 2015; Goyal et al., 2017; Sha-banian et al., 2017) that probabilistic modeling with neuralnetworks can benefit from uncertainty introduced to theirhidden states, namely including stochastic latent variablesin the network architecture. Without such uncertainty in thehidden states, RNN, PixelCNN and WaveNet would param-eterize the randomness only in the final layer by shapinga output distribution from the specific distribution family.Hence the output distribution (which is often assumed tobe Gaussian for continuous data) would be unimodal or themixture of unimodals given the input data, which may beinsufficient to capture the complex true data distributionand to describe the complex correlations among differentoutput dimensions (Boulanger-Lewandowski et al., 2012).Even for the non-parametrized discrete output distributionmodeled by the softmax function, a phenomenon referredto as softmax bottleneck (Yang et al., 2017a) still limits thefamily of output distributions. By injecting the stochasticlatent variables into the hidden states and transforming theiruncertainty to outputs by non-linear layers, the stochasticneural network is equipped with the ability to model thedata with a much richer family of distributions.

Motivated by this, numerous variants of RNN-based stochas-tic neural network have been proposed. STORN (Bayer andOsendorfer, 2014) is the first to integrate stochastic latentvariables into RNN’s hidden states. In VRNN (Chung et al.,2015), the prior of stochastic latent variables is assumedto be a function over historical data and stochastic latentvariables, which allows them to capture temporal dependen-cies. SRNN (Fraccaro et al., 2016) and Z-forcing (Goyalet al., 2017) offer more powerful versions with augmentedinference networks which better capture the correlation be-tween the stochastic latent variables and the whole observedsequence. ome training tricks introducing in (Goyal et al.,2017; Shabanian et al., 2017) would ease training processfor the stochastic recurrent neural networks which lead tobetter empirical performance. By introducing stochastic-ity to the hidden states, these RNN-based models achievedsignificant improvements over vanilla RNN models on log-likelihood evaluations on multiple benchmark datasets fromvarious domains (Goyal et al., 2017; Shabanian et al., 2017).

In parallel with RNN, WaveNet (Van Den Oord et al., 2016)

arX

iv:1

806.

0611

6v1

[cs

.LG

] 1

5 Ju

n 20

18


provides another powerful way of modeling sequential datawith dilated convolutions, especially in the natural speechgeneration task. While RNN-based models must be trainedin a sequential manner, training a WaveNet can be easilyparallelized. Furthermore, the parallel WaveNet proposedin (Oord et al., 2017) is able to generate new sequences inparallel. WaveNet, or dilated convolutions, has also beenadopted as the encoder or decoder in the VAE frameworkand produces reasonable results in the text (Semeniuta et al.,2017; Yang et al., 2017b) and music (Engel et al., 2017)generation task.

In light of the advantage of introducing stochastic latentvariables to RNN-based models, it is natural to raise a prob-lem whether this benefit carries to WaveNet-based models.To this end, in this paper we propose Stochastic WaveNet,which associates stochastic latent variables with every hid-den states in the WaveNet architecture. Compared with thevanilla WaveNet, Stochastic WaveNet is able to capture aricher family of data distributions via the added stochasticlatent variables. It also inherits the ease of parallel train-ing with dilated convolutions from the WaveNet architec-ture. Because of the added stochastic latent variables, aninference network is also designed and trained jointly withStochastic WaveNet to maximize the data log-likelihood.We believe that after model training, the multi-layer struc-ture of latent variables leads them to reflect both hierarchicaland sequential structures of the data. This hypothesis is val-idated empirically by controlling the number of layers ofstochastic latent variables.

The rest of this paper is organized as follows: we brieflyreview the background in Section 2. The proposed modeland optimization algorithm are introduced in Section 3. Weevaluate and analyze the proposed model on multiple bench-mark datasets in Section 4. Finally, the summary of thispaper is included in Section 5.

2. Preliminary2.1. Notation

We first define the mathematical symbols used in the rest ofthis paper. We denote a set of vectors by a bold symbol, suchas x, which may utilize one or two dimension subscriptsas index, such as xi or xi,j . f(·) represents the generalfunction that transforms an input vector to a output vector.And fθ(·) is a neural network function parametrized by θ.For a sequential data sample x, T represents its length.

2.2. Autoregressive Neural Network

Autoregressive network model is designed to model thejoint distribution of the high-dimensional data with sequen-tial structure, by factorizing the joint distribution of a data

sample as

p(x) =

T∏t=1

pθ(xt|x<t) (1)

where x = {x1, x2, · · ·xT }, xt ∈ Rd, t indexes the tem-poral time stamps, and θ represents the model parameters.Then the autoregressive model can compute the likelihoodof a sample and generate a new data sample in a sequentialmanner.

In order to capture richer stochasticities of the sequentialgeneration process, stochastic latent variables for each timestamp have been introduced, referred to as stochastic neuralnetwork (Chung et al., 2015; Fraccaro et al., 2016; Goyalet al., 2017). Then the joint distribution of the data togetherwith the latent variables is factorized as,

p(x, z) =

T∏t=1

pθ(xt, zt|x<t, z<t)

=

T∏t=1

pθ(xt|x<t, z≤t)pθ(zt|x<t, z<t)

(2)

where z = {z1, z2, · · · zT }, zt ∈ Rd′

has the same sequencelength as the data sample, d′ is its dimension for one timestamp. z is also generated sequentially, namely the prior ofzt is conditional probability given x<t and z<t.

2.3. WaveNet

WaveNet (Van Den Oord et al., 2016) is a convolutionalautoregressive neural network which adopts dilated causalconvolutions (Yu and Koltun, 2015) to extract the sequen-tial dependency in the data distribution. Different fromrecurrent neural network, dilated convolution layers canbe computed in parallel during the training process, whichmakes WaveNet much faster than RNN in modeling sequen-tial data. A typical WaveNet structure is visualized in Figure1. Beside the computation advantage, WaveNet has shownthe start-of-the-art result in speech generation task (Oordet al., 2017).

Input

Hidden LayerDilation = 1



OutputDilation = 8

Figure 1. Visualization of a WaveNet structure from (VanDen Oord et al., 2016)


3. Stochastic WaveNetIn this section, we introduce a sequential generative model(Stochastic WaveNet), which imposes stochastic latent vari-ables with the multi-layer dilated convolution structure.We firstly introduce the generation process of Stochas-tic WaveNet, and then describe the variational inferencemethod.

3.1. Generative Model

Similar as stochastic recurrent neural networks, we injectthe stochastic latent variable in each WaveNet hidden nodein the generation process, which is illustrated in Figure2a. More specifically, for a sequential data sample x withlength T , we introduce a set of stochastic latent variablesz = {zt,l|1 ≤ t ≤ T, 1 ≤ l ≤ L}, zt,l ∈ Rd′ , where L isthe number of the layers of WaveNet architecture. Then thegeneration process can be described as,

p(x, z) =

T∏t=1

pθ(xt, zt,1:L|x<t, z<t,1:L)

=

T∏t=1

[pθ(xt|z≤t,1:L, x<t)

L∏l=1

pθ(zt,l|zt,<l, x<t, z<t,1:L)]

(3)

The generation process can be interpreted as this. At eachtime stamp t, we sample the stochastic latent variables zt,lfrom a prior distribution which are conditioned on the lowerlevel latent variables and historical records including thedata samples x<t and latent variables z<t. Then we samplethe new data sample xt according to all sampled latentvariables and historical records. Through this process, newsequential data samples are generated in a recursive way.

In Stochastic WaveNet, the prior distributionpθ(zt,l|x<t, zt,<l, z<t,1:L) = N (zt,l;µt,l, vt,l) is de-fined as a Gaussian distribution with the diagonalcovariance matrix. The sequential and hierarchicaldependency among the latent variables are modeled by theWaveNet architecture. In order to summarize all historicalinformation, we introduce two stochastic hidden variablesh and d, which are calculated as,

ht,l = fθ1(dt−2l,l−1, dt,l−1)

dt,l = fθ2(ht,l, zt,l)(4)

Where fθ1 mimics the design of the dilated convolutionin WaveNet, and fθ2 is a fully connected layer to sum-marize the hidden states and the sampled latent variable.Different from the vanilla WaveNet, the hidden states h

are stochastic because of the random samples z. We pa-rameterize the mean and variance of the prior distributionsby the hidden representations h, which is µt,l = fθ3(ht,l)and log vt,l = fθ4(ht,l). Similarly, we parameterize theemission probability pθ(xt|x<t, zt,1:L, z<t,1:L) as a neuralnetwork function over the hidden representations.

3.2. Variational Inference for Stochastic WaveNet

Instead of directly maximizing log-likelihood for a sequen-tial sample x, we optimize its variational evidence lowerbound (ELBO) (Jordan et al., 1999). Exact posterior infer-ence of the stochastic variables z of Stochastic WaveNetis intractable. Hence, we describe a variational inferencemethod for Stochastic WaveNet by utilizing the reparame-terization trick introduced in (Kingma and Welling, 2013).Firstly, we write the ELBO as,

log p(x) ≥∫qφ(z|x) log

pθ(x, z)

qφ(z|x)dz

= Eqφ(z|x)[T∑t=1

log pθ(xt|z≤t,1:L, x<t)]

−DKL(qφ(z|x)||pθ(z|x))= L(x)

where pθ(z|x) =T∏t=1

L∏l=1

pθ(zt,l|zt,<l, x<t, z<t,1:L)

(5)

We can derive the second equation by taking Eq. 3 into thefirst equation, and L(x) denotes the loss function for thesample x. Here another problem needs to be addressed ishow to define the posterior distribution qφ(z|x). In order tomaximize the ELBO, we factorize the posterior as,

qφ(z|x) =T∏t=1

L∏l=1

qφ(zt,l|x, zt,<l, z<t,1:L) (6)

Here the posterior distribution for zt,l is conditioned on thestochastic latent variables sampled before it and the entireobserved data x. By utilizing the future data x≥t, we canbetter maximize the first term in L(x), the reconstructionloss term. In opposite, the prior distribution of zt,l is onlyconditioned on x<t, so encoding x≥t information may in-crease the degree of distribution mismatch between the priorand posterior distribution, namely enlarging the KL term inloss function.

Exploring the dependency structure in WaveNet. How-ever, by analyzing the dependency among the outputs andhidden states of WaveNet, we would find that the stochasticlatent variables at time stamp t, zt,1:L would not influence


whole posterior outputs. So the inference network wouldonly require partial posterior outputs to maximize the re-construction loss term in the loss function. Denote the setof outputs that would be influenced by zl,t as s(l, t). Theposterior distribution can be modified as,

qφ(z|x) =T∏t=1

L∏l=1

qφ(zt,l|x<t, xs(l,t), zt,<l, z<t,1:L) (7)

The modified posterior distribution removes unnecessaryconditional variables, which makes the optimization moreefficient. To summarize the information from posterioroutputs xs(l,t), we design a reversed WaveNet architectureto compute the hidden feature b, illustrated in Figure 2b,and bt,l is formulated as,

bt,l = f(xs(t,l)) = fφ1(bt,l+1, bt+2l+1,l+1) (8)

where we define that bt,L = fφ2(xt, xt+2L+1), and fφ1

and fφ2is the dilated convolution layer, whose structure

is a reverse version of fθ1 in Eq.3. Finally, we infer-ence the posterior distribution qφ(zt,l|x, zt,<l, z<t,1:L) =N (zt,l;µ

′t,l, v

′t,l) by b and h, which is µ′t,l = fφ3

(ht,l, bt,l)and log v′t,l = fφ4(ht,l, bt,l). Here, we reuse the stochastichidden states h in the generative model in order to compressthe number of the model parameters.

KL Annealing Trick. It is well known that the deep neuralnetworks with multi-layers stochastic latent variables is dif-ficult to train, of which one important reason is that the KLterm in the loss function limited the capacity of the stochas-tic latent variable to compress the data information in earlystages of training. The KL Annealing is a common trick toalleviate this issue. The objective function is redefined as,

Lλ(x) = Eqφ(z|x)[T∑t=1

log pθ(x|z)]

− λDKL(qφ(z|x)||pθ(z|x))

(9)

During the training process, the λ is annealed from 0 to1. In previous works, researchers usually adopt the linearannealing strategy (Fraccaro et al., 2016; Goyal et al., 2017).In our experiment, we find that it still increases λ too fast forStochastic WaveNet. We propose to use cosine annealingstrategy alternatively, namely the λ is following the functionλα = 1− cos(α), where α scans from 0 to π

2 .

4. ExperimentIn this section, we evaluate the proposed StochasticWaveNet on several benchmark datasets from various do-mains, including natural speech, human handwriting and

𝑥"

𝑧$,&

ℎ$ ,&

𝑧$(),&

ℎ$(),&

𝑧$(*,&

ℎ$(*,&

𝑧$(+,&

ℎ$(+,&

𝑥"() 𝑥"(* 𝑥"(+

𝑧$,)

ℎ$ ,)

𝑧$(),)

ℎ$(),)

𝑧$(*,)

ℎ$(*,)

𝑧$(+,)

ℎ$(+,)

𝑥"() 𝑥"(* 𝑥"(+ 𝑥"(,

(a) Generative Model𝑥" 𝑥"#$ 𝑥"#% 𝑥"#&

𝑏",$ 𝑏"#$,$ 𝑏"#%,$ 𝑏"#&,$

𝑧",* 𝑧"#$,* 𝑧"#%,* 𝑧"#&,*

ℎ",$ ℎ"#$,$ ℎ"#%,$ ℎ"#&,$

𝑏",* 𝑏"#$,* 𝑏"#%,* 𝑏"#&,*

ℎ",* ℎ"#$,* ℎ"#%,* ℎ"#&,*

𝑧",$ 𝑧"#$,$ 𝑧"#%,$ 𝑧"#&,$

(b) Inference Model

Figure 2. This figure illustrates a toy sample of StochasticWaveNet, which has two layers. The left one is the generativemodel, and the right one is the inference model. Both the solidline and dash line represent the neural network functions. The hin the right figure is identical to the one in the left. The z in thegenerative model are sampled from prior distributions, and theones in the inference model are from posterior.

human motion modeling tasks. We show that StochasticWaveNet, or SWaveNet in short, achieves state-of-the-artresults, and visualizes the generated samples for the humanhandwriting domain. The experiment codes are publiclyaccessible. 1

Baselines. The following sequential generative models pro-posed in recent years are treated as the baseline methods:

• RNN: The original recurrent neural network with theLSTM cell.

• VRNN: The generative model with the recurrent struc-ture proposed in (Chung et al., 2015). It firstly for-mulates the prior distribution of zt as a conditionalprobability given historical data x<t and latent vari-ables z<t.

• SRNN: Proposed in (Fraccaro et al., 2016), and it aug-ments the inference network by a backward RNN to

1https://github.com/laiguokun/SWaveNet


better optimize the ELBO.

• Z-forcing: Proposed in (Goyal et al., 2017), whose ar-chitecture is similar to SRNN, and it eases the trainingof the stochastic latent variables by adding auxiliarycost which forces model to use stochastic latent vari-ables z to reconstruct the future data x>t.

• WaveNet: Proposed in (Van Den Oord et al., 2016) andproduce state-of-the-art result in the speech generationtask.

We evaluate different models by comparing the log-likelihood on the test set (RNN, WaveNet) or its lowerbound (VRNN, SRNN, Z-forcing and our method). Forfair comparison, a multivariate Gaussian distribution withthe diagonal covariance matrix is used as the output distri-bution for each time stamp in all experiments. The Adamoptimizer (Kingma and Ba, 2014) is used for all models,and the learning rate is scheduled by the cosine annealing.Following the experiment setting in (Fraccaro et al., 2016),we use 1 sample to approximate the variational evidencelower bound to reduce the computation cost.

4.1. Natural Speech Modeling

In the natural speech modeling task, we train the modelto fit the log-likelihood function of the raw audio signals,following the experiment setting in (Fraccaro et al., 2016;Goyal et al., 2017). The raw signals, which correspond tothe real-valued amplitudes, are represented as a sequenceof 200-dimensional frames. Each frame is 200 consecutivesamples. The preprocessing and dataset segmentation areidentical to (Fraccaro et al., 2016; Goyal et al., 2017). Weevaluate the proposed model in the following benchmarkdatasets:

• Blizzard (Prahallad et al., 2013): The Blizzard Chal-lenge 2013, which is a text-to-speech dataset contain-ing 300 hours of English from a single female speaker.

• TIMIT 2: TIMIT raw audio data sets, which contains6,300 English sentence, read by 630 speakers.

For Blizzard datasets, we report the average log-likelihoodover the half-second segments of the test set. For TIMITdatasets, we report the average log-likelihood over eachsequence of the test set, which is following the setting in(Fraccaro et al., 2016; Goyal et al., 2017). In this task, weuse 5-layer SWaveNet architecture with 1024 hidden dimen-sions for Blizzard and 512 for TIMIT. And the dimensionsof the stochastic latent variables are 100 for both datasets.

The experiment results are illustrated in Table 1. The pro-posed model has produced the best result for both datasets.

2https://catalog.ldc.upenn.edu/ldc93s1

Method Blizzard TIMIT

RNN 7413 26643VRNN ≥ 9392 ≥ 28982SRNN ≥ 11991 ≥ 60550Z-forcing(+kla) ≥ 14226 ≥ 68903Z-forcing(+kla,aux)* ≥ 15024 ≥ 70469

WaveNet -5777 26074SWaveNet ≥ 15708 ≥ 72463

(±274) (±639)

Table 1. Test set Log-likelihoods on the natural speech modelingtask. The first group is all RNN-based models, while the secondgroup is WaveNet-based models. Best results are highlighted inbold. ∗ denotes that the training objective is equipped with anauxiliary term which other methods don’t have. For SWaveNet, wereport the mean (± standard deviation) produced by 10 differentruns.

Since the performance gap is not significant enough, wealso report the variance of the proposed model perfor-mance by rerunning the model with 10 random seeds, whichshows the consistence performance. Compared with theWaveNet model, the one without stochastic hidden states,SWaveNet gets a significant performance boost. Simultane-ously, SWaveNet still enjoys the advantage of the paralleltraining compared with RNN-based stochastic models. Onecommon concern about SWaveNet is that it may requirelarger hidden dimension of the stochastic latent variablesthan RNN based model due to its multi-layer structure. How-ever, the total dimension of stochastic latent variables forone time stamp of SWaveNet is 500, which is twice as thenumber in the SRNN and the Z-forcing papers (Fraccaroet al., 2016; Goyal et al., 2017). We will further discuss therelationship between the number of stochastic layers andthe model performance in section 4.3.

4.2. Handwriting and Human Motion Generation

Next, we evaluate the proposed model by visualizing gen-erated samples from the trained model. The domain wechoose is human handwriting, whose writing tracks are de-scribed by a sequential sample points. The following datasetis used to train the generative model:

IAM-OnDB (Liwicki and Bunke, 2005): The human hand-writing datasets contains 13,040 handwriting lines writtenby 500 writers. The writing trajectories are represented as asequence of 3-dimension frames. Each frame is composedof two real-value numbers, which is (x, y) coordinate forthis sample point, and a binary number indicating whetherthe pen is touching the paper. The data preprocessing anddivision are same as (Graves, 2013; Chung et al., 2015).

The quantitative results are reported in Table 2. SWaveNetachieves similar result compared with the best one, and


(a) Ground Truth (b) RNN (c) VRNN (d) SWaveNet

Figure 3. Generated handwriting sample: (a) are the samples from the ground truth data. (b) (c), (d) are from RNN, VRNN and SWaveNet,respectively. Each line is one handwriting sample.

Method IAM-OnDB

RNN (Chung et al., 2015) 1016VRNN (Chung et al., 2015) ≥ 1334

WaveNet 1021SWaveNet ≥ 1301

Table 2. Log-likelihood results on IAM-OnDB dataset. The bestresult are highlighted in bold.

still shows significant improvement to the vanilla WaveNetarchitecture. In Figure 3, we plot the ground truth samplesand the ones randomly generated from different models.Compared with RNN and VRNN, SWaveNet shows clearerresult. It is easy to distinguish the boundary of the characters,and we can obverse that more of them are similar to theEnglish-characters, such as “is” in the fourth line and “her”in the last line.

4.3. Influence of Stochastic Latent Variables

The most prominent distinction between SWaveNet andRNN-based stochastic neural networks is that SWaveNetutilizes the dilated convolution layers to model multi-layerstochastic latent variables rather than one layer latent vari-ables in the RNN models. Here, we perform the empiricalstudy about the number of stochastic layers in SWaveNetmodel to demonstrate the efficiency of the design of multi-layers stochastic latent variables. The experiment is de-signed as follows. Firstly, we retain the total number oflayers and only change the number of stochastic layers,namely the layer contains stochastic latent variables. Morespecifically, For a SWaveNet with L layers and S stochasticlayers, (S ≤ L), we eliminate the stochastic latent variablesin the bottom part, which is {z1:T,l; l < L − S} in Eq.4.Then for each time stamp, when the model hasD dimensionstochastic variables in total, each layer would have bDS cdimension stochastic variables. In this experiment, we setD = 500.

We plot the experiment results in Figure 4. From the plots,

(a) Blizzard (b) TIMIT

(c) IAM-OnDB

Figure 4. The influence of the number of stochastic layers ofSWaveNet.

we find that SWaveNet can achieve better performance withmultiple stochastic layers. This demonstrates that it is help-ful to encode the stochastic latent variables with a hier-achical structure. And in the experiment on Blizzard andIAM-OnDB, we observe that the performance will decreasewhen the number of stochastic layers is large enough. Be-cause too large number of stochastic layers would result intoo small number of latent variables for a layer to memorizevaluable information in different hierarchy levels.

We also study how the model performance would be influ-enced by the number of stochastic latent variables. Similarto previous one, we only tune the total number of stochasticlatent variables and keep rest settings unchanged, which is4 stochastic layers. The results are plotted in Figure 5. Theydemonstrate that Stochastic WaveNet would be benefitedfrom even a small number of stochastic latent variables.

5. ConclusionIn this paper, we present a novel generative latent variablemodel for sequential data, named as Stochastic WaveNet,which injects stochastic latent variables into the hidden stateof WaveNet. A new inference network structure is designed


(a) Blizzard (b) TIMIT

(c) IAM-OnDB

Figure 5. The influence of the number of stochastic latent variablesof SWaveNet.

based on the characteristic of WaveNet architecture. Empir-ically results show state-of-the-art performances on variousdomains by leveraging additional stochastic latent variables.Simultaneously, the training process of WaveNet is greatlyaccelerated by parallell computation compared with RNN-based models. For future work, a potential research di-rection is to adopt the advanced training strategies (Goyalet al., 2017; Shabanian et al., 2017) designed for sequentialstochastic neural networks, to Stochastic WaveNet.

ReferencesBayer, J. and Osendorfer, C. (2014). Learning stochastic

recurrent networks. arXiv preprint arXiv:1411.7610.

Boulanger-Lewandowski, N., Bengio, Y., and Vincent,P. (2012). Modeling temporal dependencies in high-dimensional sequences: Application to polyphonicmusic generation and transcription. arXiv preprintarXiv:1206.6392.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014).Empirical evaluation of gated recurrent neural networkson sequence modeling. arXiv preprint arXiv:1412.3555.

Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C.,and Bengio, Y. (2015). A recurrent latent variable modelfor sequential data. In Advances in neural informationprocessing systems, pages 2980–2988.

Engel, J., Resnick, C., Roberts, A., Dieleman, S., Eck, D.,Simonyan, K., and Norouzi, M. (2017). Neural audiosynthesis of musical notes with wavenet autoencoders.arXiv preprint arXiv:1704.01279.

Fabius, O. and van Amersfoort, J. R. (2014). Variationalrecurrent auto-encoders. arXiv preprint arXiv:1412.6581.

Fraccaro, M., Sønderby, S. K., Paquet, U., and Winther, O.(2016). Sequential neural models with stochastic layers.In Advances in neural information processing systems,pages 2199–2207.

Gan, Z., Li, C., Henao, R., Carlson, D. E., and Carin, L.(2015). Deep temporal sigmoid belief networks for se-quence modeling. In Advances in Neural InformationProcessing Systems, pages 2467–2475.

Goyal, A., Sordoni, A., Côté, M.-A., Ke, N. R., and Ben-gio, Y. (2017). Z-forcing: Training stochastic recurrentnetworks. In Advances in Neural Information ProcessingSystems, pages 6716–6726.

Graves, A. (2013). Generating sequences with recurrentneural networks. arXiv preprint arXiv:1308.0850.

Gu, S., Ghahramani, Z., and Turner, R. E. (2015). Neuraladaptive sequential monte carlo. In Advances in NeuralInformation Processing Systems, pages 2629–2637.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-termmemory. Neural computation, 9(8):1735–1780.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul,L. K. (1999). An introduction to variational methods forgraphical models. Machine learning, 37(2):183–233.

Kingma, D. P. and Ba, J. (2014). Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980.

Kingma, D. P. and Welling, M. (2013). Auto-encodingvariational bayes. arXiv preprint arXiv:1312.6114.

Liwicki, M. and Bunke, H. (2005). Iam-ondb-an on-lineenglish sentence database acquired from handwritten texton a whiteboard. In Document Analysis and Recognition,2005. Proceedings. Eighth International Conference on,pages 956–961. IEEE.

Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K.(2016). Pixel recurrent neural networks. arXiv preprintarXiv:1601.06759.

Oord, A. v. d., Li, Y., Babuschkin, I., Simonyan, K., Vinyals,O., Kavukcuoglu, K., Driessche, G. v. d., Lockhart,E., Cobo, L. C., Stimberg, F., et al. (2017). Paral-lel wavenet: Fast high-fidelity speech synthesis. arXivpreprint arXiv:1711.10433.

Prahallad, K., Vadapalli, A., Elluru, N., Mantena, G., Pu-lugundla, B., Bhaskararao, P., Murthy, H., King, S.,Karaiskos, V., and Black, A. (2013). The blizzard chal-lenge 2013–indian language task. In Blizzard challengeworkshop, volume 2013.


Semeniuta, S., Severyn, A., and Barth, E. (2017). A hybridconvolutional variational autoencoder for text generation.arXiv preprint arXiv:1702.02390.

Shabanian, S., Arpit, D., Trischler, A., and Bengio,Y. (2017). Variational bi-lstms. arXiv preprintarXiv:1711.05717.

Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., andKavukcuoglu, K. (2016). Wavenet: A generative modelfor raw audio. arXiv preprint arXiv:1609.03499.

Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W.(2017a). Breaking the softmax bottleneck: a high-rankrnn language model. arXiv preprint arXiv:1711.03953.

Yang, Z., Hu, Z., Salakhutdinov, R., and Berg-Kirkpatrick,T. (2017b). Improved variational autoencoders for textmodeling using dilated convolutions. arXiv preprintarXiv:1702.08139.

Yu, F. and Koltun, V. (2015). Multi-scale context ag-gregation by dilated convolutions. arXiv preprintarXiv:1511.07122.

stochastic wavenet: a generative latent variable …stochastic wavenet: a generative latent variable...

Documents