multi-space variational encoder-decoders for semi ... · been methods proposed using neural...

Multi-space Variational Encoder-Decodersfor Semi-supervised Labeled Sequence Transduction

Chunting Zhou, Graham NeubigLanguage Technologies Institute

Carnegie Mellon Universityctzhou,[email protected]

Abstract

Labeled sequence transduction is a taskof transforming one sequence into anothersequence that satisfies desiderata speci-fied by a set of labels. In this paper wepropose multi-space variational encoder-decoders, a new model for labeled se-quence transduction with semi-supervisedlearning. The generative model can useneural networks to handle both discreteand continuous latent variables to exploitvarious features of data. Experimentsshow that our model provides not only apowerful supervised framework but alsocan effectively take advantage of the un-labeled data. On the SIGMORPHONmorphological inflection benchmark, ourmodel outperforms single-model state-of-art results by a large margin for the major-ity of languages.1

1 Introduction

This paper proposes a model for labeled sequencetransduction tasks, tasks where we are given aninput sequence and a set of labels, from whichwe are expected to generate an output sequencethat reflects the content of the input sequence anddesiderata specified by the labels. Several exam-ples of these tasks exist in prior work: using labelsto moderate politeness in machine translation re-sults (Sennrich et al., 2016), modifying the outputlanguage of a machine translation system (John-son et al., 2016), or controlling the length of asummary in summarization (Kikuchi et al., 2016).In particular, however, we are motivated by thetask of morphological reinflection (Cotterell et al.,

1An implementation of our model are avail-able at https://github.com/violet-zct/MSVED-morph-reinflection.

playing playedPOS=Verb, Tense=Past

Model

plays

Supervised Learning

Semi-Supervised Learning

Figure 1: Standard supervised labeled sequence transduction,and our proposed semi-supervised method.

2016), which we will use as an example in our de-scription and test bed for our models.

In morphologically rich languages, different af-fixes (i.e. prefixes, infixes, suffixes) can be com-bined with the lemma to reflect various syntac-tic and semantic features of a word. The abilityto accurately analyze and generate morphologi-cal forms is crucial to creating applications suchas machine translation (Chahuneau et al., 2013;Toutanova et al., 2008) or information retrieval(Darwish and Oard, 2007) in these languages. Asshown in 1, re-inflection of an inflected form giventhe target linguistic labels is a challenging sub-task of handling morphology as a whole, in whichwe take as input an inflected form (in the exam-ple, “playing”) and labels representing the desiredform (“pos=Verb, tense=Past”) and must gen-erate the desired form (“played”).

Approaches to this task include those utilizinghand-crafted linguistic rules and heuristics (Tajiet al., 2016), as well as learning-based approachesusing alignment and extracted transduction rules(Durrett and DeNero, 2013; Alegria and Etxeber-ria, 2016; Nicolai et al., 2016). There have alsobeen methods proposed using neural sequence-to-sequence models (Faruqui et al., 2016; Kannet al., 2016; Ostling, 2016), and currently ensem-bles of attentional encoder-decoder models (Kannand Schutze, 2016a,b) have achieved state-of-artresults on this task. One feature of these neu-ral models however, is that they are trained in a

https://github.com/violet-zct/MSVED-morph-reinflection

https://github.com/violet-zct/MSVED-morph-reinflection

largely supervised fashion (top of Fig. 1), usingdata explicitly labeled with the input sequence andlabels, along with the output representation. Need-less to say, the ability to obtain this annotated datafor many languages is limited. However, we canexpect that for most languages we can obtain largeamounts of unlabeled surface forms that may al-low for semi-supervised learning over this unla-beled data (entirety of Fig. 1).2

In this work, we propose a new frame-work for labeled sequence transduction prob-lems: multi-space variational encoder-decoders(MSVED, §3.3). MSVEDs employ continuousor discrete latent variables belonging to multipleseparate probability distributions3 to explain theobserved data. In the example of morphologicalreinflection, we introduce a vector of continuousrandom variables that represent the lemma of thesource and target words, and also one discrete ran-dom variable for each of the labels, which are onthe source or the target side.

This model has the advantage of both providinga powerful modeling framework for supervisedlearning, and allowing for learning in an unsuper-vised setting. For labeled data, we maximize thevariational lower bound on the marginal log like-lihood of the data and annotated labels. For un-labeled data, we train an auto-encoder to recon-struct a word conditioned on its lemma and mor-phological labels. While these labels are unavail-able, a set of discrete latent variables are associ-ated with each unlabeled word. Afterwards we canperform posterior inference on these latent vari-ables and maximize the variational lower boundon the marginal log likelihood of data.

Experiments on the SIGMORPHON morpho-logical reinflection task (Cotterell et al., 2016) findthat our model beats the state-of-the-art for a sin-gle model in the majority of languages, and is par-ticularly effective in languages with more compli-cated inflectional phenomena. Further, we findthat semi-supervised learning allows for signifi-cant further gains. Finally, qualitative evaluationof lemma representations finds that our model isable to learn lemma embeddings that match withhuman intuition.

2Faruqui et al. (2016) have attempted a limited form ofsemi-supervised learning by re-ranking with a standard n-gram language model, but this is not integrated with the learn-ing process for the neural model and gains are limited.

3Analogous to multi-space hidden Markov models(Tokuda et al., 2002)

2 Labeled Sequence Transduction

In this section, we first present some notations re-garding labeled sequence transduction problems ingeneral, then describe a particular instantiation formorphological reinflection.Notation: Labeled sequence transduction prob-lems involve transforming a source sequence x(s)

into a target sequence x(t), with some labelsdescribing the particular variety of transforma-tion to be performed. We use discrete variablesy(t)1 , y

(t)2 , · · · , y(t)K to denote the labels associated

with each target sequence, where K is the totalnumber of labels. Let y(t) = [y

(t)1 , y

(t)2 , · · · , y(t)K ]

denote a vector of these discrete variables. Eachdiscrete variable y(t)k represents a categorical fea-ture pertaining to the target sequence, and has aset of possible labels. In the later sections, we alsouse y(t) and y(t)k to denote discrete latent variablescorresponding to these labels.

Given a source sequence x(s) and a set of as-sociated target labels y(t), our goal is to gener-ate a target sequence x(t) that exhibits the fea-tures specified by y(t) using a probabilistic modelp(x(t)|x(s),y(t)). The best target sequence x(t) isthen given by:

x(t) = argmaxx(t)

p(x(t)|x(s),y(t)). (1)

Morphological Reinflection Problem: In mor-phological reinflection, the source sequence x(s)

consists of the characters in an inflected word(e.g., “played”), while the associated labels y(t)

describe some linguistic features (e.g., y(t)pos =

Verb, y(t)tense = Past) that we hope to realizein the target. The target sequence x(t) is there-fore the characters of the re-inflected form of thesource word (e.g., “played”) that satisfy the lin-guistic features specified by y(t). For this task,each discrete variable y(t)k has a set of possible la-bels (e.g. pos=V, pos=ADJ, etc) and follows amultinomial distribution.

3 Proposed Method

3.1 Preliminaries: Variational Autoencoder

As mentioned above, our proposed model usesprobabilistic latent variables in a model basedon neural networks. The variational autoencoder(Kingma and Welling, 2014) is an efficient wayto handle (continuous) latent variables in neural

(a) VAE

y(t)

x

(t)x

(t)x

(s)x

(s)xx

x

z yz z z z y(t)y

(b) Labeled MSVAE (c) MSVAE (d) Labeled MSVED (e) MSVED

Figure 2: Graphical models of (a) VAE, (b) labeled MSVAE,(c) MSVAE, (d) labeled MSVED, and (e) MSVED. Whitecircles are latent variables and shaded circles are observedvariables. Dashed lines indicate the inference process whilethe solid lines indicate the generative process.

models. We describe it briefly here, and inter-ested readers can refer to Doersch (2016) for de-tails. The VAE learns a generative model of theprobability p(x|z) of observed data x given a la-tent variable z, and simultaneously uses a recog-nition model q(z|x) at learning time to estimate zfor a particular observation x (Fig. 2(a)). q(·) andp(·) are modeled using neural networks parameter-ized by φ and θ respectively, and these parametersare learned by maximizing the variational lowerbound on the marginal log likelihood of data:

log pθ(x) ≥ Ez∼qφ(z|x)[log pθ(x|z)]−KL(qφ(z|x)||p(z)) (2)

The KL-divergence term (a standard feature ofvariational methods) ensures that the distributionsestimated by the recognition model qφ(z|x) donot deviate far from our prior probability p(z) ofthe values of the latent variables. To optimizethe parameters with gradient descent, Kingmaand Welling (2014) introduce a reparameterizationtrick that allows for training using simple back-propagation w.r.t. the Gaussian latent variables z.Specifically, we can express z as a deterministicvariable z = gφ(ε,x) where ε is an independentGaussian noise variable ε ∼ N (0, 1). The meanµ and the variance σ2 of z are reparameterized bythe differentiable functions w.r.t. φ. Thus, insteadof generating z from qφ(z|x), we sample the aux-iliary variable ε and obtain z = µφ(x)+σφ(x)◦ ε,which enables gradients to backpropagate throughφ.

3.2 Multi-space Variational Autoencoders

As an intermediate step to our full model, wenext describe a generative model for a single se-quence with both continuous and discrete latentvariables, the multi-space variational auto-encoder(MSVAE). MSVAEs are a combination of twothreads of previous work: deep generative mod-els with both continuous/discrete latent variablesfor classification problems (Kingma et al., 2014;

Maaløe et al., 2016) and VAEs with only continu-ous variables for sequential data (Bowman et al.,2016; Chung et al., 2015; Zhang et al., 2016;Fabius and van Amersfoort, 2014; Bayer and Os-endorfer, 2014). In MSVAEs, we have an ob-served sequence x, continuous latent variables zlike the VAE, as well as discrete variables y.

In the case of the morphology example, x can beinterpreted as an inflected word to be generated. yis a vector representing its linguistic labels, eitherannotated by an annotator in the observed case, orunannotated in the unobserved case. z is a vectorof latent continuous variables, e.g. a latent embed-ding of the lemma that captures all the informationabout x that is not already represented in labels y.

MSVAE: Because inflected words can be naturallythought of as “lemma+morphological labels”, tointerpret a word, we resort to discrete and continu-ous latent variables that represent the linguistic la-bels and the lemma respectively. In this case whenthe labels of the sequence y is not observed, weperform inference over possible linguistic labelsand these inferred labels are referenced in gener-ating x.

The generative model pθ(x,y, z) =p(z)pπ(y)pθ(x|y, z) is defined as:

p(z) = N (z|0, I) (3)

pπ(y) =∏k

Cat(yk|πk) (4)

pθ(x|y, z) = f(x;y, z, θ). (5)

Like the standard VAE, we assume the prior ofthe latent variable z is a diagonal Gaussian dis-tribution with zero mean and unit variance. Weassume that each variable in y is independent,resulting in a factorized distribution in Eq. 4,where Cat(yk|πk) is a multinomial distributionwith parameters πk. For the purposes of this study,we set these to a uniform distribution πk,j =1|πk| . f(x;y, z, θ) calculates the likelihood of x,a function parametrized by deep neural networks.Specifically, we employ an RNN decoder to gener-ate the target word conditioned on the lemma vari-able z and linguistic labels y, detailed in §5.

When inferring the latent variables from thegiven data x, we assume the joint distribution oflatent variables z and y has a factorized form, i.e.q(z,y|x) = q(z|x)q(y|x) as shown in Fig. 2(c).

The inference model is defined as follows:

qφ(z|x) = N (z|µφ(x), diag(σ2φ(x))) (6)

qφ(y|x) =∏k

qφ(yk|x)

=∏k

Cat(yk|πφ(x)) (7)

where the inference distribution over z is a diago-nal Gaussian distribution with mean and varianceparameterized by neural networks. The inferencemodel q(y|x) on labels y has the form of a dis-criminative classifier that generates a set of multi-nomial probability vectors πφ(x) over all labelsfor each tag yk. We represent each multinomialdistribution q(yk|x) with an MLP.

The MSVAE is trained by maximizing the fol-lowing variational lower bound U(x) on the objec-tive for unlabeled data:

log pθ(x) ≥ E(y,z)∼qφ(y,z|x) logpθ(x,y, z)

qφ(y, z|x)= Ey∼qφ(y|x)[Ez∼qφ(z|x)[log pθ(x|z,y)]− KL(qφ(z|x)||p(z)) + log pπ(y)

− log qφ(y|x)] = U(x) (8)

Note that this introduction of discrete variablesrequires more sophisticated optimization algo-rithms, which we will discuss in §4.1.Labeled MSVAE: When y is observed as shownin Fig. 2(b), we maximize the following varia-tional lower bound on the marginal log likelihoodof the data and the labels:

log pθ(x,y) ≥ Ez∼qφ(z|x) logpθ(x,y, z)

qφ(z|x)=

Ez∼qφ(z|x)[log pθ(x|y, z) + log pπ(y)]

− KL(qφ(z|x)||p(z)) (9)

which is a simple extension to Eq. 2.Note that when labels are not observed, the in-

ference model qφ(y|x) has the form of a discrim-inative classifier, thus we can use observed labelsas the supervision signal to learn a better classifier.In this case we also minimize the following crossentropy as the classification loss:

D(x,y) = E(x,y)∼pl(x,y)[− log qφ(y|x)] (10)

where pl(x,y) is the distribution of labeled data.This is a form of multi-task learning, as this addi-tional loss also informs the learning of our repre-sentations.

3.3 Multi-space VariationalEncoder-Decoders

Finally, we discuss the full proposed method:the multi-space variational encoder-decoder(MSVED), which generates the target x(t) fromthe source x(s) and labels y(t). Again, we discusstwo cases of this model: labels of the targetsequence are observed and not observed.MSVED: The graphical model for the MSVED isgiven in Fig. 2 (e). Because the labels of target se-quence are not observed, once again we treat themas discrete latent variables and make inferenceon the these labels conditioned on the target se-quence. The generative process for the MSVED isvery similar to that of the MSVAE with one impor-tant exception: while the standard MSVAE condi-tions the recognition model q(z|x) on x, then gen-erates x itself, the MSVED conditions the recogni-tion model q(z|x(s)) on the source x(s), then gen-erates the target x(t). Because only the recogni-tion model is changed, the generative equations forpθ(x

(t),y(t), z) are exactly the same as Eqs. 3–5with x(t) swapped for x and y(t) swapped for y.The variational lower bound on the conditional loglikelihood, however, is affected by the recognitionmodel, and thus is computed as:

log pθ(x(t)|x(s))

≥E(y(t),z)∼qφ(y(t),z|x(s),x(t)) logpθ(x

(t),y(t), z|x(s))

qφ(y(t), z|x(s),x(t))

=Ey(t)∼qφ(y(t)|x(t))[Ez∼qφ(z|x(s))[log pθ(x(t)|y(t), z)]

− KL(qφ(z|x(s))||p(z)) + log pπ(y(t))

− log qφ(y(t)|x(t))] = Lu(x(t)|x(s)) (11)

Labeled MSVED: When the complete form ofx(s), y(t), and x(t) is observed in our training data,the graphical model of the labeled MSVED modelis illustrated in Fig. 2 (d). We maximize the vari-ational lower bound on the conditional log likeli-hood of observing x(t) and y(t) as follows:

log pθ(x(t),y(t)|x(s))

≥ Ez∼qφ(z|x(s)) logpθ(x

(t),y(t), z|x(s))

qφ(z|x(s))

= Ez∼qφ(z|x(s))[log pθ(x(t)|y(t), z) + log pπ(y

(t))]−

KL(qφ(z|x(s))||p(z)) = Ll(x(t),y(t)|x(s)) (12)

4 Learning MSVED

Now that we have described our overall model, wediscuss details of the learning process that prove

useful to its success.

4.1 Learning Discrete Latent VariablesOne challenge in training our model is that it is nottrivial to perform back-propagation through dis-crete random variables, and thus it is difficult tolearn in the models containing discrete tags suchas MSVAE or MSVED.4 To alleviate this problem,we use the recently proposed Gumbel-Softmaxtrick (Maddison et al., 2014; Gumbel and Lieblein,1954) to create a differentiable estimator for cate-gorical variables.

The Gumbel-Max trick (Gumbel and Lieblein,1954) offers a simple way to draw samples froma categorical distribution with class probabili-ties π1, π2, · · · by using the argmax operation asfollows: one hot(argmaxi[gi + log πi]), whereg1, g2, · · · are i.i.d. samples drawn from the Gum-bel(0,1) distribution.5 When making inferences onthe morphological labels y1, y2, · · · , the Gumbel-Max trick can be approximated by the continuoussoftmax function with temperature τ to generate asample vector yi for each label i:

yij =exp((log(πij) + gij)/τ)∑Nik=1 exp((log(πik) + gik)/τ

(13)

whereNi is the number of classes of label i. Whenτ approaches zero, the generated sample yi be-comes a one-hot vector. When τ > 0, yi is smoothw.r.t πi. In experiments, we start with a relativelylarge temperature and decrease it gradually.

4.2 Learning Continuous Latent VariablesMSVED aims at generating the target sequenceconditioned on the latent variable z and the tar-get labels y(t). This requires the encoder togenerate an informative representation z encod-ing the content of the x(s). However, the varia-tional lower bound in our loss function containsthe KL-divergence between the approximate pos-terior qφ(z|x) and the prior p(z), which is rel-atively easy to learn compared with learning togenerate output from a latent representation. Weobserve that with the vanilla implementation theKL cost quickly decreases to near zero, settingqφ(z|x) equal to standard normal distribution. In

4 Kingma et al. (2014) solve this problem by marginaliz-ing over all labels, but this is infeasible in our case where wehave an exponential number of label combinations.

5The Gumbel (0,1) distribution can be sampled byfirst drawing u ∼ Uniform(0,1) and computing g =− log(− log(u)).

this case, the RNN decoder can easily rely on thetrue output of last time step during training to de-code the next token, which degenerates into anRNN language model. Hence, the latent variablesare ignored by the decoder and cannot encode anyuseful information. The latent variable z learns anundesirable distribution that coincides with the im-posed prior distribution but has no contribution tothe decoder. To force the decoder to use the latentvariables, we take the following two approacheswhich are similar to Bowman et al. (2016).KL-Divergence Annealing: We add a coefficientλ to the KL cost and gradually anneal it from zeroto a predefined threshold λm. At the early stageof training, we set λ to be zero and let the modelfirst figure out how to project the representationof the source sequence to a roughly right point inthe space and then regularize it with the KL cost.Although we are not optimizing the tight varia-tional lower bound, the model balances well be-tween generation and regularization. This tech-nique can also be seen in (Kocisky et al., 2016;Miao and Blunsom, 2016).Input Dropout in the Decoder: Besides anneal-ing the KL cost, we also randomly drop out theinput token with a probability of β at each timestep of the decoder during learning. The previousground-truth token embedding is replaced with azero vector when dropped. In this way, the RNNdecoder could not fully rely on the ground-truthprevious token, which ensures that the decoderuses information encoded in the latent variables.

5 Architecture for MorphologicalReinflection

Training details: For the morphological reinflec-tion task, our supervised training data consists ofsource x(s), target x(t), and target tags y(t). Wetest three variants of our model trained using dif-ferent types of data and different loss functions.First, the single-directional supervised model (SD-Sup) is purely supervised: it only decodes thetarget word from the given source word with theloss function Ll(x(t),y(t)|x(s)) from Eq. 12. Sec-ond, the bi-directional supervised model (BD-Sup) is trained in both directions: decoding thetarget word from the source word and decodingthe source word from the target word, which cor-responds to the loss function Ll(x(t),y(t)|x(s)) +Lu(x(s)|x(t)) using Eqs. 11–12. Finally, the semi-supervised model (Semi-sup) is trained to maxi-

Proposed MSVED Baseline MEDLanguage #LD #ULD SD-Sup BD-Sup Semi-sup Single Ensemble

Turkish 12,798 29,608 93.25 95.66† 97.25‡ 89.56 95.00Hungarian 19,200 34,025 97.00 98.54† 99.16‡ 96.46 98.37Spanish 12,799 72,151 88.32 91.50 93.74 94.74†

‡ 96.69Russian 12,798 67,691 75.77 83.07 86.80‡ 83.55† 87.13Navajo 12,635 6,839 85.00 95.37† 98.25‡ 63.62 83.00Maltese 19,200 46,918 84.83 88.21† 88.46‡ 79.59 84.25Arabic 12,797 53,791 79.13 92.62† 93.37‡ 72.58 82.80Georgian 12,795 46,562 89.31 93.63† 95.97‡ 91.06 96.21German 12,777 56,246 75.55 84.08 90.28‡ 89.11† 92.41Finnish 12,800 74,687 75.59 85.11 91.20‡ 85.63† 93.18

Avg. Acc – – 84.38 90.78† 93.45‡ 84.59 90.90

Table 1: Results for Task 3 of SIGMORPHON 2016 on Morphology Reinflection. † represents the best single supervisedmodel score, ‡ represents the best model including semi-supervised models, and bold represents the best score overall. #LDand #ULD are the number of supervised data and unlabeled words respectively.

k a l b

⌃(x)

µ(x)

✏ ⇠ N (0, 1)

z

<w> k

k ä

+yT1 yT2 yT3 yT4 ....

......

k a l b

⌃(x)

µ(x)

✏ ⇠ N (0, 1)

z

<w> k

k a

Multinomial Sampling

+

......

y1 2 {pos: V, N, ADJ}..y2 2 {def: DEF, INDEF}y3 2 {num: DU, SG, PL}......

...

Source Word Reinflected Form

Source Word Source Word

Supervised Variational Encoder Decoder

Unsupervised Variational Auto-encoder

y1

y2

y3

y4

· · ·

Figure 3: Model architecture for labeled and unlabeled data.For the encoder-decoder model, only one direction from thesource to target is given. The classification model is not illus-trated in the diagram.

mize the variational lower bounds and minimizethe classification cross-entropy error of 10.

L(x(s),x(t),y(t),x) = α · U(x) + Lu(x(s)|x(t))

+ Ll(x(t),y(t)|x(s))−D(x(t),y(t)) (14)

The weight α controls the relative weight betweenthe loss from unlabeled data and labeled data.

We use Monte Carlo methods to estimate theexpectation over the posterior distribution q(z|x)and q(y|x) inside the objective function 14.Specifically, we draw Gumbel noise and Gaussiannoise one at a time to compute the latent variablesy and z.

The overall model architecture is shown inFig. 3. Each character and each label is associ-ated with a continuous vector. We employ GatedRecurrent Units (GRUs) for the encoder and de-

coder. Let−→ht and

←−ht denote the hidden state of the

forward and backward encoder RNN at time stept. u is the hidden representation of x(s) concate-nating the last hidden state from both directionsi.e. [

−→hT ;←−hT ] where T is the word length. u is

used as the input for the inference model on z. Werepresent µ(u) and σ2(u) as MLPs and sample zfromN (µ(u), diag(σ2(u))), using z = µ+ σ ◦ ε,where ε ∼ N (0, I). Similarly, we can obtain thehidden representation of x(t) and use this as inputto the inference model on each label y(t)

i which isalso an MLP following a softmax layer to generatethe categorical probabilities of target labels.

In decoding, we use 3 types of information incalculating the probability of the next character :(1) the current decoder state, (2) a tag context vec-tor using attention (Bahdanau et al., 2015) overthe tag embeddings, and (3) the latent variable z.The intuition behind this design is that we wouldlike the model to constantly consider the lemmarepresented by z, and also reference the tag cor-responding to the current morpheme being gen-erated at this point. We do not marginalize overthe latent variable z however, instead we use themode µ of z as the latent representation for z. Weuse beam search with a beam size of 8 to performsearch over the character vocabulary at each de-coding time step.Other experimental setups: All hyperparame-ters are tuned on the validation set, and includethe following: For KL cost annealing, λm is setto be 0.2 for all language settings. For characterdrop-out at the decoder, we empirically set β tobe 0.4 for all languages. We set the dimension ofcharacter embeddings to be 300, tag label embed-dings to be 200, RNN hidden state to be 256, and

latent variable z to be 150. We set α the weightfor the unsupervised loss to be 0.8. We train themodel with Adadelta (Zeiler, 2012) and use early-stop with a patience of 10.

6 Experiments

6.1 Background: SIGMORPHON 2016SIGMORPHON 2016 is a shared task on mor-phological inflection over 10 different morpholog-ically rich languages. There are a total of threetasks, the most difficult of which is task 3, whichrequires the system to output the reinflection of aninflected word.6 The training data format in task3 is in triples: (source word, target labels, targetword). In the test phase, the system is asked togenerate the target word given a source word andthe target labels. There are a total of three tracksfor each task, divided based the amount of super-vised data that can be used to solve the problem,among which track 2 has the strictest limitation ofonly using data for the corresponding task. As thisis an ideal testbed for our method, which can learnfrom unlabeled data, we choose track 2 and task 3to test our our model’s ability to exploit this data.

As a baseline, we compare our results with theMED system (Kann and Schutze, 2016a) whichachieved state-of-the-art results in the shared task.This system used an encoder-decoder model withattention on the concatenated source word and tar-get labels. Its best result is obtained from an en-semble of five RNN encoder-decoders (Ensem-ble). To make a fair comparison with our mod-els, which don’t use ensembling, we also calcu-lated single model results (Single).

All models are trained using the labeled trainingdata provided for task 3. For our semi-supervisedmodel (Semi-sup), we also leverage unlabeleddata from the training and validation data for tasks1 and 2 to train variational auto-encoders.

6.2 Results and AnalysisFrom the results in Tab. 1, we can glean a numberof observations. First, comparing the results of ourfull Semi-sup model, we can see that for all lan-guages except Spanish, it achieves accuracies bet-ter than the single MED system, often by a largemargin. Even compared to the MED ensembledmodel, our single-model system is quite compet-itive, achieving higher accuracies for Hungarian,

6Task 1 is inflection of a lemma word and task 2 is rein-flection but also provides the source word labels.

Language Prefix Stem Suffix

Turkish 0.21 1.12 98.75Hungarian 0.00 0.08 99.79Spanish 0.09 3.25 90.74Russian 0.66 7.70 85.00Navajo 77.64 18.38 26.40Maltese 48.81 11.05 98.74Arabic 68.52 37.04 88.24Georgian 4.46 0.41 92.47German 0.84 3.32 89.19Finnish 0.02 12.33 96.16

Table 2: Percentage of inflected word forms that have mod-ified each part of the lemma (Cotterell et al., 2016) (somewords can be inflected zero or multiple times, thus sums maynot add to 100%).

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Percentage of suffixing inflection

60

65

70

75

80

85

90

95

100

Acc

ura

cy

NavajoArabic

MalteseFinnish

Russian

Georgian

German

Spanish

TurkishHungarian

MSVED

MED (1)

Figure 4: Performance on test data w.r.t. the percentage ofsuffixing inflection. Points with the same x-axis value corre-spond to the same language results.

Navajo, Maltese, and Arabic, as well as achievingaverage accuracies that are state-of-the-art.

Next, comparing the different varieties of ourproposed models, we can see that the semi-supervised model consistently outperforms thebidirectional model for all languages. And simi-larly, the bidirectional model consistently outper-forms the single direction model. From these re-sults, we can conclude that the unlabeled data isbeneficial to learn useful latent variables that canbe used to decode the corresponding word.

Examining the linguistic characteristics of themodels in which our model performs well pro-vides even more interesting insights. Cotterellet al. (2016) estimate how often the inflectionprocess involves prefix changes, stem-internalchanges or suffix changes, the results of whichare shown in Tab. 2. Among the many languages,the inflection processes of Arabic, Maltese andNavajo are relatively diverse, and contain a largeamount of all three forms of inflection. By ex-amining the experimental results together with themorphological inflection process of different lan-guages, we found that among all the languages,Navajo, Maltese and Arabic obtain the largestgains in performance compared with the ensem-

a l - ↩ i ma r a t i y y a t u

def=DEF

gen=FEM

voice=None

aspect=None

tense=None

num=PL

poss=None

per=None

pos=ADJ

mood=None

case=NOM

n ı d a j i d l e e h

arg=None

aspect=IPFV/PROG

num=PL

per=4

pos=V

mood=REAL

0.0

0.2

0.4

0.6

0.8

Figure 5: Two examples of attention weights on target lin-guistic labels: Arabic (Left) and Navajo (Right). When a tagequals None, it means the word does not have this tag.

bled MED system. To demonstrate this visually, inFig. 4, we compare the semi-supervised MSVEDwith the MED single model w.r.t. the percentageof suffixing inflection of each language, showingthis clear trend.

This strongly demonstrates that our model is ag-nostic to different morphological inflection formswhereas the conventional encoder-decoder withattention on the source input tends to perform bet-ter on suffixing-oriented morphological inflection.We hypothesize that for languages that the inflec-tion mostly comes from suffixing, transduction isrelatively easy because the source and target wordsshare the same prefix and the decoder can copy theprefix of the source word via attention. However,for languages in which different inflections of alemma go through different morphological pro-cesses, the inflected word and the target word maydiffer greatly and thus it is crucial to first analyzethe lemma of the inflected word before generat-ing the corresponding the reinflection form basedon the target labels. This is precisely what ourmodel does by extracting the lemma representa-tion z learned by the variational inference model.

6.3 Analysis on Tag Attention

To analyze how the decoder attends to the linguis-tic labels associated with the target word, we ran-domly pick two words from the Arabic and Navajotest set and plot the attention weight in Fig. 5.The Arabic word “al-’imaratiyyatu” is an adjec-tive which means “Emirati”, and its source word inthe test data is “’imaratiyyin” 7. Both of these aredeclensions of “’imaratiyy”. The source word is

7https://en.wiktionary.org/wiki/%D8%A5%D9%85%D8%A7%D8%B1%D8%A7%D8%AA%D9%8A

Figure 6: Visualization of latent variables z for Maltese with35 pseudo-lemma groups in the figure.

singular, masculine, genitive and indefinite, whilethe required inflection is plural, feminine, nomi-native and definite. We can see from the left heatmap that the attention weights are turned on at sev-eral positions of the word when generating corre-sponding inflections. For example, “al-” in Ara-bic is the definite article that marks definite nouns.The same phenomenon can also be observed in theNavajo example, as well as other languages, butdue to space limitation, we don’t provide detailedanalysis here.

6.4 Visualization of Latent Lemmas

To investigate the learned latent representations,in this section we visualize the z vectors, ex-amining whether the latent space groups togetherwords with the same lemma. Each sample in SIG-MORPHON 2016 contains source word and tar-get words which share the same lemma. We run aheuristic process to assign pairs of words to groupsthat likely share a lemma by grouping togetherword pairs for which at least one of the words ineach pair shares a surface form. This process isnot error free – errors may occur in the case wheremultiple lemmas share the same surface form – butin general the groupings will generally reflect lem-mas except in these rare erroneous cases, so wedub each of these groups a pseudo-lemma.

In Fig. 6, we randomly pick 1500 words fromMaltese and visualize the continuous latent vec-tors of these words. We compute the latent vec-tors as µφ(x) in the variational posterior inference(Eq. 6) without adding the variance. As expected,words that belong to the same pseudo-lemma (inthe same color) are projected into adjacent pointsin the two-dimensional space. This demonstratesthat the continuous latent variable captures thecanonical form of a set of words and demonstratesthe effectiveness of the proposed representation.

https://en.wiktionary.org/wiki/%D8%A5%D9%85%D8%A7%D8%B1%D8%A7%D8%AA%D9%8A

https://en.wiktionary.org/wiki/%D8%A5%D9%85%D8%A7%D8%B1%D8%A7%D8%AA%D9%8A

Language Src Word Tgt Labels Gold Tgt MED Ours

Turkishkocama pos=N,poss=PSS1S,case=ESS,num=SG kocamda kocama kocamdayaratmamdan pos=N,case=NOM,num=SG yaratma yaratma yaratmanbitimizde pos=N,tense=PST,per=1,num=SG bittik bitiydik bittim

Maltesendammhomli pos=V,polar=NEG,tense=PST,num=SG tindammhiex ndammejthiex tindammhiextqozzhieli pos=V,polar=NEG,tense=PST,num=SG tqozzx tqozzx qazzejtxtissikkmuhomli pos=V,polar=POS,tense=PST,num=PL ssikkmulna tissikkmulna tissikkmulna

Finnishverovapaatta pos=ADJ,case=PRT,num=PL verovapaita verovappaita verovapaitaturrumme pos=V,mood=POT,tense=PRS,num=PL turtunemme turtunemme turrunemmesukunimin pos=N,case=PRIV,num=PL sukunimitt sukunimeitta sukunimeitta

Table 3: Randomly picked output examples on the test data. Within each block, the first, second and third lines are outputs thatours is correct and MED’s is wrong, ours is wrong and MED’s is correct, both are wrong respectively.

6.5 Analyzing Effects of Size of UnlabeledData

From Tab. 1, we can see that semi-supervisedlearning always performs better than supervisedlearning without unlabeled data. In this section,we investigate to what extent the size of unlabeleddata can help with performance. We process aGerman corpus from a 2017 Wikipedia dump andobtain more than 100,000 German words. Thesewords are ranked in order of occurrence frequencyin Wikipedia. The data contains a certain amountof noise since we did not apply any special pro-cessing. We shuffle all unlabeled data from boththe Wikipedia and the data provided in the sharedtask used in previous experiments, and increasethe number of unlabeled words used in learningby 10,000 each time, and finally use all the un-labeled data (more than 150,000 words) to trainthe model. Fig. 7 shows that the performanceon the test data improves as the amount of unla-beled data increases, which implies that the un-supervised learning continues to help improve themodel’s ability to model the latent lemma repre-sentation even as we scale to a noisy, real, and rela-tively large-scale dataset. Note that the growth rateof the performance grows slower as more data isadded, because although the number of unlabeleddata is increasing, the model has seen most wordpatterns in a relatively small vocabulary.

6.6 Case Study on Reinflected Words

In Tab. 3, we examine some model outputs on thetest data from the MED system and our model re-spectively. It can be seen that most errors of MEDand our models can be ascribed to either over-copyor under-copy of characters. In particular, from thecomplete outputs we observe that our model tendsto be more aggressive in its changes, resulting in

0

1e4

2e4

3e4

5e4

>1

5e5

# Unlabeled words

83

84

85

86

87

88

89

90

91

92

Acc

ura

cy o

n t

est

data

(%

)

Figure 7: Performance on the German test data w.r.t. theamount of unlabeled Wikipedia data.

it performing more complicated transformations,both successfully (such as Maltese “ndammhomli”to “tindammhiex”) and unsuccessfully (“tqozzx”to “qazzejtx”). In contrast, the attentional encoder-decoder model is more conservative in its changes,likely because it is less effective in learning an ab-stracted representation for the lemma, and insteadcopies characters directly from the input.

7 Conclusion and Future Work

In this work, we propose a multi-space variationalencoder-decoder framework for labeled sequencetransduction problem. The MSVED performs wellin the task of morphological reinflection, outper-forming the state of the art, and further improv-ing with the addition of external unlabeled data.Future work will adapt this framework to othersequence transduction scenarios such as machinetranslation, dialogue generation, question answer-ing, where continuous and discrete latent variablescan be abstracted to guide sequence generation.

Acknowledgments

The authors thank Jiatao Gu, Xuezhe Ma, ZihangDai and Pengcheng Yin for their helpful discus-sions. This work has been supported in part by anAmazon Academic Research Award.

ReferencesInaki Alegria and Izaskun Etxeberria. 2016. Ehu at

the sigmorphon 2016 shared task. a simple proposal:Grapheme-to-phoneme for inflection. In Proceed-ings of the 2016 Meeting of SIGMORPHON .

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. The InternationalConference on Learning Representations .

Justin Bayer and Christian Osendorfer. 2014. Learn-ing stochastic recurrent networks. arXiv preprintarXiv:1411.7610 .

Samuel R Bowman, Luke Vilnis, Oriol Vinyals, An-drew M Dai, Rafal Jozefowicz, and Samy Ben-gio. 2016. Generating sentences from a continuousspace. Proceedings of CoNLL .

Victor Chahuneau, Eva Schlinger, Noah A Smith, andChris Dyer. 2013. Translating into morphologicallyrich languages with synthetic phrases. Associationfor Computational Linguistics.

Junyoung Chung, Kyle Kastner, Laurent Dinh, KratarthGoel, Aaron C Courville, and Yoshua Bengio. 2015.A recurrent latent variable model for sequential data.In Advances in neural information processing sys-tems. pages 2980–2988.

R. Cotterell, C. Kirov, J. Sylak-Glassman,D. Yarowsky, J. Eisner, and M. Hulden. 2016.The sigmorphon 2016 shared taskmorphologicalreinflection. In Proceedings of the 54th AnnualMeeting of the Association for ComputationalLinguistics.

Kareem Darwish and Douglas W Oard. 2007. Adapt-ing morphology for arabic information retrieval. InArabic Computational Morphology, Springer, pages245–262.

Carl Doersch. 2016. Tutorial on variational autoen-coders. arXiv preprint arXiv:1606.05908 .

Greg Durrett and John DeNero. 2013. Supervisedlearning of complete morphological paradigms. InProceedings of the 2013 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies.Association for Computational Linguistics, Atlanta,Georgia, pages 1185–1195.

Otto Fabius and Joost R van Amersfoort. 2014. Vari-ational recurrent auto-encoders. arXiv preprintarXiv:1412.6581 .

Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, andChris Dyer. 2016. Morphological inflection gener-ation using character sequence to sequence learn-ing. In Proceedings of the 2016 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies. Association for Computational Linguis-tics, San Diego, California, pages 634–643.

Emil Julius Gumbel and Julius Lieblein. 1954. Sta-tistical theory of extreme values and some practicalapplications: a series of lectures. US GovernmentPrinting Office Washington .

Melvin Johnson, Mike Schuster, Quoc V Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Viegas, Martin Wattenberg, Greg Corrado,et al. 2016. Google’s multilingual neural machinetranslation system: Enabling zero-shot translation.arXiv preprint arXiv:1611.04558 .

Katharina Kann, Ryan Cotterell, and Hinrich Schutze.2016. Neural multi-source morphological reinflec-tion. arXiv preprint arXiv:1612.06027 .

Katharina Kann and Hinrich Schutze. 2016a. Med:The lmu system for the sigmorphon 2016 shared taskon morphological reinflection. In In Proceedingsof the 14th SIGMORPHON Workshop on Computa-tional Research in Phonetics, Phonology, and Mor-phology. Berlin, Germany.

Katharina Kann and Hinrich Schutze. 2016b. Single-model encoder-decoder with explicit morphologicalrepresentation for reinflection. In In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics. Berlin, Germany.

Yuta Kikuchi, Graham Neubig, Ryohei Sasano, HiroyaTakamura, and Manabu Okumura. 2016. Control-ling output length in neural encoder-decoders. InProceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing. Associ-ation for Computational Linguistics, Austin, Texas,pages 1328–1338.

Diederik P Kingma, Shakir Mohamed, Danilo JimenezRezende, and Max Welling. 2014. Semi-supervisedlearning with deep generative models. In Ad-vances in Neural Information Processing Systems.Montreal, Canada, pages 3581–3589.

D.P. Kingma and M. Welling. 2014. Auto-encodingvariational bayes. In The International Conferenceon Learning Representations.

Tomas Kocisky, Gabor Melis, Edward Grefenstette,Chris Dyer, Wang Ling, Phil Blunsom, andKarl Moritz Hermann. 2016. Semantic parsing withsemi-supervised sequential autoencoders. the 2016Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) .

Lars Maaløe, Casper Kaae Sønderby, Søren KaaeSønderby, and Ole Winther. 2016. Auxiliary deepgenerative models. Proceedings of the 33rd Interna-tional Conference on Machine Learning .

Chris J Maddison, Daniel Tarlow, and Tom Minka.2014. A* sampling. In Advances in Neural Infor-mation Processing Systems. pages 3086–3094.

Yishu Miao and Phil Blunsom. 2016. Language as alatent variable: Discrete generative models for sen-tence compression. the 2016 Conference on Em-pirical Methods in Natural Language Processing(EMNLP) .

Garrett Nicolai, Bradley Hauer, Adam St. Arnaud, andGrzegorz Kondrak. 2016. Morphological reinflec-tion via discriminative string transduction. In Pro-ceedings of the 2016 Meeting of SIGMORPHON .

Robert Ostling. 2016. Morphological reinflection withconvolutional neural networks. In Proceedings ofthe 14th SIGMORPHON Workshop on Computa-tional Research in Phonetics, Phonology, and Mor-phology page 23.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Controlling politeness in neural machinetranslation via side constraints. In Proceedings ofthe 2016 Conference of The North American Chap-ter of the Association for Computational Linguistics(NAACL). pages 35–40.

Dima Taji, Ramy Eskander, Nizar Habash, and OwenRambow. 2016. The columbia university - new yorkuniversity abu dhabi sigmorphon 2016 morphologi-cal reinflection shared task submission. In Proceed-ings of the 2016 Meeting of SIGMORPHON .

Keiichi Tokuda, Takashi Masuko, Noboru Miyazaki,and Takao Kobayashi. 2002. Multi-space probabil-ity distribution hmm. IEICE TRANSACTIONS onInformation and Systems 85(3):455–464.

Kristina Toutanova, Hisami Suzuki, and Achim Ruopp.2008. Applying morphology generation models tomachine translation. In Proceedings of the 46th An-nual Meeting of the Association for ComputationalLinguistics. pages 514–522.

Matthew D Zeiler. 2012. Adadelta: an adaptive learn-ing rate method. arXiv preprint arXiv:1212.5701 .

Biao Zhang, Deyi Xiong, Jinsong Su, Hong Duan, andMin Zhang. 2016. Variational neural machine trans-lation. Proceedings of the 54th Annual Meeting ofthe Association for Computational Linguistics .

multi-space variational encoder-decoders for semi ... · been methods proposed using neural...

Documents