understanding and improving morphological learning in the ... › belinkov › assets › pdf ›...

10
Proceedings of the The 8th International Joint Conference on Natural Language Processing, pages 142–151, Taipei, Taiwan, November 27 – December 1, 2017 c 2017 AFNLP Understanding and Improving Morphological Learning in the Neural Machine Translation Decoder Fahim Dalvi Nadir Durrani Hassan Sajjad Yonatan Belinkov * Stephan Vogel Qatar Computing Research Institute – HBKU, Doha, Qatar {faimaduddin, ndurrani, hsajjad, svogel}@qf.org.qa * MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA [email protected] Abstract End-to-end training makes the neural ma- chine translation (NMT) architecture sim- pler, yet elegant compared to traditional statistical machine translation (SMT). However, little is known about linguis- tic patterns of morphology, syntax and semantics learned during the training of NMT systems, and more importantly, which parts of the architecture are re- sponsible for learning each of these phe- nomena. In this paper we i) analyze how much morphology an NMT decoder learns, and ii) investigate whether inject- ing target morphology into the decoder helps it produce better translations. To this end we present three methods: i) joint generation, ii) joint-data learning, and iii) multi-task learning. Our results show that explicit morphological information helps the decoder learn target language mor- phology and improves the translation qual- ity by 0.2–0.6 BLEU points. 1 Introduction Neural machine translation (NMT) offers an el- egant end-to-end architecture, improving transla- tion quality compared to traditional phrase-based machine translation. These improvements are at- tributed to more fluent output (Toral and S´ anchez- Cartagena, 2017) and better handling of mor- phology and long-range dependencies (Bentivogli et al., 2016). However, systematic studies are re- quired to understand what kinds of linguistic phe- nomena (morphology, syntax, semantics, etc.) are learned by these models and more importantly, which of the components is responsible for each phenomenon. A few attempts have been made to understand what NMT models learn about morphology (Be- linkov et al., 2017a), syntax (Shi et al., 2016) and semantics (Belinkov et al., 2017b). Shi et al. (2016) used activations at various layers from the NMT encoder to predict syntactic properties on the source-side, while Belinkov et al. (2017a) and Belinkov et al. (2017b) used a similar approach to investigate the quality of word representations on the task of morphological and semantic tagging. Belinkov et al. (2017a) found that word rep- resentations learned from the encoder are rich in morphological information, while representations learned from the decoder are significantly poorer. However, the paper does not present a convincing explanation for this finding. Our first contribution in this work is to provide a more comprehensive analysis of morphological learning on the decoder side. We hypothesize that other components of the NMT architecture – specifically the encoder and the attention mechanism, learn enough informa- tion about the target language morphology for the decoder to perform reasonably well, without in- corporating high levels of morphological knowl- edge into the decoder. To probe this hypothesis, we investigate the following questions: What is the effect of attention on the perfor- mance of the decoder? How much does the encoder help the decoder in predicting the correct morphological vari- ant of the word it generates? To answer these questions, we train NMT mod- els for different language pairs, involving mor- phologically rich languages such as German and Czech. We then use the trained models to ex- tract features from the decoder for words in the language of interest. Finally we train a classifier using the extracted features to predict the morpho- logical tag of the words. The accuracy of this ex- 142

Upload: others

Post on 05-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding and Improving Morphological Learning in the ... › belinkov › assets › pdf › ... · Neural machine translation (NMT) offers an el-egant end-to-end architecture,

Proceedings of the The 8th International Joint Conference on Natural Language Processing, pages 142–151,Taipei, Taiwan, November 27 – December 1, 2017 c©2017 AFNLP

Understanding and Improving Morphological Learningin the Neural Machine Translation Decoder

Fahim Dalvi Nadir Durrani Hassan SajjadYonatan Belinkov∗ Stephan Vogel

Qatar Computing Research Institute – HBKU, Doha, Qatar{faimaduddin, ndurrani, hsajjad, svogel}@qf.org.qa

∗MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, [email protected]

Abstract

End-to-end training makes the neural ma-chine translation (NMT) architecture sim-pler, yet elegant compared to traditionalstatistical machine translation (SMT).However, little is known about linguis-tic patterns of morphology, syntax andsemantics learned during the training ofNMT systems, and more importantly,which parts of the architecture are re-sponsible for learning each of these phe-nomena. In this paper we i) analyzehow much morphology an NMT decoderlearns, and ii) investigate whether inject-ing target morphology into the decoderhelps it produce better translations. Tothis end we present three methods: i) jointgeneration, ii) joint-data learning, and iii)multi-task learning. Our results show thatexplicit morphological information helpsthe decoder learn target language mor-phology and improves the translation qual-ity by 0.2–0.6 BLEU points.

1 Introduction

Neural machine translation (NMT) offers an el-egant end-to-end architecture, improving transla-tion quality compared to traditional phrase-basedmachine translation. These improvements are at-tributed to more fluent output (Toral and Sanchez-Cartagena, 2017) and better handling of mor-phology and long-range dependencies (Bentivogliet al., 2016). However, systematic studies are re-quired to understand what kinds of linguistic phe-nomena (morphology, syntax, semantics, etc.) arelearned by these models and more importantly,which of the components is responsible for eachphenomenon.

A few attempts have been made to understand

what NMT models learn about morphology (Be-linkov et al., 2017a), syntax (Shi et al., 2016)and semantics (Belinkov et al., 2017b). Shi et al.(2016) used activations at various layers from theNMT encoder to predict syntactic properties onthe source-side, while Belinkov et al. (2017a) andBelinkov et al. (2017b) used a similar approach toinvestigate the quality of word representations onthe task of morphological and semantic tagging.

Belinkov et al. (2017a) found that word rep-resentations learned from the encoder are rich inmorphological information, while representationslearned from the decoder are significantly poorer.However, the paper does not present a convincingexplanation for this finding. Our first contributionin this work is to provide a more comprehensiveanalysis of morphological learning on the decoderside. We hypothesize that other components of theNMT architecture – specifically the encoder andthe attention mechanism, learn enough informa-tion about the target language morphology for thedecoder to perform reasonably well, without in-corporating high levels of morphological knowl-edge into the decoder. To probe this hypothesis,we investigate the following questions:

• What is the effect of attention on the perfor-mance of the decoder?

• How much does the encoder help the decoderin predicting the correct morphological vari-ant of the word it generates?

To answer these questions, we train NMT mod-els for different language pairs, involving mor-phologically rich languages such as German andCzech. We then use the trained models to ex-tract features from the decoder for words in thelanguage of interest. Finally we train a classifierusing the extracted features to predict the morpho-logical tag of the words. The accuracy of this ex-

142

Page 2: Understanding and Improving Morphological Learning in the ... › belinkov › assets › pdf › ... · Neural machine translation (NMT) offers an el-egant end-to-end architecture,

ternal classifier gives us a quantitative measure ofhow well the NMT model learned features that arerelevant to morphology. Our results indicate thatboth the encoder and the attention mechanism aidthe decoder in generating correct morphologicalforms, and thus limit the need of the decoder tolearn target morphology.

Motivated by these findings, we hypothesizethat it may be possible to force the decoder to learnmore about morphology by injecting the morpho-logical information during training which can inturn improve the overall translation quality. Inorder to test this hypothesis, we experiment withthree possible solutions:

1. Joint Generation: An NMT model is trainedon the concatenation of words and morpho-logical tags on the target side.

2. Joint-data learning: An NMT model istrained where each source sequence is usedtwice with an artificial token to either predicttarget words or morphological tags.

3. Multi-task learning: A multi-task NMT sys-tem with two objective functions is trainedto jointly learn translation and morphologicaltagging.

Our experiments show that word representationslearned after explicitly injecting target morphol-ogy improve morphological tagging accuracy ofthe decoder by 3% and also improves the trans-lation quality by up to 0.6 BLEU points.

The remainder of this paper is organized as fol-lows. Section 2 describes our experimental setup.Section 3 shows an analysis of the decoder. Sec-tion 4 describes the three proposed methods to in-tegrate morphology into the decoder. Section 5presents the results. Section 6 gives an accountof related work and Section 7 concludes the paper.

2 Experimental Design

Parallel DataWe used the German-English and Czech-Englishdatasets from the WIT3 TED corpus (Cettolo,2016) made available for IWSLT 2016. We usedthe official training sets to analyze and evaluatethe proposed methods for integrating morphology. The corpus also provides four test sets, test-11through test-14. We used test-11 for tuning, andthe other test sets for evaluation. The statistics forthe sets are provided in Table 1.

Language-pair Sentences tokde/cz token

De↔En 210K 4M 4.2MCz↔En 122K 2.1M 2.5M

Table 1: Statistics for the data used for training,tuning and testing

Morphological AnnotationsIn order to train and evaluate the external classifieron the extracted features, we required data anno-tated with morphological tags. We used the fol-lowing tools recommended on the Moses website1

to annotate the data: LoPar (Schmid, 2000) forGerman, Tree-tagger (Schmid, 1994) for Czechand MXPOST (Ratnaparkhi, 1998) for English.The number of tags produced by these taggers is214 for German and 368 for Czech.

Data preprocessingWe used the standard MT pre-processing pipelineof tokenizing and truecasing the data using Moses(Koehn et al., 2007) scripts. We did not applybyte-pair encoding (BPE) (Sennrich et al., 2016b),which has recently become a common part of theNMT pipeline, because both our analysis and theannotation tools are word level.2 However, ex-perimenting with BPE and other representationssuch as character-based models (Kim et al., 2015)would be interesting.3

NMT SystemsWe used the seq2seq-attn implementation(Kim, 2016) with the following default settings:word embeddings and LSTM states with 500 di-mensions, SGD with an initial learning rate of 1.0and decay rate of 0.5 (after the 9th epoch), anddropout rate of 0.3. We use two uni-directionalhidden layers for both the encoder and the decoder.

1These have been used frequently to annotate data in theprevious evaluation campaigns (Birch et al., 2014; Durraniet al., 2014a).

2The difficulty with using these is that it is not straight-forward to derive word representations out of a decoder thatprocesses BPE-ed text, because the original words are splitinto subwords. We considered aggregating the representa-tions of BPE subword units, but the choice of aggregationstrategy may have an undesired impact on the analysis. Forthis reason we decided to leave exploration of BPE for futurework.

3Character-based models are becoming increasingly pop-ular in Neural MT, for addressing the rare word problem– and they have been used previously also to benefit MTfor morphologically rich (Luong et al., 2010; Belinkov andGlass, 2016; Costa-jussa and Fonollosa, 2016) and closelyrelated languages (Durrani et al., 2010; Sajjad et al., 2013).

143

Page 3: Understanding and Improving Morphological Learning in the ... › belinkov › assets › pdf › ... · Neural machine translation (NMT) offers an el-egant end-to-end architecture,

Figure 1: Features for the word Nun (DECt1) areextracted from the decoder of a pre-trained NMTsystem and provided to the classifier for training

The NMT system is trained for 13 epochs, and themodel with the best validation loss is used for ex-tracting features for the external classifier. We usea vocabulary size of 50000 on both the source andtarget side.

Classifier Settings

For the classification task, we used a feed-forwardnetwork with one hidden layer, dropout (ρ = 0.5),a ReLU non-linearity, and an output layer map-ping to the tag set (followed by a Softmax). Thesize of the hidden layer is set to be identical to thesize of the NMT decoder’s hidden state (500 di-mensions). The classifier has no explicit access tocontext other than the hidden representation gener-ated by the NMT system, which allows us to focuson the quality of the representation. We use Adam(Kingma and Ba, 2014) with default parameters tominimize the cross-entropy objective.

3 Decoder Analysis

3.1 Methodology

We follow a process similar to Shi et al. (2016)and Belinkov et al. (2017a) to analyze the NMTsystems but with a focus on the decoder compo-nent of the architecture. Formally, given a sourcesentence s = {s1, s2, ..., sN} and a target sen-tence t = {t1, t2, ..., tM}, we first use the encoder(Equation 1) to compute a set of hidden statesh = {h1, h2, ..., hN}. We then use an attentionmechanism (Bahdanau et al., 2014) to compute a

weighted average of these hidden states from theprevious decoder state (di−1), known as the con-text vector ci (Equation 2). The context vector is areal valued vector of k dimensions, which is set tobe the same as the hidden states in our case. Theattention model computes a weight whi

for eachhidden state of the encoder, thus giving soft align-ment for each target word. The context vector isthen used by the decoder (Equation 3) to generatethe next word in the target sequence:

ENC : s = {s1, ..., sN} 7→ h = {h1, ..., hN} (1)

ATTNi : h, di−1, ti−1 7→ ci ∈ Rk(1 ≤ i ≤M) (2)DEC : {c1, ..., cM} 7→ t = {t1, t2, ..., tM} (3)

After training the NMT system, we freeze the pa-rameters of the network and use the encoder or thedecoder as a feature extractor to generate vectorsrepresenting words in the sentence. Let ENCsi de-note the representation of a source word si. Weuse ENCsi to train the external classifier that forpredicting the morphological tag for si and evalu-ate the quality of the representation based on ourability to train a good classifier. For word repre-sentations on the target side, we feed our word ofinterest ti as the previously predicted word, andextract the representation DECti from the higherlayers (See Figure 1 for illustration).

Note that in the decoder, the target word rep-resentations DECti are not learned for predictingthe word ti, but the next word (ti+1). Hence, it isarguable that DECti actually captures morpholog-ical information about ti+1 rather than ti, whichcan also explain the poorer decoder accuracies. Totest this argument, we also trained our systems as-suming that DECti encodes morphological infor-mation about the next word ti+1. In this case,the decoder performance dropped by almost 15%.DECti probably encodes morphological informa-tion about both the current word (ti) and the nextword (ti+1). However, we leave this explorationfor future work, and work with the assumption thatDECti encodes information about word ti.

3.2 AnalysisBefore diving into the decoder’s performance, wefirst compare the performance of encoder ver-sus decoder by training De↔En4 and Cz↔EnNMT models. We use the De→En/Cz→En mod-els to extract encoder representations, and the

4By De↔En, we mean independently trained German-to-English and English-to-German models.

144

Page 4: Understanding and Improving Morphological Learning in the ... › belinkov › assets › pdf › ... · Neural machine translation (NMT) offers an el-egant end-to-end architecture,

Baseline ENCsi DECti

De↔En 89.5 44.55Cz↔En 77.0 36.35

Table 2: Comparison of morphological accuracyfor the encoder and decoder representations

En→De/En→Cz models to extract decoder rep-resentations. We then feed these representationsto our classifier to predict morphological tagsfor German and Czech words. Table 2 showsthat German and Czech representations learnedon the encoder-side (using the De→En/Cz→Enmodels) give much better accuracy compared tothe ones learned on the decoder-side (using theEn→De/En→Cz models).

Given this difference in performance betweenthe two components in our NMT system, we ana-lyze the decoder further in various settings: com-paring the performance i) with and without theattention mechanism, and ii) augmenting the de-coder representation with the representation of themost attended source word. The baseline NMTmodels were trained with an attention mechanism.In an attempt to probe what effect the attentionmechanism has on the decoder’s performance inthe context of learning target language morphol-ogy, we trained NMT models without attention.Next we tried to take our baseline model (with at-tention) and augment its decoder representationswith the encoder hidden state corresponding tothe maximum attention (hereby denoted as ENCti).Our hypothesis is that since the decoder focuseson this hidden state to output the next target word,it may also encode some useful information abouttarget morphology. Lastly, we also train a classi-fier on ENCti alone in order to compare the abilityof the encoder and decoder in learning target lan-guage morphology.

Table 3 summarizes the results of these experi-ments. Comparing systems with (DECti) and with-out attention (w/o-ATTN), we see that the ac-curacy on the morphological tagging task goesup when no attention is used. This can be ex-plained by the fact that in the case of no attention,the decoder only receives a single context vectorfrom the encoder and it has to learn more infor-mation about each target word to make accuratepredictions. It is difficult for the encoder to trans-fer information about each target word using thesame context vector cleanly, causing the decoder

DECti w/o-ATTN DECti +ENCti ENCti

En→De 44.55 50.26 60.34 43.43En→Cz 36.35 42.09 48.64 36.36

Table 3: Morphological Tagging accuracy of theDecoder with and without attention, and effectof considering the most attended source word(ENCti)

to learn more, resulting in better decoder perfor-mance in regards to the morphological informationlearned.

The second part of the table presents results in-volving encoder representations to aid morpholog-ical analysis of target words. There is a signif-icant boost in the classifier’s performance whenthe decoder representation for a target word tiis concatenated with the encoder representationof the most attended source word (DECti+ENCti).This hints towards several hypotheses: i) becausethe source and target words are translations, theyshare some morphological properties (e.g. nounsget translated to nouns, etc.), ii) the encoder alsolearns and stores information about the target lan-guage, so that the attention mechanism can makeuse of this information while deciding which wordto focus on next. To ensure that the encoderand decoder indeed learn different information, wealso tried to classify the morphological tag of agiven word ti based on the encoder representationof the most attended source word alone (ENCti).We see a drop in accuracy, showing that both en-coder and decoder learned different things aboutthe same target word and are complementary rep-resentations. We can also see that the accuracy ofthe combined representation (DECti+ENCti) stilllags behind the encoder’s performance in predict-ing source morphology (Table 2). This indicatesthat there is still room for improvement in theNMT model’s ability to learn target side morphol-ogy.

In this section, we showed that the encoder anddecoder learn different amounts of morphologydue to the varying nature of their tasks withinNMT architecture. The decoder depends on theencoder and attention mechanism to generate thecorrect morphological variant of a target word.

4 Morphology-aware Decoder

Motivated by the result that the decoder learnsconsiderably less amount of morphology than the

145

Page 5: Understanding and Improving Morphological Learning in the ... › belinkov › assets › pdf › ... · Neural machine translation (NMT) offers an el-egant end-to-end architecture,

Figure 2: Various approaches to inject morphological knowledge into the decoder

encoder (Table 2) and the overall system does notlearn as much about target morphology as sourcemorphology, we investigated three ways to di-rectly inject target morphology into the decoder,namely: i) Joint Generation, ii) Joint-data Learn-ing, iii) Multi-task Learning. Figure 2 illustratesthe approaches.

4.1 Joint Generation

As our first approach, we considered a solutionthat uses the standard NMT architecture, but istrained on a modified dataset. To incorporatemorphological information, we modify the targetsentence by appending the morphological tag se-quence to it. The NMT system trained on thisdata learns to produce both words and morpho-logical tags simultaneously. Formally, given asource sentence s = {s1, ..., sN}, target sentencet = {t1, ..., tM} and its morphological sequencem = {m1, ...,mM}, we train an NMT system on(s′, t′) pairs, where s′ = s and t′ = t + m. Al-though this model is quite weak and the (word andmorphological) bases are quite far away, we positthat the attention mechanism might be able to at-tend to the same source word twice. Given this, thedecoder gets a similar representation from which ithas to predict a word in the first instance, and a tagin the second - thus helping in common learningfor the two tasks.

4.2 Joint-data Learning

Given the drawbacks of the first approach, weconsidered another data augmentation technique

inspired by multilingual NMT systems (Johnsonet al., 2016). Instead of having multiple sourceand target languages, we used one source languageand two target language variations. The trainingdata consists of sequences of source→target wordsand source→target morphological tags. We addedan artificial token in the beginning of each sourcesentence indicating whether we want to generatetarget words or morphological tags. Using an ar-tificial token in the source sentence has been ex-plored and shown to work well to control the styleof the target language (Sennrich et al., 2016a). Theobjective function is the same as the one in usualsequence-to-sequence models, and is hence sharedto minimize both morphological and translationerror given the mixed data.

4.3 Multi-task Learning

In this final method, we decided to follow a moreprincipled approach and modified the standardsequence-to-sequence for multi-task learning. Thegoal in multi-task training is to learn several taskssimultaneously such that each task can benefitfrom the mutual information learned (Collobertand Weston, 2008). 5 With this motivation, wemodified the NMT decoder to predict not only aword but also its corresponding tag. All of the lay-ers below the output layers are shared. We havetwo output layers in parallel – the first to predictthe target word, and the second to predict the mor-phological tag of the target word. Both ouput lay-

5For example, Eriguchi et al. (2017) jointly learned thetasks of parsing and translation.

146

Page 6: Understanding and Improving Morphological Learning in the ... › belinkov › assets › pdf › ... · Neural machine translation (NMT) offers an el-egant end-to-end architecture,

Figure 3: Improvements from adding morphology. A y-value of zero represents the baseline

ers have their own separate loss function. Whiletraining, we combine the losses from both outputlayers to jointly train the system. This is differentfrom the Joint-data learning technique, where wepredict entire sequences of words or tags withoutany dependence on each other.

Formally, given a set of N tasks, sequence-to-sequence multi-task learning involves an objec-tive function minimizing the overall loss, whichis a weighted combination of the N individualtask losses. In our scenario, the training corpusconsisted of a multi-target corpus: source→targetwords and source→target morphological tags, i.eN = 2. Hence, given a set of training exam-ples D = {〈s(n), t(n),m(n)〉}Nn=1, where s is thesource sentence, t is the target sentence and m isthe target morphological tag sequence, the new ob-jective function to maximize is as follows:

L =(1− λ)N∑

n=1

logP (t(n)|s(n); θ)

+ λ

N∑n=1

logP (m(n)|s(n); θ)

Where λ is a hyper-parameter used to shift focustowards translation or the morphological tagging.6

5 Results and Discussion

Our results show that the multi-task learning ap-proach performed the best among the three ap-proaches, while the Joint Generation method hasthe poorest performance. Figure 3 summarizes theresults for different language pairs. The joint gen-eration method degrades overall translation per-formance, as expected, given its weakness from

6We tuned the weight parameter on held-out data.

a modeling perspective. It is possible that eventhough the attention mechanism is able to focus onthe source sequence in two passes, the parts of thenetwork that predict words and tags are not tightlycoupled enough to learn from each other.

The BLEU scores improved when using theother two methods. We achieved an improvementof up to 0.6 BLEU points and 3% (in tagging accu-racy). The best improvements were obtained in theEn→De direction, while we observed lesser gainsin the De→En. This is perhaps because English ismorphologically poorer, and the baseline systemwas able to learn the required amount of morpho-logical information from the text itself. Improve-ments were also obtained for the En→Cz direc-tion, although not as much as in German. Thiscould be due to data sparsity: Czech is much richerin morphology,7 and the available TED En↔Czdata was 40% less than the En↔De data.

Joint-data vs. Multi-task Learning

Both Joint-data learning and Multi-task learningimproved overall translation performance. In thecase of En→De, the performance of both ap-proaches is very similar. However, each has itsown pros and cons. While the joint-data learn-ing method is a simple approach that allows to addmorphology and other linguistic information with-out needing to change the architecture, the multi-task learning approach is a more principled andpowerful way of integrating the same informationinto the decoder. Having separate objective func-tions in multi-task learning also allows us to ad-just the balance between the two tasks, which can

7The number of morphological tags in Czech are 368 ver-sus 214 in German.

147

Page 7: Understanding and Improving Morphological Learning in the ... › belinkov › assets › pdf › ... · Neural machine translation (NMT) offers an el-egant end-to-end architecture,

Figure 4: Multi-task learning: Translation vs. Morphological Tagging weight for En→De model

be handy if the morphological information qualityis not very high. On the flip side, this additionalexplicit weight adjustment can also be viewed as apotential constraint that is not present in the joint-data learning approach.

Multi-task Weight Hyper-ParameterAs discussed, the multi-task learning approach hasan additional weight hyper-parameter λ that ad-justs the balance between word and tag prediction.Figure 4 shows the result of varying λ from nomorphological information (λ = 0) to only mor-phological information (λ = 1) on test-11 set.The left y-axis presents the BLEU score and theright y-axis presents the morphological accuracy.The best morphological accuracy is achieved atλ = 1 which does not correspond to best trans-lation quality since at that point the model is onlyminimizing the tag objective function. Similarly atλ = 0, the model falls back to the baseline modelwith a single objective function minimizing trans-lation error. For all language pairs, we consistentlyachieved the best BLEU score at λ = 0.2. The pa-rameter was tuned on a separate held out develop-ment set (test-11), and the results shown in Figure3 are on blind test sets (test-12,13). Averages arereported in the figure.

6 Related Work

The related work to this paper can be broken intotwo groups:

Analysis Several approaches have been devisedto analyze MT models and the linguistic propertiesthat are learned during training. A common ap-proach has been to use activations from a trainedmodel to train an external classifier to predict some

relevant information about the input. Kohn (2015)and Qian et al. (2016b) analyzed linguistic infor-mation learned in word embeddings, while Qianet al. (2016a) went further and analyzed linguisticproperties in the hidden states of a recurrent neu-ral network. Adi et al. (2016) looked at the overallinformation learned in a sentence summary vectorgenerated by an RNN using a similar approach.Our approach closely aligns with that of Shi et al.(2016) and Belinkov et al. (2017a), where the acti-vations from various layers in a trained NMT sys-tem are used to predict linguistic properties.

Integrating Morphology Some work has alsobeen done in injecting morphological or more gen-eral linguistic knowledge into an NMT system.Sennrich and Haddow (2016) proposed a factoredmodel that incorporates linguistic features on thesource side as additional factors. An embeddingis learned for each factor, just like a source word,and then the word and factor embeddings are com-bined before being passed on to the encoder. Aha-roni and Goldberg (2017) proposed a method topredict the target sentence along with its syntac-tic tree. They linearize the tree in order to usethe existing sequence-to-sequence model. Nade-jde et al. (2017) also evaluated several methodsof incorporating syntactic knowledge on both thesource and target. While they used factors on thesource side, their best method for the target sidewas to linearize the information and interleave itbetween the target words. Garcıa-Martınez et al.(2016) used a neural MT model with multiple out-puts, like in our case of Multi-task learning. Theirmodel predicts two properties at every step, thelemma of the target word and its morphologicalinformation. They then use an external tool to use

148

Page 8: Understanding and Improving Morphological Learning in the ... › belinkov › assets › pdf › ... · Neural machine translation (NMT) offers an el-egant end-to-end architecture,

this information to generate the actual target word.Dong et al. (2015) presented multi-task learning totranslate a language into multiple target languages,and Luong et al. (2015) did experiments involvingseveral levels of source and target language infor-mation. There have been previous efforts to in-tegrate morphology into MT systems by learningfactored models (Koehn and Hoang, 2007; Durraniet al., 2014b) over POS and morphological tags.

7 Conclusion

In this paper we analyzed and investigated ways toimprove morphological learning in the NMT de-coder. We carried a series of experiments to un-derstand why the decoder learns considerably lessmorphology than the encoder in the NMT archi-tecture. We found that the decoder needs assis-tance from the encoder and the attention mecha-nism to generate correct target morphology. Ad-ditionally we explored three ways to explicitlyinject morphology in the decoder: joint genera-tion, joint-data learning, and multi-task learning.We found multi-task learning to outperform theother two methods. The simpler joint-data learn-ing method also gave decent improvements. Thecode for the experiments and the modified frame-work is available at https://github.com/fdalvi/seq2seq-attn-multitask.

References

Yossi Adi, Einat Kermany, Yonatan Belinkov, OferLavi, and Yoav Goldberg. 2016. Fine-grained Anal-ysis of Sentence Embeddings Using Auxiliary Pre-diction Tasks. arXiv preprint arXiv:1608.04207.

Roee Aharoni and Yoav Goldberg. 2017. TowardsString-To-Tree Neural Machine Translation. In Pro-ceedings of the 55th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), pages 132–140. Association for Computa-tional Linguistics.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural Machine Translation by JointlyLearning to Align and Translate. arXiv preprintarXiv:1409.0473.

Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, HassanSajjad, and James Glass. 2017a. What do NeuralMachine Translation Models Learn about Morphol-ogy? In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 861–872. Associationfor Computational Linguistics.

Yonatan Belinkov and James Glass. 2016. Large-ScaleMachine Translation between Arabic and Hebrew:Available Corpora and Initial Results. In Proceed-ings of the Workshop on Semitic Machine Trans-lation, pages 7–12, Austin, Texas. Association forComputational Linguistics.

Yonatan Belinkov, Lluis Marquez, Hassan Sajjad,Nadir Durrani, Fahim Dalvi, and James Glass.2017b. Evaluating layers of representation in neuralmachine translation on parts-of-speech and semantictagging task. In Proceedings of the 8th InternationalJoint Conference on Natural Language Processing(Volume 1: Long Papers), Taipei, Taiwan. Associa-tion for Computational Linguistics.

Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, andMarcello Federico. 2016. Neural versus Phrase-Based Machine Translation Quality: a Case Study.In Proceedings of the 2016 Conference on Empiri-cal Methods in Natural Language Processing, pages257–267, Austin, Texas. Association for Computa-tional Linguistics.

Alexandra Birch, Matthias Huck, Nadir Durrani, Niko-lay Bogoychev, and Philipp Koehn. 2014. Ed-inburgh SLT and MT system description for theIWSLT 2014 evaluation. In Proceedings of the 11thInternational Workshop on Spoken Language Trans-lation, IWSLT ’14, Lake Tahoe, CA, USA.

Mauro Cettolo. 2016. An Arabic-Hebrew parallel cor-pus of TED talks. In Proceedings of the AMTAWorkshop on Semitic Machine Translation (SeMaT),Austin, US-TX.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In Pro-ceedings of the 25th International Conference onMachine Learning, ICML ’08, pages 160–167, NewYork, NY, USA. ACM.

Marta R. Costa-jussa and Jose A. R. Fonollosa. 2016.Character-based Neural Machine Translation. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers), pages 357–361, Berlin, Germany. As-sociation for Computational Linguistics.

Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, andHaifeng Wang. 2015. Multi-Task Learning for Mul-tiple Language Translation. In ACL (1).

Nadir Durrani, Barry Haddow, Philipp Koehn, andKenneth Heafield. 2014a. Edinburgh’s phrase-basedmachine translation systems for WMT-14. In Pro-ceedings of the ACL 2014 Ninth Workshop on Sta-tistical Machine Translation, pages 97–104, Balti-more, MD, USA.

Nadir Durrani, Philipp Koehn, Helmut Schmid, andAlexander Fraser. 2014b. Investigating the Useful-ness of Generalized Word Representations in SMT.In Proceedings of COLING 2014, the 25th Inter-national Conference on Computational Linguistics:

149

Page 9: Understanding and Improving Morphological Learning in the ... › belinkov › assets › pdf › ... · Neural machine translation (NMT) offers an el-egant end-to-end architecture,

Technical Papers, pages 421–432, Dublin, Ireland.Dublin City University and Association for Compu-tational Linguistics.

Nadir Durrani, Hassan Sajjad, Alexander Fraser, andHelmut Schmid. 2010. Hindi-to-Urdu MachineTranslation through Transliteration. In Proceed-ings of the 48th Annual Meeting of the Associationfor Computational Linguistics, pages 465–474, Up-psala, Sweden. Association for Computational Lin-guistics.

Akiko Eriguchi, Yoshimasa Tsuruoka, and KyunghyunCho. 2017. Learning to parse and translate improvesneural machine translation. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers), pages72–78. Association for Computational Linguistics.

Mercedes Garcıa-Martınez, Loıc Barrault, and FethiBougares. 2016. Factored neural machine transla-tion. CoRR, abs/1609.04621.

Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-rat, Fernanda B. Viegas, Martin Wattenberg, GregCorrado, Macduff Hughes, and Jeffrey Dean. 2016.Google’s multilingual neural machine translationsystem: Enabling zero-shot translation. CoRR,abs/1611.04558.

Yoon Kim. 2016. Seq2seq-attn. https://github.com/harvardnlp/seq2seq-attn.

Yoon Kim, Yacine Jernite, David Sontag, and Alexan-der M Rush. 2015. Character-aware Neural Lan-guage Models. arXiv preprint arXiv:1508.06615.

Diederik Kingma and Jimmy Ba. 2014. Adam: AMethod for Stochastic Optimization. arXiv preprintarXiv:1412.6980.

Philipp Koehn and Hieu Hoang. 2007. Factored Trans-lation Models. In Proceedings of the 2007 JointConference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Lan-guage Learning (EMNLP-CoNLL), pages 868–876,Prague, Czech Republic. Association for Computa-tional Linguistics.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. InProceedings of the Association for ComputationalLinguistics (ACL’07), Prague, Czech Republic.

Arne Kohn. 2015. What’s in an Embedding? Analyz-ing Word Embeddings through Multilingual Evalu-ation. In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Process-ing, pages 2067–2073, Lisbon, Portugal. Associa-tion for Computational Linguistics.

Minh-Thang Luong, Quoc V. Le, Ilya Sutskever,Oriol Vinyals, and Lukasz Kaiser. 2015. Multi-task sequence to sequence learning. CoRR,abs/1511.06114.

Minh-Thang Luong, Preslav Nakov, and Min-Yen Kan.2010. A Hybrid Morpheme-Word Representationfor Machine Translation of Morphologically RichLanguages. In Proceedings of the 2010 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 148–157. Association for Computa-tional Linguistics.

Maria Nadejde, Siva Reddy, Rico Sennrich, TomaszDwojak, Marcin Junczys-Dowmunt, Philipp Koehn,and Alexandra Birch. 2017. Syntax-aware neu-ral machine translation using CCG. CoRR,abs/1702.01147.

Peng Qian, Xipeng Qiu, and Xuanjing Huang. 2016a.Analyzing Linguistic Knowledge in SequentialModel of Sentence. In Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 826–835, Austin, Texas. Associa-tion for Computational Linguistics.

Peng Qian, Xipeng Qiu, and Xuanjing Huang. 2016b.Investigating Language Universal and Specific Prop-erties in Word Embeddings. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1478–1488, Berlin, Germany. Association for Com-putational Linguistics.

Adwait Ratnaparkhi. 1998. Maximum Entropy Modelsfor Natural Language Ambiguity Resolution. Ph.D.thesis, University of Pennsylvania, Philadelphia, PA.

Hassan Sajjad, Kareem Darwish, and Yonatan Be-linkov. 2013. Translating Dialectal Arabic to En-glish. In Proceedings of the 51st Conference of theAssociation for Computational Linguistics (ACL).

Helmut Schmid. 1994. Part-of-Speech Tagging withNeural Networks. In Proceedings of the 15th Inter-national Conference on Computational Linguistics(Coling 1994), pages 172–176, Kyoto, Japan. Col-ing 1994 Organizing Committee.

Helmut Schmid. 2000. LoPar: Design and Imple-mentation. Bericht des Sonderforschungsbereiches“Sprachtheoretische Grundlagen fr die Computerlin-guistik” 149, Institute for Computational Linguis-tics, University of Stuttgart.

Rico Sennrich and Barry Haddow. 2016. Linguisticinput features improve neural machine translation.In Proceedings of the First Conference on MachineTranslation, pages 83–91, Berlin, Germany. Associ-ation for Computational Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a. Controlling Politeness in Neural MachineTranslation via Side Constraints. In Proceedings ofthe 2016 Conference of the North American Chap-ter of the Association for Computational Linguistics:

150

Page 10: Understanding and Improving Morphological Learning in the ... › belinkov › assets › pdf › ... · Neural machine translation (NMT) offers an el-egant end-to-end architecture,

Human Language Technologies, San Diego, Califor-nia.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural Machine Translation of Rare Wordswith Subword Units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.

Xing Shi, Inkit Padhi, and Kevin Knight. 2016. DoesString-Based Neural MT Learn Source Syntax? InProceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing, pages1526–1534, Austin, Texas. Association for Compu-tational Linguistics.

Antonio Toral and Vıctor M. Sanchez-Cartagena. 2017.A Multifaceted Evaluation of Neural versus Phrase-Based Machine Translation for 9 Language Direc-tions. In Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computa-tional Linguistics: Volume 1, Long Papers, pages1063–1073, Valencia, Spain. Association for Com-putational Linguistics.

151