proceedings of the 21st annual conference of the european

Proceedings of the21st Annual Conference ofthe European Association

for Machine Translation

28–30 May 2018

Universitat d’AlacantAlacant, Spain

Edited byJuan Antonio Perez-OrtizFelipe Sanchez-Martınez

Miquel Espla-GomisMaja Popovic

Celia RicoAndre Martins

Joachim Van den BogaertMikel L. Forcada

Organised by

research group

The papers published in this proceedings are —unless indicated otherwise— covered by theCreative Commons Attribution-NonCommercial-NoDerivatives 3.0 International (CC-BY-ND3.0). You may copy, distribute, and transmit the work, provided that you attribute it (au-thorship, proceedings, publisher) in the manner specified by the author(s) or licensor(s), andthat you do not use it for commercial purposes. The full text of the licence may be found athttps://creativecommons.org/licenses/by-nc-nd/3.0/deed.en.

c© 2018 The authorsISBN: 978-84-09-01901-4

https://creativecommons.org/licenses/by-nc-nd/3.0/deed.en

Implementing a neural machine translation engine for mobile devices: theLingvanex use case

Zuzanna Parcheta1 German Sanchis-Trilles1 Aliaksei Rudak2 Siarhei Bratchenia2

1Sciling S.L., Carrer del Riu 321,Pinedo, 46012, Spain2Nordicwise LLC, 1 Apriliou, 52, Athienou, Larnaca, 7600, Cyprus

{zparcheta, gsanchis}@sciling.com{alrudak, s.bratchenya}@lingvanex.com

Abstract

In this paper, we present the challengeentailed by implementing a mobileversion of a neural machine translationsystem, where the goal is to maximisetranslation quality while minimisingmodel size. We explain the whole processof implementing the translation engine onan English–Spanish example and wedescribe all the difficulties found and thesolutions implemented. The maintechniques used in this work are dataselection by means of Infrequent n-gramRecovery, appending a special word at theend of each sentence, and generatingadditional samples without the finalpunctuation marks. The last twotechniques were devised with the purposeof achieving a translation model thatgenerates sentences without the final fullstop, or other punctuation marks. Also, inthis work, the Infrequent n-gramRecovery was used for the first time tocreate a new corpus, and not enlarge thein-domain dataset. Finally, we get a smallsize model with quality good enough toserve for daily use.

1 Introduction

Lingvanex1 is a trademark for linguistic productsmade by Nordicwise LLC company. The mainfocus of the company are translator and dictionaryapplications that work without internet connectionon mobile and desktop platforms.

c© 2018 The authors. This article is licensed under a CreativeCommons 3.0 licence, no derivative works, attribution, CC-BY-ND.1https://lingvanex.com/en/

In collaboration with Sciling2, an agencyspecialised in providing end-to-end machinelearning solutions, a small-sized translation modelfrom English to Spanish was implemented.

When implementing a mobile translator, it iscrucial to understand its purpose. In our case, thepurpose was to be able to generate translations ona daily usage scenario, without requiring aInternet connection. This is the typical use case ina travel scenario, where travellers often do nothave an internet connection, either because theydo not want to assume the cost of a roamingconnection, because they do not want to purchasea local SIM card, or even because there is no goodconnection in the places they are travelling to,such as some countries of Africa. In this scenario,the main purpose of the translation engine is to beable to translate correctly short sentences,composed of common words in a travellerdomain, but where other words belonging to e.g. aparliamentary or a medical domain are lessfrequent. In addition, the model requires to becontained in terms of size, since large modelswould perform poorly in a mobile device.

In this work, we focus on reducing model sizemainly through data selection techniques, until asize of 150MB per model. However, there areother techniques which bring promising results ascompressing the NMT model via pruning (Seeet al., 2016).

Along this article we determine what is themain influence to model size. For that, weconducted experiments comparing model sizewith total vocabulary size and word embeddingsize. Also, we compare the model size withdifferent layer number on encoder and decoder

2https://sciling.com/

Perez-Ortiz, Sanchez-Martınez, Espla-Gomis, Popovic, Rico, Martins, Van den Bogaert, Forcada (eds.)Proceedings of the 21st Annual Conference of the European Association for Machine Translation, p. 297–302Alacant, Spain, May 2018.

side, and the size of recurrent layer. Next step isto select data for training the engines throughsentence length filtering and leveraging a DStechnique. During the implementation of ourtranslation engines we found several problems inthe translations generated. We describe each ofthe problems and we propose appropriatesolutions. After implementing these solutions, weevaluate the quality of our final model on a testset, and compare the results with Google’s andMicrosoft’s mobile translators.

2 Data description

The data used to train the translation model wasobtained from the OPUS3 corpus. In total, therewere 76M parallel sentences. We also leveragedthe Tatoeba corpus for DS described in Section 4.Tatoeba is a free collaborative online database ofexample sentences geared towards foreignlanguage learners. The development set was alsobuilt from the Tatoeba corpus, by selecting 2krandom sentence pairs. Main figures of Tatoebacorpus are shown in Table 1. As the test set wecreate small corpus of more useful sentences inEnglish found in different websites. Also we addsome sentences of unigrams and bigrams. In totalwe selected 86 sentences.

Table 1: Tatoeba main figures. k denotes thousands ofelements, |S| stands for number of sentences, |W | for numberof running words, and |V | for vocabulary size.

language |S| |W | |V |English 136k 964k 40kSpanish 136k 931k 61k

3 Model size dependency

When confronting the model size reduction, thefirst question that arises is what hyper-parametershave the most influence on model size. Beforemoving forward and implementing a NMTsystem, we conducted experiments comparingmodel size with total vocabulary size and wordembedding size (Mikolov et al., 2013). We alsocompared model size with different number oflayers and units per recurrent layer, both onencoder and decoder sides.

To determine how the previously enumeratedhyper-parameters affect model size, we trained3http://opus.nlpl.eu/

different models varying these hyper-parameters.In the first experiment, we set the number of unitsin the recurrent layer to 128, with a single layeron both encoder and decoder sides. We analysedthe effect of considering a total combined (sourceand target) vocabulary size |V | was pruned to|V | = {5k, 10k, 20k, 50k, 100k}, selectedaccording to the most frequent words in the Opuscorpus, with souce and target vocabulary size setto |V |/2. In addition, we also studied differentembedding vector sizes|ω| = {64, 128, 256, 512}. The results obtainedare shown in Figure 1a.

Next, we analysed the effect of consideringdifferent number of hidden units and the numberof layers. In this case, we fixed to |ω| = 128. Wefound that the number of layers, using 128 hiddenunits, has almost no effect on model size. InFigure 1b, we only show 1 and 4 layers for 128units. Looking at Figure 1, we can conclude thatthe number of layers has small effect on modelsize comparing with number of hidden units andembedding size. Figure 1 can be leveraged todecide on adequate values for thesehyper-parameters, once model size has been fixedto 150MB.

4 Data filtering

Data filtering involved two main steps: first,sentences with length over 20 words werediscarded. We did this under the assumption that amobile translator is mainly designed fortranslating short sentences. Second, we performeddata selection, leveraging Infrequent n-gramRecovery (Gasco et al., 2012). The intuitionbehind this technique is to select, from the fullavailable bilingual data, those sentences thatmaximise the coverage of n-grams in a small,domain-specific dataset. The full availablebilingual data is sorted by infrequency score ofeach sentence in order to select first the mostinformative.

Let be χ the set of n-grams that appear in thesentences to be translated and w one of them;C(w) denotes the counts of w in the sourcelanguage training set; t the threshold of countswhen an n-gram is considered infrequent, andN (w) the counts of w in the source sentence f tobe scored. The infrequency score of f is:

i(f) =∑

w∈χmin(1, N(w))max(0, t− C(w)) (1)

298

5k 10k

20k

50k

100k

50

100

150

200

250

300

Vocabulary size

Mod

elsi

ze(M

B)

|ω| = 64|ω| = 128|ω| = 256|ω| = 512

(a) Model size depending of vocabulary size andembedding size. Number of units in the recurrent layerset to 128, and the number of layers is 1.

5k 10k

20k

50k

100k

50

100

150

200

250

300

Vocabulary size

Mod

elsi

ze(M

B)

1 layer, 128 rnn 4 layer, 128 rnn1 layer, 256 rnn 2 layer, 256 rnn3 layer, 256 rnn 4 layer, 256 rnn

(b) Model size depending of vocabulary size, number of unitsin the recurrent layer (rnn) and number of layers, with |ω| =128.

Figure 1: Model size dependency of different parameters. k denotes thousands of elements and MB is an abbreviation formegabyte. The vocabulary size is the sum of source and target vocabulary.

We applied Infrequent n-gram Recovery to the60M sentences from the Opus corpus asout-of-domain. Intuitively, we selected sentencesfrom the available data until all n-grams, with nup to 5, extracted from the Tatoeba corpus have amaximum of 30 occurrences (if such a thing ispossible with the data available). However,applying this technique on the full set of 60Msentences would have led to very long executiontime. Hence, we divided the corpus into 6partitions, and the selection was performed oneach one of these partitions. Then, we joined theselections from all 6 partitions and conducted asecond selection step on this corpus, since somen-grams could well have 6 · 30 occurrences. Thisled to a final selection of 740k sentences. Theselected data set presented a vocabulary size of19.4k words in source and 22.9k on target side.The total (combined) vocabulary was|V | = 42.4k. Note that selection was conductedon the tokenised and lowercased corpus.

5 Experimental setup

The system was trained using theOpenNMT (Klein et al., 2017) deep learningframework based in Torch. OpenNMT is mainlyfocused at developing sequence-to-sequencemodels covering a variety of tasks such asmachine translation, summarisation, image totext, and speech recognition. Byte-pair encoding(BPE) (Sennrich et al., 2015) was trained on theselected training dataset, and then applied totraining, development, and test data. In eachexperiment we trained a recurrent neural network

with long short term memory (LSTM) (Hochreiterand Schmidhuber, 1997). We use global attentionlayer to improve translation by selectivelyfocusing on parts of the source sentence duringtranslation. Also, we use input feeding to feedattentional vectors as inputs to the next time stepsto inform the model about past alignmentdecisions (Luong et al., 2015). However, thisoption only had a visible effect with 4 or morelayers. Training was performed with 50 epochsusing the adam (Kingma and Ba, 2014) optimiser,with learning rate of 0.0002. Finally, we selectedthe best model according to higherBLEU (Papineni et al., 2002) score on thedevelopment set, and used that model to translatethe test set. Given that the test set is very small,we performed a human evaluation to analysewhether the quality obtained was good enough.

6 Results and analysis

We trained different typologies of neural networksobserving the conclusions in Section 3. In each ofthe experiments we varied the hyper-parametersdescribed in Section 3. Since the total combinedvocabulary was fixed to |V | = 42.4k, fromFigure 1 we can infer the combination ofhyper-parameters with which the allowed modelsize will not be exceeded.

Table 2 shows the values of thehyper-parameters used in each experiment,together with the BLEU score obtained by eachmodel and its size.

The best model according to the BLEU scoreon the development set is the model trained with 2

299

Table 2: Hyper-parameter values for the differentexperiments (exp) conducted and results obtained. |ω| is thesize of the word embedding vector, expressed in megabytes.

exp |ω| layers rnn size BLEU

dev test

1 128 2 128 146 39.0 26.62 128 3 128 151 36.9 22.83 128 4 128 155 37.7 21.84 64 4 256 206 38.7 23.85 256 4 64 203 32.7 21.1

layers, 128 units on recurrent layer, with|ω| = 128. Also, it is the smallest model amongthose analysed in Table 2.

6.1 Problems found

Analysing the translations from the test set wefound 3 different problems. In the following, wedescribe each of them and propose thecorresponding solutions.

6.1.1 Repeated words problemAnalysing the quality of the best model

obtained, we noticed that sentences with morethan 7 words were translated correctly. However,translations of very short sentences containedrepeated words, e.g. “perro perro perro perroperro perro”. The hypothesis for explaining thisfact could be because of differences betweentraining and test sentence lengths. To understandthe validity of this hypothesis, we analysed thehistogram of sentence lengths of training set,shown in Figure 2. As seen, the source side of thetraining data contains a very few amount ofsentences shorter than 8 words, in contrast to thetarget side, where the distribution of sentencelength is more uniform. We believe suchdifference is caused by the sentence selectionalgorithm used: selection is conducted in thesource language and the selection algorithm tendsto assign higher scores to longer sentences, sincethe more n-grams the source sentence contains,the more likely it includes infrequent n-grams. Tocope with this fact, we modified the Infrequentn-gram Recovery strategy as follows:

Re-scoring of sentences: To fix the problem ofrepeated words we decided to modify the sentenceselection procedure modifying the Infrequentn-gram Recovery scoring function by adding anormalisation step. In order to normalise suchscore, we modified Equation 1 as follows:

i(f) =∑

w∈χ

min(1, N(w))max(0, t− C(w))

|f| − w + 1

(2)where the denominator normalises by the numberof n-grams of order n in the sentence. With thisnormalisation, we avoid the side-effect of sentencelength on the infrequency score, ultimately leadingto selecting shorter sentences and improving theNMT system’s translation of such sentences.

After applying the infrequency score inEquation 2 for selecting the data anew, weobtained 667k sentences. In Table 3 we show theaverage sentence length in source and targetlanguage before and after applying the sentencelength normalisation. Average sentence length ofTatoeba is shown for comparison purpose. Asshown, we are able to obtain much shortersentences by including normalisation. The modelachieved 36.3 BLEU in development, and 22.8 intest, with a model size of 121MB. Although thisscore is slightly worse than the one achieved inexperiment 1 (Table 2), we believe BLEU is notalways the most adequate metric for evaluatingtranslation quality (Shterionov et al., 2017). Bymanually analysing the hypotheses, we concludedthat the repeated words problem had beensuccessfully solved.

Table 3: Average sentence length of Tatoeba and training setbefore and after applying normalisation in Equation 2.

source target

Tatoeba 7.1 6.8

trainbefore normalisation 17.4 15.1after normalisation 10.4 9.0

6.1.2 Punctuation mark expectationAnalysing the hypotheses generated by our new

model, we noticed that the model generatedwrong translations with very short sentences, e.g.“dog”, or “cat”, generating surprising translationssuch as “amor”. However, when adding apunctuation mark to the source sentence, e.g.“dog.”, the translation was correctly produced.Our first intuition regarding this was that themodel was expecting a punctuation mark at theend of each sentence. This intuition wasconfirmed by the fact that 94% of the sentences inthe source language training set had a dot or other

300

5 10 15 20100

101

102

103

104

105

Sentence length

Num

bero

fsen

tenc

es

(a) Histogram of source training set. (English)

5 10 15 20100

101

102

103

104

105

Sentence length

Num

bero

fsen

tenc

es

(b) Histogram of target training set. (Spanish)

Figure 2: Histogram of training set.

punctuation marks at the end of sentence, and oneof the more common final words without apunctuation mark was precisely “amor”. Hence,the network was confused (i.e., the model waspoorly estimated) when such punctuation markdid not appear. For dealing with this problem, wedevised two possible solutions:

Special word ending: We append a specialword @@ at the end of each sentence. Then, themodel is forced to learn that a sentence willalways finish with @@, and the fore-last wordmight or might not be a punctuation mark. Thiswas applied as a pre- and post-processing steps,and will be referred to as special word ending.The model trained using special word endingachieved 36.4 BLEU in development, and 26.3BLEU in test. This model was reached after 21thepochs and its size was of 121MB.

Double corpus: We enlarged the trainingcorpus by concatenating all existing sentenceswith punctuation mark at the end, but removingsuch symbols. By doing so, the model is able tolearn that a sentence can finish with or withoutpunctuation marks. This time, the model had asize of 156MB, and reached 37.3 BLEU indevelopment, and 25.1 in test.

Both techniques described previously solvedthe problem of punctuation mark expectation.However, since the double corpus strategyproduced a larger model, with lower BLEU score,we decided to employ the special word endingtechnique.

6.1.3 Missed segmentsFurther analysing the translations generated by

our model, noticed an additional problem: in case

the segment being translated was composed ofseveral short sentences, only the first of them wasbeing actually translated. For instance, “Thankyou. That was really helpful.” was translated into“Gracias.” (“Thank you.”).

To solve this problem, we decided to apply apreprocessing step, consisting separatingsegments composed by several sentences intodifferent segments, according to punctuationmarks “.”, “?” and “!”, in the case of Englishlanguage, and also in “¿” in case of Spanishlanguage. We split 86 sentences from test set into118, given that most of them were composed byshort sentences. After this preprocessing step wasperformed, the translations were correctlygenerated, reaching 36.4 BLEU in development,and 33.7 BLEU, this last one being the highestscore so far.

7 Final evaluation

Table 4 summarizes the BLEU scores obtainedafter applying each one of the solutions describedin Section 6. After applying the normalisedinfrequency score, the special word ending, andpreprocessing composed sentences, we improvedthe quality of test set by about 7 BLEU points.

Table 4: Translation quality, as measured by BLEU, afterapplying each technique described. Size is given in MB.

technique size BLEU

dev test

Base model 146 39 26.6Re-scoring of sentences 121 36.3 22.8Special word ending 121 36.4 26.3Double corpus 156 37.3 25.1Sentence splitting 121 36.4 33.7

301

As final evaluation of our translation system,we compared its quality with Google’s andMicrosoft’s mobile translators. The BLEU scoreon the test set obtained by each of the analysedtranslators, alongside with their correspondingmodel sizes, are shown in Table 5. In general, alltranslators generate good quality hypotheses,although some small differences could beobserved. We noticed that our model wasespecially accurate when using punctuation marksand capital letters, whereas Google’s translatorintroduced punctuation marks in wrong places.Also, only in a few cases, Google’s translator,uses capital letters. We believe this is the reasonwhy Google’s translator achieved such a lowBLEU score, as compared to the other twosystems. However, Google’s translator features amuch smaller than the other two others. Also,Google’s and Microsoft’s models arebidirectional, which means that the size of ourmodel should be doubled (2 · 121MB) to becomparable.

Table 5: Translation quality and model size comparison forGoogle, Microsoft and our best model.

Google Microsoft our system

BLEU 16.7 28 33.7Model

size 29MB 234MB 121MB

Bothdirections YES YES NO

8 Conclusions

In this work, we have presented our approach todeveloping a small size mobile neural machinetranslation engine, in the specific case ofEnglish–Spanish. We leveraged a data selectiontechnique to select more suitable data for real useof our translator. We have presented someadjustments to the selection algorithm thetranslation quality obtained. Also, we proposed asolution to deal with the problem of repeatedwords, and another one for dealing with missedsentence translations within some segments.Finally, we compared the quality of our modelwith Google’s and Microsoft’s mobile translatorversions. We overcome significantly the BLEUscore of both translators, partially due to beingable to translate punctuation marks and capitalletters correctly. Our model reached a size of121MB, which is even much smaller than the size

we considered initially as acceptable, presentinggood translation quality for the specific purpose(travel domain). The translations obtained by ourmodel are perfectly understandable and fluent,and can be used in a scenario where there is nointernet connection. In addition, we are stillworking on improving its quality and on reducingmodel size even further, using other effectivetechniques such as weight pruning.

Acknowledgments

Work partially supported by MINECO undergrant DI-15-08169 and by Sciling under its R+Dprogramme.

References

Gasco, G. et al. (2012). Does more data alwaysyield better translations? In Proc. of EACL,pages 152–161.

Hochreiter, S. and Schmidhuber, J. (1997). Longshort-term memory. Neural computation, pages1735–1780.

Kingma, D. P. and Ba, J. (2014). Adam: A methodfor stochastic optimization. arXiv preprint,arXiv:1412.6980.

Klein, G. et al. (2017). OpenNMT: Open-sourcetoolkit for neural machine translation. arXivpreprints, arXiv:1701.02810.

Luong, M. et al. (2015). Effective approachesto attention-based neural machine translation.arXiv preprints, arXiv:1508.04025.

Mikolov, T. et al. (2013). Distributedrepresentations of words and phrases andtheir compositionality. arXiv preprints,arXiv:1310.4546.

Papineni, K., , et al. (2002). BLEU: a method forautomatic evaluation of machine translation. InProc. of ACL, pages 311–318.

See, A. et al. (2016). Compression of neuralmachine translation models via pruning. arXivpreprints, arXiv:1606.09274.

Sennrich, R. et al. (2015). Neural machinetranslation of rare words with subword units.arXiv preprint, arXiv:1508.07909.

Shterionov, D. et al. (2017). Empirical evaluationof NMT and PBSMT quality for large-scaletranslation production. In Proc. of EAMT, pages75–80.

302

proceedings of the 21st annual conference of the european

Documents