arxiv:1610.05540v1 [cs.cl] 18 oct 2016 · • practical integration in business applications: for...

arX

iv:1

610.

0554

0v1

[cs.

CL]

18

Oct

201

6

SYSTRAN’s Pure Neural Machine Translation Systems

Josep Crego, Jungi Kim, Guillaume Klein, Anabel Rebollo, Kathy Yang, Jean SenellartEgor Akhanov, Patrice Brunelle, Aurélien Coquard, Yongchao Deng, Satoshi Enoue, Chiyo Geiss

Joshua Johanson, Ardas Khalsa, Raoum Khiari, Byeongil Ko, Catherine Kobus, Jean LorieuxLeidiana Martins, Dang-Chuan Nguyen, Alexandra Priori, Thomas Riccardi, Natalia Segal

Christophe Servan, Cyril Tiquet, Bo Wang, Jin Yang, Dakun Zhang, Jing Zhou, Peter Zoldan

[email protected]

Abstract

Since the first online demonstration ofNeural Machine Translation (NMT) byLISA (Bahdanau et al., 2014), NMT de-velopment has recently moved from lab-oratory to production systems as demon-strated by several entities announcing roll-out of NMT engines to replace their ex-isting technologies. NMT systems havea large number of training configurationsand the training process of such systemsis usually very long, often a few weeks,so role of experimentation is critical andimportant to share. In this work, wepresent our approach to production-readysystems simultaneously with release ofonline demonstrators covering a large va-riety of languages (12 languages, for32language pairs). We explore differentpractical choices: an efficient and evolu-tive open-source framework; data prepara-tion; network architecture; additional im-plemented features; tuning for production;etc. We discuss about evaluation method-ology, present our first findings and we fi-nally outline further work.

Our ultimate goal is to share our expertiseto build competitive production systemsfor ”generic” translation. We aim at con-tributing to set up a collaborative frame-work to speed-up adoption of the technol-ogy, foster further research efforts and en-able the delivery and adoption to/by in-dustry of use-case specific engines inte-grated in real production workflows. Mas-tering of the technology would allow usto build translation engines suited for par-ticular needs, outperforming current sim-plest/uniform systems.

1 Introduction

Neural MT has recently achieved state-of-the-art performance in several large-scale translationtasks. As a result, the deep learning approach toMT has received exponential attention, not onlyby the MT research community but by a growingnumber of private entities, that begin to includeNMT engines in their production systems.

In the last decade, several open-source MT toolkits have emerged—Moses(Koehn et al., 2007) is probably the best-knownout-of-the-box MT system—coexisting withcommercial alternatives, though lowering theentry barriers and bringing new opportunities onboth research and business areas. Following thisdirection, our NMT system is based on the open-source projectseq2seq-attn1 initiated by theHarvard NLP group2 with the main contributorYoon Kim. We are contributing to the project bysharing several features described in this technicalreport, which are available to the MT community.

Neural MT systems have the ability to directlymodel, in an end-to-end fashion, the associationfrom an input text (in a source language) to itstranslation counterpart (in a target language). Amajor strength of Neural MT lies in that all thenecessary knowledge, such as syntactic and se-mantic information, is learned by taking the globalsentence context into consideration when mod-eling translation. However, Neural MT enginesare known to be computationally very expen-sive, sometimes needing for several weeks to ac-complish the training phase, even making use ofcutting-edge hardware to accelerate computations.Since our interest is for a large variety of lan-guages, and that based on our long experience withmachine translation, we do not believe that a one-fits-all approach would work optimally for lan-

1https://github.com/harvardnlp/seq2seq-attn2http://nlp.seas.harvard.edu

http://arxiv.org/abs/1610.05540v1https://github.com/harvardnlp/seq2seq-attnhttp://nlp.seas.harvard.edu

guages as different as Korean, Arabic, Spanish orRussian, we did run hundreds of experiments, andparticularily explored language specific behaviors.One of our goal would indeed be to be able to in-ject existing language knowledge in the trainingprocess.

In this work we share our recipes and experi-ence to build our first generation of production-ready systems for “generic” translation, setting astarting point to build specialized systems. Wealso report on extending the baseline NMT en-gine with several features that in some casesincrease performance accuracy and/or efficiencywhile for some others are boosting the learningcurve, and/or model speed. As a machine trans-lation company, in addition to decoding accuracyfor “generic domain”, we also pay special atten-tion to features such as:

• Training time

• Customization possibility: user terminology,domain adaptation

• Preserving and leveraging internal formattags and misc placeholders

• Practical integration in business applications:for instance online translation box, but alsotranslation batch utilities, post-editing envi-ronment...

• Multiple deployment environments: cloud-based, customer-hosted environment or em-bedded for mobile applications

• etc

More important than unique and uniform trans-lation options, or reaching state-of-the-art researchsystems, our focus is to reveal language specificsettings, and practical tricks to deliver this tech-nology to the largest number of users.

The remaining of this report is as follows: Sec-tion 2 covers basic details of the NMT system em-ployed in this work. Description of the transla-tion resources are given in section 3. We report onthe different experiments for trying to improve thesystem by guiding the training process in section4 and section 5, we discuss about performance.In section 6 and 7, we report on evaluation of themodels and on practical findings. And we finish bydescribing work in progress for the next release.

2 System Description

We base our NMT system on the encoder-decoderframework made available by the open-sourceprojectseq2seq-attn. With its root on a num-ber of established open-source projects such asAndrej Karpathy’s char-rnn,3 Wojciech Zaremba’sstandard long short-term memory (LSTM)4 andthe rnn library from Element-Research,5 theframework provides a solid NMT basis consist-ing of LSTM, as the recurrent module and faith-ful reimplementations ofglobal-general-attentionmodel andinput-feedingat each time-step of theRNN decoder as described by Luong et al. (2015).

It also comes with a variety of features such asthe ability to train with bidirectional encoders andpre-trained word embeddings, the ability to handleunknown words during decoding by substitutingthem either by copying the source word with themost attention or by looking up the source wordon an external dictionary, and the ability to switchbetween CPU and GPU for both training and de-coding. The project is actively maintained by theHarvard NLP group6.

Over the course of the development of our ownNMT system, we have implemented additionalfeatures as described in Section 4, and contributedback to the open-source community by makingmany of them available in theseq2seq-attnrepository.seq2seq-attn is implemented on top of the

popular scientific computing libraryTorch.7 TorchusesLua, a powerful and light-weight script lan-guage, as its front-end and uses theC languagewhere efficient implementations are needed. Thecombination results in a fast and efficient systemboth at the development and the run time. As anextension, to fully benefit from multi-threading,optimize CPU and GPU interactions, and to havefiner control on the objects for runtime (sparse ma-trix, quantized tensor, ...), we developed aC-baseddecoder using theC APIs ofTorch, called C-torch,explained in detail in section 5.4.

The number of parameters within an NMTmodel can grow to hundreds of millions, but thereare also a handful of meta-parameters that need tobe manually determined. For some of the meta-

3https://github.com/karpathy/char-rnn4https://github.com/wojzaremba/lstm5https://github.com/Element-Research/rnn6http://nlp.seas.harvard.edu7http://torch.ch

https://github.com/karpathy/char-rnnhttps://github.com/wojzaremba/lstmhttps://github.com/Element-Research/rnnhttp://nlp.seas.harvard.eduhttp://torch.ch

ModelEmbedding dimension: 400-1000Hidden layer dimension: 300-1000Number of layers: 2-4Uni-/Bi-directional encoder

TrainingOptimization methodLearning rateDecay rateEpoch to start decayNumber of EpochsDropout: 0.2-0.3

Text unit Section 4.1Vocabulary selectionWord vs. Subword (e.g. BPE)

Train data Section 3size (quantity vs. quality)max sentence lengthselection and mixture of domains

Table 1: There are a large number of meta-parameters to be considered during training. Theoptimal set of configurations differ from languagepair to language pair.

parameters, many previous work presents clearchoices on their effectiveness, such as using theattention mechanism or feeding the previous pre-diction as input to the current time step in thedecoder. However, there are still many moremeta-parameters that have different optimal valuesacross datasets, language pairs, and the configura-tions of the rest of the meta-parameters. In table 1,we list the meta-parameter space that we exploredduring the training of our NMT systems.

In appendix B, we detail the parameters used forthe online system of this first release.

3 Training Resources

Training “generic” engines is a challenge, be-cause there is no such notion of generic transla-tion which is what online translation service usersare expecting from these services. Indeed onlinetranslation is covering a very large variety of usecases, genres and domains. Also available open-source corpora are domain specific: Europarl(Koehn, 2005), JRC (Steinberger et al., 2006) orMultiUN (Chen and Eisele, 2012) are legal texts,ted talk are scientific presentations, open subtitles(Tiedemann, 2012) are colloquial,etc.As a result,

the training corpora we used for this release werebuilt by doing a weighted mix all of the availablesources. For languages with large resources, wedid reduce the ratio of the institutional (Europal,UN-type), and colloquial types – giving the pref-erence to news-type, mix of webpages (like Giga-word).

Our strategy, in order to enable more experi-ments was to define 3 sizes of corpora for each lan-guage pair: a baseline corpus (1 million sentences)for quick experiments (day-scale), a medium cor-pus (2-5M) for real-scale system (week-scale) anda very large corpora with more than 10M seg-ments.

The amount of data used to train online systemsare reported in table 2, while most of the individ-ual experimental results reported in this report areobtained with baseline corpora.

Note that size of the corpus needs to be consid-ered with the number of training periods since theneural network is continuously fed by sequencesof sentence batches till the network is consideredtrained. In Junczys-Dowmunt et al. (2016), au-thors mention using corpus of 5M sentences andtraining of 1.2M batches each having 40 sentences– meaning basically that each sentence of the fullcorpus is presented 10 times to the training. InWu et al. (2016), authors mention 2M steps of 128examples for English–French, for a corpus of 36Msentences, meaning about 7 iterations on the com-plete corpus. In our framework, for this release,we systematically extended the training up to 18epochs and for some languages up to 22 epochs.

Selection of the optimal system is made af-ter the complete training by calculating scoreson independent test sets. As an outcome, wehave seen different behaviours for different lan-guage pairs with similar training corpus size ap-parently connected to the language pair complex-ity. For instance, English–Korean training per-plexity still decreases significantly between epoch13 and 19 while Italian–English perplexity de-creases marginally after epoch 10. For mostlanguages, in our set-up, optimal systems areachieved around epoch 15.

We did also some experiment on the corpus size.Intuitively, since NMT systems do not have thememorizing capacity of PBMT engines, the factthat the training use 10 times 10M sentence cor-pus, or 20 times 5M corpus should not make ahuge difference. In one of the experiment, we

compared training on a 5M corpus trained over 20epochs for English to/from French, and the same5M corpus for only 10 epochs, followed by 10additional epochs on additional 5M corpus. The10M being completely homogeneous. In both di-rections, we observe that the5 × 10 + 5 × 10training is completing with a score improvementof 0.8 − 1.2 compared to the5× 20 showing thatthe additional corpus is managing to bring a mean-ingful improvement. This observation leads to amore general question about how much corpus isneeded to actually build a high quality NMT en-gine (learn the language), the role and timing ofdiversity in the training and whether the incremen-tal gain could not be substituted by terminologyfeeding (learn the lexicon).

4 Technology

In this section we account for several experimentsthat improved different aspects of our translationengines. Experiments range from preprocessingtechniques to extend the network with the abil-ity to handle named entities, to use multiple wordfeatures and to enforce the attention module to bemore like word alignments. We also report on dif-ferent levels of translation customization.

4.1 Tokenization

All corpora are preprocessed with an in-housetoolkit. We use standard token separators (spaces,tabs, etc.) as well as a set of language-dependentlinguistic rules. Several kinds of entities are recog-nized (url and number) replacing its content by theappropriate place-holder. A postprocess is usedto detokenize translation hypotheses, where theoriginal raw text format is regenerated followingequivalent techniques.

For each language, we have access to lan-guage specific tokenization and normalizationrules. However, our preliminary experimentsshowed that there was no obvious gain of usingthese language specific tokenization patterns, andthat some of the hardcoded rules were actually de-grading the performance. This would need moreinvestigation, but for the release of our first batchsystems, we used a generic tokenization model formost of the languages except Arabic, Chinese andGerman. In our past experiences with Arabic, sep-arating segmentation of clitics was beneficial, andwe retained the same procedure. For German andChinese, we used in-house compound splitter and

word segmentation models, respectively.

In our current NMT approach, vocabulary sizeis an important factor that determines the ef-ficiency and the quality of the translation sys-tem; a larger vocabulary size correlates directlyto greater computational cost during decoding,whereas low coverage of vocabulary leads to se-vere out-of-vocabulary (OOV) problems, hencelowering translation quality.

In most language pairs, our strategy combines avocabulary shortlist and a placeholder mechanism,as described in Sections 4.2 and 4.3. This ap-proach, in general, is a practical and linguistically-robust option to addressing the fixed vocabularyissue, since we can take the full advantage of in-ternal manually-crafted dictionaries and customi-sized user dictionaries (UDs).

A number of previous work such as character-level (Chung et al., 2016), hybrid word-character-based (Luong and Manning, 2016) and subword-level (Sennrich et al., 2016b) address issues thatarise with morphologically rich languages such asGerman, Korean and Chinese. These approacheseither build accurate open-vacabulary word repre-sentations on the source side or improve transla-tion models’ generative capacity on the target side.Among those approaches, subword tokenizationyields competitive results achieving excellent vo-cabulary coverage and good efficiency at the sametime.

For two language pairs: enko and jaen,we used source and target sub-word tokeniza-tion (BPE, see (Sennrich et al., 2016b)) to re-duce the vocabulary size but also to deal withrich morphology and spacing flexibility that canbe observed in Korean. Although this ap-proach is very seducing by its simplicity andalso used systematically in (Wu et al., 2016) and(Junczys-Dowmunt et al., 2016), it does not havesignificant side effects (for instance generation ofimpossible words) and is not optimal to deal withactual word morphology - since the same suffix(josa in Korea) depending on the frequency of theword ending it is integrated with, will be splittedin multiple representations. Also, in Korean, thesejosa, are is an “agreement” with the previous syl-labus based on their final endings: however suchsimple information is not explicitely or implicitelyreachable by the neural network.

The sub-word encoding algorithmByte Pair En-coding(BPE) described by Sennrich et al. (2016b)

was re-implemented in C++ for further speed op-timization.

4.2 Word Features

Sennrich and Haddow (2016) showed that usingadditional input features improves translationquality. Similarly to this work, we introduced inthe framework the support for an arbitrary numberof discrete word features as additional inputs to theencoder. Because we do not constrain the num-ber of values these features can take at the sametime, we represent them with continuous and nor-malized vectors. For example, the representationof a featuref at time-stept is:

x(t)i =

{

1nf

if f takes theith value

0 otherwise(1)

wherenf is the number of possible values thefeaturef can take and withx(t) ∈ Rnf .

These representations are then concatenatedwith the word embedding to form the new inputto the encoder.

We extended this work by also supporting ad-ditional features on the target side which will bepredicted by the decoder. We used the same inputrepresentation as on the encoder side but shiftedthe features sequence compared to the words se-quence so that the prediction of the features attime-stept depend on the word at time-stept theyannotate. Practically, we are generating feature attime t+ 1 for the word generated at timet.

To learn these target features, we added a lin-ear layer to the decoder followed by the softmaxfunction and used the mean square error criterionto learn the correct representation.

For this release, we only used case informationas additional feature. It allows us to work with alowercased vocabulary and treat the recasing as aseparate problem. We observed that the use of thissimple case feature in source and target does im-prove the translation quality as illustrated in Fig-ure 1. Also, we compared the accuracy of theinduced recasing with other recasing frameworks(SRI disamb, and in-house recasing tool based onn-gram language models) and observed that theprediction of case by the NN was higher than us-ing external recaser, which was expected since NNhas access to source in addition to the source sen-tence context, and target sentence history.

6 8 10 12 14 16

16

18

20

22

24

26

28

Perplexity

BL

EU

baselinesource case

src&tgt

src&tgt+embed

Figure 1: Comparison of training progress (per-plexity/BLEU) with/without source (src) and tar-get (tgt) case features, with/without feature em-bedding (embed) on WMT2013 test corpus forEnglish-French. Score is calculated on lowercaseoutput. The perplexity increases when the targetfeatures are introduced because of the additionalclassification problem. We also notice a notice-able increase in the score when introducing thefeatures, in particular the target features. So thesefeatures do not simply help to reduce the vocabu-lary, but also by themselves help to structure theNN decoding model.

4.3 Named Entities (NE)

SYSTRAN’s RBMT and SMT translation enginesutilize number and named entity (NE) module torecognize, protect, and translate NE entities. Simi-larly, we used the same internal NE module to rec-ognize numbers and named entities in the sourcesentence and temporarily replaced them with theircorresponding placeholders (Table 3).

Both the source and the target side of the train-ing dataset need to be processed for NE placehold-ers. To ensure the correct entity recognition, wecross-validate the recognized entity across paral-lel dataset, that is: a valid entity recognition in asentence should have the same type of entity in itsparallel pair and the word or phrase covered by theentities need to be aligned to each other. We usedfast align (Dyer et al., 2013) to automaticallyalign source words to target words.

In our datasets, generally about one-fifth ofthe training instances contained one or more NEplaceholders. Our training dataset consists of sen-tences with NE placeholders as well as sentenceswithout them to be able to handle both instantiatedand recognized entity types.

NE type PlaceholderNumber ent numeric

Measurement ent numexmeasurementMoney ent numexmoney

Person (any) ent personTitle ent persontitle

First name ent personfirstnameInitial ent personinitials

Last name ent personlastnameMiddle name ent personmiddlename

Location ent locationOrganization ent organization

Product ent productSuffix ent suffix

Time expressions ent timex expressionDate ent date

Date (day) ent datedayDate (Month) ent datemonthDate (Year) ent dateyear

Hour ent hour

Table 3: Named entity placeholder types

Per source sentence, a list of all entities, alongwith their translations in the target language, ifavailable, are returned by our internal NE recogni-tion module. The entities in the source sentence isthen replaced with their corresponding NE place-holders. During beam search, we make sure thatan entity placehoder is translated by itself in thetarget sentence. When the entire target sentenceis produced along with the attention weights thatprovide soft alignments back to the original sourcetokens, Placeholders in the target sentences are re-placed with either the original source string or itstranslation.

The substitution of the NE placeholders withtheir correct values needs language pair-specificconsiderations. In Figure 4, we show that eventhe handling of Arabic numbers cannot be straight-forward as copying the original value in the sourcetext.

4.4 Guided Alignments

We re-reimplementedGuided alignmentstrategydescribed in Chen et al. (2016).Guided align-ment enforces the attention weights to be morelike alignments in the traditional sense (e.g. IBM4viterbi alignment) where the word alignments ex-plicitly indicate that source words aligned to a tar-get word are translation equivalents.

Similarly to the previous work, we createdan additional criterion on attention weights,Lga,such that the difference in the attention weightsand the reference alignments is treated as an errorand directly and additionally optimize the output

En→ KoTrain

train data 25 billion 250억entity-replaced ent numeric billion ent numeric억

Decodeinput data 1.4 billion -

entity-replaced ent numeric billion -translated - ent numeric억

naive substition - 1.4억expected result - 14억

Table 4: Examples of English and Korean num-ber expressions where naive recognition and sub-stitution fails. Even if the model correctly pro-duces correct placeholders, simply copying theoriginal value will result in incorrect translation.These kind ofstructural entitiesneed languagepair-specific treatment.

of the attention module.

Lga(A,α) =1

T·

∑

t

∑

s

(Ast − αst)2

The final loss function for the optimization is then:

Ltotal = wga ·Lga(A,α) + (1−wga) ·Ldec(y, x)

where A is the alignment matrix,α attentionweights, s and t indicating the indices in thesource and target sentences, andwga the linearcombination weight for the guided alignment loss.

Chen et al. (2016) also report that decayingwga, thereby gradually reducing the influencefrom guided alignment, over the course of train-ing found to be helpful on certain datasets. Whenguided alignment decay is enabled,wga is gradu-ally reduced at this rate after each epoch from thebeginning of the training.

Without searching for the optimal parametervalues, we simply took following configurationsfrom the literature: mean square error (MSE) asloss function, 0.5 as the linear-combination weightfor guided alignment loss and the cross-entropyloss for decoder, and 0.9 for decaying factor forguided alignment loss.

For the alignment, we again utilizedfast align tool. We stored alignments insparse format8 to save memory usage, and foreach minibatch a dense matrix is created for fastercomputation.

Applying such a constraint on attention weightscan help locate the original source word more

8Compressed Column Storage (CCS)

1 3 5 7 9 11 13 15 17 1946

47

48

49

50

51

52

Epoch

BL

EU

Baseline

Guided alignment

Guided alignment with decay

Figure 2: Effect of guided alignment. This partic-ular learning curve is from training an attention-based model with 4 layers of bidirectional LSTMswith 800 dimension on a 5 million French to En-glish dataset.

accurately, which we hope to benefits better NEplaceholder substitutions and especially unknownword handling.

In Figure 2, we see the effects of guided align-ment and its decay rate on our English-to-Frenchgeneric model. Unfortunately, this full compara-tive run was disrupted by a power outage and wedid not have time to relaunch, however we canstill clearly observe that up to the initial 5 epochs,guided alignment, with or without decay, providesrather big boosts over the baseline training. Af-ter 5 epochs, the training with decay slow downcompare to the training without, which is ratherintuitive: the guided alignment is indeed in con-flict with the attention learning. What would re-main to be seen, is if at the training the training,the baseline and the guided alignment with decayare converging.

4.5 Politeness Mode

Many languages have ways to express politenessand deference towards people being reffered to insentences. In Indo-European languages, there aretwo pronouns corresponding to the English You; itis called theT-V distinction between the informalLatin pronountu (T) and the polite Latin pronounVos (V). Asian languages, such as Japanese andKorean, make an extensive use of honorifics (re-spectful words), words that are usually appendedto the ends of names or pronouns to indicate therelative ages and social positions of the speakers.Expressing politeness can also impact the vocabu-

Mode Correct Incorrect AccuracyFormal 30 14 68.2%

Informal 35 9 79.5%

Table 6: Accuracy of generating correctPolite-ness modeof an English-to-Korean NMT system.The evaluation was carried out on a set of 50 sen-tences only; 6 sentences were excluded from eval-uation because neither the original nor their trans-lations contained any verbs.

lary of verbs, adjectives, and nouns used, as wellas sentence structures.

Following the work of Sennrich et al. (2016a),we implemented a politeness feature in our NMTengine: a special token is added to each sourcesentence during training, where the token indicatesthe politeness mode observed in the target sen-tence. Having such an ability to specify the po-liteness mode is very useful especially when trans-lating from a language where politeness is not ex-pressed, e.g. English, into where such expressionsare abundant, e.g. Korean, because it provides away of customizing politeness mode of the trans-lation output.Table 5 presents our English-to-Korean NMTmodel trained with politeness mode, and it is clearthat the proper verb endings are generated accord-ing to the user selection. After a preliminary eval-uation on a small testset from English to Korean,we observed 70 to 80% accuracy of the polite-ness generation (Table 6). We also noticed that86% of sentences (43 out of 50) have exactly thesame meaning preserved across different polite-ness modes.

This simple approach, however, comes at asmall price, where sometimes the unknown re-placement scheme tries to copy the special tokenin the target generation. A more appropriate ap-proach that we plan to switch to in our future train-ings is to directly feed the politeness mode into thesentential representation of the decoder.

4.6 Customization

Domain adaptation is a key feature for ourcustomers—it generally encompasses terminol-ogy, domain and style adaptation, but can also beseen as an extension of translation memory for hu-man post-editing workflows.

SYSTRAN engines integrate multiple tech-niques for domain adaptation, training full new

in-domain engines, automatically post-editingan existing translation model using transla-tion memories, extracting and re-using termi-nology. With Neural Machine Translation, anew notion of “specialization” comes close tothe concept of incremental translation as de-veloped for statistical machine translation like(Ortiz-Martı́nez et al., 2010).

4.6.1 Generic Specialization

Domain adaptation techniques have successfullybeen used in Statistical Machine Translation. It iswell known that a system optimized on a specifictext genre obtains higher accuracy results than a“generic” system. The adaptation process can bedone before, during or after the training process.Our preliminary experiments follow the latter ap-proach. We incrementally adapt a Neural MT“generic” system to a specific domain by runningadditional training epochs over newly available in-domain data.

Adaptation proceeds incrementally when newin-domain data becomes available, generated byhuman translators while post-editing, which issimilar to the Computer Aided Translation frame-work described in (Cettolo et al., 2014).

We experiment on an English-to-French trans-lation task. The generic model is a subsample ofthe corpora made available for the WMT15 trans-lation task (Bojar et al., 2015). Source and tar-get NMT vocabularies are the 60k most frequentwords of source and target training datasets. Thein-domain data is extracted from the EuropeanMedical Agency (EMEA) corpus. Table 7 showssome statistics of the corpora used in this experi-ment.

Type Corpus # lines # src tok (EN) # tgt tok (FR)Train Generic 1M 24M 26M

EMEA 4,393 48k 56kTest EMEA 2,787 23k 27k

Table 7: Data used to train and adapt the genericmodel to a specific domain. The test corpus alsobelongs to the specific domain.

Our preliminary results show that incrementaladaptation is effective for even limited amountsof in-domain data (nearly 50k additional words).Constrained to use the original “generic” vocabu-lary, adaptation of the models can be run in a fewseconds, showing clear quality improvements onin-domain test sets.

Figure 3 compares the accuracy (BLEU) of

0 1 2 3 4 5

29

30

31

32

33

34

35

36

Additional epochs

BL

EU

adaptfull

Figure 3: Adaptation with In-Domain data.

two systems: full is trained after concatenationof generic and in-domain data; adapt is initiallytrained over generic data (showing a BLEU scoreof 29.01 at epoch0) and adapted after running sev-eral training epochs overonly the in-domain train-ing data. Both systems share the same ”generic”NMT vocabularies. As it can be seen the adaptsystem improves drastically its accuracy after asingle additional training epoch, obtaining a simi-lar BLEU score than the full system (separated by.91 BLEU). Note also that each additional epochusing the in-domain training data takes less than50 seconds to be processed, while training the fullsystem needs more than17 hours.

Results validate the utility of the adaptation ap-proach. A human post-editor would take advan-tage of using new training data as soon as it be-comes available, without needing to wait for a longfull training process. However, the comparison isnot entirely fair since full training would allow toinclude the in-domain vocabulary in the new fullmodel, what surely would result in an additionalaccuracy improvement.

4.6.2 Post-editing Engine

Recent success of Pure Neural Machine Transla-tion has led to the application of this technologyto various related tasks and in particular to the Au-tomatic Post-Editing (APE). The goal of this taskis to simulate the behavior of a human post-editor,correcting translation errors made by a MT sys-tem.

Until recently, most of the APE approacheshave been based on phrase-based SMT sys-tems, either monolingual (MT target to hu-man post-edition) (Simard et al., 2007) or source-

1 3 5 7 9 11 130

10

20

30

40

50

Epoch

BL

EU

NMT

NPE multi-source

NPE

SPE

RBMT

Figure 4: Accuracy results of RBMT, NMT, NPEand NPE multi-source.

aware (Béchara et al., 2011). For many years nowSYSTRAN has been offering a hybrid Statisti-cal Post-Editing (SPE) solution to enhance thetranslation provided by its rule-based MT system(RBMT) (Dugast et al., 2007).

Following the success of Neural Post-Editing (NPE) in the APE task of WMT’16(Junczys-Dowmunt and Grundkiewicz, 2016), wehave run a series of experiments applying theneural approach in order to improve the RBMTsystem output. As a first experiment, we com-pared the performance of our English-to-KoreanSPE system trained on technical (IT) domaindata to two NPE systems trained on the samedata: monolingual NPE and multi-source NPE,where the input language and the MT hypothesissequences have been concatenated together intoone input sequence (separated by a special token).

Figure 4 illustrates the accuracy (BLEU) re-sults of four different systems at different trainingepochs. The RBMT system performs poorly, con-firming the importance of post-editing. Both NPEsystems clearly outperform SPE. It can also be ob-served that adding source information even in asimplest way possible (NPE multi-source), with-out any source-target alignment, considerably im-proves NPE translation results.

The system performing NPE multi-source ob-tains similar accuracy results than pure NMT.What can be seen is that NPE multi-source essen-tially employs the information from the originalsentence to produce translations. However, no-tice that benefits of utilizing multiple inputs fromdifferent sources is clear at earlier epochs whileonce the model parameters converge, the differ-

ence in performances of NMT and NPE multi-source models become negligible.

Further experiments are currently being con-ducted aiming at finding more sophisticated waysof combining the original source and the MTtranslation in the context of NPE.

5 Performance

As previously outlined, one of the major draw-backs of NMT engines is the need for cutting-edgehardware technology to face the enormous compu-tational requirements at training and runtime.

Regarding training, there are two major issues:the full training time and the required computationpower, i.e. the server investment. For this release,most of our trainings have been running on singleGTX GeForce 1080 GPU (about $2.5k) while in(Wu et al., 2016), authors mention using 96 K80GPU for a full week for training one single lan-guage pair (about $250k). On our hardware, fulltraining on 2x5M sentences (see section 3) took abit less than one month.

A reasonnable target is to maintain training timefor any language pair under one week and keep-ing reasonable investment so that the full researchcommunity can have competitive trainings but alsoindirectly so that all of our customers can benefitfrom the training technology. To do so, we need tobetter leverage multiple GPUs on a single serverwhich is on-going engineering work. We also needto continue on exploring how to learn more withless data. For instance, we are convinced thatinjecting terminology as part of the training datashould be competitive with continuing adding fullsentences.

Also, shortening training cycle can also beachieved by better control of the training cycle.We have shown that multiple features are boostingthe training pace, and if going to bigger networkis clearly improving performance. For instance,we are using a bidirectional 4 layer RNN in addi-tion to our regular RNN, but in Wu et al. (2016),authors mention using bidirectional RNN only forthe first layer. We need to understand more thesephenomena and restrict to the minimum to reducethe model size.

Finally, work on specialization described in sec-tions 4.6.1 and 5.2 are promising for long termmaintenance: we could reach a point where we donot need to retrain from scratch but continuouslyimprove existing model and use teacher models to

Model BLEUbaseline 49.2440% pruned 48.5650% pruned 47.9660% pruned 45.9070% pruned 38.3860% pruned and retrained49.26

Table 8: BLEU scores of pruned models on aninternal test set.

boost initial trainings.Regarding runtime performance, we have been

exploring the following areas and are reaching to-day throughputs compatible with production re-quirement not only using GPUs but also usingCPU and we report our different strategies in thefollowing sub-sections.

5.1 Pruning

Pruning the parameters of a neural network is acommon technique to reduce its memory footprint.This approach has been proven efficient for theNMT tasks in See et al. (2016). Inspired by thiswork, we introduced similar pruning techniques inseq2seq-attn. We reproduced that models pa-rameters can be pruned up to 60% without any per-formance loss after retraining as shown in Table 8.

With a large pruning factor, neural network’sweights can also be represented with sparse matri-ces. This implementation can lead to lower com-putation time but more importantly to a smallermemory footprint that allows us to target moreenvironment. Figures 5 and 6 show experimentsinvolving sparse matrices usingEigen9. For ex-ample, when using thefloat precision, a mul-tiplication with a sparse matrix already begins totake less memory when 35% of its parameters arepruned.

Related to this work, we present in Section 5.4our alternativeEigen-based decoder that allows usto support sparse matrices.

5.2 Distillation

Despite that surprisingly accurate, NMT systemsneed for deep networks in order to perform well.Typically, a 4-layer LSTM with 1000 hidden unitsper layer (4 x 1000) are used to obtain state-of-the-art results. Such models require cutting-edge

9http://eigen.tuxfamily.org

Figure 5: Processing time to perform a1000 ×1000 matrix multiplication on a single thread.

Figure 6: Memory needed to perform a1000 ×1000 matrix multiplication.

http://eigen.tuxfamily.org

hardware for training in reasonable time while in-ference becomes also challenging on standard se-tups, or on small devices such as mobile phones.Though, compressing deep models into smallernetworks has been an active area of research.

Following the work in (Kim and Rush, 2016),we experimented sequence-level knowledge dis-tillation in the context of an English-to-FrenchNMT task. Knowledge distillation relies on train-ing a smaller student network to perform better bylearning the relevant parts of a larger teacher net-work. Hence, ’wasting’TM parameters on trying tomodel the entire space of translations. Sequence-level is the knowledge distillation variant wherethe student model mimics the teacher’s actions atthe sequence-level.

The experiment is summarized in 3 steps:

• train a teacher model on a source/referencetraining set,

• use the teacher model to produce translationsfor the source training set,

• train a student model on the newsource/translation training set.

For our initial experiments, we produced35-best translations for each of the sentences of thesource training set, and used a normalizedn-grammatching score computed at the sentence level, toselect the closest translation to each reference sen-tence. The original training source sentences andtheir translated hypotheses where used as trainingdata to learn a 2 x 300 LSTM network.

Results showed slightly higher accuracy resultsfor a 70% reduction of the number of parame-ters and a30% increase on decoding speed. Ina second experiment, we learned a student modelwith the same structure than the teacher model.Surprisingly, the student clearly outperformed theteacher model by nearly1.5 BLEU.

We hypothesize that the translation performedover the target side of the training set produces asort of language normalization which is by con-struction very heterogeneous. Such normalizationeases the translation task, being learned by not sodeep networks with similar accuracy levels.

5.3 Batch Translation

To increase translation speed of large texts, wesupport batch translation that works in addition tothe beam search. It means that for a beam of size

0 20 40 60 80 1000

100

200

300

400

500

batch size

toke

ns/s

Figure 7: Tokens throughput when decoding froma student model (see section 5.2) with a beam ofsize 2 and usingfloat precision. The experi-ment was run using a standardTorch+ OpenBLASinstall and 4 threads on a desktopIntel i7 CPU.

K and a batch of sizeB, we forwardK × B se-quences into the model. Then, the decoder outputis split across each batch and the beam path foreach sentence is updated sequentially.

As sentences within a batch can have large vari-ations in size, extra care is needed to mask accord-ingly the output of the encoder and the attentionsoftmax over the source sequences.

Figure 7 shows the speedup obtained usingbatch decoding in a typical setup.

5.4 C++ Decoder

While Torch is a powerful and easy to use frame-work, we chose to develop an alternative C++implementation for the decoding on CPU. It in-creases our control over the decoding process andopen the path to further memory and speed im-provements while making deployment easier.

Our implementation is graph-based and useEigenfor efficient matrix computation. It can loadand infer fromTorchmodels.

For this release, experiments show that thedecoding speed is on par or faster than theTorch-based implementation especially in a multi-threaded context. Figure 8 shows the better use ofparallelization of theEigen-based implementation.

6 Evaluation

Evaluation of machine translation has always beena challenge and subject to many papers and ded-icated workshops (Bojar et al., 2016). While au-tomatic metrics are now used as standard in the

1 2 4 80

20

40

60

80

100

threads

toke

ns/s

Eigen-basedTorch-based

Figure 8: Tokens throughput with a batch of size1 in the same condition as Figure 7.

research world and have shown good correlationwith human evaluation, ad-hoc human evaluationor productivity analysis metrics are rather used inthe industry (Blain et al., 2011).

As a translation solution company, even if au-tomatic metrics are used through all the trainingprocess (and we give scores in the section 6.1),we care about human evaluation of the results.Wu et al. (2016) mention human evaluation but si-multaneously cast a doubt on the referenced hu-man to translate or evaluate. In this context, theclaim “almost indistinguishable with human trans-lation” is at the same time strong but also veryvague. On our side, we have observed duringall our experiments and preparation of specializedmodels, unprecedented level of quality, and con-texts where we could claim “super human” trans-lation quality.

However, we need to be very carefully definingthe tasks, the human that are being compared to,and the nature of the evaluation. For evaluatingtechnical translation, the nature of the evaluationis somewhat easy and really depending on the userexpectation: is the meaning properly conveyed, oris the sentence faster to post-edit than to translate.Also, to avoid doubts about integrity or compe-tency of the evaluation we sub-contracted the taskto CrossLang, a company specialized in machinetranslation evaluation. The test protocol was de-fined collaboratively and for this first release, wedecided to perform ranking of different systems,and we present in the section 6.2 the results ob-tained on two very different language pairs: En-glish to/from French, and English to Korean.

Finally, in the section 6.3, we also present some

qualitative evaluation results showing specificitiesof the Neural Machine Translation.

6.1 Automatic Scoring and systemcomparison

Figure 9 plots automatic accuracy results, BLEU,and Perplexities for all language pairs. It is re-markable the high correlation between perplexityand BLEU scores, showing that language pairswith lower perplexity yield higher BLEU scores.Note also that different amounts of training datawere used for each system (see Table 2). BLEUscores were calculated over an internal test set.

From the beginning of this report we have used“internal” validation and test sets, what makes itdifficult to compare the performance of our sys-tems to other research engines. However, we mustkeep in mind that our goal is to account for im-provements in our production systems. We focuson human evaluations rather than on any automaticevaluation score.

6.2 Human Ranking Evaluation

To evaluate translation outputs and compare withhuman translation, we have defined the followingprotocol.

1. For each language pair, 100 sentences “in do-main” (*) are collected,

2. These sentences are sent to human transla-tion (** ), and translated with candidate modeland using available online translation ser-vices (*** ).

3. Without notification of the mix of human andmachine translation, a team of 3 professionaltranslators or linguists fluent in both sourceand target languages is then asked to rank 3random outputs for each sentence based ontheir preference as translation. Preference in-cludes accuracy as a priority, but also fluencyof the generated translation. They have thechoice to give them 3 different ranks, or canalso decide to give 2 or the 3 of them the samerank, if they cannot decide.

(*) for Generic domain, sentences from recentnews article were selected online, for Technical(IT) sentences part of translation memory definedin section 4.6 were kept apart from the training.(** ) for human translation, we did use translationagency (human-pro), and online collaborativetranslation platform (human-casual).

3 4 5 625

30

35

40

45

50

55

enar

aren

ende

deenennl

nlenenititen

enbr

bren

enes esen

frar

arfr

frit

itfr

frde

defrfrbr

brfr

fres

esfr

frzh

zhen

nlfr

enfr

fren

jako

koja

faen

Perplexity

BL

EU

Figure 9: Perplexity and BLEU scores obtained by NMT enginesfor all language pairs. Perplexity andBLEU scores were calculated respectively on validation andtest sets. Note that English-to-Korean andJapanese-to-English systems are not shown in this plot achieving respectively (18.94, 16.50) and (8.82,19.99) scores for perplexity and BLEU.

(*** ) we used Naver Translator10 (Naver), GoogleTranslate11 (Google) and Bing Translator12

(Bing).

For this first release, we experimented on theevaluation protocol for 2 different extremely dif-ferent categories of language pairs. On one hand,English↔French which is probably the most stud-ied language pairs for MT and for which resourcesare very large: (Wu et al., 2016) mention about36M sentence pairs used in their NMT trainingand the equivalent PBMT is completed by a web-scale target side language models13. Also, asEnglish is a low inflected language, the currentphrase-based technology for target language En-glish is more competitive due to the relative sim-plicity of the generation and weight of giganticlanguage models.

On the other hand, English↔Korean is one ofthe toughest language pair due to the far distancebetween English and Korean language, the smallavailability of training corpus, and the rich agglu-tinative morphology of Korean. For a real compar-ison, we ran evaluation against Naver Translationservice from English into Korean, where Naver isthe main South Korean search engine.

Tables 9 and 10 describe the different evalua-tions and their results.

Several interesting outcomes:

• Our vanilla English7→ French model outper-forms existing online engines and our best ofbreed technology.

• For French7→ English, if the model slightlyoutperforms (human-casual) and our best ofbreed technology, it stays behind GoogleTranslate, and more significantly behind Mi-crosoft Translator.

• The generic English7→ Korean model showscloser results with human translation and out-performs clearly existing online engines.

• The “in-domain” specialized model surpris-ingly outperforms the reference human trans-lation.

We are aware that far more evaluations are nec-essary and we will be launching a larger evalu-

10http://translate.naver.com11http://translate.google.com12http://translator.bing13In 2007, Google already mentions using 2 trillion

words in their language models for machine translation(Brants et al., 2007).

ation plan for our next release. Informally, wedo observe that the biggest performance jumpare observed on complicated language pairs, likeEnglish-Korean or Japanese-English showing thatNMT engines are better able to handle majorstructure differences, but also on the languageswith lower resources like Farsi-English demon-strating that NMT is able to learn better with less,and we will explore this even more.

Finally, we are also launching in parallel, a real-life beta-testing program with our customers sothat we can also obtain formal feedback from theiruse-case and related to specialized models.

6.3 Qualitative Evaluation

In the table 11, we report the result of error analy-sis for NMT, SMT and RBMT for English-Frenchlanguage pair. This evaluation confirms the trans-lation ranking performed in the previous but alsoexhibits some interesting facts:

• The most salient error comes from missingwords or parts of sentence. It is interesting tosee though that half of these “omissions” areconsidered okay by the reviewers and weremost of the time not considered as errors -it indeed shows the ability of the system notonly to translate but to summarize and get tothe point as we would expect from humantranslation. Of course, we need to fix thecases, where the “omissions” are not okay.

• Another finding is that the engine is badlymanaging quotes, and we will make sure tospecifically teach that in our next release.Other low-hanging fruit are the case genera-tion, which seems sometimes to get off-track,and the handling of Named Entity that wehave already introduced in the system but notconnected for the release.

• On the positive side, we observe that NMTis drastically improving fluency, reducesslightly meaning selection errors, and handlebetter morphology although it does not haveyet any specific access to morphology (likesub-word embedding). Regarding meaningselection errors, we will focussing on teach-ing more expressions to the system which isstill a major structural weakness compared toPBMT engines.

http://translate.naver.comhttp://translate.google.comhttp://translator.bing

7 Practical Issues

Translation results from an NMT system, at firstglance, is incredibly fluent that your are divertedfrom its downsides. Over the course of the trainingand during our internal evaluation, we found outmultiple practical issues worth sharing:

1. translating very long sentences

2. translating user input such as a short word orthe title of a news article

3. cleaning the corpus

4. alignment

NMT is greatly impacted by the train data, onwhich NMT learns how to generate accurate andfluent translations holistically. Because the max-imal length of a training instance was limited toa certain length during the training of our mod-els, NMT models are puzzled by sentences thatexceed this length, not having encountered such atraining data. Hard splitting of longer sentenceshas some side-effects since the model considerboth parts as full sentence. As a consequence,whatever is the limit we set for sentence length,we do also need to teach the neural network howto handle longer sentences. For that, we are ex-ploring several options including using separatemodel based on source/target alignment to find op-timal breaking point, and introduce special and tokens.Likewise, very short phrases and incomplete orfragmental sentences were not included in ourtraining data, and consequently NMT systems failto correctly translate such input texts (e.g. Figure10). Here also, to enable this, we do simply needto teach the model to handle such input by inject-ing additional synthetic data.

Also, our training corpus includes a number ofresources that are known to contain many noisydata. While NMT seems more robust than othertechnologies for handling noise, we can still per-ceive noise effect in translation - in particular forrecurring noise. An example is in English-to-Korean, where we see the model trying to sys-tematically convert amount currency in addition tothe translation. As demonstrated in Section 5.2,preparing theright kind of inputto NMT seems toresult in more efficient and accurate systems, andsuch a procedure should also be directly applied tothe training data more aggressively.

Finally,let us note that source/target alignmentis a must for our users, but this information ismissing from NMT output due to soft alignment.To hide this issue from the end-users, multi-ple alignment heuristic are showing tradictionaltarget-source alignment.

8 Further Work

In this section we outline further experiments cur-rently being conducted. First we extend NMTdecoding with the ability to make use of multi-ple models. Both external models, particularlyan n-gram language model, as well as decodingwith multiple networks (ensemble). We also workon using external word embeddings, and on mod-elling unknown words within the network.

8.1 Extending Decoding

8.1.1 Additional LM

As proposed by (Gülçehre et al., 2015), we con-duct experiments to integrate ann-gram languagemodel estimated over a large dataset on our NeuralMT system. We followed a shallow fusion integra-tion, similar to how language models are used in astandard phrase-based MT decoder.

In the context of beam search decoding in NMT,at each time stept, candidate wordsx are hy-pothesized and assigned a score according to theneural network,pNMT (x). Sorted according totheir respective scores, theK-best candidates, arereranked using the score assigned by the languagemodel,pLM(x). The resulting probability of eachcandidate is obtained by the weighted sum ofeach log-probabilitylog p(x) = log pLM (x) +β log pNMT (x). Whereβ is a hyper-parameterthat needs to be tuned.

This technique is specially useful for handlingout-of-vocabulary words (OOV). Deep networksare technically constrained to work with limitedvocabulary sets (in our experiments we use targetvocabularies of60k words), hence suffering fromimportant OOV problems. In contrast,n-gramlanguage models can be learned for very large vo-cabularies.

Initial results show the suitability of the shal-low integration technique to select the appropriateOOV candidate out of a dictionary list (external re-source). The probability obtained from a languagemodel is the unique modeling alternative for thoseword candidates for which the neural network pro-duces no score.

8.1.2 Ensemble Decoding

Ensemble decoding has been verified as a prac-tical technique to further improve the perfor-mance compared to a single Encoder-Decodermodel (Sennrich et al., 2016b; Wu et al., 2016;Zhou et al., 2016). The improvement comes fromthe diversity of prediction from different neuralnetwork models, which are learned by random ini-tialization seeds and shuffling of examples dur-ing training, or different optimization methodstowards the development set(Cho et al., 2015).As a consequence, 3-8 isolated models willbe trained and ensembled together, consider-ing the cost of memory and training speed.Also, (Junczys-Dowmunt et al., 2016) providessome methods to accelerate the training by choos-ing different checkpoints as the final models.

We implement ensemble decoding by averagingthe output probabilities for each estimation of tar-get wordx with the formula:

pensx =1M

∑Mm=1 p

mx

wherein, pmx represents probabilities of eachx,andM is the number of neural models.

8.2 Extending word embeddings

Although NMT technology has recently accom-plished a major breakthrough in Machine Trans-lation field, it still remains constrained due to thelimited vocabulary size and to the use of bilin-gual training data. In order to reduce the negativeimpact of both phenomena, experiments are cur-rently being held on using external word embed-ding weights.

Those external word embeddings are notlearned by the NMT network from bilingual dataonly, but by an external model (e.g. word2vec(Mikolov et al., 2013)). They can therefore be es-timated from larger monolingual corpora, incorpo-rating data from different domains.

Another advantage lies in the fact that, since ex-ternal word embedding weights are not modifiedduring NMT training, it is possible to use a dif-ferent vocabulary for this fixed part of the inputduring the application or re-training of the model(provided that the weights for the words in new vo-cabulary come from the same embedding space asthe original ones). This may allow a more efficientadaptation to the data coming from a different do-main with a different vocabulary.

8.3 Unknown word handling

When anunknown wordis generated in the targetoutput sentence, a general encoder-decoder withattentional mechanism utilizes heuristics based onattention weights such that the source word withthe most attention is either directly copied as-is orlooked up in a dictionary.

In the recent literature (Gu et al., 2016;Gulcehre et al., 2016), researchers have attemptedto directly model the unknown word handlingwithin the attention and decoder networks.Having the model learn to take control of bothdecoding and unknown word handling will resultin the most optimized way to address the singleunknown word replacement problem, and weare implementing and evaluating the previousapproaches within our framework.

9 Conclusion

Neural MT has progressed at a very impressiverate, and it has proven itself to be competitiveagainst online systems trained on train data whosesize is several orders of magnitude larger. Thereis no doubt that Neural MT is definitely a tech-nology that will continue to have a great impacton academia and industry. However, at its currentstatus, it is not without limitations; on languagepairs that have abundant amount of monolingualand bilingual train data, phrase-based MT still per-form better than Neural MT, because Neural MTis still limited on the vocabulary size and deficientutilization of monolingual data.

Neural MT is not an one-size-fits-all technologysuch that one general configuration of the modeluniversally works on any language pairs. For ex-ample, subword tokenization such as BPE pro-vides an easy way out of the limited vocabularyproblem, but we have discovered that it is not al-ways the best choice for all language pairs. Atten-tion mechanism is still not at the satisfactory statusand it needs to be more accurate for better con-trolling the translation output and for better userinteractions.

For upcoming releases, we have begun to mak-ing even more experiments with injection of var-ious linguistic knowledges, at which SYSTRANpossesses the foremost expertise. We will alsoapply our engineering know-hows to conquer thepractical issues of NMT one by one.

Acknowledgments

We would like to thank Yoon Kim, Prof. Alexan-der Rush and the rest of members of the HarvardNLP group for their support with the open-sourcecode, their pro-active advices and their valuableinsights on the extensions.

We are also thankful to CrossLang and HominKwon for their thorough and meaningful defini-tion of the evaluation protocol, and their evalua-tion team as well as Inah Hong, Weimin Jiang andSunHyung Lee for their work.

References[Bahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun

Cho, and Yoshua Bengio. 2014. Neural machinetranslation by jointly learning to align and trans-late. CoRR, abs/1409.0473. Demoed at NIPS 2014:http://lisa.iro.umontreal.ca/mt-demo/.

[Béchara et al.2011] Hanna Béchara, Yanjun Ma, andJosef van Genabith. 2011. Statistical post-editingfor a statistical mt system. InMT Summit, vol-ume 13, pages 308–315.

[Blain et al.2011] Frédéric Blain, Jean Senellart, Hol-ger Schwenk, Mirko Plitt, and Johann Roturier.2011. Qualitative analysis of post-editing for highquality machine translation.MT Summit XIII: theThirteenth Machine Translation Summit [organizedby the] Asia-Pacific Association for Machine Trans-lation (AAMT), pages 164–171.

[Bojar et al.2015] Ondřej Bojar, Rajen Chatterjee,Christian Federmann, Barry Haddow, MatthiasHuck, Chris Hokamp, Philipp Koehn, Varvara Lo-gacheva, Christof Monz, Matteo Negri, Matt Post,Carolina Scarton, Lucia Specia, and Marco Turchi.2015. Findings of the 2015 workshop on statisticalmachine translation. InProceedings of the TenthWorkshop on Statistical Machine Translation, pages1–46, Lisbon, Portugal, September.

[Bojar et al.2016] Ondřej Bojar, Yvette Graham, AmirKamran, and Miloš Stanojević. 2016. Results of thewmt16 metrics shared task. InProceedings of theFirst Conference on Machine Translation, Berlin,Germany, August.

[Brants et al.2007] Thorsten Brants, Ashok C Popat,Peng Xu, Franz J Och, and Jeffrey Dean. 2007.Large language models in machine translation. InIn Proceedings of the Joint Conference on EmpiricalMethods in Natural Language Processing and Com-putational Natural Language Learning. Citeseer.

[Cettolo et al.2014] Mauro Cettolo, Nicola Bertoldi,Marcello Federico, Holger Schwenk, Loic Barrault,and Chrstophe Servan. 2014. Translation projectadaptation for mt-enhanced computer assisted trans-lation. Machine Translation, 28(2):127–150, Octo-ber.

[Chen and Eisele2012] Yu Chen and Andreas Eisele.2012. Multiun v2: Un documents with multilingualalignments. InLREC, pages 2500–2504.

[Chen et al.2016] Wenhu Chen, Evgeny Matusov,Shahram Khadivi, and Jan-Thorsten Peter. 2016.Guided alignment training for topic-aware neuralmachine translation.CoRR, abs/1607.01628v1.

[Cho et al.2015] Sébastien Jean Kyunghyun Cho,Roland Memisevic, and Yoshua Bengio. 2015. Onusing very large target vocabulary for neural ma-chine translation.arXiv preprint arXiv:1412.2007.

[Chung et al.2016] Junyoung Chung, Kyunghyun Cho,and Yoshua Bengio. 2016. A character-level de-coder without explicit segmentation for neural ma-chine translation.CoRR, abs/1603.06147.

[Dugast et al.2007] Loı̈c Dugast, Jean Senellart, andPhilipp Koehn. 2007. Statistical Post-Editing onSYSTRAN’s Rule-Based Translation System. InProceedings of the Second Workshop on Statisti-cal Machine Translation, pages 220–223, Prague,Czech Republic, June. Association for Computa-tional Linguistics.

[Dyer et al.2013] Chris Dyer, Victor Chahuneau, andNoah A. Smith. 2013. A simple, fast, and effec-tive reparameterization of ibm model 2. InProceed-ings of the 2013 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages644–648, Atlanta, Georgia, June.

[Gu et al.2016] Jiatao Gu, Zhengdong Lu, Hang Li, andVictor O.K. Li. 2016. Incorporating copying mech-anism in sequence-to-sequence learning. InPro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 1631–1640, Berlin, Germany, Au-gust. Association for Computational Linguistics.

[Gülçehre et al.2015] Çaglar Gülçehre, Orhan Firat,Kelvin Xu, Kyunghyun Cho, Loı̈c Barrault, Huei-Chi Lin, Fethi Bougares, and Holger SchwenkandYoshua Bengio. 2015. On using monolin-gual corpora in neural machine translation.CoRR,abs/1503.03535.

[Gulcehre et al.2016] Caglar Gulcehre, Sungjin Ahn,Ramesh Nallapati, Bowen Zhou, and Yoshua Ben-gio. 2016. Pointing the unknown words. InPro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 140–149, Berlin, Germany, August.Association for Computational Linguistics.

[Junczys-Dowmunt and Grundkiewicz2016] MarcinJunczys-Dowmunt and Roman Grundkiewicz.2016. Log-linear combinations of monolingual andbilingual neural machine translation models forautomatic post-editing.CoRR, abs/1605.04800.

[Junczys-Dowmunt et al.2016] M. Junczys-Dowmunt,T. Dwojak, and H. Hoang. 2016. Is NeuralMachine Translation Ready for Deployment? ACase Study on 30 Translation Directions.CoRR,abs/1610.01108, October.

[Kim and Rush2016] Yoon Kim and Alexander M.Rush. 2016. Sequence-level knowledge distillation.CoRR, abs/1606.07947.

[Koehn et al.2007] Philipp Koehn, Hieu Hoang,Alexandra Birch, Chris Callison-Burch, MarcelloFederico, Nicola Bertoldi, Brooke Cowan, WadeShen, Christine Moran, Richard Zens, Chris Dyer,Ondrej Bojar, Alexandra Constantin, and EvanHerbst. 2007. Moses: Open Source Toolkit forStatistical Machine Translation. InProceedingsof the 45th Annual Meeting of the Association forComputational Linguistics Companion VolumeProceedings of the Demo and Poster Sessions.

[Koehn2005] Philipp Koehn. 2005. Europarl: A paral-lel corpus for statistical machine translation. InMTsummit, volume 5, pages 79–86.

[Luong and Manning2016] Minh-Thang Luong andChristopher D. Manning. 2016. Achieving openvocabulary neural machine translation with hybridword-character models.CoRR, abs/1604.00788.

[Luong et al.2015] Thang Luong, Hieu Pham, andChristopher D. Manning. 2015. Effective ap-proaches to attention-based neural machine trans-lation. In Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Process-ing, pages 1412–1421, Lisbon, Portugal, September.Association for Computational Linguistics.

[Mikolov et al.2013] Tomas Mikolov, Kai Chen, GregCorrado, and Jeffrey Dean. 2013. Efficient estima-tion of word representations in vector space.CoRR,abs/1301.3781.

[Ortiz-Martı́nez et al.2010] Daniel Ortiz-Martı́nez, Is-mael Garcı́a-Varea, and Francisco Casacuberta.2010. Online learning for interactive statistical ma-chine translation. InHuman Language Technolo-gies: The 2010 Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics, pages 546–554. Association forComputational Linguistics.

[See et al.2016] Abigail See, Minh-Thang Luong, andChristopher D Manning. 2016. Compressionof neural machine translation models via pruning.In the proceedings of The 20th SIGNLL Confer-ence on Computational Natural Language Learning(CoNLL2016), Berlin, Germany, August.

[Sennrich and Haddow2016] Rico Sennrich and BarryHaddow. 2016. Linguistic input features improveneural machine translation.CoRR, abs/1606.02892.

[Sennrich et al.2016a] Rico Sennrich, Barry Haddow,and Alexandra Birch. 2016a. Controlling politenessin neural machine translation via side constraints.

In Proceedings of the 15th Annual Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 35–40, San Diego, California, USA,June. Association for Computational Linguistics.

[Sennrich et al.2016b] Rico Sennrich, Barry Haddow,and Alexandra Birch. 2016b. Neural machine trans-lation of rare words with subword units. InProceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 1715–1725, Berlin, Germany, August.Association for Computational Linguistics.

[Simard et al.2007] Michel Simard, Cyril Goutte, andPierre Isabelle. 2007. Statistical phrase-based post-editing. InIn Proceedings of NAACL.

[Steinberger et al.2006] Ralf Steinberger, BrunoPouliquen, Anna Widiger, Camelia Ignat, TomazErjavec, Dan Tufis, and Dániel Varga. 2006. Thejrc-acquis: A multilingual aligned parallel corpuswith 20+ languages.arXiv preprint cs/0609058.

[Tiedemann2012] Jörg Tiedemann. 2012. Parallel data,tools and interfaces in opus. InLREC, pages 2214–2218.

[Wu et al.2016] Yonghui Wu, Mike Schuster, ZhifengChen, Quoc V. Le, Mohammad Norouzi, Wolf-gang Macherey, Maxim Krikun, Yuan Cao, QinGao, Klaus Macherey, Jeff Klingner, Apurva Shah,Melvin Johnson, Xiaobing Liu, Łukasz Kaiser,Stephan Gouws, Yoshikiyo Kato, Taku Kudo,Hideto Kazawa, Keith Stevens, George Kurian, Nis-hant Patil, Wei Wang, Cliff Young, Jason Smith, Ja-son Riesa, Alex Rudnick, Oriol Vinyals, Greg Cor-rado, Macduff Hughes, and Jeffrey Dean. 2016.Google’s neural machine translation system: Bridg-ing the gap between human and machine translation.Technical report, Google.

[Zhou et al.2016] Jie Zhou, Ying Cao, Xuguang Wang,Peng Li, and Wei Xu. 2016. Deep recurrent mod-els with fast-forward connections for neural ma-chine translation. Transactions of the Associationfor Computational Linguistics, 4:371–383.

LanguagePair

Training Testing#Sents #Tokens Vocab #Sents #Tokens Vocab OOV

source target source target source target source target source targeten↔br 2.7M 74.0M 76.6M 150k 213k 2k 51k 53k 6.7k 8.1k 47 64en↔it 3.6M 98.3M 100M 225k 312k 2k 52k 53k 7.3k 8.8k 66 85en↔ar 5.0M 126M 155M 295k 357k 2k 51k 62k 7.5k 8.7k 43 47en↔es 3.5M 98.8M 105M 375k 487k 2k 53k 56k 8.5k 9.8k 110 119en↔de 2.6M 72.0M 69.1M 150k 279k 2k 53k 51k 7.0k 9.6k 30 77en↔nl 2.1M 57.3M 57.4M 145k 325k 2k 52k 53k 6.7k 7.9k 50 141en→ko 3.5M 57.5M 46.4M 98.9k 58.4k 2k 30k 26k 7.1k 11k 0 -en↔fr 9.3M 220M 250M 558k 633k 2k 48k 55k 8.2k 8.6k 77 63fr↔br 1.6M 53.1M 47.9M 112k 135k 2k 62k 56k 7.4k 8.1k 55 59fr↔it 3.1M 108M 96.5M 202k 249k 2k 69k 61k 8.2k 8.8k 47 57fr↔ar 5.0M 152M 152M 290k 320k 2k 60k 60k 8.5k 8.6k 42 61fr↔es 2.8M 99.0M 91.7M 170k 212k 2k 69k 64k 8.0k 8.6k 37 55fr↔de 2.4M 73.4M 62.3M 172k 253k 2k 57k 48k 7.5k 9.0k 59 104fr→zh 3.0M 98.5M 76.3M 199k 168k 2k 67k 51k 8.0k 5.9k 51 -ja↔ko 1.4M 14.0M 13.9M 61.9k 55.6k 2k 19k 19k 9.3k 8.5k 0 0nl→fr 3.0M 74.8M 84.7M 446k 260k 2k 49k 55k 7.9k 7.5k 150 -fa→en 795k 21.7M 20.2M 166k 147k 2k 54k 51k 7.7k 8.7k 197 -ja→en 1.3M 28.0M 22.0M 24k 87k 2k 41k 32k 6.2k 7.3k 3 -zh→en 5.8M 145M 154M 246k 225k 2k 48k 51k 5.5k 6.9k 34 -

Table 2: Corpora statistics for each language pair (iso 639-1 2-letter code, expect for Portuguese Brazil-ian noted as ”br”). All language pairs are bidirectional except nlfr, frzh, jaen, faen, enko, zhen. Columns2-6 indicate the number of sentences, running words and vocabularies referred to training datasets whilecolumns 7-11 indicate the number of sentences, running words and vocabularies referred to test datasets.Columns 12 and 13 indicate respectively the vocabulary of OOV of the source and target test sets. (Mstand for milions,k for thousands). Since jako and enko are trained using BPE tokenization (see section4.1), there is no OOV.

En:A senior U.S. treasury official is urging china to move fasteron making its currency more flexible.

Ko with Formal mode:미국재무부고위관계자는중국이위안화의융통성을더유연하게만들기위해더빨리움직일것을촉구했습니다.

Ko with Informal mode:미국재무부고위관계자는중국이위안화를좀더유연하게만들기위해더빨리움직일것을촉구하고있어요.

Table 5: A translation examples from an En-Ko system where the choice of different politeness modesaffects the output.

Language Pair Domain(*) Human Translation (** ) Online Translation (*** )

English 7→ French Generic human-pro, human-casual Bing, GoogleFrench7→ English Generic human-pro, human-casual Bing, GoogleEnglish 7→ Korean News human-pro Naver, Google, BingEnglish 7→ Korean Technical (IT) human-pro N/A

Table 9: Performed evaluations

Human-pro Human-casual Bing Google Naver Systran V8English 7→ French -64.2 -18.5 +48.7 +10.4 +17.3French7→ English -56.8 +5.5 -23.1 -8.4 +5English 7→ Korean -15.4 +35.5 +33.7 +31.4 +13.2English 7→ Korean (IT) +30.3

Table 10: This table shows relative preference of SYSTRAN NMT compared to other outputs calcu-lated this way: for each triplet where outputA andB were compared, we noteprefA>B the number oftimes whereA was strictly preferred toB, andEA the total number of triplet including outputA. Foreach output,E, the number in the table iscompar(SNMT, E) = (prefSNMT>E − prefE>SNMT)/ESNMT.compar(SNMT, E) is a percent value in the range[−1; 1]

Figure 10: Effect of translation of a single word through a model not trained for that.

Category NMT RB SMT ExampleEntity

Major 7 5 0 Galaxy Note 77→ note 7 de Galaxyvs. Galaxy Note 7Format 3 1 1 (number localization):$2.66 7→ 2.66$ vs. 2,66$

MorphologyMinor - Local 3 2 3 (tense choice):accused7→ accusaitvs. a accusé

Minor - Sentence Level 3 3 5thepresident [...], sheemphasized

7→ la président [...], ellea soulignévs. la [...], elle

Major 3 4 6 he scanned7→ il scannévs. il scannaitMeaning Selection

Minor 9 17 7 game7→ jeuvs. match

Major - Prep Choice 4 9 10[... facing war crimes charges] over[its bombardment of ...]

7→ contrevs. pour

Major - Expression 3 7 1[two] counts of murder

7→ chefs de meutrevs. chefs d’accusation de meutre

Major - Not Translated 5 1 4 he scanned7→ il scannedvs. il a scanné

Major - Contextual Meaning 14 39 1433 senior Republicans

7→ 33 républicainssupérieursvs. 33 ténors républicains

Word Ordering and Fluency

Minor 2 28 15(determiner):[without] a [specific destination in mind]

7→ sansune destination [...]vs. sans destination [...]

Major 3 16 15

(word ordering):in the Sept. 26 deaths7→ dans les mortsseptembre de 26

vs. dans les morts du26 septembre

Missing or Duplicated

Missing Minor 7 3 1

a week after the hurricane struck7→ une semaine après l’ouragan

vs. une semaine aprèsquel’ouragan ait frappé

Missing Major 6 1 3

As a working oncologist,Giordano knew [...]

7→ Giordano savaitvs. En tant qu’oncologue en fonction,Giordano savait

Duplicated Major 2 2 1

for the Republicanpresidential nominee

7→ au candidat républicainrépublicainvs. au candidat républicain

Misc. (Minor)Quotes, Punctuations 2 0 0 (misplaced quotes)

Case 6 0 2

[...] will be affected by Brexit7→ [...] Sera touchéePar le brexit

vs. [...] sera touchéepar le Brexit

TotalMajor 47 84 54Minor 36 55 35Minor & Major 83 139 89

Table 11: Human error analysis done for 50 sentences of the corpus defined in the section 6.2 forEnglish-French on NMT, SMT (Google) and RBMT outputs. Errorcategories are: - issue with entityhandling (Entity), issue with Morphology either local or reflecting sentence level missing agreements,issue with Meaning Selection splitted into different sub-categories, - issue with Word Ordering or Flu-ency (wrong or missing determiner for instance), - missing or duplicated words. Errors are either Minorwhen reader could still understand the sentence without access to the source, otherwise is considered asMajor. Erroneous words are counted in only one category evenif several problems add-up - for instanceordering and meaning selection.

A Remarkable Results

In this section, we highlight a serie of “remarkable” translations (positively remarkable and also fewnegatively outstanding) that we found out during evaluation for a variety of languages.

LanguagePair Source NMT Google Translate (2016/10/15)

en7→fr The report also said that althoughthree-quarters of Chinese say theircountry is playing a larger role inthe world than it did a decade ago,most want their government to fo-cus on domestic issues rather thanhelping other nations.

Le rapport dit aussi que, bien queles trois quarts des Chinois disentque leur pays joue un rôle plus im-portant dans le monde qu’il nel’afait il y a dix ans, la plupart veulentque leur gouvernement se concentresur les questions nationales plutôtque d’aider d’autres nations.

Le rapport indique également que,bien que les trois quarts des Chi-nois disent que leur pays joue unrôle plus important dans le mondequ’il y a dix ans, la plupart veulentque leur gouvernement de se con-centrer sur les questions intérieuresplutôt que d’aider les autres nations.

fr 7→en Fenêtre à rue se fermait mal, trèstrès bruyant et mal dormi. Petitdéjeuner banal.

The room was very clean and theroom was very clean.

street window will not close, veryvery noisy and slept badly. mun-dane breakfast.

en7→ko Forty Republican senators and con-gressmen have revoked their sup-port for the Republican presidentialnominee — with nearly 30 of themcalling on him to quit the race alto-gether in recent days.

40명의공화당상원의원들과하원의원들이 공화당 대통령 후보자에대한그들의지지를철회했고, 거의 30명의 사람들이 최근며칠동안그경주를그만두도록그에게요청했습니다.

(From Naver NMT14): 40여명의공화당의원들과하원의원들은거의 30명의공화당대통령후보들을철회했고,그들중거의 30명이최근에그를은퇴직전에그만두게할것을촉구했다.

en7→it all this leads to feelings of inse-curity and conflict thus reinforcingnegative spirals.

tutto ciò porta a sentimenti di insi-curezza e di conflitto, rafforzandocosı̀ la spirale negativa.

tutto ciò genera elementi di insi-curezza e di conflitto rinforzandoquindi spirali negative.

it 7→fr Mi sembra che ciò non abbia nullaa che vedere con una penuria ali-mentare: si tratta di una vera cares-tia, di una vera e propria catas-trofe che non è naturale ma strut-turale, provocata da un regimefolle, criminale, psicopatico, peg-giore dei peggiori regimi esistitinell’Unione Sovietica e nella Ro-mania di Ceaucescu.

Il me semble que cela n’a rienà voir avec une pénurie alimen-taire: il s’agit d’une véritablefamine, d’une véritable catastrophequi n’est pas naturelle mais struc-turelle, provoquée par un régime in-sensé, criminel, psychopathe, pireque les pires régimes de l’unionsoviétique et de la Roumanie deCeaucescu.

Il me semble que cela n’a rien àvoir avec une pénurie alimentaire: ilest une véritable famine, une catas-trophe qui est pas naturel, maisune structure, provoquée par unrégime criminel fou, psychopathe,le pire des pires régimes existaitl’Union soviétique et en Roumaniede Ceaucescu.

it 7→fr Ebbene io sono in totale disaccordo:lo Stato non deve passare in sec-ondo piano; deve invece interveniredi più e garantire una maggiore vig-ilanza in ambito economico.

je suis tout à fait d’accord: l’étatne doit pas passer sous silence; ildoit plutôt agir et assurer une plusgrande vigilance dans le domaineéconomique.

Eh bien, je suis en désaccord total:l’Etat ne doit pas prendre un siègearrière; Il doit plutôt agir plus et as-surer une plus grande supervisiondans l’économie.

es7→en ¿Es posible un Bangladesh más am-able con los niños? Is Bangladesh more friendly with

children?

Can a kinder Bangladesh with chil-dren?

en7→de On the other hand, businesses areconstantly complaining about theshortage of specialist staff, usingthis supposed shortage as an excuseto use cheap labour.

Andererseits beschweren sich Un-ternehmen ständig über den Man-gel an Fachpersonal, indem siediese angebliche Knappheit alsVorwand benutzen, billige Arbeit-skräfte einzusetzen.

Auf der anderen Seite beschw-eren sich Unternehmen ständig überden Mangel an Fachkräften, wobeidiese angebliche Mangelhaftigkeitals Ausrede für billige Arbeitskräfteverwendet wird.

zh7→en 朝闻天下,现在又到了我们的互动时间啦. Morning News, now it’s our inter-

active time.

(From Google GNMT): North Ko-rea heard the world, and now to ourinteraction time friends.

LanguagePair Source NMT Google Translate (2016/10/15)

br7→en Face às decisões de Nice, tı́nhamosde ter em conta esta situação ede adotar uma metodologia quecontemplasse as transformaçõesnecessárias.

Faced with the Nice decisions, wehad to take this situation into ac-count and adopt a methodologythat would address the necessarychanges.

Nice view of the decisions wehad to take account of this situa-tion and adopt a methodology thattook into consideration the neces-sary changes.

fr 7→br Une information présentée au mo-ment opportun signifie la trans-parence, laquelle crée la confianceet évite à l’entreprise de subir despertes.

A informação apresentada nomomento oportuno significatransparência, que cria confiança eevita que a empresa sofra perdas.

Informação apresentada em umatransparência meio oportuna, quecria confiança e evita a empresa asofrer perdas.

B Online System Parameters

All systems were trained with4 LSTM layers, size of word embedding vectors was500, dropout was setto 0.3 and we used bidirectional RNN (BRNN). Column Guided Alignment indicates wether the networkwas trained with guided alignments and on which epoch the feature was stopped.

TokenizationRNNsize

OptimalEpoch

GuidedAlignment

NERaware Special

zh7→en word boundary-generic 800 12 epoch 4 yesen7→it generic 800 16 epoch 4 yesit 7→en generic 800 16 epoch 4 yesen7→ar generic-crf 800 15 epoch 4 noar7→en crf-generic 800 15 epoch 4 noen7→es generic 800 18 epoch 4 yeses7→en generic 800 18 epoch 4 yesen7→de generic-compound splitting 800 17 epoch 4 yesde7→en compound splitting-generic 800 18 epoch 4 yesen7→nl generic 800 17 epoch 4 nonl 7→en generic 800 14 epoch 4 noen7→fr generic 800 18 epoch 4 yes double corpora (2 x 5M)fr 7→en generic 800 17 epoch 4 yesja7→en bpe 800 11 no nofr 7→br generic 800 18 epoch 4 yesbr7→fr generic 800 18 epoch 4 yesen7→pt generic 800 18 epoch 4 yesbr7→en generic 800 18 epoch 4 yesfr 7→it generic 800 18 epoch 4 yesit 7→fr generic 800 18 epoch 4 yesfr 7→ar generic-crf 800 10 epoch 4 noar7→fr crf-generic 800 15 epoch 4 nofr 7→es generic 800 18 epoch 4 yeses7→fr generic 800 18 epoch 4 yesfr 7→de generic-compound splitting 800 17 epoch 4 node7→fr compound splitting-generic 800 16 epoch 4 nonl 7→fr generic 800 16 epoch 4 nofr 7→zh generic-word segmentation 800 18 epoch 4 noja7→ko bpe 1000 18 no noko7→ja bpe 1000 18 no noen7→ko bpe 1000 17 no no politenessfa7→en basic 800 18 yes no

Table 12: Parameters for online systems.

1 Introduction2 System Description3 Training Resources4 Technology4.1 Tokenization4.2 Word Features4.3 Named Entities (NE)4.4 Guided Alignments4.5 Politeness Mode4.6 Customization4.6.1 Generic Specialization4.6.2 Post-editing Engine

5 Performance5.1 Pruning5.2 Distillation5.3 Batch Translation5.4 C++ Decoder

6 Evaluation6.1 Automatic Scoring and system comparison6.2 Human Ranking Evaluation6.3 Qualitative Evaluation

7 Practical Issues8 Further Work8.1 Extending Decoding8.1.1 Additional LM8.1.2 Ensemble Decoding

8.2 Extending word embeddings8.3 Unknown word handling

9 ConclusionA Remarkable ResultsB Online System Parameters

arxiv:1610.05540v1 [cs.cl] 18 oct 2016 · • practical integration in business applications: for...

Documents