bilingual dictionary based neural machine translation

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1570–1579July 5 - 10, 2020. c©2020 Association for Computational Linguistics

1570

Bilingual Dictionary Based Neural Machine Translation without UsingParallel Sentences

Xiangyu Duan1, Baijun Ji1, Hao Jia1, Min Tan1, Min Zhang1∗,Boxing Chen2, Weihua Luo2, Yue Zhang3

1 Institute of Aritificial Intelligence, School of Computer Science and Technology,Soochow university

2 Alibaba DAMO Academy3 School of Engineering, Westlake University

{xiangyuduan,minzhang}@suda.edu.cn; {bjji,hjia,mtan2017}@stu.suda.edu.cn;{boxing.cbx,weihua.luowh}@alibaba-inc.com; [email protected]

Abstract

In this paper, we propose a new task of ma-chine translation (MT), which is based on noparallel sentences but can refer to a ground-truth bilingual dictionary. Motivated by theability of a monolingual speaker learning totranslate via looking up the bilingual dictio-nary, we propose the task to see how muchpotential an MT system can attain using thebilingual dictionary and large scale monolin-gual corpora, while is independent on paral-lel sentences. We propose anchored train-ing (AT) to tackle the task. AT uses thebilingual dictionary to establish anchoringpoints for closing the gap between sourcelanguage and target language. Experimentson various language pairs show that our ap-proaches are significantly better than variousbaselines, including dictionary-based word-by-word translation, dictionary-supervised cross-lingual word embedding transformation, andunsupervised MT. On distant language pairsthat are hard for unsupervised MT to performwell, AT performs remarkably better, achiev-ing performances comparable to supervisedSMT trained on more than 4M parallel sen-tences1 .

1 Introduction

Motivated by a monolingual speaker acquiringtranslation ability by referring to a bilingual dic-tionary, we propose a novel MT task that no par-allel sentences are available, while a ground-truthbilingual dictionary and large-scale monolingualcorpora can be utilized. This task departs fromunsupervised MT task that no parallel resources,including the ground-truth bilingual dictionary, areallowed to utilize (Artetxe et al., 2018c; Lam-ple et al., 2018b). This task is also distinct to

∗ Corresponding Author.1Code is available at https://github.com/

mttravel/Dictionary-based-MT

supervised/semi-supervised MT task that mainlydepends on parallel sentences (Bahdanau et al.,2015; Gehring et al., 2017; Vaswani et al., 2017;Chen et al., 2018; Sennrich et al., 2016a).

The bilingual dictionary is often utilized as aseed in bilingual lexicon induction (BLI) that aimsto induce more word pairs within the languagepair (Mikolov et al., 2013). Another utilizationof the bilingual dictionary is for translating low-frequency words in supervised NMT (Arthur et al.,2016; Zhang and Zong, 2016). We are the first toutilize the bilingual dictionary and the large scalemonolingual corpora to see how much potentialan MT system can achieve without using parallelsentences. This is different from using artificialbilingual dictionaries generated by unsupervisedBLI for initializing an unsupervised MT system(Artetxe et al., 2018c,b; Lample et al., 2018a), weuse the ground-truth bilingual dictionary and applyit throughout the training process.

We propose Anchored Training (AT) to tacklethis task. Since word representations are learnedover monolingual corpora without any parallel sen-tence supervision, the representation distances be-tween source language and target language are of-ten quite large, leading to significant translationdifficulty. As one solution, AT selects words cov-ered by the bilingual dictionary as anchoring pointsto drive the distance between the source languagespace and the target language space closer so thattranslation between the two languages becomeseasier. Furthermore, we propose Bi-view AT thatplaces anchors based on either source languageview or target language view, and combines bothviews to enhance the translation quality.

Experiments on various language pairs show thatAT performs significantly better than various base-lines, including word-by-word translation throughlooking up the dictionary, unsupervised MT, anddictionary-supervised cross-lingual word embed-

https://github.com/mttravel/Dictionary-based-MT

https://github.com/mttravel/Dictionary-based-MT

1571

ding transformation to make distances betweenboth languages closer. Bi-view AT further im-proves AT performance due to mutual strengthen-ing of both views of the monolingual data. Whencombined with cross-lingual pretraining (Lampleand Conneau, 2019), Bi-view AT achieves perfor-mances comparable to traditional SMT systemstrained on more than 4M parallel sentences. Themain contributions of this paper are as follows:

• A novel MT task is proposed which can onlyuse the ground-truth bilingual dictionary andmonolingual corpora, while is independent onparallel sentences.

• AT is proposed as a solution to the task. ATuses the bilingual dictionary to place anchorsthat can encourage monolingual spaces ofboth languages to become closer so that trans-lation becomes easier.

• The detailed evaluation on various languagepairs shows that AT, especially Bi-view AT,performs significantly better than variousmethods, including word-by-word translation,unsupervised MT, and cross-lingual embed-ding transformation. On distant languagepairs that unsupervised MT struggled to beeffective, AT and Bi-view AT perform remark-ably better.

2 Related Work

The bilingual dictionaries used in previous worksare mainly for bilingual lexicon induction (BLI),which independently learns the embedding in eachlanguage using monolingual corpora, and thenlearns a transformation from one embedding spaceto another by minimizing squared euclidean dis-tances between all word pairs in the dictionary(Mikolov et al., 2013; Artetxe et al., 2016). Later ef-forts for BLI include optimizing the transformationfurther through new training objectives, constraints,or normalizations (Xing et al., 2015; Lazaridouet al., 2015; Zhang et al., 2016; Artetxe et al., 2016;Smith et al., 2017; Faruqui and Dyer, 2014; Luet al., 2015). Besides, the bilingual dictionary isalso used for supervised NMT which requires large-scale parallel sentences (Arthur et al., 2016; Zhangand Zong, 2016). To our knowledge, we are thefirst to use the bilingual dictionary for MT withoutusing any parallel sentences.

Our work is closely related to unsupervisedNMT (UNMT) (Artetxe et al., 2018c; Lample

et al., 2018b; Yang et al., 2018; Sun et al., 2019),which does not use parallel sentences neither. Thedifference is that UNMT may use the artificialdictionary generated by unsupervised BLI for ini-tialization (Artetxe et al., 2018c; Lample et al.,2018a) or abandon the artificial dictionary by us-ing joint BPE so that multiple BPE units can beshared by both languages (Lample et al., 2018b).We use the ground-truth dictionary instead and ap-ply it throughout a novel training process. UNMTworks well on close language pairs such as English-French, while performs remarkably bad on distantlanguage pairs in which aligning the embeddingsof both side languages is quite challenging. Weuse the ground-truth dictionary to alleviate suchproblem, and experiments on distant language pairsshow the necessity of using the bilingual dictionary.

Other utilizations of the bilingual dictionary fortasks beyond MT include cross-lingual dependencyparsing (Xiao and Guo, 2014), unsupervised cross-lingual part-of-speech tagging and semi-supervisedcross-lingual super sense tagging (Gouws and Sø-gaard, 2015), multilingual word embedding train-ing (Ammar et al., 2016; Duong et al., 2016), andtransfer learning for low-resource language model-ing (Cohn et al., 2017).

3 Our Approach

There are multiple freely available bilingual dictio-naries such as Muse dictionary2 (Conneau et al.,2018), Wiktionary3, and PanLex4. We adopt Musedictionary which contains 110 large-scale ground-truth bilingual dictionaries.

We propose to inject the bilingual dictionary intothe MT training by placing anchoring points on thelarge scale monolingual corpora to drive the se-mantic spaces of both languages becoming closerso that MT training without parallel sentences be-comes easier. We present the proposed AnchoredTraining (AT) and Bi-view AT in the following.

3.1 Anchored Training (AT)

Since word embeddings are trained on monolin-gual corpora independently, the embedding spacesof both languages are quite different, leading tosignificant translation difficulty. AT forces wordsof a translation pair to share the same word embed-ding as an anchor. We place multiple anchors by

2https://github.com/facebookresearch/MUSE3https://en.wiktionary.org/wiki/Wiktionary:Main_Page4https://panlex.org/

1572

Figure 1: Illustration of (a) AT and (b) Bi-view AT. We use a source language sentence “s1s2s3s4” and a targetlanguage sentence “t1t2t3t4t5” from the large-scale monolingual corpora as an example. . denotes an anchoringpoint which replaces a word with its translation based on the bilingual dictionary. Thin arrows of ↓ denote NMTdecoding, thick arrows of ⇓ denote training an NMT model, 99K and L99 denote generating the anchored sentencebased on the dictionary. Words with primes such as t1′ denote the decoding output of a thin arrow.

selecting words covered by the bilingual dictionary.With stable anchors, the embedding spaces of bothlanguages become more and more close during theAT process.

As illustrated in Figure 1 (a), given the sourcesentence “s1s2s3s4” with words of s2 and s3 beingcovered by the bilingual dictionary, we replace thetwo words with their translation words according tothe dictionary. This results in the source sentence“s1 s2.t s3.t s4”, of which s2.t and s3.t serve asthe anchors which are actually the target languagewords obtained by translating s2 and s3 accordingto the dictionary, respectively. Through the anchors,some words on the source side share the same wordembeddings with the corresponding words on thetarget side. The AT process will strengthen theconsistency of embedding spaces of both languagesbased on these anchors.

The training process illustrated in Figure 1 (a)consists of a mutual back-translation procedure.The anchored source sentence “s1 s2.t s3.t s4” istranslated into target sentence “t1′ t2′ t3′” by us-ing source-to-target decoding, then “t1′ t2′ t3′” and“s1 s2.t s3.t s4” constitute a sentence pair for train-ing the target-to-source translation model. In con-trast, the target sentence “t1t2t3t4t5” is translatedinto anchored source sentence “s1′ s2′ s3.t′ s4′”by using target-to-source decoding, then both sen-

tences constitute a sentence pair for training thesource-to-target translation model. Note that dur-ing training the translation model, the input sen-tences are always pseudo sentences generated bydecoding an MT model, while the output sentencesare always true or anchored true sentences. Besidethis mutual back-translation procedure, a denoisingprocedure used in unsupervised MT (Lample et al.,2018b) is also adopted. The deletion and permuta-tion noises are added to the source/target sentence,and the translation model is also trained to denoisethem into the original source/target sentence.

During testing, a source sentence is transformedinto an anchored sentence at first by looking up thebilingual dictionary. Then we use the source-to-target model trained in the AT process to decodethe anchored sentence.

We use Transformer architecture (Vaswani et al.,2017) as our translation model with four stackedlayers in both encoder and decoder. In the encoder,we force the last three layers shared by both lan-guages, and leave the first layer not shared. In thedecoder, we force the first three layers shared byboth languages, and leave the last layer not shared.Such architecture is designed to capture both com-mon and specific characteristics of the two lan-guages in one model for the training.

1573

3.2 Bi-view AT

AT as illustrated in Figure 1 (a) actually tries tomodel the sentences of both languages in the targetlanguage view with partial source words replacedwith the target words and the full target languagesentence. Bi-view AT enhances AT by adding an-other language view. Figure 1 (b) adds the sourcelanguage view shown in the right part to accompanywith the target language view of Figure 1 (a). Inparticular, the target language sentence “t1t2t3t4t5”is in the form of “t1 t2 t3.s t4 t5.s” after looking upthe bilingual dictionary. Such partial target wordsreplaced with the source words and the full sourcelanguage sentence “s1s2s3s4” constitute the sourcelanguage view.

Based on the target language view shown in theleft part and the source language view shown in theright part, we further combine both views throughthe pseudo sentences denoted by primes in Fig-ure 1 (b). As shown by “99K” in Figure 1 (b),“t1′t2′t3′” is further transformed into “t1′ t2.s′ t3.s′”by looking up the bilingual dictionary. Similarly,“s1′s2′s3′” is further transformed into “s1′s2′s3.t′”as shown by “L99”. Finally, solid line box repre-sents training the source-to-target model on datafrom both views, and dashed line box representstraining the target-to-source model on data fromboth views.

Bi-view AT starts from training both views inparallel. After both views converge, we generatepseudo sentences in both the solid line box and thedashed line box, and pair these pseudo sentences(as input) with genuine sentences (as output) totrain the corresponding translation model. Thisgeneration and training process iterates until Bi-view AT converges. Through such rich views, thetranslation models of both directions are mutuallystrengthened.

3.3 Anchored Cross-lingual Pretraining(ACP)

Cross-lingual pretraining has demonstrated effec-tiveness on tasks such as cross-lingual classifica-tion, unsupervised MT (Lample and Conneau,2019). It is conducted over large monolingual cor-pora by masking random words and training topredict them as a cloze task. Instead, we proposeACP to pretrain on data that is obtained by trans-forming the genuine monolingual corpora of bothlanguages into the anchored version. For example,words in the source language corpus that are cov-

ered by the bilingual dictionary are replaced withtheir translation words respectively. Such wordsare anchoring points that can drive the pretrainingto close the gap between the source language spaceand the target language space better than the orig-inal pretraining method of Lample and Conneau(2019) does as evidenced by the experiments insection 4.5. Such anchored source language corpusand the genuine target language corpus constitutethe target language view for ACP.

ACP can be conducted in either the source lan-guage view or the target language view. After ACP,each of them is used to initialize the encoder of thecorresponding AT system.

3.4 Training Procedure

For AT, the pseudo sentence generation step andNMT training step are interleaved. Take the tar-get language view AT shown in Figure 1 (a) forexample, we extract anchored source sentences asone batch, and decode them into pseudo target sen-tences; then we use the same batch to train theNMT model of target-to-anchored source. In themeantime, a batch of target sentences are decodedinto pseudo anchored source sentences, and thenwe use the same batch to train the NMT modelof anchored source-to-target. The above processrepeats until AT converges.

For Bi-view AT, after each mono-view AT con-verging, we set larger batch for generating pseudosentences as shown in solid/dashed line boxes inFigure 1 (b), and train the corresponding NMTmodel using the same batch.

For ACP, we follow XLM procedure (Lampleand Conneau, 2019), and conduct pretraining onthe anchored monolingual corpora concatenatedwith the genuine corpora of the other language.

4 Experimentation

We conduct experiments on English-French,English-Russian, and English-Chinese translationto check the potential of our MT system with onlybilingual dictionary and large scale monolingualcorpora. The English-French task deals with thetranslation between close-related languages, whilethe English-Russian and English-Chinese tasksdeal with the translation between distant languagesthat do not share the same alphabets.

1574

4.1 DatasetsFor English-French translation task, we use themonolingual data released by XLM (Lample andConneau, 2019)5. For English-Russian translationtask, we use the monolingual data identical to Lam-ple et al.(2018a), which uses all available sentencesfor the WMT monolingual News Crawl datasetsfrom years 2007 to 2017. For English-Chinesetranslation task, we extract Chinese sentences fromhalf of the 4.4M parallel sentences from LDC, andextract English sentences from the complementaryhalf. We use WMT newstest-2013/2014, WMTnewstest-2015/2016, and NIST2006/NIST2002 asvalidation/test sets for English-French, English-Russian, and English-Chinese, respectively.

For cross-lingual pretraining, we extract raw sen-tences from Wikipedia dumps, which contain 80M,60M, 13M, 5.5M monolingual sentences for En-glish, French, Russian, and Chinese, respectively.

Muse ground-truth bilingual dictionaries areused for our dictionary-related experiments. If aword has multiple translations, we select the transla-tion word that appears most frequently in the mono-lingual corpus. Table 1 summarizes the number ofword pairs and their coverage on the monolingualcorpora on the source side.

entry no. coveragefr→en 97,046 60.91%en→fr 94,719 69.77%ru→en 45,065 65.43%en→ru 42,725 88.77%zh→en 13,749 50.20%en→zh 32,495 47.02%

Table 1: Statistics of Muse bilingual dictionaries.

4.2 Experiment SettingsFor AT/Bi-view AT without cross-lingual pretrain-ing, we use Transformer with 4 layers, 512 em-bedding/hidden units, and 2048 feed-forward fil-ter size, for fair comparison to UNMT (Lampleet al., 2018b). For AT/Bi-view AT with ACP, we setTransformer with 6 layers, 1024 embedding/hiddenunits, and 4096 feed-forward filter size for a faircomparison to XLM (Lample and Conneau, 2019).

We conduct joint byte-pair encoding (BPE) onthe monolingual corpora of both languages with ashared vocabulary of 60k tokens for both English-French and English-Russian tasks, and 40k tokensfor English-Chinese task (Sennrich et al., 2016b).

5https://github.com/facebookresearch/XLM/blob/master/get-data-nmt.sh

During training, we set the batch size to 32 andlimit the sentence length to 100 BPE tokens. Weemploy the Adam optimizer with lr = 0.0001,twarm_up = 4000 and dropout = 0.1. At decod-ing time, we generate greedily with length penaltyα = 1.0.

4.3 Baselines

• Word-by-word translation by looking up theground truth dictionary or the artificial dictio-nary generated by Conneau et al. (2018).

• Unsupervised NMT (UNMT) that does notrely on any parallel resources (Lample et al.,2018b)6. Besides, cross-lingual pretraining(XLM) based UNMT (Lample and Con-neau, 2019)7, is also set as a stronger baseline(XLM+UNMT).

• We implement a UNMT initialized by Un-supervised Word Embedding Transforma-tion (UNMT+UWET) as a baseline(Artetxeet al., 2018d). The transformation function islearned in an unsupervised way without usingany ground-truth bilingual dictionaries (Con-neau et al., 2018)8.

• We also implement a UNMT system initial-ized by Supervised Word Embedding Trans-formation (UNMT+SWET) as a baseline. In-stead of UWET used in Artetxe et al. (2018d),we use the ground-truth bilingual dictionaryas the supervision signal to train the transfor-mation function for transforming the sourceword embeddings into the target languagespace (Conneau et al., 2018). After such ini-tialization, the gap between the embeddingspaces of both languages is narrowed for easyUNMT training.

4.4 Experimental Results: withoutCross-lingual Pretraining

The upper part of Table 2 presents the results ofvarious baselines and our AT approaches. AT andBi-view AT significantly outperform the baselines,and Bi-view AT is consistently better than AT. De-tailed comparisons are listed as below:

Results of Word-by-word Translation

6https://github.com/facebookresearch/UnsupervisedMT7https://github.com/facebookresearch/XLM8https://github.com/facebookresearch/MUSE

1575

system fr → en en → fr ru → en en → ru zh → en en → zhWithout Cross-lingual Pre-training

Word-by-word using artificial dictionary 7.76 4.88 3.05 1.60 1.99 1.14Word-by-word using ground-truth dictionary 7.97 6.61 4.17 2.81 2.68 1.79UNMT (Lample et al., 2018b) 24.02 25.10 9.09 7.98 1.50 0.45UNMT+SWET 21.11 21.22 9.79 4.07 19.78 7.84UNMT+UWET 19.80 21.27 8.79 6.21 15.54 6.62AT 25.07 26.36 10.20 9.91 19.83 9.18Bi-view AT 27.11 27.54 12.85 10.64 21.16 11.23

With Cross-lingual Pre-trainingXLM+UNMT (Lample and Conneau, 2019) 33.28 35.10 17.39 13.29 20.68 11.28ACP+AT 33.51 36.15 16.41 15.43 26.80 13.91ACP+Bi-view AT 34.05 36.56 20.09 17.62 30.12 17.05Supervised SMT - - 21.48 14.54 31.86 16.55

Table 2: Experiment results evaluated by BLEU using the multi-bleu script.

It shows that using the ground-truth dictionary isslightly better than using the artificial one gener-ated by Conneau et al. (2018). Both performancesare remarkably bad, indicating that simple word-by-word translation is not qualified as an MT method.More effective utilization of the bilingual dictio-nary is needed to improve the translation perfor-mance.

Comparison between UNMT and UNMT withWET Initialization

UNMT-related systems generally improves the per-formance of the word-by-word translation. Onthe close-related language pair of English-French,UNMT is better than UNMT+UWET/SWET. Thisis partly because there are numerous BPE unitsshared by both English and French, enabling easyestablishing the shared word embedding space ofboth languages. In contrast, WET that transformsthe source word embedding into the target lan-guage space seems not a necessary initializationstep since shared BPE units already establish theshared space.

On distant language pairs, UNMT does not havean advantage over UNMT with WET initializa-tion. Especially on English-Chinese, UNMT per-forms extremely bad, even worse than the word-by-word translation method. We argue that this isbecause the BPE units shared by both languagesare so few that UNMT fails to align the languagespaces. In contrast, using the bilingual dictionarygreatly alleviate such problem for distant languagepairs. UNMT+SWET, which transforms the sourceword embedding into the target word embeddingspace supervised by the bilingual dictionary, out-performs UNMT by more than 18 BLEU points onChinese-to-English and more than 7 BLEU points

on English-to-Chinese. This indicates the necessityof the bilingual dictionary for translation betweendistant language pairs.

Comparison between AT/Bi-view AT and TheBaselines

Our proposed AT approaches significantly out-perform the baselines. The baselines of us-ing the ground-truth bilingual dictionary, i.e.,word-by-word translation using the dictionary andUNMT+SWET that uses the dictionary to super-vise the word embedding transformation, are infe-rior to our AT approaches.

The AT approaches consistently improves theperformances over both close-related language pairof English-French and distant language pairs ofEnglish-Russian and English-Chinese. Our Bi-view AT achieves the best performance on all lan-guage pairs.

4.5 Experimental Results: with Cross-lingualPretraining

The bottom part of Table 2 reports performancesof UNMT with XLM, which conducts the cross-lingual pretraining on concatenated non-parallelcorpora (Lample and Conneau, 2019), and perfor-mances of our AT/Bi-view AT with the anchoredcross-lingual pretraining, i.e., ACP. The resultsshow that our proposed AT approaches are stillsuperior when equipped with the cross-lingual pre-training.

UNMT obtains great improvement when com-bined with XLM, achieving state-of-the-art unsu-pervised MT performance better than UnsupervisedSMT (Artetxe et al., 2019) and Unsupervised NMT(Lample et al., 2018b) across close and distant lan-guage pairs.

1576

Figure 2: Visualization of the bilingual word embed-dings after Bi-view AT.

ACP+AT/Bi-view AT performs consistently su-perior to XLM+UNMT. Especially on distant lan-guage pairs, ACP+Bi-view AT gains 2.7-9.4 BLEUimprovements over the strong XLM+UNMT. Thisindicates that AT/Bi-view AT with ACP buildscloser language spaces via anchored pretrainingand anchored training. We present such advantagein the analyses of Section 4.6.

Comparison with Supervised SMT

To check the ability of our system using only thedictionary and non-parallel corpora, we make thecomparison to supervised SMT trained on over4M parallel sentences, which are from WMT19for English-Russian and from LDC for English-Chinese. We use Moses9 as the supervised SMTsystem with a 5-gram language model trained onthe target language part of the parallel corpora.

The bottom part of Table 2 shows that ACP+Bi-view AT performs comparable to supervised SMT,and performs even better on English-to-Russianand English-to-Chinese.

4.6 Analyses

We analyze the cross-lingual property of our ap-proaches in both word level and sentence level.We also compare the performances between theground-truth dictionary and the artificial dictionary.In the end, we vary the size of the bilingual dictio-nary and report its impact on the AT training.

9http://www.statmt.org/moses/. We use the default settingof Moses.

Effect on Bilingual Word Embeddings

As shown in Figure 2, we depict the word embed-dings of some sampled words in English-Chineseafter our Bi-view AT. The dimensions of the em-bedding vectors are reduced to two by using T-SNEand are visualized by the visualization tool in Ten-sorflow10.

We sample the English words that are not cov-ered by the dictionary at first, then search theirnearest Chinese neighbors in the embedding space.It shows that the words which constitute a newground-truth translation pair do appear as neigh-boring points in the 2-dimensional visualization ofFigure 2.

Precision of New Word Pairs

We go on with studying bilingual word embed-ding by quantitative analysis of the new wordpairs, which are detected by searching bilingualwords that are neighbors in the word embeddingspace, and evaluate them using the ground-truthbilingual dictionary. In particular, we split theMuse dictionary of Chinese-to-English into stan-dard training set and test set as in BLI (Artetxeet al., 2018a). The training set is used for thedictionary-based systems, including our AT/Bi-view AT, UNMT+SWET, and Muse, which is aBLI toolkit. The test set is used to evaluate thesesystems by computing the precision of discoveredtranslation words given the source words in thetest set. The neighborhood is computed by CSLSdistance (Conneau et al., 2018).

Table 3 shows the precision, where precision@kindicates the accuracy of top-k predicted candidate.Muse induces new word pairs through either thesupervised way or the unsupervised way. MuseSu-pervised is better than MuseUnsupervised since it issupervised by the ground-truth bilingual dictionary.Our AT/Bi-view AT surpasses MuseSupervisedby a large margin. UNMT+SWET/UWET alsoobtains good performance through the word em-bedding transformation. Bi-view AT significantlysurpasses UNMT+SWET/UWET in precision@5and precision@10, while is worse than them inprecision@1. This indicates that Bi-view AT canproduce better n-best translation words that arebeneficial for NMT beam decoding to find bettertranslations.

Through the word level analysis, we can see thatAT/Bi-view AT leads to more consistent word em-

10https://projector.tensorflow.org/

1577

MuseUnsupervised MuseSupervised UNMT+SWET UNMT+UWET AT Bi-view ATPrecision@1 30.51 35.38 48.01 45.85 43.32 45.49Precision@5 55.42 58.48 68.05 67.15 68.23 72.02Precision@10 62.45 63.18 72.02 72.20 73.83 76.71

Table 3: Precision of Discovered New Word Pairs.

0.92

0.93

0.94

0.95

0.96

0.97

1 2 3 4 5 6

Co

sin

e S

imilari

ty

Encoder Layer

ACP+AT ACP+Bi-view AT XLM+UNMT

Figure 3: Sentence level cosine similarity of the paral-lel sentences on each encoder layer.

bedding space shared by both languages, makingthe translation between both languages easier.

Sentence Level Similarity of Parallel Sentences

We check the sentence level representational invari-ance across languages for the cross-lingual pretrain-ing methods. In detail, following Arivazhagan et al.(2018), we adopt max-pooling operation to collectthe sentence representation of each encoder layerfor all Chinese-to-English sentence pairs in the testset. Then we calculate the cosine similarity foreach sentence pair and average all cosine scores.

Figure 3 shows the sentence level cosine simi-larity. ACP+Bi-view AT consistently has a highersimilarity for parallel sentences than XLM+UNMTon all encoder layers. When compare Bi-view ATand AT, the Bi-view AT is better on more encoderlayers.

We can see that in both word level and sentencelevel analysis, our AT methods achieve better cross-lingual invariance, significantly reduce the gap be-tween the source language space and the targetlanguage space, leading to decreased translationdifficulty between both languages.

Ground-Truth Dictionary Vs Artificial Dictio-nary

Table 4 presents the comparison in English-Chinese. The ground-truth dictionary is from theMuse dictionary deposit, and the artificial dictio-

Ground-Truth Dict. Artificial Dict.zh→en en→zh zh→en en→zh

AT 19.83 9.18 16.7 6.98Bi-view AT 21.16 11.23 18.23 8.50

Table 4: BLEU of AT methods using either the ground-truth dictionary or the artificial dictionary.

XLM+UNMT 20.68ACP+AT with 1/4 of the dictionary 22.84ACP+AT with 1/2 of the dictionary 24.32ACP+AT with the full dictionary 26.80

Table 5: BLEU of ACP+AT using different size of thedictionary in zh→en translation.

nary is generated by unsupervised BLI (Conneauet al., 2018). We extract top-n word pairs as theartificial dictionary, where n is the same as thenumber of entries in the ground-truth dictionary.

Both dictionaries use AT methods for transla-tion. As shown in Table 4, the ground-truth dictio-nary performs significantly better than the artificialdictionary in both methods and both translationdirections.

The Effect of The Dictionary Size

We randomly select a portion of the ground-truthbilingual dictionary to study the effect of the dictio-nary size on the performance. Table 5 reports theperformances of ACP+AT using a quarter or a halfof the zh→en dictionary.

It shows that, in comparison to the baseline ofXLM+UNMT that does not use a dictionary, a quar-ter of the dictionary consisting of around 3k wordpairs is capable of improving the performance sig-nificantly. More word pairs in the dictionary lead tobetter translation results, suggesting that expandingthe size of the current Muse dictionary via collect-ing various dictionaries built by human experts mayimprove the translation performance further.

5 Discussion and Future Work

In the literature of unsupervised MT that onlyuses non-parallel corpora, Unsupervised SMT(USMT) and Unsupervised NMT (UNMT) are

1578

complementary to each other. Combining them(USMT+UNMT) achieves significant improvementover the individual system, and performs compara-ble to XLM+UNMT (Lample et al., 2018b; Artetxeet al., 2019).

We have set XLM+UNMT as a stronger baseline,and our ACP+AT/Bi-view AT surpasses it signif-icantly. By referring to the literature of unsuper-vised MT, we can opt to combine ACP+AT/Bi-viewAT with SMT. We leave it as a future work.

6 Conclusion

In this paper, we explore how much potential anMT system can achieve when only using a bilingualdictionary and large-scale monolingual corpora.This task simulates people acquiring translationability via looking up the dictionary and depend-ing on no parallel sentence examples. We proposeto tackle the task by injecting the bilingual dic-tionary into MT via anchored training that drivesboth language spaces closer so that the translationbecomes easier. Experiments show that, on bothclose language pairs and distant language pairs,our proposed approach effectively reduces the gapbetween the source language space and the targetlanguage space, leading to significant improvementof translation quality over the MT approaches thatdo not use the dictionary and the approaches thatuse the dictionary to supervise the cross-lingualword embedding transformation.

Acknowledgments

The authors would like to thank the anonymousreviewers for the helpful comments. This workwas supported by National Natural Science Foun-dation of China (Grant No. 61525205, 61673289),National Key R&D Program of China (Grant No.2016YFE0132100), and was also partially sup-ported by the joint research project of Alibaba andSoochow University.

ReferencesWaleed Ammar, George Mulcaire, Yulia Tsvetkov,

Guillaume Lample, Chris Dyer, and Noah A. Smith.2016. Massively multilingual word embeddings.ArXiv:1602.01925.

Naveen Arivazhagan, Ankur Bapna, Orhan Firat, RoeeAharoni, Melvin Johnson, and Wolfgang Macherey.2018. The missing ingredient in zero-shot neuralmachine translation. ArXiv:1903.07091.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016.Learning principled bilingual mappings of word em-beddings while preserving monolingual invariance.In Proc. of EMNLP.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre.2018a. A robust self-learning method for fully un-supervised cross-lingual mappings of word embed-dings. In Proc. of ACL.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre.2018b. Unsupervised statistical machine translation.In Proc. of EMNLP.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2019.An effective approach to unsupervised machinetranslation. In Proc. of ACL.

Mikel Artetxe, Gorka Labaka, Eneko Agirre, andKyunghyun Cho. 2018c. Unsupervised neural ma-chine translation. In Proc. of ICLR.

Mikel Artetxe, Gorka Labaka, Eneko Agirre, andKyunghyun Cho. 2018d. Unsupervised neural ma-chine translation. In Proc. of ICLR.

Philip Arthur, Graham Neubig, and Satoshi Nakamura.2016. Incorporating discrete translation lexiconsinto neural machine translation. In Proc. of EMNLP.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proc. of ICLR.

Mia Xu Chen, Orhan Firat, Ankur Bapna, MelvinJohnson, Wolfgang Macherey, George Foster, LlionJones, Niki Parmar, Michael Schuster, Zhi-FengChen, Yonghui Wu, and Macduff Hughes. 2018.The best of both worlds: Combining recent advancesin neural machine translation. In Proc. of ACL.

Trevor Cohn, Steven Bird, Graham Neubig, OliverAdams, and Adam J. Makarucha. 2017. Cross-lingual word embeddings for low-resource languagemodeling. In Proc. of EACL.

Alexis Conneau, Guillaume Lample, Marc’AurelioRanzato, Ludovic Denoyer, and Hervé Jégou. 2018.Word translation without parallel data. In Proc. ofICLR.

Long Duong, Hiroshi Kanayama, Tengfei Ma, StevenBird, and Trevor Cohn. 2016. Learning crosslin-gual word embeddings without bilingual corpora. InProc. of EMNLP.

Manaal Faruqui and Chris Dyer. 2014. Improving vec-tor space word representations using multilingualcorrelation. In Proc. of EACL.

Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann Dauphin. 2017. Convolutional se-quence to sequence learning. In Proc. of ICML.

Stephan Gouws and Anders Søgaard. 2015. Simpletask-specific bilingual word embeddings. In Proc.of HLT-NAACL.

1579

Guillaume Lample and Alexis Conneau. 2019.Cross-lingual language model pretraining.ArXiv:1901.07291.

Guillaume Lample, Ludovic Denoyer, andMarc’Aurelio Ranzato. 2018a. Unsupervisedmachine translation using monolingual corpora only.In Proc. of ICLR.

Guillaume Lample, Myle Ott, Alexis Conneau, Lu-dovic Denoyer, and Marc’Aurelio Ranzato. 2018b.Phrase-based & neural unsupervised machine trans-lation. In Proc. of EMNLP.

Angeliki Lazaridou, Georgiana Dinu, and Marco Ba-roni. 2015. Hubness and pollution: Delving intocross-space mapping for zero-shot learning. In Proc.of ACL.

Ang Lu, Weiran Wang, Mohit Bansal, Kevin Gimpel,and Karen Livescu. 2015. Deep multilingual corre-lation for improved word embeddings. In Proc. ofHLT-NAACL.

Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013.Exploiting similarities among languages for ma-chine translation. ArXiv:1309.4168.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016a. Improving neural machine translation mod-els with monolingual data. In Proc. of ACL.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016b. Neural machine translation of rare wordswith subword units. In Proc. of ACL.

Samuel L. Smith, David H. P. Turban, Steven Hamblin,and Nils Y. Hammerla. 2017. Offline bilingual wordvectors, orthogonal transformations and the invertedsoftmax. In Proc. of ICLR.

Haipeng Sun, Rui Wang, Kehai Chen, Masao Utiyama,Eiichiro Sumita, and Tiejun Zhao. 2019. Unsuper-vised bilingual word embedding agreement for un-supervised neural machine translation. In Proc. ofACL.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proc. of NIPS.

Min Xiao and Yuhong Guo. 2014. Distributed wordrepresentation learning for cross-lingual dependencyparsing. In Proc. of CoNLL.

Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015.Normalized word embedding and orthogonal trans-form for bilingual word translation. In Proc. of HLT-NAACL.

Zhen Yang, Wei Chen, Feng Wang, and Bo Xu.2018. Unsupervised neural machine translation withweight sharing. In Proc. of ACL.

Jiajun Zhang and Chengqing Zong. 2016. Bridgingneural machine translation and bilingual dictionaries.ArXiv:1610.07272.

Yuan Zhang, David Gaddy, Regina Barzilay, andTommi S. Jaakkola. 2016. Ten pairs to tag - multi-lingual pos tagging via coarse mapping between em-beddings. In Proc. of HLT-NAACL.

bilingual dictionary based neural machine translation

Documents