earth mover's distance minimization for unsupervised … · · 2017-12-03proceedings of the...

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1934–1945Copenhagen, Denmark, September 7–11, 2017. c©2017 Association for Computational Linguistics

Earth Mover’s Distance Minimization forUnsupervised Bilingual Lexicon Induction

Meng Zhang†‡ Yang Liu†‡ Huanbo Luan† Maosong Sun†‡†State Key Laboratory of Intelligent Technology and Systems

Tsinghua National Laboratory for Information Science and TechnologyDepartment of Computer Science and Technology, Tsinghua University, Beijing, China‡Jiangsu Collaborative Innovation Center for Language Competence, Jiangsu, China

[email protected], [email protected]@gmail.com, [email protected]

Abstract

Cross-lingual natural language processinghinges on the premise that there exists in-variance across languages. At the wordlevel, researchers have identified such in-variance in the word embedding seman-tic spaces of different languages. How-ever, in order to connect the separatespaces, cross-lingual supervision encodedin parallel data is typically required. Inthis paper, we attempt to establish thecross-lingual connection without relyingon any cross-lingual supervision. Byviewing word embedding spaces as dis-tributions, we propose to minimize theirearth mover’s distance, a measure of diver-gence between distributions. We demon-strate the success on the unsupervisedbilingual lexicon induction task. In addi-tion, we reveal an interesting finding thatthe earth mover’s distance shows potentialas a measure of language difference.

1 Introduction

Despite tremendous variation and diversity, lan-guages are believed to share something in com-mon. Indeed, this belief forms the underlying ba-sis of computational approaches to cross-lingualtransfer (Tackstrom et al., 2013, inter alia), other-wise it would be inconceivable for the transfer tosuccessfully generalize.

Linguistic universals manifest themselves atvarious levels of linguistic units. At the wordlevel, there is evidence that different languagesrepresent concepts with similar structure (Younet al., 2016). Interestingly, as computational mod-els of word semantics, monolingual word embed-dings also exhibit isomorphism across languages(Mikolov et al., 2013a). This finding opens up the

possibility to use a simple transformation, e.g. alinear map, to connect separately trained word em-beddings cross-lingually. Learning such a trans-formation typically calls for cross-lingual supervi-sion from parallel data (Faruqui and Dyer, 2014;Lu et al., 2015; Dinu et al., 2015; Lazaridou et al.,2015; Smith et al., 2017).

In this paper, we ask the question: Can we un-cover the transformation without any cross-lingualsupervision? At first sight, this task appearsformidable, as it would imply that a bilingual se-mantic space can be constructed by using mono-lingual corpora only. On the other hand, the ex-istence of structural isomorphism across mono-lingual embedding spaces points to the feasibilityof this task: The transformation exists right thereonly to be discovered by the right tool.

We propose such a tool to answer the abovequestion in the affirmative. The key insight is toview embedding spaces as distributions, and thedesired transformation should make the two dis-tributions close. This naturally calls for a measureof distribution closeness, for which we introducethe earth mover’s distance. Therefore, our taskcan be formulated as the minimization of the earthmover’s distance between the transformed sourceembedding distribution and the target one with re-spect to the transformation. Importantly, the mini-mization is performed at the distribution level, andhence no word-level supervision is required.

We demonstrate that the earth mover’s distanceminimization successfully uncovers the transfor-mation for cross-lingual connection, as evidencedby experiments on the bilingual lexicon inductiontask. In fact, as an unsupervised approach, its per-formance turns out to be highly competitive withsupervised methods. Moreover, as an interestingbyproduct, the earth mover’s distance provides adistance measure that may quantify a facet of lan-guage difference.

1934

ma

gou

mao

horse

dog

cat

Chinese embeddings

English embeddings

kitten

horse

dog

cat

kitten

ma

gou

mao

English and transformed Chinese embeddings

G

Figure 1: An illustration of our earth mover’s distance minimization formulation. The subplots on theleft schematically visualize Chinese and English embeddings. Due to isomorphism, there exists a simpletransformationG that aligns the two embedding spaces well, as shown on the right. We expect to find thetransformationG by minimizing the earth mover’s distance without the need for cross-lingual word-levelsupervision, because the earth mover’s distance holistically measures the closeness between two sets ofweighted points. It computes the minimal cost of transporting one set of points to the other, whoseweights are indicated by the sizes of squares and dots. We show the transport scheme in the right subplotwith arrows, which can be interpreted as word translations.

2 Background

2.1 Aligning Isomorphic Embeddings

As discovered by previous work (Mikolov et al.,2013a), monolingual word embeddings exhibitisomorphism across languages, i.e., they appearsimilar in structure. However, as they are trainedindependently, the specific “orientation” of eachembedding space is arbitrary, as illustrated in theleft part of Figure 1. In order to connect the sep-arate embedding spaces, we can try to transformthe source embeddings so that they align well withtarget ones. Naturally, we need a measure for thequality of the alignment to guide our search for thetransformation.

As we aim to eliminate the need for cross-lingual supervision from word translation pairs,the measure cannot be defined at the word level asin previous work (Mikolov et al., 2013a). Rather,it should quantify the difference between the entiredistributions of embeddings. With this in mind,we find the earth mover’s distance to be a suit-able choice (Zhang et al., 2016b). Its workingsare illustrated in the right part of Figure 1. Wecan think of target embeddings as piles of earth,and transformed source embeddings as holes to befilled. Then the earth mover’s distance computes

the minimal cost of moving the earth to fill theholes. Clearly, if the two sets of embeddings alignwell, the earth mover’s distance will be small.Therefore, we can try to find the transformationthat minimizes the earth mover’s distance.

Another desirable feature of the earth mover’sdistance is that the computed transport scheme canbe readily interpreted as translations. Moreover,this interpretation naturally handles multiple al-ternative translations. For example, the Chineseword “mao” can be translated to “cat” or “kitten”,as shown in Figure 1.

2.2 The Form of the Transformation

The approximate isomorphism across embeddingspaces inspires researchers to use a simple formof transformation. For example, Mikolov et al.(2013a) chose to use a linear transformation, i.e.the transformation G parametrized by a matrix.Later, proposals for using an orthogonal trans-formation are supported empirically (Xing et al.,2015; Zhang et al., 2016c; Artetxe et al., 2016)and theoretically (Smith et al., 2017). Indeed,an orthogonal transformation has desirable prop-erties in this setting. If G is an orthogonal matrixthat transforms the source embeddings into the tar-get space, then its transpose (also its inverse) G>

1935

performs transformation in the reverse direction.In that case, any word embedding a can be re-covered by transforming back and forth becauseG>Ga = a. Moreover, computing the cosine sim-ilarity between a source embedding a and a targetembedding b will be independent of the semanticspace in which the similarity is measured, becauseb>Ga/ ‖Ga‖ ‖b‖ = a>G>b/ ‖a‖∥∥G>b∥∥. There-fore we are inclined to use an orthogonal transfor-mation for our task.

2.3 The Earth Mover’s Distance

The earth mover’s distance (EMD) is a powerfultool widely used in computer vision and naturallanguage processing (Rubner et al., 1998; Kus-ner et al., 2015; Huang et al., 2016; Zhang et al.,2016b,a). Mathematically speaking, the EMD de-fines a distance between probability distributions.In the discrete case, a probability distribution canbe represented by a sum of Dirac delta functions.For a pair of discrete distributions P1 =

∑i uiδxi

and P2 =∑

j vjδyj , the EMD is defined as

EMD (P1,P2) = minT∈U(u,v)

∑i

∑j

Tijc (xi, yj) ,

(1)where c (xi, yj) gives the ground distance betweenxi and yj , and U (u, v) is known as the transportpolytope, defined asT |Tij ≥ 0,

∑j

Tij = ui,∑i

Tij = vj ,∀i, j .

(2)After solving the minimization program (1), thetransport matrix T stores information of the trans-port scheme: A non-zero Tij indicates the amountof probability mass transported from yj to xi. Forour task, this can be interpreted as evidence forword translation (Zhang et al., 2016b), as indi-cated by arrows in the right part of Figure 1.

The EMD is closely related to the Wassersteindistance in mathematics, defined as

W (P1,P2) = infγ∈Γ(P1,P2)

E(x,y)∼γ [c (x, y)] , (3)

where Γ (P1,P2) denotes the set of all joint distri-butions γ (x, y) with marginals P1 and P2 on thefirst and second factors respectively. As we cansee, the Wasserstein distance generalizes the EMDto allow continuous distributions. In our context,we will use both terms interchangeably.

DWasserstein

estimate

G

Figure 2: The Wasserstein GAN for unsuper-vised bilingual lexicon induction. The generatorG transforms the source word embeddings into thetarget space. The critic D takes both sets of em-beddings and tries to estimate their Wassersteindistance, and this information will be passed tothe generator G during training to guide it towardsminimizing the Wasserstein estimate.

3 Approaches

In our task, we are interested in a pair of distri-butions of word embeddings, one for the sourcelanguage and the other for the target language.A source word embedding wS

s is a d-dimensionalcolumn vector that represents the s-th source wordin the V S-sized source language vocabulary. Itsdistribution is characterized by a positive vectorof frequencies fS satisfying

∑V S

s=1 fSs = 1, i.e.

P(wSs

)= fS

s . Notations are similar for the targetside. We assume the embeddings are normalizedto have unit L2 norm, which makes no differenceto the result as we use cosine to measure semanticsimilarity.

Under this setting, we develop two approachesto our EMD minimization idea, called WGAN(Section 3.1) and EMDOT (Section 3.2) respec-tively.

3.1 Wasserstein GAN (WGAN)

Generative adversarial nets (GANs) are originallyproposed to generate natural images (Goodfellowet al., 2014). They can generate sharp images iftrained well, but they are notoriously difficult totrain. Therefore, a lot of research efforts have beendedicated to the investigation into stabler train-ing (Radford et al., 2015; Salimans et al., 2016;Nowozin et al., 2016; Metz et al., 2016; Pooleet al., 2016; Arjovsky and Bottou, 2017), andthe recently proposed Wasserstein GAN (Arjovskyet al., 2017) is a promising technique along thisline of research.

While the original GAN is formulated as an ad-versarial game (hence its name), the WassersteinGAN can be directly understood as minimizingthe Wasserstein distance (3). Figure 2 illustratesthe concept in the context of our unsupervised

1936

bilingual lexicon induction task. The generatorG takes source word embeddings and transformsthem, with the goal that the transformed sourcedistribution PG(S) and the target distribution PT

should be close as measured by the Wassersteindistance. The critic D takes both transformedsource word embeddings and target word embed-dings and attempts to accurately estimate theirWasserstein distance, which will guide the gener-ator during training. The overall objective is

minG∈Rd×d

W(PG(S),PT

), (4)

where PG(S) =∑V S

s=1 fSs δGwS

sand PT =∑V T

t=1 fTt δwT

tare the distributions of transformed

source word embeddings and target word embed-dings. Here we do not impose the orthogonal con-straint onG to facilitate the use of a gradient-basedoptimizer. With the ground distance c being Eu-clidean distance L2, the Kantorovich-Rubinsteinduality (Villani, 2009) gives

W(PG(S),PT

)=

1K

sup‖f‖L≤K

Ey∼PT [f (y)]− Ey∼PG(S) [f (y)] ,

(5)where the supremum is over all K-Lipschitz func-tions f . As neural networks are universal functionapproximators (Hornik, 1991), we can attempt toapproximate f with a neural network, called thecritic D, with weight clipping to ensure the func-tion family is K-Lipschitz. Therefore the objec-tive of the critic is

maxD

Ey∼PT [fD (y)]− Ex∼PS [fD (Gx)] . (6)

Conceptually, the critic D assigns scores fD toreal target embeddings and fake ones generated bythe generator G. When the objective (6) is traineduntil optimality, the difference of the scores willapproximate the Wasserstein distance up to a mul-tiplicative constant. The generator G then aims tominimize the approximate distance, which leads to

minG∈Rd×d

−Ex∼PS [fD (Gx)] . (7)

3.2 EMD Minimization Under OrthogonalTransformation (EMDOT)

Alternative to minimizing the Wasserstein dis-tance by duality, the primal program with the or-thogonal constraint can be formalized as

minG∈O(d)

EMD(PG(S),PT

), (8)

where O (d) is the orthogonal group in dimen-sion d. The exact solution to this minimizationprogram is NP-hard (Ding and Xu, 2016). For-tunately, an alternating minimization procedure isguaranteed to converge to a local minimum (Co-hen and Guibas, 1999). Starting from an initialmatrix G(0), we alternate between the followingsubprograms repeatedly:

T (k) = arg minT∈U(fS,fT)

V S∑s=1

V T∑t=1

Tstc(G(k)wS

s , wTt

),

(9)

G(k+1) = arg minG∈O(d)

V S∑s=1

V T∑t=1

T(k)st c

(GwS

s , wTt

).

(10)The minimization in (9) is the EMD program

(1), with existing solvers available. For better scal-ability, we choose an approximate solver (Cuturi,2013).

The minimization in (10) aims to find the trans-formation G(k+1) with cross-lingual connectionprovided in T (k). This is exactly the supervisedscenario, and previous works typically resort togradient-based solvers (Mikolov et al., 2013a).But they can be cumbersome especially as weimpose the orthogonal constraint on G. Fortu-nately, if we choose the ground distance c to be thesquared Euclidean distance L2

2, the program (10)is an extension of the orthogonal Procrustes prob-lem (Schonemann, 1966), which admits a closed-form solution:

G(k+1) = UV >, (11)

where U and V are obtained from a singular valuedecomposition (SVD):

V S∑s=1

V T∑t=1

T(k)st w

Tt w

S>s = USV >. (12)

Note that the SVD is efficient because it is per-formed on a d× d matrix, which is typically low-dimensional. Choosing c = L2

2 is also motivatedby its equivalence to the cosine dissimilarity, asproved in Appendix A.

3.3 Discussion

Starting from the idea of earth mover’s distanceminimization, we have developed two approachestowards the goal. They employ different optimiza-tion techniques, which in turn lead to different

1937

practical choices. For example, we choose c = L22

for the EMDOT approach to obtain a closed-formsolution to the subprogram (10), otherwise wewould have to use gradient-based solvers. In con-trast, the WGAN approach calls for c = L2 be-cause the Kantorovich-Rubinstein duality takes asimple form only in this case.

The EMDOT approach is attractive for sev-eral reasons: It is consistent for training and test-ing (the equivalence between the ground distancec = L2

2 and cosine dissimilarity), compatible withthe orthogonal constraint, mathematically sound(without much assumption and approximation),guaranteed to converge, almost hyperparameterfree, and fast in speed (the alternating subpro-grams have either effective approximate solvers orclosed-form solutions). However, it suffers froma serious limitation: The alternating minimizationprocedure only converges to local minima, andthey often turn out to be rather poor in practice.

Although the WGAN approach employs astochastic-gradient-based optimizer (RMSProp)and does not guarantee global optima either, itworks reasonably well in practice. It seems betterat exploring the parameter space and finally land-ing in a neighborhood of a good optimum. Likeother success stories of using stochastic-gradient-based optimizers to train neural networks, theoret-ical understanding of the behavior remains elusive.

We can enjoy the best of both worlds by incor-porating the merits of both approaches: First theWGAN approach locates a good neighborhood ofthe parameter space, and then, starting from a rea-sonable initialization, the EMDOT approach effi-ciently explores the neighborhood to achieve en-hanced performance.

4 Experiments

We first investigate the learning behavior of ourWGAN approach, and then present experimentson the bilingual lexicon induction task, followedby a showcase of the earth mover’s distance as alanguage distance measure. Details of the data setsand hyperparameters are described in AppendicesB and C.

4.1 Learning Behavior of WGAN

We analyze the learning behavior of WGAN bylooking at a typical training trajectory on Chinese-English. During training, we save 100 mod-els, translate based on the nearest neighbor, and

0.0 0.5 1.0# minibatches ×107

0

10

20

30

40

50

accu

racy

(%)

4

6

8

10

12

14

16

18

20

||G> G−

I||F

accuracyWasserstein estimate||G>G− I||F

Figure 3: A typical training trajectory of WGAN.The three curves all correlate well. The Wasser-stein estimate is rescaled because its magnitude isirrelevant.

record their accuracy as the bilingual lexicon in-duction performance indicator at these trainingcheckpoints. In theory, the critic objective (6) pro-vides an estimate of the Wasserstein distance upto a multiplicative constant, and a smaller Wasser-stein distance should mean the transformed sourceembedding space and the target embedding spacealign better, which should in turn result in a betterbilingual lexicon. This is validated in Figure 3 bythe correlation between Wasserstein estimate andaccuracy. Therefore, the Wasserstein estimate canserve as an indicator for the bilingual lexicon in-duction performance, and we can save the modelwith the lowest value during training as the finalmodel.

In Figure 3, we also plot the value of∥∥G>G− I∥∥F

, which indicates the degree of or-thogonality of the transformation matrix G. Inter-estingly, this also correlates nicely with the othercurves, even though our WGAN formulation doesnot encourage G towards orthogonality. This find-ing confirms that a good transformation matrix isindeed close to orthogonality, and empirically jus-tifies the orthogonal constraint for the EMDOTformulation.

Finally, we observe that the curves in Fig-ure 3 are not very smooth. This means that al-though WGAN does well in exploring the param-eter space and locating a reasonable transforma-

1938

method # seeds zh-en es-en it-en ja-zh tr-en

TM

50 1.71 1.80 1.31 1.40 0.41100 17.27 24.93 23.22 22.91 15.02200 24.87 30.19 30.09 31.30 25.50500 28.24 32.11 31.69 35.79 32.63

IA

50 14.02 20.48 16.88 17.71 9.06100 22.14 28.73 25.99 28.24 18.37200 25.63 30.59 30.24 32.66 25.15500 27.21 31.94 31.54 35.33 31.50

WGAN 0 21.36 29.91 27.23 27.14 9.76EMDOT 0 27.78 32.26 31.37 34.83 21.95

Table 1: F1 scores for bilingual lexicon induction on Chinese-English, Spanish-English, Italian-English,Japanese-Chinese, and Turkish-English. The supervised methods TM and IA require seeds to train, andare listed for reference. Our EMDOT approach is initialized with the transformation found by WGAN,and consistently improves on it, reaching competitive performance with supervised methods.

tion matrix, it cannot stably refine the transforma-tion. Fortunately, this is where EMDOT thrives,and hence combining them enjoys the benefits ofboth approaches.

4.2 Bilingual Lexicon Induction Performance

We test the quality of the cross-lingual transfor-mation by evaluating on the bilingual lexiconinduction task for five language pairs: Chinese-English, Spanish-English, Italian-English,Japanese-Chinese, and Turkish-English.

As the EMD automatically handles multiple al-ternative translations, we follow (Zhang et al.,2016b,a) to use F1 score as the preferred evalu-ation metric.

BaselinesOur formulation is based on the isomorphismfound across monolingual word embeddings. Thisidea has led to previous supervised methods:

• Translation matrix (TM) (Mikolov et al.,2013a): the pioneer of this type of methods,using linear transformation. We use a pub-licly available implementation.1

• Isometric alignment (IA) (Zhang et al.,2016c): an extension of TM by augmentingits learning objective with the isometric (or-thogonal) constraint. Although Zhang et al.(2016c) had subsequent steps for their POStagging task, it could be used for bilinguallexicon induction as well.

1http://clic.cimec.unitn.it/˜georgiana.dinu/down

Although they need seed word translation pairs totrain and thus not directly comparable to our sys-tem, we nonetheless report their results using {50,100, 200, 500} seeds for a ballpark range of ex-pected performance on this task, and skip the setof 500 seeds when testing all systems. We en-sure the same input embeddings for these meth-ods and ours. Their seeds are obtained throughGoogle Translate (details in Appendix B.2). Weapply the EMD as a postprocessing step (Zhanget al., 2016b) to allow them to handle multiple al-ternative translations. This is also done for ourWGAN approach, as it does not produce the trans-port scheme to interpret as translation due to itsduality formulation.

Results

Table 1 shows the F1 scores on the five languagepairs. As we can see, WGAN successfully findsa transformation that produces reasonable wordtranslations. On top of that, EMDOT consider-ably improves the performance, which indicatesthat EMDOT refines the transformation found byWGAN.

Similar behavior across language pairs provesthe generality of our approaches, as they buildon embeddings learned from monolingual corporawithout language-specific engineering. The qual-ity of the embeddings, thus, will have an impor-tant effect on the performance, which may explainthe lower scores on Turkish-English, as this low-resource setting may lack sufficient data to pro-duce reliable embeddings. Higher noise levels inthe preprocessing and ground truth for this lan-

1939

zh-en es-en it-en ja-zh tr-enEMD 0.650 0.445 0.559 0.599 0.788

typology dissimilarity 0.467 0.342 0.259 0.433 0.541geographical distance (km) 8161 1246 1464 2095 2854

Table 2: The earth mover’s distance (EMD), typology dissimilarity, and geographical distance forChinese-English, Spanish-English, Italian-English, Japanese-Chinese, and Turkish-English. The EMDshows correlation with both factors of linguistic difference.

guage pair (cf. the supplemental material), as wellas the morphological richness of Turkish, may alsobe contributing factors to the relatively low scores.

Concerning the supervised methods TM and IA,they attain better performance with more supervi-sion from seeds, as expected. For TM in particu-lar, hundreds of seeds are needed for generaliza-tion, in line with the finding in (Vulic and Korho-nen, 2016). Below that threshold, its performancedrops dramatically, and this is when IA fares bet-ter with the orthogonal constraint. This indicatesthe importance of orthogonality when the seedsare few, or even zero as faced by our system. Asthe number of seeds increases, the performance ofthe supervised methods converges to a level com-parable to our system.

4.3 The EMD as Language Distance

As our system minimizes the earth mover’s dis-tance between embeddings of two languages, weshow here the final EMD can indicate the degree ofdifference between languages, serving as a proxyfor language distance. Table 2 lists the EMD forthe five language pairs considered in this paper, aswell as their typology dissimilarity and geograph-ical distance. The typology dissimilarity is com-puted from features in the WALS database (Dryerand Haspelmath, 2013). It is defined as one minusrelative Hamming similarity, which is in turn de-fined as the number of agreeing features dividedby the number of total features available for thelanguage pair (Albu, 2006; Cysouw, 2013b). Asa rough approximation, the geographical distanceis measured by the distance between the capitalcities of the countries where the considered lan-guages are spoken (Eger et al., 2016).

The typology dissimilarity reflects genealogicalinfluence on the divergence between languages,while the geographical distance indicates the ef-fect of language contact. Both play importantroles in shaping the languages we perceive today,and they also correlate with each other (Cysouw,

2013a). As we analyze Table 2, we find theEMD may be explained by both factors. Spanish-English and Italian-English are close both ge-nealogically and geographically, and their EMDvalues are the lowest. English, Chinese, andJapanese belong to different language families, butthe geographical proximity of the latter two en-ables intensive language contact, especially for thevocabularies, causing relatively smaller EMD. Fi-nally, Turkish and English are distant in both as-pects, and the EMD between them is large. Notethat, however, the large EMD may also be causedby the relatively poor quality of monolingual em-beddings due to low resource, and this should bea caveat of using the EMD to measure languagedistance.

5 Related Work

5.1 Bilingual Lexicon InductionBilingual lexicon induction is a long-standing re-search task in cross-lingual natural language pro-cessing. Traditional methods build statistical mod-els for monolingual word co-occurrence, and com-bine cross-lingual supervision to solve the task. Asword alignment for parallel sentences can producefairly good bilingual lexica (Och and Ney, 2003),these methods focus on non-parallel data with aseed lexicon as cross-lingual supervision (Rapp,1999; Gaussier et al., 2004).

An exception that does not rely on cross-lingualsupervision is the decipherment approach (Douand Knight, 2012, 2013; Dou et al., 2015). It viewsthe source language as a cipher for the target lan-guage, and solves a statistical model that attemptsto decipher the source language.

Following the popularity of monolingual wordembeddings, cross-lingual word representationlearning has also attracted significant attention inrecent years. Building bilingual lexica from thelearned cross-lingual embeddings is often consid-ered an evaluative tool. Most methods rely on su-pervision encoded in parallel data, at the document

1940

level (Vulic and Moens, 2015), the sentence level(Zou et al., 2013; Chandar A P et al., 2014; Her-mann and Blunsom, 2014; Kocisky et al., 2014;Gouws et al., 2015; Luong et al., 2015; Coul-mance et al., 2015; Oshikiri et al., 2016), or theword level (i.e. in the form of seed lexicon)(Gouws and Søgaard, 2015; Wick et al., 2016;Duong et al., 2016; Shi et al., 2015; Mikolov et al.,2013a; Faruqui and Dyer, 2014; Lu et al., 2015;Dinu et al., 2015; Lazaridou et al., 2015; Ammaret al., 2016; Zhang et al., 2016a, 2017; Smith et al.,2017).

There is a recent work that aims to removethe need for cross-lingual supervision (Cao et al.,2016). Similar to ours, the underlying idea is tomatch cross-lingually at the level of distributionrather than word. However, the distributions con-sidered in that work are the hidden states of neu-ral embedding models during the course of train-ing. They are assumed to be Gaussian, so thatthe matching of distributions reduces to matchingtheir means and variances, but this assumption ishard to justify and interpret. In contrast, our pro-posal does not make any assumption on the dis-tributions, and directly matches the transformedsource embedding distribution with the target dis-tribution by minimizing their earth mover’s dis-tance.

Another attempt to learn cross-lingual em-bedding transformation without supervision is(Barone, 2016). Architectures of generativeadversarial nets and adversarial autoencoders(Makhzani et al., 2015) are experimented, butthe reported results are not positive. We triedthe publicly available code on our data and ob-tained negative results as well. This outcome islikely caused by the training difficulty pointed outby (Arjovsky and Bottou, 2017), as traditionalGAN training minimizes Jensen-Shannon diver-gence between distributions, which can providepathological gradient to the generator and ham-per its learning. The use of Wasserstein GAN ad-dresses this problem and allows our simple archi-tecture to be trained successfully.

5.2 Language Distance

Quantifying language difference is an open ques-tion with on-going efforts that put forward bettermeasures based on manually compiled data (Albu,2006; Hammarstrom and O’Connor, 2013). Re-searchers in computational linguistics also try to

contribute corpus-based approaches to this ques-tion. Parallel data is typically exploited, and ideasrange from information-theoretic (Juola, 1998),statistical (Mayer and Cysouw, 2012), to graph-based (Eger et al., 2016; Asgari and Mofrad,2016). To our knowledge, the earth mover’s dis-tance is proposed for language distance for the firsttime, with the distinctive feature of relying on non-parallel data only.

5.3 The Earth Mover’s Distance

First introduced into computer vision (Rubneret al., 1998), the earth mover’s distance alsofinds application in natural language processing(Kusner et al., 2015; Huang et al., 2016), in-cluding bilingual lexicon induction (Zhang et al.,2016b,a). Zhang et al. (2016b) build upon bilin-gual word embeddings and apply the EMD pro-gram as a postprocessing step to automaticallyproduce multiple alternative translations. Later,Zhang et al. (2016a) introduce the EMD into thetraining objective of bilingual word embeddings asa regularizer. These previous works rely on cross-lingual supervision, and do not approach the taskfrom the view of embedding transformation, whileour work formulates the task as EMD minimiza-tion to allow zero supervision.

Apart from the usage as a regularizer (Zhanget al., 2016a), the EMD can also play other roles inoptimization programs designed for various appli-cations (Cuturi and Doucet, 2014; Frogner et al.,2015; Montavon et al., 2016).

6 Conclusion and Future Work

In this work, we attack the problem of find-ing cross-lingual transformation between mono-lingual word embeddings in a purely unsuper-vised setting. We introduce earth mover’s dis-tance minimization to tackle this task by exploit-ing its distribution-level matching to sidestep therequirement for word-level cross-lingual supervi-sion. Even though zero supervision poses a clearchallenge, our system attains competitive perfor-mance with supervised methods for bilingual lex-icon induction. In addition, the earth mover’s dis-tance provides a natural measure that may provehelpful for quantifying language difference.

We have implemented the earth mover’s dis-tance minimization framework from two paths,and their combination has worked well, but bothcan be potentially improved by recent advances

1941

in optimization techniques (Gulrajani et al., 2017;Ding and Xu, 2016). Future work should also eval-uate the earth mover’s distance between more lan-guages to assess its quality as language distance.

A Proof

The following proof shows that using squared Eu-clidean distance as the ground distance (c = L2

2) isequivalent to using cosine dissimilarity when min-imizing Equation (10).

minG∈O(d)

V S∑s=1

V T∑t=1

T(k)st

∥∥GwSs − wT

t

∥∥2

= minG∈O(d)

V S∑s=1

V T∑t=1

T(k)st(∥∥wS

s

∥∥2+∥∥wT

t

∥∥2 − 2wT>t GwS

s

)= minG∈O(d)

−2V S∑s=1

V T∑t=1

T(k)st cos

(GwS

s , wTt

)+ const

= minG∈O(d)

−V S∑s=1

V T∑t=1

T(k)st cos

(GwS

s , wTt

).

(13)

B Data Preparation

B.1 Non-Parallel Corpora for TrainingEmbeddings

The data for training monolingual word embed-dings comes from Wikipedia comparable cor-pora.2 Following (Vulic and Moens, 2013), weretain only nouns with at least 1,000 occurrencesexcept for Turkish-English, whose frequency cut-off threshold is 100, as the amount of data is rela-tively small in this low-resource setting. For theChinese side, we first use OpenCC3 to normal-ize characters to be simplified, and then performChinese word segmentation and POS tagging withTHULAC.4 The preprocessing of the English sideinvolves tokenization, POS tagging, lemmatiza-tion, and lowercasing, which we carry out with theNLTK toolkit5 for the Chinese-English pair. ForSpanish-English and Italian-English, we choose touse TreeTagger6 for preprocessing, as in (Vulic

2http://linguatools.org/tools/corpora/wikipedia-comparable-corpora

3https://github.com/BYVoid/OpenCC4http://thulac.thunlp.org5http://www.nltk.org6http://www.cis.uni-muenchen.de/˜schmid/tools/

TreeTagger

# tokens vocabulary size

zh-enzh 21m 3,349en 53m 5,154

es-enes 61m 4,774en 95m 6,637

it-enit 73m 8,490en 93m 6,597

ja-zhja 38m 6,043zh 16m 2,814

tr-entr 6m 7,482en 28m 13,220

Table 3: Statistics of the non-parallel corporafor training monolingual word embeddings. Lan-guage codes: zh = Chinese, en = English, es =Spanish, it = Italian, ja = Japanese, tr = Turkish.

and Moens, 2013). For the Japanese corpus, weuse MeCab7 for word segmentation and POS tag-ging. For Turkish, we utilize the preprocessingtools (tokenization and POS tagging) provided inLORELEI Language Packs (Strassel and Tracey,2016), and its English side is preprocessed byNLTK. The statistics of the preprocessed corporais given in Table 3.

B.2 Seed Word Translation Pairs

The seed word translation pairs for the transla-tion matrix (TM) and isometric alignment (IA)approaches are obtained as follows. First, weask Google Translate8 to translate the source lan-guage vocabulary. Then the target translations arequeried again and translated back to the sourcelanguage, and those that do not match the originalsource words are discarded. This helps to ensurethe translation quality. Finally, the translations arediscarded if they fall out of our target language vo-cabulary.

B.3 Ground Truth

As the ground truth bilingual lexicon for evalu-ation, we use Chinese-English Translation Lexi-con Version 3.0 (LDC2002L27) for the Chinese-English pair. For Spanish-English and Italian-English, we access Open Multilingual WordNet9

through NLTK. For Japanese-Chinese, we use anin-house lexicon. For Turkish-English, we builda set of ground truth translation pairs in the same

7http://taku910.github.io/mecab8https://translate.google.com9http://compling.hss.ntu.edu.sg/omw

1942

way as how we obtain seed word translation pairsfrom Google Translate, described above.

C Hyperparameters

C.1 WGANWe parametrize the criticD as a feed-forward neu-ral network with one hidden layer of 500 neurons.The generator G is initialized with a random or-thogonal matrix. The expectations in critic andgenerator objectives (6)(7) are approximated byminibatches of 1024 samples. We train for 107

minibatches. Most other hyperparameters followfrom (Arjovsky et al., 2017) except the learningrates, for which larger values of 0.05 and 0.0005are used for the generator and the critic respec-tively for faster convergence.

C.2 EMDOTThe approximate EMD solver (Cuturi, 2013) givesfairly accurate approximation with orders of mag-nitude speedup. However, it makes the transportmatrix T no longer sparse. This is problematic, aswe rely on interpreting a non-zero Tst as evidenceto translate the s-th source word to the t-th targetword (Zhang et al., 2016b). We therefore retainthe largest pV S elements of T , where p encodesour belief of the expected number of translations asource word can have. We set p = 1.3.

The alternating minimization procedure con-verges very fast. We run 10 iterations.

C.3 Monolingual Word EmbeddingsAs input monolingual word embeddings to thetested systems, we train the CBOW model(Mikolov et al., 2013b) with default hyperparam-eters in word2vec10. The embedding dimensiond is 50.

Acknowledgments

We thank the anonymous reviewers for theirinsightful comments. This work is supportedby the National Natural Science Foundationof China (No. 61522204), the 973 Program(2014CB340501), and the National Natural Sci-ence Foundation of China (No. 61331013). Thisresearch is also supported by the Singapore Na-tional Research Foundation under its InternationalResearch Centre@Singapore Funding Initiativeand administered by the IDM Programme.

10https://code.google.com/archive/p/word2vec

ReferencesMihai Albu. 2006. Quantitative analyses of typological

data. Ph.D. thesis, Univ. Leipzig.

Waleed Ammar, George Mulcaire, Yulia Tsvetkov,Guillaume Lample, Chris Dyer, and Noah A. Smith.2016. Massively Multilingual Word Embeddings.arXiv:1602.01925 [cs].

Martin Arjovsky and Leon Bottou. 2017. TowardsPrincipled Methods For Training Generative Adver-sarial Networks. In ICLR.

Martin Arjovsky, Soumith Chintala, and Leon Bottou.2017. Wasserstein GAN. arXiv:1701.07875 [cs,stat].

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016.Learning principled bilingual mappings of word em-beddings while preserving monolingual invariance.In EMNLP.

Ehsaneddin Asgari and Mohammad R. K. Mofrad.2016. Comparing Fifty Natural Languages andTwelve Genetic Languages Using Word EmbeddingLanguage Divergence (WELD) as a QuantitativeMeasure of Language Distance. In Proceedingsof the Workshop on Multilingual and Cross-lingualMethods in NLP.

Antonio Valerio Miceli Barone. 2016. Towards cross-lingual distributed representations without paralleltext trained with adversarial autoencoders. In Pro-ceedings of the 1st Workshop on RepresentationLearning for NLP.

Hailong Cao, Tiejun Zhao, Shu Zhang, and Yao Meng.2016. A Distribution-based Model to Learn Bilin-gual Word Embeddings. In COLING.

Sarath Chandar A P, Stanislas Lauly, Hugo Larochelle,Mitesh Khapra, Balaraman Ravindran, Vikas CRaykar, and Amrita Saha. 2014. An AutoencoderApproach to Learning Bilingual Word Representa-tions. In NIPS.

Scott Cohen and Leonidas Guibas. 1999. The EarthMover’s Distance Under Transformation Sets. InICCV.

Jocelyn Coulmance, Jean-Marc Marty, GuillaumeWenzek, and Amine Benhalloum. 2015. Trans-gram, Fast Cross-lingual Word-embeddings. InEMNLP.

Marco Cuturi. 2013. Sinkhorn Distances: LightspeedComputation of Optimal Transport. In NIPS.

Marco Cuturi and Arnaud Doucet. 2014. Fast Compu-tation of Wasserstein Barycenters. In ICML.

Michael Cysouw. 2013a. Disentangling geographyfrom genealogy. Space in language and linguistics:Geographical, interactional, and cognitive perspec-tives.

1943

Michael Cysouw. 2013b. Predicting language learningdifficulty. Approaches to measuring linguistic dif-ferences.

Hu Ding and Jinhui Xu. 2016. FPTAS for Minimizingthe Earth Mover’s Distance Under Rigid Transfor-mations and Related Problems. Algorithmica.

Georgiana Dinu, Angeliki Lazaridou, and Marco Ba-roni. 2015. Improving Zero-Shot Learning by Miti-gating the Hubness Problem. In ICLR Workshop.

Qing Dou and Kevin Knight. 2012. Large scale deci-pherment for out-of-domain machine translation. InEMNLP-CoNLL.

Qing Dou and Kevin Knight. 2013. Dependency-Based Decipherment for Resource-Limited MachineTranslation. In EMNLP.

Qing Dou, Ashish Vaswani, Kevin Knight, and ChrisDyer. 2015. Unifying Bayesian Inference and Vec-tor Space Models for Improved Decipherment. InACL-IJCNLP.

Matthew S. Dryer and Martin Haspelmath, editors.2013. WALS Online. Max Planck Institute for Evo-lutionary Anthropology, Leipzig.

Long Duong, Hiroshi Kanayama, Tengfei Ma, StevenBird, and Trevor Cohn. 2016. Learning Crosslin-gual Word Embeddings without Bilingual Corpora.In EMNLP.

Steffen Eger, Armin Hoenen, and Alexander Mehler.2016. Language classification from bilingual wordembedding graphs. In COLING.

Manaal Faruqui and Chris Dyer. 2014. Improving Vec-tor Space Word Representations Using MultilingualCorrelation. In EACL.

Charlie Frogner, Chiyuan Zhang, Hossein Mobahi,Mauricio Araya, and Tomaso A Poggio. 2015.Learning with a Wasserstein Loss. In NIPS.

Eric Gaussier, J.M. Renders, I. Matveeva, C. Goutte,and H. Dejean. 2004. A Geometric View on Bilin-gual Lexicon Extraction from Comparable Corpora.In ACL.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. GenerativeAdversarial Nets. In NIPS.

Stephan Gouws, Yoshua Bengio, and Greg Corrado.2015. BilBOWA: Fast Bilingual Distributed Rep-resentations without Word Alignments. In ICML.

Stephan Gouws and Anders Søgaard. 2015. Sim-ple task-specific bilingual word embeddings. InNAACL-HLT.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky,Vincent Dumoulin, and Aaron Courville. 2017.Improved Training of Wasserstein GANs.arXiv:1704.00028 [cs, stat].

Harald Hammarstrom and Loretta O’Connor. 2013.Dependency-sensitive typological distance. Ap-proaches to measuring linguistic differences.

Karl Moritz Hermann and Phil Blunsom. 2014. Mul-tilingual Distributed Representations without WordAlignment. In ICLR.

Kurt Hornik. 1991. Approximation capabilities of mul-tilayer feedforward networks. Neural Networks.

Gao Huang, Chuan Guo, Matt J Kusner, Yu Sun, FeiSha, and Kilian Q Weinberger. 2016. SupervisedWord Mover’s Distance. In NIPS.

Patrick Juola. 1998. Cross-Entropy and Linguistic Ty-pology. In CoNLL.

Tomas Kocisky, Karl Moritz Hermann, and Phil Blun-som. 2014. Learning Bilingual Word Representa-tions by Marginalizing Alignments. In ACL.

Matt Kusner, Yu Sun, Nicholas Kolkin, and KilianWeinberger. 2015. From Word Embeddings To Doc-ument Distances. In ICML.

Angeliki Lazaridou, Georgiana Dinu, and Marco Ba-roni. 2015. Hubness and Pollution: Delving intoCross-Space Mapping for Zero-Shot Learning. InACL-IJCNLP.

Ang Lu, Weiran Wang, Mohit Bansal, Kevin Gimpel,and Karen Livescu. 2015. Deep Multilingual Corre-lation for Improved Word Embeddings. In NAACL-HLT.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Bilingual Word Representations withMonolingual Quality in Mind. In Proceedings of the1st Workshop on Vector Space Modeling for NaturalLanguage Processing.

Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly,Ian Goodfellow, and Brendan Frey. 2015. Adver-sarial Autoencoders. arXiv:1511.05644 [cs].

Thomas Mayer and Michael Cysouw. 2012. Languagecomparison through sparse multilingual word align-ment. In Proceedings of the EACL 2012 Joint Work-shop of LINGVIS & UNCLH.

Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. 2016. Unrolled Generative AdversarialNetworks. arXiv:1611.02163 [cs, stat].

Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a.Exploiting Similarities among Languages for Ma-chine Translation. arXiv:1309.4168 [cs].

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed Representa-tions of Words and Phrases and their Composition-ality. In NIPS.

Gregoire Montavon, Klaus-Robert Muller, and MarcoCuturi. 2016. Wasserstein Training of RestrictedBoltzmann Machines. In NIPS.

1944

Sebastian Nowozin, Botond Cseke, and RyotaTomioka. 2016. f-GAN: Training Generative NeuralSamplers using Variational Divergence Minimiza-tion. arXiv:1606.00709 [cs, stat].

Franz Josef Och and Hermann Ney. 2003. A System-atic Comparison of Various Statistical AlignmentModels. CL.

Takamasa Oshikiri, Kazuki Fukui, and Hidetoshi Shi-modaira. 2016. Cross-Lingual Word Representa-tions via Spectral Graph Embeddings. In ACL.

Ben Poole, Alexander A. Alemi, Jascha Sohl-Dickstein, and Anelia Angelova. 2016. Improvedgenerator objectives for GANs. arXiv:1612.02780[cs, stat]. ArXiv: 1612.02780.

Alec Radford, Luke Metz, and Soumith Chintala.2015. Unsupervised Representation Learning withDeep Convolutional Generative Adversarial Net-works. arXiv:1511.06434 [cs].

Reinhard Rapp. 1999. Automatic Identification ofWord Translations from Unrelated English and Ger-man Corpora. In ACL.

Y. Rubner, C. Tomasi, and L.J. Guibas. 1998. AMetric for Distributions with Applications to ImageDatabases. In ICCV.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba,Vicki Cheung, Alec Radford, and Xi Chen. 2016.Improved Techniques for Training GANs. In NIPS.

Peter H. Schonemann. 1966. A generalized solution ofthe orthogonal procrustes problem. Psychometrika.

Tianze Shi, Zhiyuan Liu, Yang Liu, and Maosong Sun.2015. Learning Cross-lingual Word Embeddings viaMatrix Co-factorization. In ACL-IJCNLP.

Samuel Smith, David Turban, Steven Hamblin, andNils Hammerla. 2017. Offline bilingual word vec-tors, orthogonal transformations and the invertedsoftmax. In ICLR.

Stephanie Strassel and Jennifer Tracey. 2016.LORELEI Language Packs: Data, Tools, andResources for Technology Development in LowResource Languages. In LREC.

Oscar Tackstrom, Dipanjan Das, Slav Petrov, RyanMcDonald, and Joakim Nivre. 2013. Token andType Constraints for Cross-Lingual Part-of-SpeechTagging. TACL.

Cedric Villani. 2009. Optimal Transport: Old andNew.

Ivan Vulic and Anna Korhonen. 2016. On the Role ofSeed Lexicons in Learning Bilingual Word Embed-dings. In ACL.

Ivan Vulic and Marie-Francine Moens. 2013. Cross-Lingual Semantic Similarity of Words as the Sim-ilarity of Their Semantic Word Responses. InNAACL-HLT.

Ivan Vulic and Marie-Francine Moens. 2015. BilingualWord Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induc-tion. In ACL-IJCNLP.

Michael Wick, Pallika Kanani, and Adam Pocock.2016. Minimally-Constrained Multilingual Embed-dings via Artificial Code-Switching. In AAAI.

Chao Xing, Dong Wang, Chao Liu, and Yiye Lin.2015. Normalized Word Embedding and Orthogo-nal Transform for Bilingual Word Translation. InNAACL-HLT.

Hyejin Youn, Logan Sutton, Eric Smith, CristopherMoore, Jon F. Wilkins, Ian Maddieson, WilliamCroft, and Tanmoy Bhattacharya. 2016. On the uni-versal structure of human lexical semantics. Pro-ceedings of the National Academy of Sciences.

Meng Zhang, Yang Liu, Huanbo Luan, Yiqun Liu, andMaosong Sun. 2016a. Inducing Bilingual LexicaFrom Non-Parallel Data With Earth Mover’s Dis-tance Regularization. In COLING.

Meng Zhang, Yang Liu, Huanbo Luan, Maosong Sun,Tatsuya Izuha, and Jie Hao. 2016b. Building EarthMover’s Distance on Bilingual Word Embeddingsfor Machine Translation. In AAAI.

Meng Zhang, Haoruo Peng, Yang Liu, Huanbo Luan,and Maosong Sun. 2017. Bilingual Lexicon Induc-tion From Non-Parallel Data With Minimal Super-vision. In AAAI.

Yuan Zhang, David Gaddy, Regina Barzilay, andTommi Jaakkola. 2016c. Ten Pairs to Tag – Mul-tilingual POS Tagging via Coarse Mapping betweenEmbeddings. In NAACL-HLT.

Will Y. Zou, Richard Socher, Daniel Cer, and Christo-pher D. Manning. 2013. Bilingual Word Embed-dings for Phrase-Based Machine Translation. InEMNLP.

1945

earth mover's distance minimization for unsupervised … · · 2017-12-03proceedings of the...

Documents