arxiv:2004.06748v3 [cs.cl] 12 may 2020 · 2020-05-14 · balancing training for multilingual neural...

Balancing Training forMultilingual Neural Machine Translation

Xinyi Wang Yulia Tsvetkov Graham NeubigLanguage Technology Institute, Carnegie Mellon University, Pittsburgh, PA 15213

{xinyiw1,ytsvetko,gneubig}@cs.cmu.edu

AbstractWhen training multilingual machine transla-tion (MT) models that can translate to/frommultiple languages, we are faced with im-balanced training sets: some languages havemuch more training data than others. Stan-dard practice is to up-sample less resourcedlanguages to increase representation, and thedegree of up-sampling has a large effect on theoverall performance. In this paper, we proposea method that instead automatically learns howto weight training data through a data scorerthat is optimized to maximize performance onall test languages. Experiments on two sets oflanguages under both one-to-many and many-to-one MT settings show our method not onlyconsistently outperforms heuristic baselines interms of average performance, but also offersflexible control over the performance of whichlanguages are optimized.1

1 Introduction

Multilingual models are trained to process differ-ent languages in a single model, and have beenapplied to a wide variety of NLP tasks such astext classification (Klementiev et al., 2012; Chenet al., 2018a), syntactic analysis (Plank et al.,2016; Ammar et al., 2016), named-entity recog-nition (Xie et al., 2018; Wu and Dredze, 2019),and machine translation (MT) (Dong et al., 2015;Johnson et al., 2016). These models have two par-ticularly concrete advantages over their monolin-gual counterparts. First, deploying a single mul-tilingual model is much more resource efficientthan deploying one model for each language underconsideration (Arivazhagan et al., 2019; Aharoniet al., 2019). Second, multilingual training makes itpossible to transfer knowledge from high-resourcelanguages (HRLs) to improve performance on low-resource languages (LRLs) (Zoph et al., 2016;

1The code is available at https://github.com/cindyxinyiwang/multiDDS/.

Nguyen and Chiang, 2018; Neubig and Hu, 2018;Wang and Neubig, 2019; Aharoni et al., 2019).

A common problem with multilingual training isthat the data from different languages are both het-erogeneous (different languages may exhibit verydifferent properties) and imbalanced (there may bewildly varying amounts of training data for eachlanguage). Thus, while LRLs will often benefitfrom transfer from other languages, for languageswhere sufficient monolingual data exists, perfor-mance will often decrease due to interference fromthe heterogeneous nature of the data. This is espe-cially the case for modestly-sized models that areconducive to efficient deployment (Arivazhaganet al., 2019; Conneau et al., 2019).

To balance the performance on different lan-guages, the standard practice is to heuristically ad-just the distribution of data used in training, specifi-cally by over-sampling the training data from LRLs(Johnson et al., 2016; Neubig and Hu, 2018; Ari-vazhagan et al., 2019; Conneau et al., 2019). Forexample, Arivazhagan et al. (2019) sample trainingdata from different languages based on the datasetsize scaled by a heuristically tuned temperatureterm. However, such heuristics are far from per-fect. First, Arivazhagan et al. (2019) find that theexact value of this temperature term significantlyaffects results, and we further show in experimentsthat the ideal temperature varies significantly fromone experimental setting to another. Second, thisheuristic ignores factors other than data size thataffect the interaction between different languages,despite the fact that language similarity has beenempirically proven important in examinations ofcross-lingual transfer learning (Wang and Neubig,2019; Lin et al., 2019).

In this paper, we ask the question: “is it possi-ble to learn an optimal strategy to automaticallybalance the usage of data in multilingual modeltraining?” To this effect, we propose a method that

arX

iv:2

004.

0674

8v4

[cs

.CL

] 5

Sep

202

0

https://github.com/cindyxinyiwang/multiDDS/

https://github.com/cindyxinyiwang/multiDDS/

learns a language scorer that can be used through-out training to improve the model performanceon all languages. Our method is based on the re-cently proposed approach of Differentiable DataSelection (Wang et al., 2019b, DDS), a general ma-chine learning method for optimizing the weight-ing of different training examples to improve apre-determined objective. In this work, we takethis objective to be the average loss from differ-ent languages, and directly optimize the weights oftraining data from each language to maximize thisobjective on a multilingual development set. Thisformulation has no heuristic temperatures, and en-ables the language scorer to consider the interactionbetween languages.

Based on this formulation, we propose an algo-rithm that improves the ability of DDS to optimizemultiple model objectives, which we name Multi-DDS. This is particularly useful in the case wherewe want to optimize performance on multiple lan-guages simultaneously. Specifically, MultiDDS (1)has a more flexible scorer parameterization, (2) ismemory efficient when training on multiple lan-guages, and (3) stabilizes the reward signal so thatit improves all objectives simultaneously instead ofbeing overwhelmed by a single objective.

While the proposed methods are model-agnosticand thus potentially applicable to a wide variety oftasks, we specifically test them on the problem oftraining multilingual NMT systems that can trans-late many languages in a single model. We performexperiments on two sets of languages (one withmore similarity between the languages, one withless) and two translation directions (one-to-manyand many-to-one where the “one” is English). Re-sults show that MultiDDS consistently outperformsvarious baselines in all settings. Moreover, wedemonstrate MultiDDS provides a flexible frame-work that allows the user to define a variety ofoptimization objectives for multilingual models.

2 Multilingual Training Preliminaries

Monolingual Training Objective A standardNMT model is trained to translate from a singlesource language S to a target language T . Theparameters of the model are generally trained bypreparing a training dataset Dtrain, and definingthe empirical distribution of sentence pairs 〈x, y〉sampled from Dtrain as P . We then minimize theempirical risk J(θ, P ), which is the expected value

of the loss function `(x, y; θ) over this distribution:

θ∗ = argminθ

J(θ,Dtrain)

where J(θ,Dtrain) = Ex,y∼P (X,Y )[`(x, y; θ)]

(1)

Multilingual Training Formulation A multilin-gual NMT model can translate n pairs of languages{S1-T 1, S2-T 2, ..., Sn-Tn}, from any source lan-guage Si. to its corresponding target T i. To trainsuch a multilingual model, we have access to n setsof training data Dtrain = D1

train, D2train, . . . , D

ntrain,

where Ditrain is training data for language pair

Si-T i. From these datasets, we can define P i, thedistribution of sentences from Si-T i, and conse-quently also define a risk J(θ, P i) for each lan-guage following the monolingual objective in Eq. 1.

However, the question now becomes: “how dowe define an overall training objective given thesemultiple separate datasets?” Several different meth-ods to do so have been proposed in the past. Todiscuss all of these different methods in a unifiedframework, we further define a distribution PDover the n sets of training data, and define our over-all multilingual training objective as

Jmult(θ, PD, Dtrain) = Ei∼PD(i;ψ)

[J(θ,Di

train)].

(2)

In practice, this overall objective can be approx-imated by selecting a language according to i ∼PD(i), then calculating gradients with respect to θon a batch of data from Di

train.

Evaluation Methods Another important ques-tion is how to evaluate the performance of suchmultilingual models. During training, it is com-mon to use a separate development set for eachlanguage Ddev = D1

dev, D2dev, ..., D

ndev to select the

best model. Given that the objective of multilingualtraining is generally to optimize the performance onall languages simultaneously (Arivazhagan et al.,2019; Conneau et al., 2019), we can formalize thisobjective as minimizing the average of dev risks2:

Jdev(θ,Ddev) =1

n

n∑i=1

J(θ,Didev). (3)

2In reality, it is common to have the loss ` be a likelihood-based objective, but finally measure another metric such asBLEU score at test time, but for simplicity we will assumethat these two metrics are correlated.

Relation to Heuristic Strategies This formula-tion generalizes a variety of existing techniquesthat define PD(i) using a heuristic strategy, andkeep it fixed throughout training.

Uniform: The simplest strategy sets PD(i) to auniform distribution, sampling minibatches fromeach language with equal frequency (Johnsonet al., 2016).

Proportional: It is also common to sample data inportions equivalent to the size of the correspond-ing corpora in each language (Johnson et al., 2016;Neubig and Hu, 2018).

Temperature-based: Finally, because both of thestrategies above are extreme (proportional under-weighting LRLs, and uniform causing overfittingby re-sampling sentences from limited-size LRLdatasets), it is common to sample according todata size exponentiated by a temperature term τ(Arivazhagan et al., 2019; Conneau et al., 2019):

PD(i) =q1/τi∑n

k=1 q1/τk

where qi =|Di

train|∑nk=1 |Dk

train|.

(4)

When τ = 1 or τ = ∞ this is equivalent to pro-portional or uniform sampling respectively, andwhen a number in the middle is chosen it becomespossible to balance between the two strategies.

As noted in the introduction, these heuristicstrategies have several drawbacks regarding sen-sitivity to the τ hyperparameter, and lack of con-sideration of similarity between the languages. Inthe following sections we will propose methods toresolve these issues.

3 Differentiable Data Selection

Now we turn to the question: is there a better wayto optimize PD(i) so that we can achieve our finalobjective of performing well on a representativedevelopment set over all languages, i.e. minimiz-ing Jdev(θ,Ddev). In order to do so, we turn to arecently proposed method of Differentiable DataSelection (Wang et al., 2019b, DDS), a generalpurpose machine learning method that allows forweighting of training data to improve performanceon a separate set of held-out data.

Specifically, DDS uses a technique called bi-level optimization (Colson et al., 2007), that learns

a second set of parameters ψ that modify the train-ing objective that we use to learn θ, so as to maxi-mize the final objective Jdev(θ,Ddev). Specifically,it proposes to learn a data scorer P (x, y;ψ), param-eterized by ψ, such that training using data sampledfrom the scorer optimizes the model performanceon the dev set. To take the example of learning anNMT system to translate a single language pair iusing DDS, the general objective in Eq. 1 could berewritten as

ψ∗ = argminψ

J(θ∗(ψ), Didev) where

θ∗(ψ) = argminθ

Ex,y∼P (x,y;ψ) [`(x, y; θ)] .(5)

DDS optimizes θ and ψ iteratively throughoutthe training process. Given a fixed ψ, the updaterule for θ is simply

θt ← θt−1 −∇θEx,y∼P (x,y;ψ) [`(x, y; θ)]

To update the data scorer, DDS uses reinforce-ment learning with a reward function that approxi-mates the effect of the training data on the model’sdev performance

R(x, y; θt) ≈ ∇J(θt, Didev)> · ∇θ`(x, y; θt−1)

≈ cos(∇J(θt, Di

dev),∇θ`(x, y; θt−1))

(6)

where cos(·) is the cosine similarity of two vectors.This reward can be derived by directly differentiat-ing J(θ(ψ), Di

dev) with respect to ψ, but intuitively,it indicates that the data scorer should be updated toup-weigh the data points that have similar gradientwith the dev data. According to the REINFORCEalgorithm (Williams, 1992), the update rule for thedata scorer then becomes

ψt+1 ← ψt +R(x, y; θt) · ∇ψlogP (x, y;ψ) (7)

4 DDS for Multilingual Training

In this section, we use the previously describedDDS method to derive a new framework that, in-stead of relying on fixed heuristics, adaptively opti-mizes usage of multilingual data for the best modelperformance on multiple languages. We illustratethe overall workflow in Fig. 1.

First, we note two desiderata for our multilin-gual training method: 1) generality: the methodshould be flexible enough so that it can be utilizeduniversally for different multilingual tasks and set-tings (such as different translation directions for

NMT). 2) scalablity: the method should be stableand efficient if one wishes to scale up the numberof languages that a multilingual model supports.Based on these two properties, we introduce Multi-DDS, an extension of the DDS method tailored formultilingual training.

Method MultiDDS directly parameterizes thestandard dataset sampling distribution for multi-lingual training with ψ:

PD(i;ψ) = eψi/∑n

k=1eψk (8)

and optimizes ψ to minimize the dev loss. Notably,unlike standard DDS we make the design decisionto weight training datasets rather than score eachtraining example 〈x, y〉 directly, as it is more effi-cient and also likely easier to learn.

We can thus rewrite the objective in Eq. 2 toincorporate both ψ and θ as:

ψ∗ = argminψ

Jdev(θ∗(ψ), Ddev) where

θ∗ = argminθ

Ei∼PD(i;ψ)

[J(θ,Di

train)] (9)

In other words, while the general DDS frameworkevaluates the model performance on a single devset and optimizes the weighting of each trainingexample, our multilingual training objective evalu-ates the performance over an aggregation of n devsets and optimizes the weighting of n training sets.

The reward signal for updating ψt is

R(i; θt) ≈ cos(∇ (Jdev(θt, Ddev)) ,∇θJ(θt−1, D

itrain)

)= cos

(∇

(1

n

n∑k=1

J(θt, Dkdev)

),∇θJ(θt−1, D

itrain)

),

(10)

where Jdev(·) defines the combination of n dev sets,and we simply plug in its definition from Eq. 3.Intuitively, Eq. 10 implies that we should favor thetraining language i if its gradient aligns with thegradient of the aggregated dev risk of all languages.

Implementing the Scorer Update The pseudo-code for the training algorithm using MultiDDS canbe found in Alg. 1. Notably, we do not update thedata scorer ψ on every training step, because itis too computationally expensive for NMT train-ing (Wang et al., 2019b). Instead, after training themultilingual model θ for a certain number of steps,we update the scorer for all languages. This imple-mentation is not only efficient, but also allows us to

xScorer Model∇θ J(Di

train; θt)

∇θ Jdev(θ′�t+1, Ddev)

θt

D1train

Dntrain

…

D1dev

Dndev

…

ψt

PD(i; ψt)

Figure 1: An illustration of the MultiDDS algorithm. Solidlines represent updates for θ, and dashed lines represent up-dates for ψ. The scorer defines the distribution over n traininglanguages, from which training data is sampled to train themodel. The scorer is updated to favor the datasets with similargradients as the gradient of the aggregated dev sets.

re-estimate more frequently the effect of languagesthat have low probability of being sampled.

In order to do so, it is necessary to calculatethe effect of each training language on the currentmodel, namely R(i; θt). We estimate this value bysampling a batch of data from each Di

train to get thetraining gradient for θt, and use this to calculate thereward for this language. This process is detailedin line 11 of the Alg. 1.

Unlike the algorithm in DDS which requiresstoring n model gradients,3 this approximationdoes not require extra memory even if n is large,which is important given recent efforts to scalemultilingual training to 100+ (Arivazhagan et al.,2019; Aharoni et al., 2019) or even 1000+ lan-guages (Ostling and Tiedemann, 2017; Malaviyaet al., 2017).

5 Stabilized Multi-objective Training

In our initial attempts to scale DDS to highly multi-lingual training, we found that one challenge wasthat the reward for updating the scorer becameunstable. This is because the gradient of a mul-tilingual dev set is less consistent and of highervariance than that of a monolingual dev set, whichinfluences the fidelity of the data scorer reward. 4

3The NMT algorithm in (Wang et al., 2019b) estimatesthe reward by storing the moving average of n training gra-dients, which is not memory efficient (See Line. 7 of Alg. 2in (Wang et al., 2019b)). In the preliminary experiments, ourapproximation performs as well as the moving average ap-proximation (see App. A.1). Thus, we use our approximationmethod as the component for MultiDDS for the rest of theexperiments.

4Suppose the dev set gradient of language k has varianceof var(gkdev) = σ, and that the dev gradients of each language{g1dev, ..., g

ndev} are independent. Then the sum of the gradients

Algorithm 1: Training with MultiDDSInput :Dtrain; M: amount of data to train the

multilingual model before updatingψ;

Output :The converged multilingual model θ∗

. Initialize PD(i, ψ) to be proportional todataset size

1 PD(i, ψ)← |Ditrain|∑nj=1 |D

jtrain|

2 while not converged do. Load training data with ψ

3 X,Y ← ∅4 while |X,Y | < M do5 i ∼ PD(i, ψt)6 (x, y) ∼ Di

train7 X,Y ← X,Y ∪ x, y8 end

. Train the NMT model for multiple steps9 for x, y in X,Y do

10 θ ← GradientUpdate (θ,∇θ`(x, y; θ))11 end

. Estimate the effect of each languageR(i; θ)

12 for i from 1 to n do13 x′, y′ ∼ Di

train14 gtrain ← ∇θ`(x′, y′; θ)15 θ′ ← GradientUpdate(θ, gtrain)16 gdev ← 017 for j from 1 to n do18 xd, yd ∼ Dj

dev19 gdev ← gdev +∇θ′`(xd, yd; θ′)20 end21 R(i; θ)← cos(gdev, gtrain)

22 end. Optimize ψ

23 dψ ←∑n

i=1R(i; θ) · ∇ψlog (PD (i;ψ))24 ψ ← GradientUpdate(ψ, dψ)25 end

Thus, instead of using the gradient alignmentbetween the training data and the aggregated lossof n dev sets as the reward, we propose a secondapproach to first calculate the gradient alignmentreward between the data and each of the n dev sets,then take the average of these as the final reward.

from the n languages has a variance of var(∑nk=1 g

kdev) = nσ.

This can be expressed mathematically as follows:

R′(i; θt) ≈

cos

(∇θ(1

n

n∑k=1

J(θt, Dkdev)

),∇θJ(θt−1, Di

train)

)

≈ 1

n

n∑k=1

cos(∇θJ(θt, Dk

dev),∇θJ(θt−1, Ditrain)

)(11)

To implement this, we can simply replace thestandard reward calculation at Line 11 of Alg. 1to use the stable reward. We name this settingMultiDDS-S. In § 6.6 we show that this methodhas less variance than the reward in Eq. 10.

6 Experimental Evaluation

6.1 Data and SettingsWe use the 58-languages-to-English parallel datafrom Qi et al. (2018). A multilingual NMT modelis trained for each of the two sets of language pairswith different level of language diversity:

Related: 4 LRLs (Azerbaijani: aze, Belarusian:bel, Glacian: glg, Slovak: slk) and a re-lated HRL for each LRL (Turkish: tur, Rus-sian: rus, Portuguese: por, Czech: ces)

Diverse: 8 languages with varying amounts ofdata, picked without consideration for relat-edness (Bosnian: bos, Marathi: mar, Hindi:hin, Macedonian: mkd, Greek: ell, Bulgar-ian: bul, French: fra, Korean: kor)

Statistics of the datasets are in § A.3.For each set of languages, we test two varieties

of translation: 1) many-to-one (M2O): translating8 languages to English; 2) one-to-many (O2M):translating English into 8 different languages. Atarget language tag is added to the source sentencesfor the O2M setting (Johnson et al., 2016).

6.2 Experiment SetupAll translation models use standard transformermodels (Vaswani et al., 2017) as implemented infairseq (Ott et al., 2019) with 6 layers and 4 atten-tion heads. All models are trained for 40 epochs.We preprocess the data using sentencpiece (Kudoand Richardson, 2018) with a vocabulary size of8K for each language. The complete set of hy-perparameters can be found in § A.2. The modelperformance is evaluated with BLEU score (Pap-ineni et al., 2002), using sacreBLEU (Post, 2018).

Baselines We compare with the three standardheuristic methods explained in § 2: 1) Uniform(τ =∞): datasets are sampled uniformly, so thatLRLs are over-sampled to match the size of theHRLs; 2) Temperature: scales the proportional dis-tribution by τ = 5 (following Arivazhagan et al.(2019)) to slightly over-sample the LRLs; 3) Pro-portional (τ = 1): datasets are sampled propor-tional to their size, so that there is no over-samplingof the LRLs.

Ours we run MultiDDS with either the standardreward (MultiDDS), or the stabilized reward pro-posed in Eq. 11 (MultiDDS-S). The scorer for Mul-tiDDS simply maps the ID of each dataset to itscorresponding probability (See Eq. 8. The scorerhas N parameters for a dataset with N languages.)

6.3 Main ResultsWe first show the average BLEU score over all lan-guages for each translation setting in Tab. 1. First,comparing the baselines, we can see that there isno consistently strong strategy for setting the sam-pling ratio, with proportional sampling being bestin the M2O setting, but worst in the O2M setting.Next, we can see that MultiDDS outperforms thebest baseline in three of the four settings and iscomparable to proportional sampling in the lastM2O-Diverse setting. With the stabilized reward,MultiDDS-S consistently delivers better overallperformance than the best baseline, and outper-forms MultiDDS in three settings. From these re-sults, we can conclude that MultiDDS-S providesa stable strategy to train multilingual systems overa variety of settings.

Method M2O O2MRelated Diverse Related Diverse

Bas

elin

e Uni. (τ=∞) 22.63 24.81 15.54 16.86Temp. (τ=5) 24.00 26.01 16.61 17.94Prop. (τ=1) 24.88 26.68 15.49 16.79

Our

s MultiDDS 25.26 26.65 17.17 18.40MultiDDS-S 25.52 27.00 17.32 18.24

Table 1: Average BLEU for the baselines and our methods.Bold indicates the highest value.

Next, we look closer at the BLEU score of eachlanguage pair for MultiDDS-S and the best base-line. The results for all translation settings are inTab. 2. In general, MultiDDS-S outperforms thebaseline on more languages. In the best case, forthe O2M-Related setting, MultiDDS-S brings sig-nificant gains for five of the eight languages, with-

out hurting the remaining three. The gains for theRelated group are larger than for the Diverse group,likely because MultiDDS can take better advantageof language similarities than the baseline methods.

It is worth noting that MultiDDS does not im-pose large training overhead. For example, for ourM2O system, the standard method needs around19 hours and MultiDDS needs around 20 hours forconvergence. The change in training time is notsiginificant because MultiDDS only optimizes asimple distribution over the training datasets.

6.4 Prioritizing what to OptimizePrior works on multilingual models generally fo-cus on improving the average performance of themodel on all supported languages (Arivazhaganet al., 2019; Conneau et al., 2019). The formula-tion of MultiDDS reflects this objective by definingthe aggregation of n dev sets using Eq. 3, which issimply the average of dev risks. However, averageperformance might not be the most desirable objec-tive under all practical usage settings. For example,it may be desirable to create a more egalitariansystem that performs well on all languages, or amore specialized system that does particularly wellon a subset of languages.

In this section, we examine the possibility ofusing MultiDDS to control the priorities of themultilingual model by defining different dev setaggregation methods that reflect these priorities. Todo so, we first train the model for 10 epochs usingregular MultiDDS, then switch to a different devset aggregation method. Specifically, we compareMultiDDS with three different priorities:

Regular: this is the standard MultiDDS that opti-mizes all languages throughout training usingthe average dev risk aggregation in Eq. 3

Low: a more egalitarian system that optimizes theaverage of the four languages with the worstdev perplexity, so that MultiDDS can focuson optimizing the low-performing languages

High: a more specialized system that optimizesthe four languages with the best dev perplexity,for MultiDDS to focus on optimizing the high-performing languages

We performed experiments with these aggrega-tion methods on the Diverse group, mainly be-cause there is more performance trade-off amongthese languages. First, in Tab. 3 we show the aver-age BLEU over all languages, and find that Mul-tiDDS with different optimization priorities still

Method Avg. aze bel glg slk tur rus por ces

M2O Prop. 24.88 11.20 17.17 27.51 28.85 23.09∗ 22.89 41.60 26.80MultiDDS-S 25.52 12.20∗ 19.11∗ 29.37∗ 29.35∗ 22.81 22.78 41.55 27.03

O2M Temp. 16.61 6.66 11.29 21.81 18.60 11.27 14.92 32.10 16.26MultiDDS-S 17.32 6.59 12.39∗ 21.65 20.61∗ 11.58 15.26∗ 33.52∗ 16.98∗

bos mar hin mkd ell bul fra kor

M2O Prop. 26.68 23.43 10.10 22.01 31.06 35.62∗ 36.41∗ 37.91∗ 16.91MultiDDS-S 27.00 25.34∗ 10.57 22.93∗ 32.05∗ 35.27 35.77 37.30 16.81

O2M Temp. 17.94 14.73∗ 4.93 15.49 20.59 24.82 26.60 29.74∗ 6.62MultiDDS-S 18.24 14.02 4.76 15.68∗ 21.44 25.69∗ 27.78∗ 29.60 7.01∗

Table 2: BLEU scores of the best baseline and MultiDDS-S for all translation settings. MultiDDS-S performs better on morelanguages. For each setting, bold indicates the highest value, and ∗ means the gains are statistically significant with p < 0.05.

Setting Baseline MultiDDS-SRegular Low High

M2O 26.68 27.00 26.97 27.08O2M 17.94 18.24 17.95 18.55

Table 3: Average BLEU of the best baseline and threeMultiDDS-S settings for the Diverse group. MultiDDS-Salways outperform the baseline.

maintains competitive average performance com-pared to the baseline. More interestingly, in Fig. 2,we plot the BLEU score difference of High andLow compared to Regular for all 8 languages. Thelanguages are ordered on the x-axis from left toright in decreasing perplexity. Low generally per-forms better on the low-performing languages onthe left, while High generally achieves the bestperformance on the high-performing languages onthe right, with results most consistent in the O2Msetting. This indicates that MultiDDS is able toprioritize different predefined objectives.

It is also worth noting that low-performing lan-guages are not always low-resource languages. Forexample, Korean (kor) has the largest amount oftraining data, but its BLEU score is among thelowest. This is because it is typologically verydifferent from English and the other training lan-guages. Fig. 2 shows that Low is still able to focuson improving kor, which aligns with the prede-fined objective. This fact is not considered in base-line methods that only consider data size whensampling from the training datasets.

6.5 Learned Language DistributionsIn Fig. 3, we visualize the language distributionlearned by MultiDDS throughout the training pro-cess. Under all settings, MultiDDS gradually in-creases the usage of LRLs. Although initializedwith the same distribution for both one-to-many

markorhin bo

smkd ell bu

lfra

−0.50

−0.25

0.00

0.25

BLEU

markorboshin mk

d ell bulfra

−1

0

1

low

high

Figure 2: The difference between Low and High optimiza-tion objectives compared to Regular for the Diverse languagegroup. MultiDDS successfully optimize for different priorities.left: M2O; right: O2M.

and many-to-one settings, MultiDDS learns to up-sample the LRLs more in the one-to-many setting,likely due to the increased importance of learninglanguage-specific decoders in this setting. For theDiverse group, MultiDDS learns to decrease theusage of Korean (kor) the most, probably because itis very different from other languages in the group.

6.6 Effect of Stablized Rewards

Next, we study the effect of the stablized rewardproposed in § 2. In Fig. 4, we plot the regu-lar reward (used by MultiDDS) and the stablereward (used by MultiDDS-S) throughout train-ing. For all settings, the reward in MultiDDS andMultiDDS-S follows the similar trend, while thestable reward used in MultiDDS-S has consistentlyless variance.

MultiDDS-S also results in smaller variancein the final model performance. We run Multi-DDS and MultiDDS-S with 4 different randomseeds, and record the mean and variance of the av-erage BLEU score. Tab. 4 shows results for theDiverse group, which indicate that the model per-formance achieved using MultiDDS-S has lowervariance and a higher mean than MultiDDS.

0 100 200Step

0.00

0.05

0.10

0.15

0.20

0.25

LanguageProbability

0 100 200Step

0.00

0.05

0.10

0.15

0.20

0.25

turrus

porces

azebel

glgslk

0 100 200Step

0.00

0.05

0.10

0.15

0.20

0.25

LanguageProbability

0 100 200Step

0.00

0.05

0.10

0.15

0.20

0.25

korfra

bulell

mkdhin

marbos

Figure 3: Language usage by training step. Left: many-to-one; Right: one-to-many; Top: related language group;Bottom: diverse language group.

Method M2O O2MMean Var. Mean Var.

MultiDDS 26.85 0.04 18.20 0.05MultiDDS-S 26.94 0.02 18.24 0.02

Table 4: Mean and variance of the average BLEU score forthe Diverse group. The models trained with MultiDDS-Sperform better and have less variance.

Additionally, we compare the learned languagedistribution of MultiDDS-S and MultiDDS inFig. 5. The learned language distribution in bothplots fluctuates similarly, but MultiDDS has moredrastic changes than MultiDDS-S. This is alsolikely due to the reward of MultiDDS-S havingless variance than that of MultiDDS.

7 Related Work

Our work is related to the multilingual trainingmethods in general. Multilingual training has a richhistory (Schultz and Waibel, 1998; Mimno et al.,2009; Shi et al., 2010; Tackstrom et al., 2013), buthas become particularly prominent in recent yearsdue the ability of neural networks to easily performmulti-task learning (Dong et al., 2015; Plank et al.,2016; Johnson et al., 2016). As stated previously,recent results have demonstrated the importanceof balancing HRLs and LRLs during multilingualtraining (Arivazhagan et al., 2019; Conneau et al.,2019), which is largely done with heuristic sam-

0 100 200

0.00

0.05

multDDS, var=0.0012

multDDS-S, var=0.0003

0 100 200−0.05

0.00

0.05

0.10

0.15multDDS, var=0.0026


0 100 200

−0.02

0.00

0.02

0.04multDDS, var=0.0015


0 100 200−0.025

0.000

0.025

0.050

0.075 multDDS, var=0.0018


Figure 4: Variance of reward. Left: M2O; Right: O2M; Top:Related language group; Bottom: Diverse language group.

0 100 200Step

0.00

0.05

0.10

0.15

0.20

0.25

LanguageProbability

0 100 200Step

0.00

0.05

0.10

0.15

0.20

0.25

0.30

korfra

bulell

mkdhin

marbos

Figure 5: Language usage for the M2O-Diverse setting. Left:MultiDDS-S; Right: MultiDDS. The two figures follow simi-lar trends while MultiDDS changes more drastically.

pling using a temperature term; MultiDDS providesa more effective and less heuristic method. Wangand Neubig (2019); Lin et al. (2019) choose lan-guages from multilingual data to improve the per-formance on a particular language, while our workinstead aims to train a single model that handlestranslation between many languages. (Zaremoodiet al., 2018; Wang et al., 2018, 2019a) propose im-provements to the model architecture to improvemultilingual performance, while MultiDDS is amodel-agnostic and optimizes multilingual data us-age.

Our work is also related to machine learningmethods that balance multitask learning (Chenet al., 2018b; Kendall et al., 2018). For exam-ple, Kendall et al. (2018) proposes to weigh thetraining loss from a multitask model based on theuncertainty of each task. Our method focuses onoptimizing the multilingual data usage, and is bothsomewhat orthogonal to and less heuristic thansuch loss weighting methods. Finally, our work isrelated to meta-learning, which is used in hyperpa-rameter optimization (Baydin et al., 2018), model

initialization for fast adaptation (Finn et al., 2017),and data weighting (Ren et al., 2018). Notably, Guet al. (2018) apply meta-learning to learn an NMTmodel initialization for a set of languages, so that itcan be quickly fine-tuned for any language. This isdifferent in motivation from our method because itrequires an adapted model for each of the language,while our method aims to optimize a single modelto support all languages. To our knowledge, ourwork is the first to apply meta-learning to optimizedata usage for multilingual objectives.

8 Conclusion

In this paper, we propose MultiDDS, an algorithmthat learns a language scorer to optimize multilin-gual data usage to achieve good performance onmany different languages. We extend and improveover previous work on DDS (Wang et al., 2019b),with a more efficient algorithmic instantiation tai-lored for the multilingual training problem and astable reward to optimize multiple objectives. Mul-tiDDS not only outperforms prior methods in termsof overall performance on all languages, but alsoprovides a flexible framework to prioritize differentmultilingual objectives.

Notably, MultiDDS is not limited to NMT, andfuture work may consider applications to other mul-tilingual tasks. In addition, there are other con-ceivable multilingual optimization objectives thanthose we explored in § 6.4.

9 Acknowledgement

The first author is supported by a research grantfrom the Tang Family Foundation. This work wassupported in part by NSF grant IIS-1812327. Theauthors would like to thank Amazon for providingGPU credits.

ReferencesRoee Aharoni, Melvin Johnson, and Orhan Firat. 2019.

Massively multilingual neural machine translation.In NAACL.

Waleed Ammar, George Mulcaire, Miguel Ballesteros,Chris Dyer, and Noah A Smith. 2016. Many lan-guages, one parser. TACL, 4:431–444.

Naveen Arivazhagan, Ankur Bapna, Orhan Firat,Dmitry Lepikhin, Melvin Johnson, Maxim Krikun,Mia Xu Chen, Yuan Cao, George Foster, ColinCherry, Wolfgang Macherey, Zhifeng Chen, andYonghui Wu. 2019. Massively multilingual neural

machine translation in the wild: Findings and chal-lenges. In arxiv.

Atilim Gunes Baydin, Robert Cornish, David Martınez-Rubio, Mark Schmidt, and Frank Wood. 2018. On-line learning rate adaptation with hypergradient de-scent. In ICLR.

Xilun Chen, Yu Sun, Ben Athiwaratkun, Claire Cardie,and Kilian Weinberger. 2018a. Adversarial deep av-eraging networks for cross-lingual sentiment classi-fication. In ACL.

Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, andAndrew Rabinovich. 2018b. Gradnorm: Gradientnormalization for adaptive loss balancing in deepmultitask networks. In ICML.

Benoıt Colson, Patrice Marcotte, and Gilles Savard.2007. An overview of bilevel optimization. AnnalsOR, 153(1).

Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzman, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2019. Unsupervisedcross-lingual representation learning at scale. InEMNLP.

Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, andHaifeng Wang. 2015. Multi-task learning for mul-tiple language translation. In ACL.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017.Model-agnostic meta-learning for fast adaptation ofdeep networks. In ICML.

Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li,and Kyunghyun Cho. 2018. Meta-learning for low-resource neural machine translation. In EMNLP.

Melvin Johnson et al. 2016. Google’s multilingual neu-ral machine translation system: Enabling zero-shottranslation. In TACL.

Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018.Multi-task learning using uncertainty to weighlosses for scene geometry and semantics. In CVPR.

Alexandre Klementiev, Ivan Titov, and Binod Bhattarai.2012. Inducing crosslingual distributed representa-tions of words. In Proceedings of COLING 2012,pages 1459–1474.

Taku Kudo and John Richardson. 2018. Sentencepiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. InEMNLP.

Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li,Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani,Junxian He, Zhisong Zhang, Xuezhe Ma, AntoniosAnastasopoulos, Patrick Littell, and Graham Neubig.2019. Choosing transfer languages for cross-linguallearning. In ACL.

http://arxiv.org/abs/1907.05019



https://www.aclweb.org/anthology/D18-1398


Chaitanya Malaviya, Graham Neubig, and Patrick Lit-tell. 2017. Learning language representations for ty-pology prediction. In EMNLP.

David M. Mimno, Hanna M. Wallach, Jason Narad-owsky, David A. Smith, and Andrew McCallum.2009. Polylingual topic models. In EMNLP.

Graham Neubig and Junjie Hu. 2018. Rapid adapta-tion of neural machine translation to new languages.EMNLP.

Toan Q. Nguyen and David Chiang. 2018. Transferlearning across low-resource, related languages forneural machine translation. In NAACL.

Toan Q. Nguyen and Julian Salazar. 2019. Transform-ers without tears: Improving the normalization ofself-attention. In IWSLT.

Robert Ostling and Jorg Tiedemann. 2017. Continuousmultilinguality with language vectors. In EACL.

Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In NAACL: Demon-strations.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In ACL.

Barbara Plank, Anders Søgaard, and Yoav Goldberg.2016. Multilingual part-of-speech tagging with bidi-rectional long short-term memory models and auxil-iary loss. In ACL.

Matt Post. 2018. A call for clarity in reporting BLEUscores. In WMT.

Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sar-guna Padmanabhan, and Graham Neubig. 2018.When and why are pre-trained word embeddingsuseful for neural machine translation? In NAACL.

Mengye Ren, Wenyuan Zeng, Bin Yang, and RaquelUrtasun. 2018. Learning to reweight examples forrobust deep learning. In ICML.

Tanja Schultz and Alex Waibel. 1998. Multilingual andcrosslingual speech recognition. In Proc. DARPAWorkshop on Broadcast News Transcription and Un-derstanding. Citeseer.

Lei Shi, Rada Mihalcea, and Mingjun Tian. 2010.Cross language text classification by model transla-tion and semi-supervised learning. In EMNLP.

Oscar Tackstrom, Dipanjan Das, Slav Petrov, Ryan Mc-Donald, and Joakim Nivre. 2013. Token and typeconstraints for cross-lingual part-of-speech tagging.In TACL.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS.

Xinyi Wang and Graham Neubig. 2019. Target condi-tioned sampling: Optimizing data selection for mul-tilingual neural machine translation. In ACL.

Xinyi Wang, Hieu Pham, Philip Arthur, and GrahamNeubig. 2019a. Multilingual neural machine trans-lation with soft decoupled encoding. In ICLR.

Xinyi Wang, Hieu Pham, Paul Mitchel, Antonis Anas-tasopoulos, Jaime Carbonell, and Graham Neubig.2019b. Optimizing data usage via differentiable re-wards. In arxiv.

Yining Wang, Jiajun Zhang, Feifei Zhai, Jingfang Xu,and Chengqing Zong. 2018. Three strategies toimprove one-to-many multilingual translation. InEMNLP.

Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine Learning.

Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas:The surprising cross-lingual effectiveness of BERT.In EMNLP.

Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A.Smith, and Jaime Carbonell. 2018. Neural cross-lingual named entity recognition with minimal re-sources. In EMNLP.

Poorya Zaremoodi, Wray L. Buntine, and GholamrezaHaffari. 2018. Adaptive knowledge sharing in multi-task learning: Improving low-resource neural ma-chine translation. In ACL.

Barret Zoph, Deniz Yuret, Jonathan May, and KevinKnight. 2016. Transfer learning for low resourceneural machine translation. In EMNLP.

https://www.aclweb.org/anthology/D09-1092/

https://www.aclweb.org/anthology/P16-2067



https://arxiv.org/abs/1911.10088

https://arxiv.org/abs/1911.10088



A Appendix

A.1 Effect of Step-ahead Reward

Setting BaselineMultiDDS

Moving Ave. Step-ahead

M2O 24.88 25.19 25.26O2M 16.61 17.17 17.17

Table 5: Ave. BLEU for the Related language group. Thestep-ahead reward proposed in the paper is better or compa-rable with the moving average, and both are better than thebaseline.

A.2 HyperparametersIn this section, we list the details of preprocessingand hyperparameters we use for the experiments.

• We use 6 encoder and decoder layers, with 4attention heads

• The embedding size is set to 512, and the feed-forward layer has a dimension of 1024

• We use the dropout rate of 0.3

• The batch size is set to 9600 tokens

• We use label smoothing with rate of 0.1

• We use the scaled l2 normalization beforeresidual connection, which is shown to behelpful for small data (Nguyen and Salazar,2019)

A.3 Dataset statistics

Language Train Dev Test

aze 5.94k 671 903bel 4.51k 248 664glg 10.0k 682 1007slk 61.5k 2271 2445tur 182k 4045 5029rus 208k 4814 5483por 185k 4035 4855ces 103k 3462 3831

Table 6: Statistics of the related language group.

A.4 Detailed Results for All Settings

Language Train Dev Test

bos 5.64k 474 463mar 9.84k 767 1090hin 18.79k 854 1243mkd 25.33k 640 438ell 134k 3344 4433bul 174k 4082 5060fra 192k 4320 4866kor 205k 4441 5637

Table 7: Statistics of the diverse language group.


Uni. (τ =∞) 22.63 8.81 14.80 25.22 27.32 20.16 20.95 38.69 25.11Temp. (τ = 5) 24.00 10.42 15.85 27.63 28.38 21.53 21.82 40.18 26.26Prop. (τ = 1) 24.88 11.20 17.17 27.51 28.85 23.09 22.89 41.60 26.80

MultiDDS 25.26 12.20 18.60 28.83 29.21 22.24 22.50 41.40 27.22MultiDDS-S 25.52 12.20 19.11 29.37 29.35 22.81 22.78 41.55 27.03

Table 8: BLEU score of the baselines and our method on the Related language group for many-to-one translation

Method Avg. bos mar hin mkd ell bul fra kor

Uni. (τ =∞) 24.81 21.52 9.48 19.99 30.46 33.22 33.70 35.15 15.03Temp. (τ = 5) 26.01 23.47 10.19 21.26 31.13 34.69 34.94 36.44 16.00Prop. (τ = 1) 26.68 23.43 10.10 22.01 31.06 35.62 36.41 37.91 16.91


Table 9: BLEU score of the baselines and our method on the Diverse language group for many-to-one translation


Uni. (τ =∞) 15.54 5.76 10.51 21.08 17.83 9.94 13.59 30.33 15.35Temp. (τ = 5) 16.61 6.66 11.29 21.81 18.60 11.27 14.92 32.10 16.26Prop. (τ = 1) 15.49 4.42 5.99 14.92 17.37 12.86 16.98 34.90 16.53


Table 10: BLEU score of the baselines and our method on the Related language group for one-to-many translation

Method Avg. bos mar hin mkd ell bul fra kor

Uni. (τ =∞) 16.86 14.12 4.69 14.52 20.10 22.87 25.02 27.64 5.95Temp. (τ = 5) 17.94 14.73 4.93 15.49 20.59 24.82 26.60 29.74 6.62Prop. (τ = 1) 16.79 6.93 3.69 10.70 15.77 26.69 29.59 33.51 7.49


Table 11: BLEU score of the baselines and our method on the Diverse language group for one-to-many translation

arxiv:2004.06748v3 [cs.cl] 12 may 2020 · 2020-05-14 · balancing training for multilingual neural...

Documents