analyzing bayesian crosslingual transfer in topic models

Proceedings of NAACL-HLT 2019, pages 1551–1565Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

1551

Analyzing Bayesian Crosslingual Transfer in Topic Models

Shudong HaoBoulder, CO

[email protected]

Michael J. PaulInformation Science

University of ColoradoBoulder, CO

[email protected]

Abstract

We introduce a theoretical analysis of crosslin-gual transfer in probabilistic topic models. Byformulating posterior inference through Gibbssampling as a process of language transfer,we propose a new measure that quantifies theloss of knowledge across languages during thisprocess. This measure enables us to derive aPAC-Bayesian bound that elucidates the fac-tors affecting model quality, both during train-ing and in downstream applications. We pro-vide experimental validation of the analysis ona diverse set of five languages, and discuss bestpractices for data collection and model designbased on our analysis.

1 Introduction

Crosslingual learning is an important area of nat-ural language processing that has driven appli-cations including text mining in multiple lan-guages (Ni et al., 2009; Smet and Moens, 2009),cultural difference detection (Gutierrez et al.,2016), and various linguistic studies (Shutovaet al., 2017; Barrett et al., 2016). Crosslin-gual learning methods generally extend mono-lingual algorithms by using various multilin-gual resources. In contrast to traditionalhigh-dimensional vector space models, moderncrosslingual models tend to rely on learning low-dimensional word representations that are moreefficient and generalizable.

A popular approach to representation learn-ing comes from the word embedding commu-nity, in which words are represented as vectorsin an embedding space shared by multiple lan-guages (Ruder et al., 2018; Faruqui and Dyer,2014; Klementiev et al., 2012). Another di-rection is from the topic modeling community,where words are projected into a probabilistictopic space (Ma and Nasukawa, 2017; Jagarla-mudi and III, 2010). While formulated differently,

both types of models apply the same principles—low-dimensional vectors exist in a shared crosslin-gual space, wherein vector representations of sim-ilar concepts across languages (e.g., “dog” and“hund”) should be nearby in the shared space.

To enable crosslingual representation learning,knowledge is transferred from a source languageto a target language, so that representations havesimilar values across languages. In this study,we will focus on probabilistic topic models, and“knowledge” refers to a word’s probability distri-bution over topics. Little is known about the char-acteristics of crosslingual knowledge transfer intopic models, and thus this paper provides an anal-ysis, both theoretical and empirical, of crosslin-gual transfer in multilingual topic models.

1.1 Background and Contributions

Multilingual Topic Models Given a multilin-gual corpus D(1,...,L) in languages ℓ = 1, . . . , Las inputs, a multilingual topic model learns Ktopics. Each multilingual topic k(1,...,L) (k =1, . . . ,K), is defined as an L-dimensional tuple(ϕ(1)k , . . . , ϕ

(L)k

), where ϕ(ℓ)

k is a multinomial dis-

tribution over the vocabulary V (ℓ) in language ℓ.From a human’s perspective, a multilingual topick(1,...,L) can be interpreted by looking at the wordtypes that have C highest probabilities in ϕ

(ℓ)k for

each language ℓ. C here is called cardinality ofthe topic. Thus, a multilingual topic can looselybe thought of as a group of word lists where eachlanguage ℓ has its own version of the topic.

Multilingual topic models are generally ex-tended from Latent Dirichlet Allocation (Bleiet al., 2003, LDA). Though many variations havebeen proposed, the underlying structures of mul-tilingual topic models are similar. These mod-els require either a parallel/comparable corpus inmultiple languages, or word translations from a

1552

dictionary. One of the most popular models isthe polylingual topic model (Mimno et al., 2009,PLTM), where comparable document pairs sharedistributions over topics θ, while each language ℓ

has its own distributions {ϕ(ℓ)k }Kk=1 over the vo-

cabulary V (ℓ). By re-marginalizing the estima-tions {ϕ(ℓ)

k }Kk=1, we obtain word representationsφ(w) ∈ RK for each word w, where φ

(w)k =

Pr(zw = k|w), i.e., the probability of topic kgiven a word type w.

Crosslingual Transfer Knowledge transferthrough crosslingual representations has beenstudied in prior work. Smet and Moens (2009)and Heyman et al. (2016) show empiricallyhow document classification using topic modelsimplements the ideas of crosslingual transfer, butto date there has been no theoretical framework toanalyze this transfer process in detail.

In this paper, we describe two types oftransfer—on-site and off-site—based on the na-ture of where and how the transfer takes place. Werefer to transfer that happens while training topicmodels (i.e., during representation learning) as on-site. Once we obtain the low-dimensional repre-sentations, they can be used for downstream tasks.We refer to transfer in this phase as off-site, sincethe crosslingual tasks are usually detached fromthe process of representation learning.

Contributions Our study provides a theoreticalanalysis of crosslingual transfer learning in topicmodels. Specifically, we first formulate on-sitetransfer as circular validation, and derive an up-per bound based on PAC-Bayesian theories (Sec-tion 2). The upper bound explicitly shows the fac-tors that can affect knowledge transfer. We thenmove on to off-site transfer, and focus on crosslin-gual document classification as a downstream task(Section 3). Finally, we show experimentally thatthe on-site transfer error can have impact on theperformance of downstream tasks (Section 4).

2 On-Site Transfer

On-site transfer refers to the training procedure ofmultilingual topic models, which usually involvesBayesian inference techniques such as variationalinference and Gibbs sampling. Our work focuseson the analysis of collapsed Gibbs sampling (Grif-fiths and Steyvers, 2004), showing how knowledgeis transferred across languages and how a topicspace is formed through the sampling process.

To this end, we first describe a specific formula-tion of knowledge transfer in multilingual topicmodels as a starting point of our analysis (Sec-tion 2.1). We then formulate Gibbs sampling ascircular validation and quantify a loss during thisphase (Section 2.2). This formulation leads us toa PAC-Bayesian bound that explicitly shows thefactors that affect the crosslingual training (Sec-tion 2.3). Lastly, we look further into differenttransfer mechanisms in more depth (Section 2.4).

2.1 Transfer through Priors

Priors are an important component in Bayesianmodels like PLTM. In the original generative pro-cess of PLTM, each comparable document pair(dS , dT ) in the source and target languages (S, T )is generated by the same multinomial θ ∼ Dir(α).

Hao and Paul (2018) showed that knowledgetransfer across languages happens through priors.Specifically, assume the source document is gen-erated from θ(dS) ∼ Dir(α), and has a sufficientstatistics ndS ∈ NK where each cell nk|dS is thecount of topic k in document dS . When generat-ing the corresponding comparable document dT ,the Dirichlet prior of the distribution over topicsθ(dT ), instead of a symmetric α, is parameterizedby α+ndS . This formulation yields the same pos-terior estimation as the original joint model and isthe foundation of our analysis in this section.

To see this transfer process more clearly, welook closer to the conditional distributions duringsampling, and take PLTM as an example. Whensampling a token in target language xT , the Gibbssampler calculates a conditional distribution PxT

over K topics, where a topic k is randomly drawnand assigned to xT (denoted as zxT ). Assume thetoken xT is in document dT whose comparabledocument in the source language is dS . The con-ditional distribution for xT is

Px,k = Pr(zx = k;w−, z−) (1)

∝(nk|dT + nk|dS + α

)·

nwT |k + β

n·|k + V (T )β,

where the quantity nk|dS is added and thus trans-ferred from the source document. Thus, the cal-culation of Px incorporates the knowledge trans-ferred from the other language.

Now that we have identified the transfer pro-cess, we provide an alternative view of Gibbs sam-pling, i.e., circular validation, in the next section.

1553

animal physiology extends the methods of human physiology to …

the physiology of yeast cells can apply to human cells.

Source language S Target language T

and ignored the irrevocable biology laws of human nature.

human human humanSwS =�

Djur är flercelliga organismer som kännetecknas av att de är rörliga

Som heterotrofa organismer är djur inte självnärande, det vill säga

De gener som förenar alla djur tros ha en gemensam

Topic 1Topic 2

(1) Knowledge transferdo

c 1

doc

2do

c 3

(2) Reversevalidate

PxT

PxS2SwS

Eh⇠PxT

⇥1{h(xS) 6=zxS}

⇤

nwS

Figure 1: The Gibbs sampler is sampling the to-ken “djur” (animal). Using the classifier hk sampledfrom its conditional distribution PxT

, circular valida-tion evaluates hk on all the tokens of type “human”.

2.2 Circular Validation

Circular validation (or reverse validation) was pro-posed by Zhong et al. (2010) and Bruzzone andMarconcini (2010) in transfer learning. Briefly, alearning algorithm A is trained on both source andtarget datasets (DS and DT ), where the source islabeled and target is unlabeled. After predictingthe labels for the target dataset using A (predic-tions denoted as A(DT )), circular validation trainsanother algorithm A′ in the reverse direction, i.e.,uses A(DT ) and DT as the labeled dataset andDS as the unlabeled dataset. The error is thenevaluated on A′(DS). This “train-predict-reverse-repeat” cycle has a similar flavor to the iterativemanner of Gibbs sampling, which inspires us tolook at the sampling process as circular validation.

Figure 1 illustrates this process. Suppose theGibbs sampler is currently sampling xT of wordtype wT in target language T . As discussedfor Equation (1), the calculation of the condi-tional distribution PxT incorporates the knowl-edge transferred from the source language. Wethen treat the process of drawing a topic from PxT

as a classification of the token xT . Let PxT bea distribution over K unary classifiers, {hk}Kk=1,and the k-th classifier labels the token as topic kwith a probability of one:

hk ∼ PxT , and Pr (zxT = k;hk) = 1. (2)

This process is repeated between the two lan-guages until the Markov chain converges.

The training of topic models is unsupervised,i.e., there is no ground truth for labeling a topic,which makes it difficult to analyze the effect oftransfer learning. Thus, after calculating PxT ,we take an additional step called reverse valida-

tion, where we design and calculate a measure—circular validation loss—to quantify the transfer.

Definition 1 (Circular validation loss, CVL). LetSw be the set containing all the tokens of typew throughout the whole training corpus, and callit the sample of w. Given a bilingual word pair(wT , wS) where wT is in target language T whilewS in source S, let SwT and SwS be the samplesfor the two types respectively, and nwT and nwS

the sizes of them. The empirical circular valida-tion score (CVL) is defined as

CVL(wT , wS) =1

2E

xS ,xT

[L(xT , wS) + L(xS , wT )

],

L(xT , wS) =1

nwS

∑xS∈SwS

Eh∼PxT

[1 {h(xS) = zxS}

]=

1

nwS

∑xS∈SwS

(1− PxT ,zxS

),

where PxT ,k is the conditional probability of to-ken xT assigned with topic k. Taking expectationsover all tokens xS and xT , we have general CVL:

CVL(wT , wS) =1

2E

xS ,xT

[L(xT , wS) + L(xS , wT )] ,

L(xT , wS) = ExSEh∼PxT

[1 {h(xS) = zxS}

].

When sampling a token xT , we still follow thetwo-step process as in Equation (2), but instead oflabeling xT itself, we use its conditional PxT tolabel the entire sample of a word type wS in thesource language. Since all the topic labels for thesource language are fixed, we take them as the as-sumed “correct” labelings, and compare xS’s la-bels and the predictions from PxT . This is the in-tuition behind CVL.

Note that the choice of word types wT and wS tocalculate CVL is arbitrary. However, CVL is onlymeaningful when the two word types are seman-tically related, such as word translations, becausethose word pairs are where the knowledge trans-fer takes place. On the other hand, the Gibbs sam-pler does not calculate this CVL explicitly, and thusadding reverse validation step does not affect thetraining of the model. It does, however, help us toexpose and analyze the knowledge transfer mech-anism. In fact, as we show in the next theorem,sampling is also a procedure of optimizing CVL.

Theorem 1. Let CVL(t)(wT , wS) be the empiri-

cal circular validation loss of any bilingual wordpair at iteration t of Gibbs sampling. ThenCVL

(t)(wT , wS) converges as t → ∞.

1554

Proof. See Appendix.

2.3 PAC-Bayes View

A question following the formulation of CVL is,what factors could lead to better transfer duringthis process, particularly for semantically relatedwords? To answer this, we turn to theory thatbounds the performance of classifiers and applythis theory to this formulation of topic samplingas classification.

The PAC-Bayes theorem was introduced byMcAllester (1999) to bound the performance ofBayes classifiers. Given a hypothesis set H, themajority vote classifier (or Bayes classifier) usesevery hypothesis h ∈ H to perform binary clas-sification on an example x, and uses the majorityoutput as the final prediction. Since minimizingthe error by Bayes classifier is NP-hard, an alter-native way is to use a Gibbs classifier as approxi-mation. The Gibbs classifier first draws a hypoth-esis h ∈ H according to a posterior distributionover H, and then uses this hypothesis to predictthe label of an example x (Germain et al., 2012).The generalization loss of this Gibbs classifier canbe bounded as follows.

Theorem 2 (PAC-Bayes theorem, McAllester(1999)). Let P be a posterior distribution over allclassifiers h ∈ H, and Q a prior distribution. Witha probability at least 1− δ, we have

L ≤ L+

√1

2n

(KL (P||Q) + ln

2√n

δ

),

where L and L are the general loss and the empir-ical loss on a sample of size n.

In our framework, a token xT provides a poste-rior PxT over K classifiers. The loss L(xT , wS)is then calculated on a sample of SwS in languageS. The following theorem shows that for a bilin-gual word pair (wT , wS), the general CVL can bebounded with several quantities.

Theorem 3. Given a bilingual word pair(wT , wS), with probability at least 1 − δ, the fol-lowing bound holds:

CVL(wT , wS) ≤ CVL(wT , wS) + (3)

1

2

√1

n

(KLwT +KLwS + 2 ln

2

δ

)+

lnn⋆

n,

n = min{nwT , nwS

}, n⋆ = max

{nwT , nwS

}.

For brevity we use KLw to denote KL(Px||Qx),where Px is the conditional distribution fromGibbs sampling of token x with word type w thatgives highest loss L(x,w), and Qx a prior.


2.4 Multilevel TransferRecall that knowledge transfer happens throughpriors in topic models (Section 2.1). Because theKL-divergence terms in Theorem 3 include thisprior Q, we can use this theorem to analyze thetransfer mechanisms more concretely.

The conditional distribution for sampling atopic zx for a token x during sampling can be fac-torized into document-topic and topic-word levels:Px,k = Pr (zx = k|wx = w,w−, z−)

= Pr (zx = k|z−) · Pr (wx = w|zx = k,w−, z−)

∝ Pr (zx = k|z−)︸︷︷︸document level

·Pr (zx = k|wx = w,w−)︸︷︷︸word level

∆= Pθ,x,k · Pφ,x,k,

Px∆= Pθ,x ⊗ Pφ,x,

where ⊗ is element-wise multiplication. Thus, wehave the following inequality:

KL (Px||Qx) = KL (Pθ,x ⊗ Pφ,x||Qθ,x ⊗Qφ,x)

≤ KL (Pθ,x||Qθ,x) + KL (Pφ,x||Qφ,x) ,

and the KL-divergence term in Theorem 3 is sim-ply the sum of the KL-divergences between theconditional and prior distributions on all levels.

Recall that PLTM transfers knowledge at thedocument level, through Qθ,x, by linking docu-ment translations together (Equation (1)). Assumethe current token x is from a target documentlinked to a document dS in the source language.Then the prior for Pθ,x is θ(dS), i.e., the normal-ized empirical distribution over topics of dS .

Since the words are generated within each lan-guage under PLTM, i.e., ϕ(S)

k is irrelevant to ϕ(T )k ,

no transfer happens at the word level. In thiscase, Qφ,x, the prior for Pφ,x, is simply a K-dimensional uniform distribution U . Then:

KLw ≤ KL(Pθ,x||θ(dS)

)+KL (Pφ,x||U)

= KL(Pθ,x||θ(dS)

)︸︷︷︸

crosslingual entropy

+ logK −H(Pφ,x)︸︷︷︸monolingual entropy

.

Thus, at levels where transfer happens (document-or word-level), a low crosslingual entropy is pre-ferred, to offset the impact of monolingual entropywhere no transfer happens.

1555

Most multilingual topic models are generativeadmixture models in which the conditional proba-bilities can be factorized into different levels, thusKL-divergence term in Theorem 3 can be decom-posed and analyzed in the same way as in thissection for models that have transfer at other lev-els, such as Hao and Paul (2018), Heyman et al.(2016), and Hu et al. (2014). For example, if amodel has word-level transfer, i.e., the model as-sumes that word translations share the same distri-butions, we have a KL-divergence term as,

KLw ≤ KL(Pφ,x||φ(wS)

)+KL(Pθ,x||U)

= KL(Pφ,x||φ(wS)

)+ logK −H(Pθ,x),

where wS is the word translation to word w.

3 Off-Site Transfer

Off-site transfer refers to language transfer thathappens while applying trained topic models todownstream crosslingual tasks such as documentclassification. Because transfer happens using thetrained representations, the performance of off-site transfer heavily depends on that of on-sitetransfer. To analyze this problem, we focus on thetask of crosslingual document classification.

In crosslingual document classification, a doc-ument classifier, h, is trained on documents fromone language, and h is then applied to documentsfrom another language. Specifically, after trainingbilingual topic models, we have K bilingual worddistributions {ϕ(S)

k }Kk=1 and {ϕ(T )k }Kk=1. These

two distributions are used to infer document-topicdistributions θ on unseen documents in the testcorpus, and each document is represented by theinferred distributions. A document classifier isthen trained on the θ vectors as features in sourcelanguage S and tested on the target T .

We aim to show how the generalization risk ontarget languages T , denoted as RT (h), is related tothe training risk on source languages S, RS(h). Todifferentiate the loss and classifiers in this sectionfrom those in Section 2, we use the term “risk”here, and h refers to the document classifiers, notthe topic labeling process by the sampler.

3.1 Languages as DomainsClassic learning theory requires training and testsets to come from the same distribution D, i.e.,(θ, y) ∼ D, where θ is the document representa-tion (features) and y the document label (Valiant,

1984). In practice, however, corpora in S andT may be sampled from different distributions,i.e., D(S) = {(θ(dS), y)} ∼ D(S) and D(T ) ={(θ(dT ), y)} ∼ D(T ). We refer to these distribu-tions as document spaces. To relate RT (h) andRS(h), therefore, we have to take their distribu-tion bias into consideration. This is often formu-lated as a problem of domain adaptation, and herewe can formulate this such that each language istreated as a “domain”.

We follow the seminal work by Ben-David et al.(2006), and define H-distance as follows.

Definition 2 (H-distance, Ben-David et al.(2006)). Let H be a symmetric hypothesis space,i.e., for every hypothesis h ∈ H there exists itscounterpart 1 − h ∈ H. We let m =

∣∣D(S)∣∣ +∣∣D(T )

∣∣, the total size of test corpus. The H-distance between D(S) and D(T ) is defined as

1

2dH

(D(S), D(T )

)= max

h∈H

1

m

∑ℓ∈{S,T}

∑xd:h(xd)=ℓ

1{xd ∈ D(ℓ)

},

where xd is the representation for document d, andh(xd) outputs the language of this document.

This distance measures how identifiable the lan-guages are based on their representations. Ifsource and target languages are from entirely dif-ferent distributions, a classifier can easily identifylanguage-specific features, which could affect per-formance of the document classifiers.

With H-distances, we have a measure of the“distance” between the two distributions D(S) andD(T ). We state the following theorem from do-main adaptation theory.

Theorem 4 (Ben-David et al. (2006)). Let m bethe corpus size of the source language, i.e., m =∣∣D(S)

∣∣, c the VC-dimension of document classi-

fiers h ∈ H, and dH

(D(S), D(T )

)the H-distance

between two languages in the document space.With probability at least 1 − δ, we have the fol-lowing bound,

RT (h) ≤ RS(h) + dH

(D(S), D(T )

)+ λ+√

4

m

(c log

2em

c+ log

4

δ

), (4)

λ = minh∈H

RS(h) + RT (h). (5)

1556

The term λ in Theorem 4 defines a joint risk,i.e., the training error on both source and targetdocuments. This term usually cannot be estimatedin practice since the labels for target documentsare unavailable. However, we can still calculatethis term for the purpose of analysis.

The theorem shows that the crosslingual clas-sification risk is bounded by two critical compo-nents: the H-distance, and the joint risk λ. In-terestingly, these two quantities are based on thesame set of features with different labeling rules:for H-distance, the label for each instance is itslanguage, while λ uses the actual document label.Therefore, a better bound requires the consistencyof features across languages, both in language anddocument labelings.

3.2 From Words to Documents

Since consistency of features depends on the doc-ument representations θ, we need to trace back tothe upstream training of topic models and showhow the errors propagate to the formation of doc-ument representations. Thus, we first show the re-lations between CVL and word representations φin the following lemma.

Lemma 1. Given any bilingual word pair(wT , wS), let φ(w) denote the distribution overtopics of word type w. Then we have,

1− φ(wT )⊤ · φ(wS) ≤ CVL(wT , wS).


We need to connect the word representations φ,which are central to on-site transfer, to the docu-ment representations θ, which are central to off-site transfer. To do this, we make an assumptionthat the inferred distribution over topics θ(d) foreach test document d is a weighted average overall word vectors, i.e., θ(d) ∝

∑w fd

w · φ(w), wherefdw is the normalized frequency of word w in docu-

ment d (Arora et al., 2013). When this assumptionholds, we can bound the similarity of documentrepresentations θ(dS) and θ(dT ) in terms of wordrepresentations and hence their CVL.

Theorem 5. Let θ(dS) be the distribution overtopics for document dS (similarly for dT ),

F (dS , dT ) =(∑

wSfdSwS

2 ·∑

wTfdTwT

2) 1

2 where

fdw is the normalized frequency of word w in doc-

ument d, and K the number of topics. Then

θ(dS)⊤ · θ(dT )

≤ F (dS , dT ) ·√K ·

∑wS ,wT

(CVL(wT , wS)− 1

)2.


This provides a spatial connection between doc-ument pairs and word pairs they have. Many ker-nalized classifiers such as support vector machines(SVM) explicitly use this inner product in the dualoptimization objective (Platt, 1998). Since the in-ner product is directly related to the cosine simi-larity, Theorem 5 indicates that if two documentsare spatially close, their inner product should belarge, and thus the CVL of all word pairs theyshare should be small. In an extreme case, ifCVL(wT , wS) = 1 for all the bilingual wordpairs appearing in document pair (dS , dT ), thenθ(dS)⊤ · θ(dT ) = 0, meaning the two documentsare orthogonal and tend to be irrelevant topically.

With upstream training discussed in Section 2,we see that CVL has an impact on the consistencyof features across languages. A low CVL indicatesthat the transfer from source to target is sufficientin two ways. First, languages share similar distri-butions, and therefore, it is harder to distinguishlanguages based on their distributions. Second, ifthere exists a latent mapping from a distributionto a label, it should produce similar labeling onboth source and target data since they are similar.These two aspects correspond to the language H-distance and joint risk λ in Theorem 4.

4 Experiments

We experiment with five languages: Arabic (AR,Semitic), German (DE, Germanic), Spanish (ES,Romance), Russian (RU, Slavic), and Chinese (ZH,Sinitic). In the first two experiments, we pair eachwith English (EN, Germanic) and train PLTM oneach language pair individually.

Training Data For each language pair, we usea subsample of 3,000 Wikipedia comparable doc-uments, i.e., 6,000 documents in total. We setK = 50, and train PLTM with default hyperparam-eters (McCallum, 2002). We run each experimentfive times and average the results.

Test Data For experiments with document clas-sification, we use Global Voices (GV) in all fivelanguage pairs as test sets. Each document in thisdataset has a “categories” attribute that can be used

1557

as the document label. In our classification exper-iments, we use culture, technology, and educationas the labels to perform multiclass classification.

Evaluation To evaluate topic qualities, we useCrosslingual Normalized Pointwise Mutual Infor-mation (Hao et al., 2018, CNPMI), an intrinsic met-ric of crosslingual topic coherence. For any bilin-gual word pair (wT , wS),

CNPMI(wT , wS) = −log Pr(wT ,wS)

Pr(wT ) Pr(wS)

log Pr (wT , wS), (6)

where Pr (wT , wS) is the occurrence of wT andwS appearing in the same pair of comparabledocuments. We use 10,000 Wikipedia compa-rable document pairs outside PLTM training datafor each language pair to calculate CNPMI scores.All datasets are publicly available at http://opus.nlpl.eu/ (Tiedemann, 2012). Addi-tional details of our datasets and experiment setupcan be found in the appendix.

4.1 Sampling as Circular Validation

Our first experiment shows how CVL changes overtime during Gibbs sampling. According to thedefinition, the arguments of CVL can include anybilingual word pairs; however, we suggest thatit should be calculated specifically among wordpairs that are expected to be related (and thus en-able transfer). In our experiments, we select wordpairs in the following way.

Recall that the output of a bilingual topic modelis K topics, where each language has its owndistribution. For each topic k, we can calculateCVL(wS , wT ) such that wS and wT belong to thesame topic (i.e., are in the top C most probablewords in that topic), from the two languages, re-spectively. Using a cardinality C for each of theK topics, we have in total C2 ×K bilingual wordpairs in the calculation of CVL.

At certain iterations, we collect the topic wordsas described above with cardinality C = 5, andcalculate CVL(wT , wS), CNPMI(wT , wS), and theerror term (the 1

2

√· · · term in Theorem 3) of all the

bilingual word pairs. In the middle panel of Fig-ure 2, CVL over all word pairs from topic wordsis decreasing as sampling proceeds and becomesstable by the end of sampling. On the other hand,the correlations between CNPMI and CVL are con-stantly decreasing. The negative correlations be-tween CVL and CNPMI implies that lower CVL is

associated with higher topic quality, since higher-quality topic has higher CNPMI but lower CVL.

4.2 What the PAC-Bayes Bound Shows

Theorem 3 provides insights into how knowledgeis transferred during sampling and the factors thatcould affect this process. We analyze this boundfrom two aspects, the size of the training data (cor-responding to lnn⋆

n term) and model assumptions(as in the crosslingual entropy terms).

4.2.1 Training Data: DownsamplingOne factor that could affect CVL, according toTheorem 3, is the balance of tokens of a word pair.In an extreme case, if a word type wS has onlyone token, while another word type wT has a largenumber of tokens, the transfer from wS to wT isnegligible. In this experiment, we will test if in-creasing the ratio term lnn⋆

n in the corpus lowersthe performance of crosslingual transfer learning.

To this end, we specify a sample rate ρ =0.2, 0.4, 0.6, 0.8, and 1.0. For each word pair(wT , wS), we calculate n as in the ratio termlnn⋆

n , and remove (1 − ρ) · n tokens from thecorpus (rounded to the nearest integer). Smallerρ removes more tokens from the corpus and thusyields a larger ratio term on average.

We use a dictionary from Wiktionary to col-lect word pairs, where each word pair (wS , wT )is a translation pair. Figure 3 shows the results ofdownsampling using these two methods. Decreas-ing the sample rate ρ lowers the topic qualities.This implies that although PLTM can process com-parable corpora, which need not be exact transla-tions, one still needs to be careful about the tokenbalance between linked document pairs.

For many low-resource languages, the targetlanguage corpus is much smaller than the sourcecorpus, so the effect of this imbalance is importantto be aware of. This is an important issue whenchoosing comparable documents, and Wikipediais an illustrative example. Although one can col-lect comparable documents via Wikipedia’s inter-language links, articles under the same title butin different languages can have very large varia-tions on document length, causing the imbalanceof samples lnn⋆

n , and thus potentially suboptimalperformance of crosslingual training.

4.2.2 Model AssumptionsRecall that the crosslingual entropy term can bedecomposed into different levels, e.g., document

1558

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

d cvl(

wT,w

S)

1 10 20 40 60 80 100

500

1000

Iterations

�0.50

�0.40

�0.30

�0.20

�0.10

0.00Cor

rela

tion

s(c

npmi,

d cvl)

1 10 20 40 60 80 100

500

1000

Iterations

0.19

0.20

0.21

0.22

0.23

0.24

0.25

Err

orte

rm

0.2 0.4 0.6 0.8 1.0Sample rate ⇢

0.12

0.14

0.16

0.18

0.20

0.22

0.24

CN

PM

I

AR DE ES RU ZH

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

d cvl(

wT,w

S)

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

d cvl(

wT,w

S)

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

d cvl(

wT,w

S)

Figure 2: As Gibbs sampling progresses, CVL of topic words drops, which leads to higher quality topics, and thusincreases CNPMI. The left panel shows this negative correlation, and we use shades to indicate standard deviationsacross five chains.

0.2 0.4 0.6 0.8 1.0Sample rate ⇢

0.12

0.14

0.16

0.18

0.20

0.22

0.24

cnpmi

0.2 0.4 0.6 0.8 1.0Sample rate ⇢

0.12

0.14

0.16

0.18

0.20

0.22

0.24

cnpmi

0.2 0.4 0.6 0.8 1.0Sample rate ⇢

0.12

0.14

0.16

0.18

0.20

0.22

0.24

CN

PM

I

AR DE ES RU ZH

C = 5 C = 10

cnpmi(w

T,w

S)

Figure 3: Increasing ρ results in smaller values oflnn⋆

n for translation pairs. Topic quality, evaluated byCNPMI, increases as well.

level and word level, and we prefer a model withlow crosslingual entropy but high monolingual en-tropy. In this experiment, we show how these twoquantities affect the topic qualities, using English-German (EN-DE) documents as an example.

Given PLTM output in (EN,DE) and a cardinalityC = 5, we collect C2 ×K bilingual word pairs asdescribed in Section 4.1. For each word pair, wecalculate three quantities: CVL, CNPMI, and theinner product of the word representations. In Fig-ure 4, each dot is a word pair (wS , wT ) colored bythe values of these quantities. The word pair dotsare positioned by their crosslingual and monolin-gual entropies.

We observe that CVL decreases with crosslin-gual entropy on document level. The larger thecrosslingual entropy, the harder it is to get a lowCVL because it needs larger monolingual entropyto decrease the bound, as shown in Section 2.4.On the other hand, the inner product of word pairsshows an opposite pattern of CVL, indicating anegative correlation (Lemma 1). In Figure 2 we

0 2 4 60.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0.2

0.4

0.6

0.8

0 2 4 60.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0.2

0.4

0.6

0.8

0 2 4 60.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0.0

0.2

0.4

0.6

0.8

Crosslingual K

L(P

✓||✓

) dcvl(wT , wS) b'(wT )> · b'(wS) cnpmi(wT , wS)

02460.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

0.2

0.4

0.6

0.8 0.8

0.4

0.2

0.6

MonolingualH(P')entropy



Figure 4: Each dot is a (EN,DE) word pair, and its colorshows corresponding values of the indicated quantity.Best viewed in color.

see the correlation between CNPMI and CVL isaround −0.4 at the end of sampling, so there arefewer clear patterns for CNPMI in Figure 4. How-ever, we also notice that the word pairs with higherCNPMI scores often appear at the bottom wherecrosslingual entropy is low while the monolingualentropy is high.

4.3 Downstream Task

We move on to crosslingual document classifica-tion as a downstream task. At various iterationsof Gibbs sampling, we infer topics on the test setsfor another 500 iterations and calculate the quan-tities shown in the Figure 5 (averaged over all lan-guages), including the H-distances for both train-ing and test sets, and the joint risk λ.

We treat English as the source language andtrain support vector machines to obtain the bestclassifier h⋆ that fits the English documents. Thisclassifier is then used to calculate the source andtarget risks RS(h

⋆) and RT (h⋆). We also include

12 dH (S, T ), the H-distance based on word rep-

1559

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

Dista

nces

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

Dista

nces

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

Dista

nces

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

Dista

nces

1 10 20 40 60 80 100 5001000Iterations

0.40

0.50

0.60

bRT (h?) bRS(h?) b� train 12bdH

⇣bD(S), bD(T )

⌘test 1

2bdH

⇣bD(S), bD(T )

⌘12bdH (S, T )

1 10 20 40 60 80 100

500

1000

Iterations

0.40

0.50

0.60

0.70

0.80Cla

ssifi

cation

risk

s

1 10 20 40 60 80 100

500

1000

Iterations

0.40

0.50

0.60

0.70

0.80

Dista

nces

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

Dista

nces

1 10 20 40 60 80 100

500

1000

Iterations

0.40

0.50

0.60

0.70

0.80

Cla

ssifi

cation

risk

s

1 10 20 40 60 80 100

500

1000

Iterations

0.40

0.50

0.60

0.70

0.80

Cla

ssifi

cation

risk

s

1 10 20 40 60 80 100

500

1000

Iterations

0.40

0.50

0.60

0.70

0.80

Cla

ssifi

cation

risk

s

AR DE ES RU ZH

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

Dista

nces

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

Dista

nces

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

Dista

nces

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

Dista

nces

1 10 20 40 60 80 100

500

1000

Iterations

0.50

0.60

0.70

0.80

0.90

Dista

nces

Figure 5: Gibbs sampling optimizes CVL, which decreases the joint risk λ and H-distances for test data.

resentations φ. As mentioned in Section 3.1, wetrain support vector machines to use languages aslabels, and the accuracy score as the H-distance.

The classification risks, such as RS(h⋆),

RT (h⋆), and λ, are decreasing as expected (upper

row in Figure 5), which shows very similar trendsas CVL in Figure 2. On the other hand, we noticethat the H-distances of training documents andvocabularies, 1

2 dH

(D(S), D(T )

)and 1

2 dH (S, T ),stabilize around 0.5 to 0.6, meaning it is difficultto differentiate the languages based on their rep-resentations. Interestingly, the H-distances of testdocuments are at a less ideal value, although theyare slightly decreasing in most of the languagesexcept AR. However, recall that the target risk alsodepends on other factors than H-distance (Theo-rem 4), and we use Figure 6 to illustrate this point.

We further explore the relationship between thepredictability of languages vs document classes inFigure 6. We collect documents correctly classi-fied for both document class and language labels,from which we randomly choose 200 documentsfor each language, and use θ to plot t-SNE scatter-plots. Note that the two plots are from the sameset of documents, and so the spatial relations be-tween any two points are fixed, but we color themwith different labelings. Although the classifiercan identify the languages (right panel), the fea-tures are still consistent, because on the left panel,the decision boundary changes its direction andalso successfully classifies the documents basedon actual label class. This illustrates why a singleH-distance does not necessarily mean inconsistentfeatures across languages and high target risks.

Labeling: document class Labeling: language

English (EN)Chinese (ZH)

technologynon-technology

Figure 6: Although the classifier identifies the lan-guages (right), the features are still consistent based onactual document class (left).

5 Conclusions and Future Directions

This study gives new insights into crosslingualtransfer learning in multilingual topic models. Byformulating the inference process as a circular val-idation, we derive a PAC-Bayesian theorem toshow the factors that affect the success of crosslin-gual learning. We also connect topic model learn-ing with downstream crosslingual tasks to showhow errors propagate.

As the first step toward more theoretically justi-fied crosslingual transfer learning, our study sug-gests considerations for constructing crosslingualtransfer models in general. For example, an effec-tive model should strengthen crosslingual trans-fer while minimizing non-transferred components,use a balanced dataset or specific optimization al-gorithms for low-resource languages, and supportevaluation metrics that relate to CVL.

1560

ReferencesSanjeev Arora, Rong Ge, Yonatan Halpern, David M.

Mimno, Ankur Moitra, David Sontag, Yichen Wu,and Michael Zhu. 2013. A Practical Algorithm forTopic Modeling with Provable Guarantees. In Pro-ceedings of the 30th International Conference onMachine Learning, ICML 2013, Atlanta, GA, USA,16-21 June 2013, pages 280–288.

Maria Barrett, Frank Keller, and Anders Søgaard. 2016.Cross-lingual Transfer of Correlations between Partsof Speech and Gaze Features. In COLING 2016,26th International Conference on ComputationalLinguistics, Proceedings of the Conference: Tech-nical Papers, December 11-16, 2016, Osaka, Japan,pages 1330–1339.

Shai Ben-David, John Blitzer, Koby Crammer, andFernando Pereira. 2006. Analysis of Representa-tions for Domain Adaptation. In Advances in Neu-ral Information Processing Systems 19, Proceedingsof the Twentieth Annual Conference on Neural In-formation Processing Systems, Vancouver, BritishColumbia, Canada, December 4-7, 2006, pages137–144.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent Dirichlet Allocation. Journal of Ma-chine Learning Research, 3:993–1022.

Lorenzo Bruzzone and Mattia Marconcini. 2010. Do-main Adaptation Problems: A DASVM Classifica-tion Technique and a Circular Validation Strategy.IEEE Transactions on Pattern Analysis and MachineIntelligence, 32(5):770–787.

Manaal Faruqui and Chris Dyer. 2014. Improving Vec-tor Space Word Representations Using MultilingualCorrelation. In Proceedings of the 14th Confer-ence of the European Chapter of the Association forComputational Linguistics, EACL 2014, April 26-30, 2014, Gothenburg, Sweden, pages 462–471.

Pascal Germain, Amaury Habrard, Francois Laviolette,and Emilie Morvant. 2012. PAC-Bayesian Learningand Domain Adaptation. CoRR, abs/1212.2340.

Thomas L Griffiths and Mark Steyvers. 2004. Find-ing Scientific Topics. Proceedings of the Nationalacademy of Sciences, 101(suppl 1):5228–5235.

E. Dario Gutierrez, Ekaterina Shutova, Patricia Licht-enstein, Gerard de Melo, and Luca Gilardi. 2016.Detecting Cross-cultural Differences Using a Multi-lingual Topic Model. Transactions of the Associa-tion for Computational Linguistics, 4:47–60.

Shudong Hao, Jordan L. Boyd-Graber, and Michael J.Paul. 2018. Lessons from the Bible on ModernTopics: Low-Resource Multilingual Topic ModelEvaluation. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT 2018, New Or-leans, Louisiana, USA, June 1-6, 2018, Volume 1(Long Papers), pages 1090–1100.

Shudong Hao and Michael J. Paul. 2018. LearningMultilingual Topics from Incomparable Corpora. InProceedings of the 27th International Conferenceon Computational Linguistics, COLING 2018, SantaFe, New Mexico, USA, August 20-26, 2018, pages2595–2609.

Geert Heyman, Ivan Vulic, and Marie-Francine Moens.2016. C-BiLDA: Extracting Cross-lingual Topicsfrom Non-parallel Texts by Distinguishing Sharedfrom Unshared Content. Data Mining and Knowl-edge Discovery, 30(5):1299–1323.

Yuening Hu, Jordan L. Boyd-Graber, Brianna Satinoff,and Alison Smith. 2014. Interactive Topic Model-ing. Machine Learning, 95(3):423–469.

Jagadeesh Jagarlamudi and Hal Daume III. 2010. Ex-tracting Multilingual Topics from Unaligned Com-parable Corpora. In Advances in Information Re-trieval, 32nd European Conference on IR Research,ECIR 2010, Milton Keynes, UK, March 28-31, 2010.Proceedings, pages 444–456.

Alexandre Klementiev, Ivan Titov, and Binod Bhat-tarai. 2012. Inducing Crosslingual Distributed Rep-resentations of Words. In COLING 2012, 24th In-ternational Conference on Computational Linguis-tics, Proceedings of the Conference: Technical Pa-pers, 8-15 December 2012, Mumbai, India, pages1459–1474.

Tengfei Ma and Tetsuya Nasukawa. 2017. InvertedBilingual Topic Models for Lexicon Extraction fromNon-parallel Data. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial In-telligence, IJCAI 2017, Melbourne, Australia, Au-gust 19-25, 2017, pages 4075–4081.

David A. McAllester. 1999. PAC-Bayesian Model Av-eraging. In Proceedings of the Twelfth Annual Con-ference on Computational Learning Theory, COLT1999, Santa Cruz, CA, USA, July 7-9, 1999, pages164–170.

Andrew Kachites McCallum. 2002. MALLET: A Ma-chine Learning for Language Toolkit.

David M. Mimno, Hanna M. Wallach, Jason Narad-owsky, David A. Smith, and Andrew McCallum.2009. Polylingual Topic Models. In Proceedings ofthe 2009 Conference on Empirical Methods in Natu-ral Language Processing, EMNLP 2009, 6-7 August2009, Singapore, A meeting of SIGDAT, a SpecialInterest Group of the ACL, pages 880–889.

Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen.2009. Mining Multilingual Topics from Wikipedia.In Proceedings of the 18th International Conferenceon World Wide Web, WWW 2009, Madrid, Spain,April 20-24, 2009, pages 1155–1156.

John Platt. 1998. Sequential minimal optimization: Afast algorithm for training support vector machines.Technical report.

1561

Sebastian Ruder, Ivan Vulic, and Anders Søgaard.2018. A Survey of Cross-lingual Word EmbeddingModels. Journal of Artificial Intelligence Research,abs/1706.04902.

Ekaterina Shutova, Lin Sun, E. Dario Gutierrez, Patri-cia Lichtenstein, and Srini Narayanan. 2017. Mul-tilingual Metaphor Processing: Experiments withSemi-Supervised and Unsupervised Learning. Com-putational Linguistics, 43(1):71–123.

Wim De Smet and Marie-Francine Moens. 2009.Cross-language Linking of News Stories on the WebUsing Interlingual Topic Modelling. In Proceedingsof the 2nd ACM Workshop on Social Web Search andMining, CIKM-SWSM 2009, Hong Kong, China,November 2, 2009, pages 57–64.

Jorg Tiedemann. 2012. Parallel Data, Tools and Inter-faces in OPUS. In Proceedings of the Eighth In-ternational Conference on Language Resources andEvaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, pages 2214–2218.

Leslie G. Valiant. 1984. A Theory of the Learnable.Communications of the ACM, 27(11):1134–1142.

Erheng Zhong, Wei Fan, Qiang Yang, Olivier Ver-scheure, and Jiangtao Ren. 2010. Cross Valida-tion Framework to Choose amongst Models andDatasets for Transfer Learning. In Machine Learn-ing and Knowledge Discovery in Databases, Euro-pean Conference, ECML PKDD 2010, Barcelona,Spain, September 20-24, 2010, Proceedings, PartIII, pages 547–562.

Appendix A Notation

See Table 1.

Appendix B Proofs

Theorem 1. Let CVL(t)(wT , wS) be the empiri-

cal circular validation loss of any bilingual wordpair at iteration t of Gibbs sampling. ThenCVL

(t)(wT , wS) converges as t → ∞.

Proof. We first notice the triangle inequality:

∣∣∣CVL(t)

(wT , wS)− CVL(t−1)

(wT , wS)∣∣∣

=

∣∣∣∣ ExS ,xT

[L(t)(xT , wS) + L(t)(xS , wT )

]− E

xS ,xT

[L(t−1)(xT , wS) + L(t−1)(xS , wT )

]∣∣∣∣=

∣∣∣∣∣ ExT∈SwT

[L(t)(xT , wS)

]+ E

xS∈SwS

[L(t)(xS , wT )

]− E

xT∈SwT

[L(t−1)(xT , wS)

]− E

xS∈SwS

[L(t−1)(xS , wT )

]∣∣∣∣∣

Notation DescriptionS, T Source and target languages. They are

interchangeable during Gibbs sampling.For example, when training English andGerman, English can be either source ortarget.

wℓ A word type of language ℓ.xℓ An individual token of language ℓ.zxℓ

The topic assignment of token xℓ.Swℓ

The sample of word type wℓ, the set con-taining all the tokens xℓ that are of thisword type.

Pxℓ, Pxℓ,k Pxℓ

denotes the conditional distributionover all topics for token xℓ. The condi-tional probability of sampling a topic kfrom Pxℓ

is denoted as Pxℓ,k.D(ℓ) The set of documents in language ℓ.

This usually refers to the test corpus.D(ℓ) The array of document representations

from the corpus D(ℓ) and their docu-ment labels.

ϕ(ℓ)k The empirical distribution over vocab-

ulary of language ℓ for topic k =1, . . . ,K.

φ(w) The word representation, i.e., the em-pirical distribution over K topics for aword type w. This can be obtained byre-normalizing ϕ

(ℓ)k .

θ(d) The document representation, i.e., theempirical distribution over K topics fora document d.

Table 1: Notation table.

≤


[L(t)(xT , wS)

]− E

xT∈SwT

[L(t−1)(xT , wS)

]+ E

xS∈SwS

[L(t)(xS , wT )

]− E

xS∈SwS

[L(t−1)(xS , wT )

]∣∣∣∣∣≡

∣∣∣∣∣∆ ExT∈SwT

[L(xT , wS)

]+∆ E

xS∈SwS

[L(xS , wT )

]∣∣∣∣∣≤


[L(xT , wS)

]∣∣∣∣∣+∣∣∣∣∣∆ E

xS∈SwS

[L(xS , wT )

]∣∣∣∣∣ .

We look at the first term of the last equation, andthe other term can be derived in the same way.We use PxT to denote the invariant distribution ofthe conditional P(t)

xT as t → ∞. Additionally, letPxT ,zxS

be the conditional probability for the to-

1562

ken xT being assigned to topic zxS :

PxT ,zxS= Pr (k = zxS ;w = wxT , z−,w−) .

Another assumption we made is once the sourcelanguage is converged, we keep the states of itfixed. That is, z(t)xS = z

(t−1)xS , and only sample the

target language. Taking the difference between theexpectation at iterations t and t− 1, we have

limt→∞


[L(xT , wS)

]∣∣∣∣∣= lim

t→∞


[L(t)(xT , wS)

]− E

xT∈SwT

[L(t−1)(xT , wS)

]∣∣∣∣∣= lim

t→∞

∣∣∣∣∣∣ExT

1

nwS

∑xS

Eh∼P(t)

xT

1{h(xS) = z(t)xS

}− E

xT

1

nwS

∑xS

Eh∼P(t−1)

xT

1{h(xS) = z(t−1)

xS

}∣∣∣∣∣∣= lim

t→∞

1

nwS

∑xS

ExT

[∣∣∣∣Eh∼P(t)xT

1{h(xS) = z(t)xS

}−E

h∼P(t−1)xT

1{h(xS) = z(t−1)

xS

}∣∣∣∣]= lim

t→∞

1

nwS

∑xS

ExT

[∣∣∣∣Eh∼P(t)xT

1 {h(xS) = zxS}

−Eh∼P(t−1)

xT

1 {h(xS) = zxS}∣∣∣∣]

= limt→∞

1

nwS

∑xS∈SwS

ExT∈SwT

[∣∣∣(1− P(t)xT ,zxS

)−(1− P(t−1)

xT ,zxS

)∣∣∣]= lim

t→∞

1

nwS

∑xS∈SwS

ExT∈SwT

[∣∣∣P(t−1)xT ,zxS

− P(t)xT ,zxS

∣∣∣]= lim

t→∞

1

nwS

∑xS∈SwS

ExT∈SwT

[∣∣PxT ,zxS− PxT ,zxS

∣∣]= 0.

Therefore, we have

limt→∞

∣∣∣CVL(t)

(wT , wS)− CVL(t−1)

(wT , wS)∣∣∣

≤ limt→∞


[L(xT , wS)

]∣∣∣∣∣+

∣∣∣∣∣∆ ExS∈SwS

[L(xS , wT )

]∣∣∣∣∣= 0.

Theorem 3. Given a bilingual word pair(wT , wS), with probability at least 1 − δ, the fol-lowing bound holds:

CVL(wT , wS) ≤ CVL(wT , wS) +

1

2

√1

n

(KLwT +KLwS + 2 ln

2

δ

)+

lnn⋆

n,

n = min{nwT , nwS

}, n⋆ = max

{nwT , nwS

}.

For brevity we use KLw to denote KL(Px||Qx),where Px is the conditional distribution fromGibbs sampling of token x with word type w thatgives highest loss L(x,w), and Qx a prior.

Proof. From Theorem 2, for target language, withprobability at least 1− δ,

L(xT , wS)

≤ L(xT , wS) +

√KL (PxT ||QxT ) + ln

2√nwSδ

2nwS

= L(xT , wS) +

√KL (PxT ||QxT ) + ln 2

δ+

lnnwS2nwS

2

≡ L(xT , wS) + ϵ(xT , wS).

For the source language, similarly, with proba-bility at least 1− δ,

L(xS , wT )

≤ L(xS , wT ) +

√KL (PxS ||QxS ) + ln 2

δ+

lnnwT2

2nwT

≡ L(xS , wT ) + ϵ(xS , wT ).

Given a word type wT , we notice that only theKL-divergence term in ϵ(xT , wS) varies amongdifferent tokens xT . Thus, we use KLwS andKLwT to denote the maximal values of KL-divergence over all the tokens,

KLwS = KL(Px⋆

T||Qx⋆

T

),

x⋆T = argmax

xT∈SwT

ϵ(xT , wS);

KLwT = KL(Px⋆

S||Qx⋆

S

),

x⋆S = argmax

xS∈SwS

ϵ(xS , wT ).

Let n = min {nwT , nwS}, and n⋆ =max {nwT , nwS}. Due to the fact that

√x+

√y ≤

2√2

√x+ y for x, y > 0, we have

1563

CVL(wT , wS)

=1

2E

xS ,xT

[L(xT , wS) + L(xS , wT )]

=1

2(ExT L(xT , wS) +ExSL(xS , wT ))

≤ 1

2

(ExT∈SwT

L(xT , wS) +ExS∈SwSL(xS , wT )

)+

1

2

(ExT∈SwT

ϵ(xT , wS) +ExS∈SwSϵ(xS , wT )

)= CVL(wT , wS)

+1

2

(ExT∈SwT

ϵ(xT , wS) +ExS∈SwSϵ(xS , wT )

)≤ CVL(wT , wS) +

1

2(ϵ(x⋆

T , wS) + ϵ(x⋆S , wT ))

≤ CVL(wT , wS)

+1

2

(√1

2nwT

(KLwT + ln

2

δ+

1

2lnnwT

)

+

√1

2nwS

(KLwS + ln

2

δ+

1

2lnnwS

))≤ CVL(wT , wS)

+1

2

√KLwT +KLwS + 2 ln 2

δ

n+

(ln (nwT · nwS )

2n

)≤ CVL(wT , wS)

+1

2

√KLwT +KLwS + 2 ln 2

δ

n+

(lnn⋆

n

),

which gives us the result.

Lemma 1. Given any bilingual word pair(wT , wS), let φ(w) denote the distribution overtopics of word type w. Then we have,

1− φ(wT )⊤ · φ(wS) ≤ CVL(wT , wS).

Proof. We expand the equation of CVL as follows,

CVL(wT , wS)

=1

2E

xS ,xT

[L(xT , wS) + L(xS , wT )

]=

1

2

(ExT

[L(xT , wS)

]+ExS

[L(xS , wT )

])=

1

2

(∑xT∈SwT

∑xS∈SwS

Eh∼PxT

[1 {h(xS) = zxS}

]nwT · nwS

+

∑xS∈SwS

∑xT∈SwT

Eh∼PxS

[1 {h(xT ) = zxT }

]nwS · nwT

)

=1

2

(∑xT∈SwT

∑xS∈SwS

(1− PxT ,zxS

)nwT · nwS

+

∑xS∈SwS

∑xT∈SwT

(1− PxS ,zxT

)nwS · nwT

)

= 1− 1

2

(∑xT∈SwT

∑xS∈SwS

PxT ,zxS

nwT · nwS

+

∑xS∈SwS

∑xT∈SwT

PxS ,zxT

nwS · nwT

)

= 1− 1

2

K∑k=1

(nk|wS

·∑

xT∈SwTPxT ,k

nwT · nwS

+nk|wT

·∑

xS∈SwSPxS ,zxT

nwS · nwT

)

= 1− 1

2

K∑k=1

(φ

(wS)k ·

∑xT∈SwT

PxT ,k

nwT

+ φ(wT )k ·

∑xS∈SwS

PxS ,zxT

nwS

)

≥ 1− 1

2

K∑k=1

(φ

(wS)k ·

nk|wT

nwT

+ φ(wT )k ·

nk|wS

nwS

)

= 1− 1

2

K∑k=1

(φ

(wS)k · φ(wT )

k + φ(wT )k · φ(wS)

k

)= 1− φ(wS)⊤ · φ(wT )

which concludes the proof.

Theorem 5. Let θ(dS) be the distribution overtopics for document dS (similarly for dT ),

F (dS , dT ) =(∑

wSfdSwS

2 ·∑

wTfdTwT

2) 1

2 where

fdw is the normalized frequency of word w in doc-

ument d, and K the number of topics. Then

θ(dS)⊤ · θ(dT ) ≤ F (dS , dT )

·√K ·

∑wS ,wT

(CVL(wT , wS)− 1

)2.

Proof. We first expand the inner product ofθ(dS)⊤ · θ(dT ) as follows,

θ(dS)⊤ · θ(dT )

=

K∑k=1

θ(dS)k · θ(dT )

k

=

K∑k=1

∑wS∈V (S)

fdSwS

· φ(wS)k

·

∑wT∈V (T )

fdTwT

· φ(wT )k

≤ F (dS , dT ) ·

K∑k=1

∑

wS∈V (S)

φ(wS)2

k

12

·

∑wT∈V (T )

φ(wT )2

k

12

,

F (dS , dT )

=

∑wS∈V (S)

fdSwS

2

12

·

∑wT∈V (T )

fdTwT

2

12

,

1564

where F (dS , dT ) is a constant independent oftopic k, and the last inequality due to Holder’s.We then focus on the topic-dependent part of thelast inequality.

K∑k=1

∑

wS∈V (S)

φ(wS)2

k

12

·

∑wT∈V (T )

φ(wT )2

k

12

=

K∑k=1

∑wS ,wT

(φ

(wS)k · φ(wT )

k

)2 12

≤√K ·

K∑k=1

∑wS ,wT

(φ

(wS)k · φ(wT )

k

)2 12

=√K ·

∑wS ,wT

K∑k=1

(φ

(wS)k · φ(wT )

k

)2 12

≤√K ·

∑wS ,wT

(K∑

k=1

φ(wS)k · φ(wT )

k

)2 1

2

=√K ·

∑wS ,wT

(φ(wT )⊤ · φ(wS)

)2 12

.

Thus, we have the following inequality:

θ(dS)⊤ · θ(dT ) ≤ F (dS , dT ) ·√K

·

( ∑wS ,wT

(φ(wT )⊤ · φ(wS)

)2) 12

.

Plug in Lemma 1, we see that

θ(dS)⊤ · θ(dT ) ≤ F (dS , dT ) ·√K

·

( ∑wS ,wT

(CVL(wT , wS)− 1

)2) 12

.

Appendix C Dataset Details

C.1 Pre-processingFor all the languages, we use existing stemmers tostem words in the corpora and the entries in Wik-tionary. Since Chinese does not have stemmers,we loosely use “stem” to refer to “segment” Chi-nese sentences into words. We also use fixed stop-word lists to filter out stop words. Table 2 lists thesource of the stemmers and stopwords.

1 http://snowball.tartarus.org;2 http://arabicstemmer.com;3 https://github.com/6/stopwords-json;4 https://github.com/fxsjy/jieba.

C.2 Training Sets

Our training set is a comparable corpus fromWikipedia. For each Wikipedia article page, thereexists an interlingual link to view the article inanother language. This interlingual link providesthe same article in different languages and is com-monly used to create comparable corpora in multi-lingual studies. We show the statistics of this train-ing corpus in Table 3. The numbers are calculatedafter stemming and lemmatization.

C.3 Test Sets

C.3.1 Topic Coherence Evaluation Sets

Topic coherence evaluation for multilingual topicmodels was proposed by Hao et al. (2018), wherea comparable corpus is used to calculate bilingualword pair co-occurrence and CNPMI scores. Weuse a Wikipedia corpus to calculate this score, andthe statistics are shown in Table 4. This Wikipediacorpus does not overlap with the training set.

C.3.2 Unseen Document Inference

We use the Global Voices (GV) corpus to createtest sets, which can be retrieved from the web-site https://globalvoices.org directly,or from the OPUS collection at http://opus.nlpl.eu/GlobalVoices.php. We show thestatistics in Table 5. After the column showingnumber of documents, we also include the statis-tics of specific labels. The multiclass labels aremutual exclusive, and each document has only onelabel.

Note that although all the language pairs sharethe same set of English test documents, the doc-ument representations are inferred from differenttopic models trained specifically for that languagepair. Thus, the document representations for thesame English document are different across dif-ferent language pairs.

Lastly, the number of word types is based on thetraining set and after stemming and lemmatization.When a word type in the test set does not appearin the training set, we ignore this type.

C.3.3 Wiktionary

In downsampling experiments (Section 4.2),we use English Wiktionary to create bilin-gual dictionaries, which can be downloadedat https://dumps.wikimedia.org/enwiktionary/.

1565

Language Family Stemmer StopwordsAR Semitic Assem’s Arabic Light Stemmer 1 GitHub 2

DE Germanic SnowBallStemmer 3 NLTKEN Germanic SnowBallStemmer NLTKES Romance SnowBallStemmer NLTKRU Slavic SnowBallStemmer NLTKZH Sinitic Jieba 4 GitHub

Table 2: List of source of stemmers and stopwords used in experiments.

EnglishLanguage #docs #token #types

AR 3,000 724,362 203,024DE 3,000 409,381 125,071ES 3,000 451,115 134,241RU 3,000 480,715 142,549ZH 3,000 480,142 141,679

Paired languageLanguage #docs #token #types

AR 3,000 223,937 61,267DE 3,000 285,745 125,169ES 3,000 276,188 95,682RU 3,000 276,462 96,568ZH 3,000 233,773 66,275

Table 3: Statistics of the Wikipedia training corpus.

Appendix D Topic Model Configurations

For each experiment, we run five chains of Gibbssampling using the Polylingual Topic Model im-plemented in MALLET, 5 and take the averageover all chains. Each chain has 1,000 iterations,and we do not set a burn-in period. We set thetopic number K = 50. Other hyperparameters areα = 50

K = 1 and β = 0.01 which are the defaultsettings. We do not enable hyperparameter opti-mization procedures.

5 http://mallet.cs.umass.edu/topics-polylingual.php.

EnglishLanguage #docs #token #types

AR 10,000 3,092,721 143,504DE 10,000 2,779,963 146,757ES 10,000 3,021,732 149,423RU 10,000 3,016,795 154,442ZH 10,000 1,982,452 112,174

Paired languageLanguage #docs #token #types

AR 10,000 1,477,312 181,734DE 10,000 1,702,101 227,205ES 10,000 1,737,312 142,086RU 10,000 2,299,332 284,447ZH 10,000 1,335,922 144,936

Table 4: Statistics of the Wikipedia corpus for topiccoherence evaluation (CNPMI).

Language #docs #token #typesEN 11,012 3,838,582 104,164AR 1,086 314,918 53,030DE 773 334,611 38,702ES 7,470 3,454,304 110,134RU 1,035 454,380 67,202ZH 1,590 804,720 61,319

#tech. #culture #edu.

EN 4,384 4,679 1,949AR 457 430 199DE 315 294 164ES 2,961 3,121 1,388RU 362 456 217ZH 619 622 349

Table 5: Statistics of the Global Voices (GV) corpus.