i p a i d r h c r a e s e t r o p epublications.idiap.ch/downloads/reports/2017/pappas... ·...

TROPERHCRAESER

PAIDI

MULTILINGUAL HIERARCHICAL ATTENTIONNETWORKS FOR DOCUMENT

CLASSIFICATION

Nikolaos Pappas Andrei Popescu-Belis

Idiap-RR-17-2017

Version of SEPTEMBER 08, 2017

Centre du Parc, Rue Marconi 19, P.O. Box 592, CH - 1920 MartignyT +41 27 721 77 11 F +41 27 721 77 12 [email protected] www.idiap.ch

Multilingual Hierarchical Attention Networks for Document Classification

Nikolaos PappasIdiap Research Institute

Rue Marconi 19CH-1920 Martigny, Switzerland

[email protected]

Andrei Popescu-BelisIdiap Research Institute

Rue Marconi 19CH-1920 Martigny, Switzerland

[email protected]

Abstract

Hierarchical attention networks have re-cently achieved remarkable performancefor document classification in a given lan-guage. However, when multilingual doc-ument collections are considered, train-ing such models separately for each lan-guage entails linear parameter growth andlack of cross-language transfer. Learn-ing a single multilingual model with fewerparameters is therefore a challenging butpotentially beneficial objective. To thisend, we propose multilingual hierarchicalattention networks for learning documentstructures, with shared encoders and/or at-tention mechanisms across languages, us-ing multi-task learning and an aligned se-mantic space as input. We evaluate theproposed models on multilingual docu-ment classification with disjoint label sets,on a large dataset which we provide, with600k news documents in 8 languages, and5k labels. The multilingual models out-perform strong monolingual ones in low-resource as well as full-resource settings,and use fewer parameters, thus confirmingtheir computational efficiency and the util-ity of cross-language transfer.

1 Introduction

Learning word sequence representations has be-come increasingly useful for a variety of NLPtasks such as document classification (Tang et al.,2015; Yang et al., 2016), neural machine trans-lation (NMT) (Cho et al., 2014; Luong et al.,2015), question answering (Chen et al., 2015; Ku-mar et al., 2015) and summarization (Rush et al.,2015). However, when data are available in mul-tiple languages, representation learning must ad-

Figure 1: Vectors of documents labeled with ‘Eu-rope’, ‘Culture’ and their Arabic counterparts.The multilingual hierarchical attention networkseparates topics better than monolingual ones.

dress two main challenges. Firstly, the compu-tational cost of training separate models for eachlanguage which grows linearly with their number,or even quadratically in the case of multi-way mul-tilingual NMT (Firat et al., 2016a). Secondly, themodels should be capable of cross-language trans-fer, which is an important component in humanlanguage learning (Ringbom, 2007). For instance,(Johnson et al., 2016) attempted to use a singlesequence-to-sequence neural network model forNMT across multiple language pairs.

Previous studies in document classification at-tempted to overcome these issues by employingmultilingual word embeddings which allow di-rect comparisons and groupings across languages(Klementiev et al., 2012; Hermann and Blunsom,2014; Ferreira et al., 2016). However, they areonly applicable when common label sets are avail-able across languages which is often not the case(e.g. Wikipedia or news). Moreover, despite re-cent advances in monolingual document modeling(Tang et al., 2015; Yang et al., 2016), multilingualmodels are still based on shallow approaches.

In this paper, we propose Multilingual Hier-archical Attention Networks to learn shared doc-ument structures across languages for document

classification with disjoint label sets, as opposedto training hierarchical attention networks (HANs)in a language-specific manner (Yang et al., 2016).Our networks have a hierarchical structure withword and sentence encoders, along with atten-tion mechanisms. Each of these can either beshared across languages or kept language-specific.To enable cross-language transfer, the networksare trained with multi-task learning across lan-guages using an aligned semantic space as in-put. Fig. 1 displays document vectors, projectedwith t-SNE (van der Maaten, 2009), for two topicsand two languages, either learned by monolingualHANs (a) or by our multilingual HAN (b). Themultilingual HAN achieves better separation be-tween ‘Europe’ and ‘Culture’ topics in English asa result of the knowledge transfer from Arabic.

We evaluate our model against strong monolin-gual baselines, in low-resource and full-resourcescenarios, on a large multilingual document col-lection with 600k documents, labeled with general(1.2k) and specific topics (4.4k), from DeutscheWelle – which we will make available. Our mul-tilingual models outperform monolingual ones inboth scenarios, thus confirming the utility of cross-language transfer and the computational efficiencyof the proposed architecture. To encourage fur-ther research in multilingual representation learn-ing our code and dataset are made available athttps://github.com/idiap/mhan.

2 Related Work

Research on learning multilingual word repre-sentations is based on early work on word em-beddings (Turian et al., 2010; Mikolov et al.,2013; Pennington et al., 2014). The goal is tolearn an aligned word embedding space for mul-tiple languages by leveraging bilingual dictionar-ies (Faruqui and Dyer, 2014; Ammar et al., 2016),parallel sentences (Gouws et al., 2015) or com-parable documents such as Wikipedia pages (Yihet al., 2011; Al-Rfou et al., 2013). Bilingual em-beddings were learned from word alignments us-ing neural language models (Klementiev et al.,2012; Zou et al., 2013), including auto-encoders(Chandar et al., 2014). Despite progress at theword level, the document level remains compara-tively less explored: the approaches proposed byHermann and Blunsom (2014) or Ferreira et al.(2016) are based on shallow modeling and are ap-plicable only to classification tasks with label sets

shared across languages. These sets are costly toproduce, and are often unavailable; here, we re-move this constraint. We also develop deeper mul-tilingual document models with hierarchical struc-ture based on prior art at the word level.

Early work on neural document classificationwas based on shallow feed-forward networks,which required unsupervised pre-training (Le andMikolov, 2014). Later studies focused on neuralnetworks with hierarchical structure. Kim (2014)proposed a convolutional neural network (CNN)for sentence classification. Johnson and Zhang(2015) proposed a CNN for high-dimensional dataclassification, while Zhang et al. (2015) adopteda character-level CNN for text classification. Laiet al. (2015) proposed a recurrent CNN to capturesequential information, which outperformed sim-pler CNNs. Lin et al. (2015) and Tang et al. (2015)proposed hierarchical recurrent NNs and showedthat they were superior to CNN-based models.Recently, Yang et al. (2016) demonstrated that ahierarchical attention network with bi-directionalgated encoders outperforms traditional and neu-ral baselines. Using such networks in multilin-gual settings has two drawbacks: the computa-tional complexity increases linearly with the num-ber of languages, and knowledge is acquired sepa-rately for each language. We address these issuesby proposing a new multilingual model based onHANs, which learns shared document structuresand to transfer knowledge across languages.

Early examples of attention mechanisms ap-peared in computer vision tasks such as opti-cal character recognition (Larochelle and Hinton,2010) and image tracking (Denil et al., 2012) orclassification (Mnih et al., 2014). In text classifica-tion, studies which aimed to learn the importanceof sentences included Yessenalina et al. (2010);Pappas and Popescu-Belis (2014); Yang et al.(2016) and more recently (Pappas and Popescu-Belis, 2017; Ji and Smith, 2017). In NMT, Bah-danau et al. (2015) proposed an attention-basedencoder-decoder network, while Luong et al.(2015) proposed a local and ensemble attentionmodel for NMT. Firat et al. (2016a) proposed asingle encoder-decoder model with shared atten-tion across language pairs for multi-way, multi-lingual NMT. Hermann et al. (2015) developedattention-based document readers for question an-swering. Chen et al. (2015) proposed a recurrentattention model over an external memory. Simi-

larly, Kumar et al. (2015) introduced a dynamicmemory network for question answering and othertasks. We propose hierarchical models with sharedmulti-level attention across languages, which, toour knowledge, has not been attempted before.

3 Background: Hierarchical AttentionNetworks for Document Classification

We adopt a general hierarchical attention architec-ture for document representation derived from aspecific one proposed by Yang et al. (2016), dis-played in Figure 2. The goal of this generality isto be able to represent a large number of architec-tures rather than individual ones. We consider adataset D = {(xi, yi), i = 1, . . . , N} made of Ndocuments xi with labels yi ∈ {0, 1}k. Each doc-ument xi = {w11, w12, . . . , wKT } is representedby the sequence of d-dimensional embeddings oftheir words grouped into sentences, T being themaximum number of words in a sentence, and Kthe maximum number of sentences in a document.

The network takes as input a document xi andoutputs a document vector ui. In particular, it hastwo levels of abstraction, word vs. sentence. Theformer is made of an encoder gw with parame-ters Hw and an attention aw with parameters Aw,while the latter similarly includes an encoder andan attention (gs, Hs and as, As). The output ui isused by the classification layer to determine yi.

...

...s1 sK

w11 w12 w1T wK1... wK2 ... wKT

u

Hw

αw

Hw

αw

Hs

αs

Wor

d-lev

el

Sent

ence

-leve

l

Figure 2: General architecture of hierarchical at-tention neural networks for modeling documents.

3.1 Encoder LayersAt the word level, the function gw encodes the se-quence of input words {wit | t = 1, . . . , T} foreach sentence i of the document, noted as:

h(it)w = gw(wit), t ∈ [1, T ], (1)

and at the sentence level, after combining the in-termediate word vectors {h(it)w | t = 1, . . . ,K} toa sentence vector si (see Section 3.2), the func-tion gs encodes the sequence of sentence vectors{si | i = 1, . . . , S}, noted as h(i)s .

The gw and gs functions can be any feed-forward or recurrent networks with parametersHw and Hs respectively. We consider the fol-lowing networks: a fully-connected one, notedas Dense, a Gated Recurrent Unit network (Choet al., 2014) noted as GRU1, and a bi-directionalGRU which captures temporal information for-ward or backward in time, noted as biGRU. Thelatter is defined as a concatenation of the hiddenstates for each input vector obtained from the for-ward GRU, ~gw, and the backward GRU, ~gw:

h(it)w =[~gw(h(it)w ); ~gw(h(it)w )

](2)

The same concatenation is applied for the hidden-state representation of a sentence h(i)s .

3.2 Attention Layers

A typical way to obtain a representation for a givenword sequence at each level is by taking the lasthidden-state vector that is output by the encoder.However, it is hard to encode all the relevant inputinformation needed in a fixed-length vector. Thisproblem is addressed by introducing an attentionmechanism at each level (noted αw and αs) thatestimates the importance of each hidden state vec-tor to the representation of the sentence or docu-ment meaning respectively. The sentence vectorsi ∈ Rdw , where dw is the dimension of the wordencoder, is thus obtained as follows:

1

T

T∑t=1

α(it)w h(it)w =

1

T

T∑t=1

exp(v>ituw)∑j exp(v>ijuw)

h(it)w

(3)where vit = fw(h

(it)w ) is a fully-connected neural

network with Ww parameters. Similarly, the doc-ument vector u ∈ Rds , where ds is the dimensionof the sentence encoder, is obtained as follows:

1

K

K∑i=1

α(i)s h(i)s =

1

K

K∑i=1

exp(v>i us)∑j exp(v>j us)

h(i)s

(4)where vi = fs(h

(i)w ) is a fully-connected neural

network with Ws parameters. The vectors uw andus are parameters which encode the word contextand sentence context respectively, and are learnedjointly with the rest of the parameters. The totalset of parameters for aw is Aw = {Ww, uw} andfor as is As = {Ws, us}.

1GRU is a simplified version of Long-Short Term Mem-ory, LSTM (Hochreiter and Schmidhuber, 1997).

3.3 Classification LayersThe output of such a network is typically fed to asoftmax layer for classification, with a loss basedon the cross-entropy between gold and predictedlabels (Tang et al., 2015) or a loss based on thenegative log-likelihood of the correct labels (Yanget al., 2016). However, softmax overemphasizesthe probability of the most likely label, which maynot be ideal for multi-label classification becauseeach document will have very few labels withlikely values (e.g. 0.2–0.9). Instead, it is more ap-propriate to have independent predictions for eachlabel; as we also verified empirically in our pre-liminary experiments. Hence, we replace the soft-max with a sigmoid function, for each document irepresented by the vector ui we model the proba-bility of the k labels as follows:

yi = p(y|ui) =1

1 + e−(Wcui+bc)∈ [0, 1]k (5)

whereWc is a ds×k matrix and bc is the bias termfor the classification layer. The training loss basedon cross-entropy is computed as follows:

L(θ) = − 1

N

N∑i=1

H(yi, yi) (6)

where θ is a notation for all the parameters of themodel (i.e. Hw, Aw, Hs, As,Wc), andH is the bi-nary cross-entropy of the gold labels yi and pre-dicted labels yi for a document i. The above ob-jective is differentiable and can be minimized withstochastic gradient descent (Bottou, 1998) or vari-ants such as Adam (Kingma and Ba, 2014), tomaximize classification performance.

4 Multilingual Hierarchical AttentionNetworks: MHANs

When multilingual data is available, the above net-work can be trained on each language separatelybut the needed parameters grow linearly with thenumber of languages. Moreover, they do not ex-ploit common knowledge across languages or totransfer it from one to another. We propose herea hierarchical attention network with shared com-ponents across languages which have slower pa-rameter growth compared to monolingual ones(hence sublinear), which enables knowledge trans-fer across languages. We now consider M lan-guages noted L = {Ll | l = 1, . . . ,M}, and amultilingual set of topic-labeled documents D =

{(x(l)i , y(l)i ) | i = 1, . . . , Nl} defined as above.

Hw

αw

Hs

αs

wKTw11 ...

HW1

wKTw11 ...

HWM

αw

HS1 HSM

αs

sKs1 ...

...

...

u1

WC1 WCM...

uM

sKs1 ...

u1

WC1 WCM...

uM

Hs

αS1

sKs1 ...

u1

WC1 WCM...

uM

αSM

Hw

αW1 αWM

sKs1 ...

wKTw11 ...

...

...

(a) Sharing Encoders (b) Sharing Attentions (c) Sharing Both

Figure 3: Multilingual attention networks formodeling documents over disjoint label spaces.

4.1 Sharing Components across Languages

To enable multilingual learning, we propose threedistinct ways for sharing components between net-works in a multi-task learning setting, depictedin Figure 3, namely: (a) sharing the parametersof word and sentence encoders, hence θenc ={Hw,W

(l)w , Hs,W

(l)s ,W

(l)c }; (b) sharing the pa-

rameters of word and sentence attention models,hence θatt = {H(l)

w ,Ww, H(l)s ,Ws,W

(l)c }; and (c)

sharing both previous sets of parameters, henceθboth = {Hw,Ww, Hs,Ws,W

(l)c }. For instance,

the document representation of a text for languagel based on a shared sentence-level attention wouldbe computed based on Eq.4 by using the same pa-rameters Ws and us across languages.

Let θmono = {H(l)w ,W

(l)w , H

(l)s ,W

(l)s ,W

(l)c } be

the parameters of multiple independent monolin-gual models with Dense encoders, then we have:

|θmono| > |θenc| > |θatt| > |θboth| (7)

where | · | is the number of parameters in a set. ForGRU and biGRU encoders, the inequalities stillhold, but swapping |θenc| and |θatt|. Excluding theclassification layer which is nessesarily language-specific2, the (a) and (b) networks have sublinearnumbers of parameters and the (c) network has aconstant number of parameters with respect to thenumber of languages.

Depending on the label sets, several types ofdocument classification problems can be solvedwith such architectures. First, label sets can becommon or disjoint across languages. Second,considering labels as k-hot vectors, k = 1 cor-responds to a multi-class task, while k > 1 is amulti-label task. We focus here on the multi-label

2Note that the word embeddings are not considered as pa-rameters in our setup because they are fixed during training.For learned word embeddings, the analysis still applies if weconsider their parameters as part of the word-level encoder.

problem with disjoint label sets. Moreover, weassume an aligned input space i.e with multilin-gual word embeddings that have aligned meaningsacross languages (Ammar et al., 2016). With non-aligned word embeddings, the multilingual trans-fer is harder due to the complete lack of parallelinformation, as we show in Section 6.

4.2 Training over Disjoint Label SetsFor training, we replace the monolingual train-ing objective (Eq. 6) with a joint multilingualobjective that facilitates the sharing of compo-nents, i.e. a subset of parameters for each languageθ1, θ2, ..., θM , across different language networks:

L(θ1,...,θM ) = − 1

Z

Ne∑i

M∑l

H(y(l)i , y

(l)i ) (8)

where Z = M×Ne withNe being the epoch size.Here it may also be beneficial to add a term γl foreach language objective which encodes our priorknowledge about its importance; we keep this asfuture work. The joint objective L can be mini-mized with respect to the parameters θ1, θ2, ..., θMusing SGD as before. However, when training onexamples from different languages consecutively(I iterations) it is difficult to learn a shared spacewhich works well across languages.

This is because each language updates applyonly on a subset of parameters and may bias themodel away from other languages. To address this,we employ the training strategy from (Firat et al.,2016a), who sampled parallel sentences for multi-way machine translation from different languagepairs in a cyclic fashion at each iteration.3 Here,we sample instead a document-label pair j fromeach language i at iteration j, L(j)

i , as follows:(L(1)1 . . . L

(1)M

)→, . . . ,→

(L(I)1 . . . L

(I)M

)(9)

For minibatch SGD, the number of samples perlanguage are equal to the batch size divided by M.

5 A New Corpus for MultilingualDocument Classification: DW

Multilingual document classification datasets areusually limited in size, have target categoriesaligned across languages, and assign documentsto only one category. However, classification isoften necessary in cases where the categories are

3We verified this empirically in our preliminary experi-ments and found out that mixing languages in a single batchperformed better than keeping them in separate batches.

Languages Documents LabelsL |X| s w |Yg| |Ys|

English 112,816 17.9 516.2 327 1,058German 132,709 22.3 424.1 367 809Spanish 75,827 13.8 412.9 159 684

Portuguese 39,474 20.2 571.9 95 301Ukrainian 35,423 17.6 342.9 28 260Russian 108,076 16.4 330.1 102 814Arabic 57,697 13.3 357.7 91 344Persian 36,282 18.7 538.4 71 127

All 598,304 17.52 436.7 1,240 4,397

Table 1: Statistics of the Deutsche Welle corpus:s and w are the average numbers of sentences andwords per document.

not strictly aligned and can be multiple per docu-ment. For instance, this is the case for online newsagencies keeping track of multilingual news writ-ten and annotated by different journalists, a pro-cess which is expensive and time-consuming.

Two datasets for multilingual document classifi-cation have been used in previous studies: ReutersRCV1/RCV2 (6,000 documents, 2 languages and4 labels) and TED talk transcripts (12,078 doc-uments, 12 languages and 15 labels) introducedby Hermann and Blunsom (2014). The former istailored for evaluating word embeddings alignedacross languages, rather than complex multilin-gual document models. The latter is twice as largeand covers more languages, in a multi-label set-ting, but biases evaluation by including transla-tions of talks in all languages.

Here, we present and use a much larger datasetcollected from Deutsche Welle, Germany’s pub-lic international broadcaster. The DW dataset con-tains nearly 600,000 documents annotated by jour-nalists with several topic labels in 8 languages,shown in Table 1. Documents are on average 2.6times longer than in Yang et al.’s (2016) monolin-gual dataset (436 vs. 163 words). There are twotypes of labels, namely general topics (Yg) andspecific ones (Ys) described by one or more words.We consider (and count in Table 1) only those spe-cific labels that appear at least 100 times, to avoidsparsity issues.

The number of labels varies greatly acrossDW’s language services. Moreover, we found forinstance that only 25-30% of the labels could bemanually aligned between English and German.The commonalities are mainly concentrated on themost frequent labels, reflecting a shared top-leveldivision of the news domain, but the long tail ex-hibits significant independence across languages.

English + Auxiliary → English English + Auxiliary → AuxiliaryModels de es pt uk ru ar fa de es pt uk ru ar fa

Ygen

era

l

Mon

o NN (Avg) 50.7 53.1 70.0 57.2 80.9 59.3 64.4 66.6HNN (Avg) 70.0 67.9 82.5 70.5 86.8 77.4 79.0 76.6HAN (Att) 71.2 71.8 82.8 71.3 85.3 79.8 80.5 76.6

Mul

ti MHAN-Enc 71.0 69.9 69.2 70.8 71.5 70.0 71.3 69.7 82.9 69.7 86.8 80.3 79.0 76.0MHAN-Att 74.0 74.2 74.1 72.9 73.9 73.8 73.3 72.5 82.5 70.8 87.7 80.5 82.1 76.3MHAN-Both 72.8 71.2 70.5 65.6 71.1 68.9 69.2 70.4 82.8 71.6 87.5 80.8 79.1 77.1

Ysp

ecifi

c

Mon

o NN (Avg) 24.4 21.8 22.1 24.3 33.0 26.0 24.1 32.1HNN (Avg) 39.3 39.6 37.9 33.6 42.2 39.3 34.6 43.1HAN (Att) 43.4 44.8 46.3 41.9 46.4 45.8 41.2 49.4

Mul

ti MHAN-Enc 45.4 45.9 44.3 41.1 42.1 44.9 41.0 43.9 46.2 39.3 47.4 45.0 37.9 48.6MHAN-Att 46.3 46.0 45.9 45.6 46.4 46.4 46.1 46.5 46.7 43.3 47.9 45.8 41.3 48.0MHAN-Both 45.7 45.6 41.5 41.2 45.6 44.6 43.0 45.9 46.4 40.3 46.3 46.1 40.7 50.3

Table 2: Full-resource classification performance (F1) on general (top) and specific (bottom) topic cate-gories using bilingual training with English as target (left) and the auxiliary language as target (right).

6 Evaluation

We evaluate our multilingual models on full-resource and low-resource scenarios of multilin-gual document classification over disjoint labelsets on the Deutsche Welle corpus.

6.1 SettingsFollowing the typical evaluation protocol in thefield, the corpus is split per language into 80% fortraining, 10% for validation and 10% for testing.We evaluate both type of labels (Yg, Ys) on a full-resource scenario and only the general topics (Yg)on a low-resource scenario. We report the micro-averaged F1 scores for each test set, as in previouswork (Hermann and Blunsom, 2014).

Model configuration. For all models, we usethe aligned pre-trained 40-dimensional multilin-gual embeddings trained on the Leipzig corpus us-ing multi-CCA from Ammar et al. (2016). Thenon-aligned embeddings are trained with the samemethod and data. We zero-pad documents up toa maximum of 30 words per sentence and 30 sen-tences per document. The hyper-parameters wereselected on the validation sets. We made the fol-lowing settings: 100-dimensional encoder and at-tention embeddings (at every level), relu activa-tion function for all intermediate layers, batch sizeof 16, epoch size of 25k, and optimization usingSGD with Adam until convergence.

All the hierarchical models have Dense en-coders in both scenarios (Tables 2, 4, and 5), andGRU and biGRU in the full-resource scenario forEnglish+Arabic (Table 3). For the low-resourcescenario, we define three levels of data availabil-ity: tiny from 0.1% to 0.5%, small from 1% to5% and medium from 10% to 50% of the originaltraining set. We report the average F1 scores on

the test set for each level based on discrete incre-ments of 0.1, 1 and 10 respectively. The decisionthreshold for the full-resource scenario is set to 0.4for |Ys| < 400 and 0.2 for |Ys| ≥ 400, and for thelow-resource scenario it is 0.3 for all sets.

Baselines. We compare against the followingmonolingual neural networks, with shallow or hi-erarchical structure. These networks are based onthe state of the art in the field, reviewed in Sec-tion 2, and thus represent strong baselines.

• NN : A neural network which feeds the av-erage vector of the input words directly toa classification layer, as the common base-line for multilingual document classificationby Klementiev et al. (2012).

• HNN : A hierarchical network with encodersand average pooling at every level, followedby a classification layer.

• HAN: A hierarchical network with encodersand attention, followed by a classificationlayer. This model is the one proposed byYang et al. (2016) adapted to our task.

When training our multilingual system with thethree options described in Section 4.1 noted asEnc, Att and Both, we use in fact the multilingualversion of the third monolingual baseline, as illus-trated in Fig. 3, trained with the objective of Eq. 8.

6.2 Results

Full-resource Scenario. Table 2 displays the re-sults of full-resource document classification us-ing Dense encoders for general and specific la-bels. In particular, on the left, the performance onthe English sub-corpus is shown when English andan auxiliary sub-corpus are used for training, and

Figure 4: Low-resource document classification performance (F1) of our multilingual attention networkensemble vs. a monolingual attention network on the DW corpus.

on the right the performance on the auxiliary sub-corpus is shown when that sub-corpus and Englishare used for training.

The multilingual model trained on pairs of lan-guages outperforms on average all the examinedmonolingual models, from simple bag-of-wordsneural models to hierarchical neural models usingaverage pooling or attention. The best-performingmultilingual architecture is the one with shared at-tention across languages, especially when testedon English (documents and labels). Interest-ingly, this reveals that the transfer of knowledgeacross languages in a full-resource setting is max-imized with language-specific word and sentenceencoders, but language-independent (i.e. shared)attention for both words and sentences.

However, when transferring from English toPortuguese (en→pt), Russian (en→ru) and Per-sian (en→fa) on general categories, it is more ef-fective to have only language-independent compo-nents. We hypothesize that this is due to the under-lying commonness between the label sets ratherthan the relationship between languages. Theconsistent gain for English as target could be at-tributed to the alignment of the word embeddingsto English and to the large number of English la-bels, which makes it easier to find multilingual la-bels from which to transfer knowledge.

We quantify now the impact of three importantmodel choices on the performance: encoder type,word embeddings, and number of languages usedfor training. In Table 3, we observe that whenwe replace the Dense encoder layers with GRU orbiGRU layers, the improvement from the multi-lingual training is still present. In particular, themultilingual models with shared attention are su-perior to alternatives regardless of the employedencoders. For reference, using simply logistic re-gression with bag-of-words (counts) for classifica-tion leads to F1 scores of 75.8% in English and81.9% in Arabic, using many more parametersthan biGRU: 56.5M vs. 410k in English and 5.8Mvs. 364k in Arabic.

In Table 4, we observe that when we train

Encoders Mono MultiYgeneral HAN Enc Att Both

ar→

en Dense 71.2 70.0 73.8 68.9GRU 77.0 74.8 77.5 75.4biGRU 77.7 77.1 77.5 76.7

en→

ar Dense 80.5 79.0 82.1 79.1GRU 81.5 81.2 83.4 83.1biGRU 82.2 82.7 84.0 83.0

Table 3: Full-resource classification performance(F1) for English-Arabic with various encoders.

Word emb. |L| Ygeneral (nl, fl) Yspecific (nl, fl)1 50K – 77.41 – 90K – 44.90 –

aligned 2 40K ↓ 78.30 ↑ 80K ↓ 45.72 ↑8 32K ↓ 77.91 ↑ 72K↓ 45.82 ↑

non-aligned 8 32K ↓ 71.23 ↓ 72K ↓ 33.41 ↓

Table 4: Average number of parameters per lan-guage (nl), average F1 per language (fl), and theirvariation (arrows) with the number of languages|L| and the word embeddings used for training.

our multilingual model (MHAN-att) on eight lan-guages at the same time, the F1 scores improveacross languages – for both types of labels, gen-eral or specific – while the number of parametersper language decreases, by 36% for Ygeneral and20% for Yspecific . Lastly, when we train the samemodel with word embeddings that are not alignedacross languages, the performance of the multilin-gual model drops significantly. An input space thatis aligned across languages is thus crucial.

Low-resource Scenario. We assess the abil-ity of the multilingual attention networks to trans-fer knowledge across languages in a low-resourcescenario. The results for seven languages whentrained jointly with English are displayed in de-tail in Table 5 and summarized in Figure 4. Inall cases, at least one of the multilingual modelsoutperforms the monolingual one, which demon-strates the usefulness of multilingual training forlow-resource document classification.

Moreover, the improvements obtained from ourmultilingual models for lower levels of availabil-ity (tiny and small) are larger than in higher levels(medium). This is also clearly observed in Figure 4

Size Mono MultiYgeneral HAN Enc Att Both ∆%

en→

de 0.1-0.5% 29.9 41.0 37.0 39.4 +37.21-5% 51.3 51.7 49.7 52.6 +2.6

10-50% 63.5 63.0 63.8 63.8 +0.5

en→

es 0.1-0.5% 39.5 38.7 33.3 41.5 +4.91-5% 45.6 45.5 50.8 50.1 +11.6

10-50% 74.2 75.7 74.2 75.2 +2.0

en→

pt 0.1-0.5% 30.9 25.3 31.6 33.8 +9.61-5% 44.6 44.3 37.5 47.3 +6.0

10-50% 60.9 61.9 62.1 62.1 +1.9

en→

uk 0.1-0.5% 60.4 62.4 59.8 60.9 +3.11-5% 68.2 67.7 70.6 69.0 +3.4

10-50% 76.4 76.2 76.3 76.7 +0.3

en→

ru 0.1-0.5% 27.6 26.6 27.0 29.1 +5.41-5% 39.3 38.2 39.6 40.2 +2.2

10-50% 69.2 70.5 70.4 69.4 +1.9

en→

ar 0.1-0.5% 35.4 35.5 39.5 36.6 +11.71-5% 45.6 48.7 47.2 46.6 +6.9

10-50% 48.9 52.2 46.8 47.8 +6.8

en→

fa 0.1-0.5% 36.0 35.6 33.6 41.3 +14.61-5% 55.0 55.6 51.9 55.5 +1.0

10-50% 69.2 70.3 70.1 70.0 +1.5

Table 5: Low-resource classification performance(F1) with various sizes of training data.

with our multilingual attention network ensemble,i.e. when we do model selection among the threemultilingual variants on the development set. Thebest performing architecture in a majority of casesis the one which shares both the encoders and theattention mechanisms across languages. More-over, this architecture also has the fewest numberof parameters for the document modeling.

This promising finding for the low-resource sce-nario means that the classification performancecan greatly benefit from the multilingual train-ing (sharing encoders and attention) without in-creasing the parameters beyond that of a singlemonolingual document model. Nevertheless, ina few cases, we observe that the other two archi-tectures, with increased complexity perform betterthan the “shared both” model. For instance, shar-ing encoders is superior to alternatives for Ara-bic language, i.e. the knowledge transfer benefitsfrom shared word and sentence representations.Hence, to generalize to a large number of lan-guages, we may need to consider more sophis-ticated models than sharing a monolingual doc-ument model across languages. Lastly, we didnot generally observe a negative (or positive) cor-relation of the closeness between languages withthe performance in the low-resource scenario, al-though the largest improvements were observedon languages more related to English (German,Spanish, Portuguese) than others (Arabic).

Overall, the above experiments pinpointed the

en → ru

en → ar

en → pt

en → de

Figure 5: Cumulative true positive (TP) differencebetween monolingual and multilingual (ensemble)models for topic classification with specific labels,in the full resource scenario.

most suitable general class of architectures foreach language availiability setting rather than spe-cific ones. Having said that, increasing the sophis-tication of the encoders is expected to improve ac-curacy across the board (as shown in Table 3).

6.3 Qualitative Analysis

We analyze the performance of the model overthe full range of labels, to observe on which typeof labels it performs better than the monolingualmodel, and provide some qualitative examples.Figure 5 shows the cumulative true positive (TP)difference between the monolingual and multi-lingual models on the Arabic and German, Por-tuguese and Russian test sets, ordered by labelfrequency. We can observe that the cumulativeTP difference of multilingual model consistentlykeeps increasing as the frequency of the label de-creases. This shows that labels across the entirerange of frequencies benefit from joint trainingwith English and not only a subset, e.g. only fre-quent labels.

For example, the top 5 labels on which the mul-tilingual model performed better than the mono-lingual one for en → de were: russland (21),berlin (19), irak (14), wahlen (13) and nato (13),while for the opposite direction those were: ger-

many (259), german (97), soccer (73), football(47) and merkel (25). These topics are likely bettercovered in the respective auxiliary language whichhelps the multilingual model to better distinguishthem in the target language. This is also observedin Figure 1, as the improved separation of topicsusing multilingual model vs. monolingual ones.

7 Conclusion

We proposed multilingual hierarchical attentionnetworks for document classification and showedthat they can benefit both full-resource and low-resource scenarios by using fewer parameters thanwith monolingual networks. In the former sce-nario, the most beneficial was to share only theattention mechanisms, while in the latter one, toshare the encoders along with the attention mech-anisms. These results confirm the merits of lan-guage transfer, which is also an important com-ponent of human language learning (Odlin, 1989;Ringbom, 2007). Moreover, our study broadensthe applicability of multilingual document classi-fication, since our framework is not restricted tocommon label sets.

There are several future directions for this study.In their current form, our models cannot gener-alize to languages without any example, as at-tempted by Firat et al. (2016b) for neural MT.This could be achieved by a label-size indepen-dent classification layer as in zero-shot classifica-tion (Qiao et al., 2016; Nam et al., 2016). More-over, although we explored three distinct architec-tures, other configurations could be examined toimprove document modeling, for example by shar-ing the attention mechanism at the sentence-levelonly. Lastly, the learning objective could be fur-ther constrained with sentence-level parallel infor-mation, to embed multilingual vectors of similartopics more closely together in the learned space.

Acknowledgments

We are greatful for the support from the Eu-ropean Union through its Horizon 2020 pro-gram (SUMMA project n. 688139, www.summa-project.eu). We would also like to thank Se-bastiao Miranda for crawling the news articlesfrom Deutsche-Welle and the anonymous review-ers for their helpful suggestions.

ReferencesRami Al-Rfou, Bryan Perozzi, and Steven Skiena.

2013. Polyglot: Distributed word representationsfor multilingual nlp. In Proc. of the SeventeenthConference on Computational Natural LanguageLearning, Sofia, Bulgaria.

Waleed Ammar, George Mulcaire, Yulia Tsvetkov,Guillaume Lample, Chris Dyer, and Noah A. Smith.2016. Massively multilingual word embeddings.CoRR, abs/1602.01925.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proc. of the 5thInternational Conference on Learning Representa-tions, San Diego, CA, USA.

Leon Bottou. 1998. On-line learning and stochastic ap-proximations. In David Saad, editor, On-line Learn-ing in Neural Networks, pages 9–42. CambridgeUniversity Press.

Sarath Chandar, Stanislas Lauly, Hugo Larochelle,Mitesh Khapra, Balaraman Ravindran, Vikas C.Raykar, and Amrita Saha. 2014. An autoencoderapproach to learning bilingual word representations.In Advances in Neural Information Processing Sys-tems 27, pages 1853–1861.

Jianshu Chen, Ji He, Yelong Shen, Lin Xiao, XiaodongHe, Jianfeng Gao, Xinying Song, and Li Deng.2015. End-to-end learning of LDA by mirror-descent back propagation over a deep architecture.In Advances in Neural Information Processing Sys-tems 28, pages 1765–1773, Montreal, Canada.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder–decoderfor statistical machine translation. In Proc. of theConference on Empirical Methods in Natural Lan-guage Processing, pages 1724–1734, Doha, Qatar.

Misha Denil, Loris Bazzani, Hugo Larochelle, andNando de Freitas. 2012. Learning where to attendwith deep architectures for image tracking. NeuralComputation, 24(8):2151–2184.

Manaal Faruqui and Chris Dyer. 2014. Improvingvector space word representations using multilin-gual correlation. In Proc. of the 14th Conference ofthe European Chapter of the Association for Com-putational Linguistics, pages 462–471, Gothenburg,Sweden.

Daniel C. Ferreira, Andre F. T. Martins, and MarianaS. C. Almeida. 2016. Jointly learning to embedand predict with multiple languages. In Proc. of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages2019–2028, Berlin, Germany.

Orhan Firat, Kyunghyun Cho, and Yoshua Bengio.2016a. Multi-way, multilingual neural machinetranslation with a shared attention mechanism. InProc. of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages866–875, San Diego, California.

Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan,Fatos T. Yarman Vural, and Kyunghyun Cho. 2016b.Zero-resource translation with multi-lingual neuralmachine translation. In Proc. of the Conference onEmpirical Methods in Natural Language Process-ing, pages 268–277, Austin, Texas.

Stephan Gouws, Yoshua Bengio, and Gregory S. Cor-rado. 2015. BilBOWA: Fast bilingual distributedrepresentations without word alignments. In Proc.of the 32nd International Conference on MachineLearning, pages 748–756, Lille, France.

Karl Moritz Hermann and Phil Blunsom. 2014. Multi-lingual models for compositional distributed seman-tics. In Proc. of the 52nd Annual Meeting of the As-sociation for Computational Linguistics, pages 58–68, Baltimore, Maryland.

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Proc. of the28th International Conference on Neural Informa-tion Processing Systems, NIPS’15, pages 1693–1701, Montreal, Canada.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. In Neural Computation, vol-ume 9 (8), pages 1735–1780. MIT Press.

Yangfeng Ji and Noah Smith. 2017. Neural dis-course structure for text categorization. CoRR,abs/1702.01829.

Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-rat, Fernanda B. Viegas, Martin Wattenberg, GregCorrado, Macduff Hughes, and Jeffrey Dean. 2016.Google’s multilingual neural machine translationsystem: Enabling zero-shot translation. CoRR,abs/1611.04558.

Rie Johnson and Tong Zhang. 2015. Effective useof word order for text categorization with convolu-tional neural networks. In Proc. of the 2015 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, pages 103–112, Denver, Col-orado.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proc. of the Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 1746–1751, Doha, Qatar.

Diederik P. Kingma and Jimmy Lei Ba. 2014. Adam:A method for stochastic optimization. In Proc. ofthe International Conference on Learning Represen-tations, Banff, Canada.

Alexandre Klementiev, Ivan Titov, and Binod Bhat-tarai. 2012. Inducing crosslingual distributed rep-resentations of words. In Proc. of the InternationalConference on Computational Linguistics, Bombay,India.

Ankit Kumar, Ozan Irsoy, Jonathan Su, James Brad-bury, Robert English, Brian Pierce, Peter Ondruska,Ishaan Gulrajani, and Richard Socher. 2015. Askme anything: Dynamic memory networks for nat-ural language processing. In Proc. of the 33rd In-ternational Conference on Machine Learning, pages334–343, New York City, NY, USA.

Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.Recurrent convolutional neural networks for textclassification. In Proc. of the 29th AAAI Conferenceon Artificial Intelligence, pages 2267–2273, Austin,Texas.

Hugo Larochelle and Geoffrey Hinton. 2010. Learningto combine foveal glimpses with a third-order Boltz-mann machine. In Proc. of the 23rd InternationalConference on Neural Information Processing Sys-tems, pages 1243–1251, Vancouver, Canada.

Quoc V. Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In Proc.of the 31st International Conference on MachineLearning, pages 1188–1196, Beijing, China.

Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou,and Sheng Li. 2015. Hierarchical recurrent neuralnetwork for document modeling. In Proc. of theConference on Empirical Methods in Natural Lan-guage Processing, pages 899–907, Lisbon, Portugal.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In Proc. of the Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1412–1421, Lisbon, Portugal.

Laurens van der Maaten. 2009. Learning a parametricembedding by preserving local structure. In Proc. ofthe 12th International Conference on Artificial In-telligence and Statistics, pages 384–391, ClearwaterBeach, FL, USA.

Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word repre-sentations in vector space. In Proc. of the Inter-national Conference on Learning Representations,Scottsdale, AZ, USA.

Volodymyr Mnih, Nicolas Heess, Alex Graves, and Ko-ray Kavukcuoglu. 2014. Recurrent models of visualattention. CoRR, abs/1406.6247.

Jinseok Nam, Eneldo Loza Mencıa, and JohannesFurnkranz. 2016. All-in text: Learning document,label, and word representations jointly. In Proc. ofthe Thirtieth AAAI Conference on Artificial Intelli-gence, pages 1948–1954, Phoenix, AR, USA.

Terence Odlin. 1989. Language transfer: Cross-linguistic influence in language learning. In Cam-bridge Applied Linguistics. Cambridge UniversityPress.

Nikolaos Pappas and Andrei Popescu-Belis. 2014. Ex-plaining the stars: Weighted multiple-instance learn-ing for aspect-based sentiment analysis. In Proc.of the Conference on Empirical Methods in NaturalLanguage Processing, pages 455–466, Doha, Qatar.

Nikolaos Pappas and Andrei Popescu-Belis. 2017.Explicit document modeling through weightedmultiple-instance learning. Journal of Artificial In-telligence Research, pages 591–626.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global vectors for wordrepresentation. In Proc. of the Conference on Em-pirical Methods in Natural Language Processing,pages 1532–1543, Doha, Qatar.

Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, and An-ton van den Hengel. 2016. Less is more: Zero-shotlearning from online textual documents with noisesuppression. In Proc. of the 2016 IEEE Conferenceon Computer Vision and Pattern Recognition, pages2249–2257, Las Vegas, NV, USA.

Hakan Ringbom. 2007. Cross-linguistic Similarity inForeign Language Learning. Second language ac-quisition. Multilingual Matters.

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In Proc. of the Conference onEmpirical Methods in Natural Language Process-ing, pages 379–389, Lisbon, Portugal.

Duyu Tang, Bing Qin, and Ting Liu. 2015. Docu-ment modeling with gated recurrent neural networkfor sentiment classification. In Empirical Methodson Natural Language Processing, pages 1422–1432,Lisbon, Spain.

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.Word representations: A simple and general methodfor semi-supervised learning. In Proc. of the 48thAnnual Meeting of the Association for Computa-tional Linguistics, pages 384–394, Uppsala, Swe-den.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchicalattention networks for document classification. InProc. of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages1480–1489, San Diego, California.

Ainur Yessenalina, Yisong Yue, and Claire Cardie.2010. Multi-level structured models for document-level sentiment classification. In Proc. of the Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1046–1056, Cambridge, MA.

Wen-tau Yih, Kristina Toutanova, John C. Platt, andChristopher Meek. 2011. Learning discriminativeprojections for text similarity measures. In Proc.of the 15th Conference on Computational NaturalLanguage Learning, pages 247–256, Portland, OR,USA.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In Advances in Neural InformationProcessing Systems 28, pages 649–657, Montreal,Canada.

Will Y. Zou, Richard Socher, Daniel Cer, and Christo-pher D. Manning. 2013. Bilingual word embeddingsfor phrase-based machine translation. In Proc. of theConference on Empirical Methods in Natural Lan-guage Processing, pages 1393–1398, Seattle, Wash-ington, USA.