multilingual dialogue generation with shared-private...

Multilingual Dialogue Generation withShared-Private Memory

Chen Chen1, Lisong Qiu2, Zhenxin Fu2, Junfei Liu3, and Rui Yan2?

1 School of Software and Microelectronics of Peking University, Beijing, China,2 Institute of Computer Science and Technology of Peking University, Beijing, China

3 National Engineering Research Center for Software Engineering (PekingUniversity), Beijing, China

{chenchen,qiulisong,fuzhenxin,liujunfei,ruiyan}@pku.edu.cn

Abstract. Existing dialog systems are all monolingual, where featuresshared among different languages are rarely explored. In this paper, weintroduce a novel multilingual dialogue system. Specifically, we augmentthe sequence to sequence framework with improved shared-private mem-ory. The shared memory learns common features among different lan-guages and facilitates a cross-lingual transfer to boost dialogue systems,while the private memory is owned by each separate language to cap-ture its unique feature. Experiments conducted on Chinese and Englishconversation corpora of different scales show that our proposed archi-tecture outperforms the individually learned model with the help of theother language, where the improvement is particularly distinct when thetraining data is limited.

Keywords: Multilingual dialogue system · Memory network · Seq2Seq· Multi-task learning.

1 Introduction

Dialogue systems have long been an interest to the community of natural lan-guage processing due to their width range of applications. These systems canbe classified as task-oriented and non-task-oriented where task-oriented dialoguesystems accomplish a specific task and non-task-oriented dialogue systems aredesigned to chat in open domain as chatbots [1]. In particular, the sequence-to-sequence (Seq2Seq) framework [2], which learns to generate responses accordingto the given queries can achieve promising performance and grow popular [3].

Building a current state-of-the-art generation-based dialogue system requireslarge-scale conversational data. However, the difficulty of collecting conversa-tional data in different languages varies greatly [4, 5]. For example, it is difficultfor minority languages to collect enough dialogue corpora to build a dialoguegeneration model as other majority languages (e.g., English and Chinese) do.Herein, we investigate to move the frontier of dialogue generation forward from

? Corresponding author: Rui Yan ([email protected]).

2 Chen et al.

a different angle. More specifically, we find that some common features, e.g., di-alogue logic, are shared in different languages but with different linguistic forms.Leveraging a multi-task framework for cross-lingual transfer learning can alle-viate the problems caused by the scarcity of resources [6–8]. Through commondialogue features shared among different languages, the logic knowledge of differ-ent languages can be transferred and the robustness of the conversational modelcan be improved. However, to the best of our knowledge, no existing study hasever tackled multilingual generation-based dialogue systems.

This paper proposes a multi-task learning architecture for multilingual open-domain dialogue system that leverages the common dialogue features sharedamong different languages. Inspired by [16], we augment the Seq2Seq frame-work by adding a architecture-improved key-value memory layer between theencoder and decoder. Concretely, the memory layer consists of two parts, wherethe key memory is used for query addressing and the value memory stores thesemantic representation of the corresponding response. To capture both sharedand private features in different languages, the memory layer is further dividedinto shared and private memory separately. Though proposed for open-domaindialogue system, the multilingual shared-private memory architecture can beadapted flexibly and used for other tasks.

Experiments conducted on Weibo and Twitter conversational corpora of dif-ferent sizes show that our proposed multilingual architecture outperforms ex-isting techniques on both automatic and human evaluation metrics. Especiallywhen the training data is scarce, the dialogue capability can be enhanced signif-icantly with the help of the multilingual model.

To this end, the main contributions of our work are summarized into fourfolds: 1) To the best of our knowledge, the proposed work is the first to pro-vide a solution for multilingual dialogue systems. 2) We improve the traditionalkey-value memory structure to expand its capacity, with which we extend theSeq2Seq model to capture dialogue features. 3) Based on the memory augmenteddialogue model, a multi-task learning architecture with shared-private memory isproposed to achieve the transfer of dialogue features among different languages.4) We empirically demonstrate the efficiency of multi-task learning in dialoguegeneration task and investigate some characteristics of this framework.

2 Related Works

2.1 Dialogue Systems

Building a dialogue system is a challenging task in natural language processing(NLP). The focus in previous decades was on template-based models [9]. How-ever, recent generation-based dialogue systems are of growing interest due totheir effectiveness and scalability. Ritter et al. [10] proposed a response genera-tion model using statistical machine-translation methods. This idea was furtherdeveloped by [11], who represented previous utterances as a context vector andincorporated the context vector into response generation. Many methods are

Multilingual Dialogue Generation with Shared-Private Memory 3

applied in dialogue generation. Attention helps the generation-based dialoguesystem by aligning the context and the response. [12] improved the performanceof a recurrent neural network dialogue model via a dynamic attention mech-anism. In addition, some works concentrate on many aspects of the dialoguegeneraion, including diversity, coherence, personality, knowledgeable and con-trollability [13]. In these approaches, the corpora used are always in the samelanguage. These systems are referred to as monolingual dialogue systems. As faras we know, this study is the first to explore the use of multilingual architectureto better suit the generation-based dialogue system.

2.2 Memory Networks

Memory networks [14, 15] are a class of neural network models that are aug-mented with external memory resources. Valuable information can be stored andreused in memory networks through the memory components. Based on the end-to-end memory network architecture [15], [16] proposed a key-value memory net-work architecture for question answering. The memory stores facts in a key-valuestructure so that the model can learn to use keys to address relevant memorieswith respect to the question and return corresponding values for answering. [17],[18] and [19] built goal-oriented dialogue systems based on memory-augmentedneural networks. Compared with the above models, our memory componentsare not trained based on specific knowledge bases, but self-tuning in the trainingprocess, which makes the model more flexible. We further divide each memorymodule into several blocks to improve its capability.

2.3 Multi-task learning

Multi-task learning (MTL) is an approach to learn multiple related tasks simulta-neously. It improves generalization by leveraging the domain-specific informationcontained in the training signals of related tasks [20]. [21] confirmed that NLPmodels benefit from the MTL approach. Many recent deep-learning approachesto multilingual issues also used MTL as part of their model.

In the context of deep learning, MTL is usually done with either hard or softparameter sharing of hidden layers: hard parameter sharing method explicitlyshares hidden layers between tasks while keeping several task-specific outputlayers; soft parameter sharing method usually employs regularization techniquesto encourage the parameters in different tasks to be similar [22]. Hard param-eter sharing is the most commonly used approach to MTL in neural networks.[7] learned a model that simultaneously translated sentences from one sourcelanguage to multiple target languages. [23] propose an adversarial multi-tasklearning framework for text classification. [8] demonstrated a single deep learn-ing model that jointly learned large-scale tasks from various domains includingmultiple translation tasks, an English parsing task, and an image captioning task.However to date, no multilingual dialogue-generation system based on multi-tasklearning framework has been built.

4 Chen et al.

MemSeq2Seq ImpMemSeq2Seq (when n=3)

Input (key) Memory Output (value) Memory

M

t

𝒗 = {𝒗𝒋}

𝟏 ≤ 𝒋 ≤ 𝒕

𝒌 = {𝒌𝒋}

𝟏 ≤ 𝒋 ≤ 𝒕

𝒗i = {𝒗𝒊𝒋}

𝟏 ≤ 𝒊 ≤ n𝟏 ≤ 𝒋 ≤ 𝒕

𝒌𝒊 = {𝒌𝒊𝒋}

𝟏 ≤ 𝒊 ≤ 𝒏𝟏 ≤ 𝒋 ≤ 𝒕

n blocks (n=3)

t

M1

t

M2

t

M3

Fig. 1. Memory structure of MemSeq2Seq and ImpMemSeq2Seq. M represents memorymodule. k and v represent input and output memory respectively.

3 Model

In this section, we first review the vanilla Seq2Seq, then propose the key-valuememory augmented Seq2Seq models, and extended them with shared-privatememory components to implement the multilingual dialogue systems.

3.1 Preliminary background knowledge

A Seq2Seq model maps input sequences to output sequences. It consists of twokey components: an encoder, which encodes the source input to a fix-sized con-text vector using the Recurrent Neural Network (RNN), and a decoder, whichgenerates the output sequence with another RNN based on the context vector.

Given a source sequence of words (query) q = {x1, x2, ..., xnq} and a target

sequence of words (response) r = {y1, y2, ..., ynr}, a basic Seq2Seq based dialogue

system automatically generates response r conditioned on query q by maximizingthe generation probability p(r|q). Specifically, the encoder encodes q to a contextvector c, and the decoder generates r word by word with c as input. The objectivefunction of Seq2Seq can be written as

ht = f(xt, ht−1), c = hnq, (1)

p(r|q) = p(y1|c)nr∏t=2

p(yt|c, y1, ..., yt−1), (2)

where ht is the hidden state at time t and f is a non-linear transformation.Moreover, gated recurrent units (GRU) and the attention mechanism proposedby [24] are used in this work.

3.2 Key-Value Memory Augmented Seq2Seq

Inspired by the end-to-end memory network[16], we introduce the MemSeq2Seq

model which adds a key-value memory layer between the encoder and decoderto learn dialogue features, and the ImpMemSeq2Seq which divides the memory ofMemSeq2Seq into blocks to expand model capacity.


MemSeq2Seq The MemSeq2Seq augments the Seq2Seq with a key-value mem-ory layer between the encoder and decoder. The memory component consistsof two parts: input (key) and output (value) memorys. The input memory isused for query representation addressing, while the output memory stores therepresentation of the corresponding response information. The model retrievesinformation from the value memory with the weights computed as the similaritybetween the query representation and the key memory, with the goal of selectingvalues that are most relevant to the query.

Formally, we first encode a query q to a context vector c, and then calculatethe similarity p = {p1, ...pt} between c and each item of the key memory usingsoftmax weight. Later, the model computes a new context vector c∗, which is aweighted sum of the value memory according to p.

pj = softmax(c · kj), (3)

c∗ =

t∑j=1

pjvj (1 ≤ j ≤ t), (4)

where kj and vj are items in the key and value memory, and t is the numberof key and value items. During training, all items in memory and parametersin the Seq2Seq are jointly learned to maximize the likelihood of generating theground-truth responses conditioned on the queries in the training set.

ImpMemSeq2Seq In the MemSeq2Seq, the key-value pairs in memory are lim-ited, which are linear with the number of items in memory. To expand capacity,we further divide the entire memory into several individual blocks and accord-ingly split the input vector into several segments to compute the similarity scores.After division, similarity to multi-head attention mechanism [25], different rep-resentation subspaces at different positions are individually projected and thenumber of key-value pairs becomes the number of slot combinations in theseblocks, while one key still corresponds to one value.

The model first split a context vector c into n segments, then compute newcontext segments c∗i by memory blocks independently, and the final new contextvector c∗ is the concatenation of c∗i . The formula is as follows.

c1, ...cn = split(c), c∗i = Mi(ci), (5)

c∗ = concat(c∗1, c∗2, . . . , c

∗n), (6)

where Mi represents the calculation in ith memory block.The ImpMemSeq2Seq calculates the weight p with a finer granularity, which

makes the addressing more precise and flexible. Besides, with a parallel imple-mentation, the memory layer becomes more efficient.

3.3 Seq2Seq with Shared-Private Memory

The models introduced in the previous sections can be extended for monolin-gual tasks. Specifically, we augment the MemSeq2Seq and ImpMemSeq2Seq for

6 Chen et al.

Mlang2

h1 h2 h3 h4 h1 h2 h3 h4 h5

Attentionlang1Attentionlang2

Mlang1 Mglobal Mlang2

h1 h2 h3 h4 h1 h2 h3c* c*

WeightsSoftmax

Inner product

Weighted average

Encoderlang1 Encoderlang2

Decoderlang1 Decoderlang2

How are you ? 你最近好吗 ?

Not bad

h5

.

h6

How aboutyou

h7

? 不错啊

h4 h5 h6

。你呢 ?

M1𝑙𝑎𝑛𝑔2

𝑐1

𝑐1∗

concat

split

clang1 clang2 𝒌

𝒗

Fig. 2. SPImpMem model for multilingual dialogue system. Mglobal represents the sharedmemory. Superscript lang1, lang2 represent two different languages respectively.

multilingual tasks and named the extensions SPMem and SPImpMem, respectively.According to multi-task learning, dialogue systems in two different languagescan be simultaneously trained. By sharing representations between two dialoguetasks, the model facilitates the cross-lingual transfer of dialogue capability.

Our multilingual model consists of four modules: an encoder, decoders, a pri-vate memory for each language and shared memory occupied by all languages.Figure 2 gives an illustration of SPImpMem. The SPMem is a special case where thenumber of memory blocks n is set to 1. More specifically, given a input query q,the encoder of its language first encode it into a context vector c, and then themodel feeds c to both its private and shared memory. The private memory isoccupied by the language corresponding to the input. The shared memory is ex-pected to capture common features of conversations among different languages.By matching and addressing the shared and private memory components, weobtain two output vectors that are then concatenated as a new context vectorc∗. The returned vector is supposed to contain features from both its own lan-guage and other languages involved in the multilingual model, which is then fedto the decoder of its language.

Given the first language conversational corpus (q1i , r1i )T1

i=1 and the second

language conversational corpus (q2i , r2i )T1

i=1, the parameters Θ are learned byminimizing the negative log-likelihood between the generated r̃ and reference r,that is equivalent to maximizing the conditional probability of responses r1 andr2 given Θ, q1 and q2:

J =1

T1

T1∑i=1

log p(r1i |q1i , Θs1 , ΘM1

, ΘMg) +

1

T2

T2∑i=1

log p(r2i |q2i , Θs2 , ΘM2

, ΘMg),(7)

where ΘS is a collection of parameters for the encoders and decoders; ΘM is theparameters of memory contents; T is the size of corpus; and subscriptions 1, 2and g represent lang1, lang2, and global in Figure 2 respectively.


4 Experimental Settings

4.1 Datasets

We conducted experiments on open-domain single-turn Chinese (Zh) and En-glish (En) conversational corpora. The Chinese corpus consists of 4.4 millionconversations and the English corpus consists of 2.1 million conversations [10].The conversations are scraped from Sina Weibo1 and Twitter2 respectively.

The experiments include two parts: balanced and unbalanced tests, whichare discriminated by the relative size of training data for each language. In thebalanced tests, the sizes of the Chinese and English corpus are comparable.We empirically set the dataset size to 100k, 400k, 1m and the whole (4.4m-Zh, 2.1m-En) to evaluate the model performance in different data scales. Theunbalanced tests consist of training data of (1m-Zh, 100k-En) and (100k-Zh,1m-En) respectively. Subsets used are sampled randomly. All the experimentshave the same validation and testing data with size 10k.

4.2 Evaluation Metrics

Three different metrics are used in our experiments:

– Word overlap based metric: Following previous work [11], we employBLEU [26] as an evaluation metric to measure word overlaps in a givenresponse compared to a reference response.

– Distinct-1 & Distinct-2: Distinct-1 and Distinct-2 are the ratios of distinctunigrams and bigrams in generated responses respectively [27] which measurethe diversity of the generated responses.

– Three-scale human annotation: We adopt human evaluation following[28]. Four human annotators were recruited to judge the quality of 500 gen-erated responses from different models. All of the responses are pooled andrandomly permuted. The criteria are as follows: +2: the response is relevantand natural; +1: the response is a correct reply, but contains little errors;0: the response is irrelevant, meaningless, or has serious grammatical errors.

4.3 Implementation Details

The Adam algorithm is adopted for optimization during training. All embeddingsare set to 630-dimensional and hidden states 1024d. Considering both efficiencyand memory size, we restrict both the source and target vocabulary to 60kand the batch size to 32. Chinese word segmentation is performed on Chineseconversational data. For the single block memory components, the number ofcells in the memory block is set to 1024 empirically, and the dimension of eachcell is adjusted according to the encoder. The memory block is further divided

1 http://weibo.com2 http://www.twitter.com

8 Chen et al.

into 32 parts in our improved memory model. In multilingual models, the numberof blocks for shared and private memory component are the same. To prevent themultilingual model from favoring one certain language, we switched sentences ofdifferent languages individually by batch during training.

4.4 Comparisons

We compare our framework with the following methods:

– Seq2Seq. A Seq2Seq model with attention mechanism.– MemSeq2Seq. The key-value memory augmented Seq2Seq model in Section 3.2.– ImpMemSeq2Seq. The improved memory augmented Seq2Seq model with the

memory block decomposed into several blocks as in Section 3.2.– SPMem. The proposed multilingual model with shared-private memory which

is extended from MemSeq2Seq. It is a special case of SPImpMem where thenumber of memory blocks n is set to 1.

– SPImpMem. The proposed multilingual model with shared-private memorycomponent which is extended from ImpMemSeq2Seq as in Section 3.3.

5 Results and Analysis

We present the evaluation results of balanced test and unbalanced test in Table1 and 2 respectively. Table 1 contains evaluation results of monolingual dialoguesystems with Seq2Seq, MemSeq2Seq and ImpMemSeq2Seq as baseline. Table 2 canbe viewed in conjunction with the data in Table 1.

5.1 Monolingual Models

From Table 1, we observe that the performance of the MemSeq2Seq model onlyslightly outperforms the Seq2Seq model. However, with memory decomposedinto several parts, the ImpMemSeq2Seq model surpasses the basic Seq2Seq model.Therefore, we conclude from the comparisons that our modification of the mem-ory components improves the capability of the model. Another observation isthat in English a good conversation model can be trained with less data. Henceit does not get a significant performance gain in English as the size of dataincreases.

5.2 Multilingual Models

Balanced Test From the experimental results shown in Table 1, we observethat the proposed multilingual model outperforms the monolingual baselines onEnglish corpus of different sizes. For the Chinese corpus, the promotion decreaseswhen the size of training data increases, and thus it can only be seen on dataof small sizes (i.e., 100k and 400k). Similar results can also be observed in [29].There are several interpretations of the phenomena: 1) By the shared memory


DatasetsMonolingual (baseline) Multilingual

Seq2seq MemSeq2Seq ImpMemSeq2Seq SPMem SPImpMem

100kZh 0.485 0.478 0.492 0.524 0.549En 0.779 0.780 0.825 0.831 0.805

400kZh 2.024 2.144 2.142 2.565 2.317En 1.001 0.937 0.974 0.976 1.034

1mZh 3.135 3.131 3.268 2.732 2.827En 1.027 1.100 1.099 1.135 1.146

all (4.4m-Zh, 2.1m-En)Zh 3.600 3.383 3.755 2.955 2.765En 1.082 1.140 1.208 1.254 1.336

Table 1. BLEU-4 scores of the balanced test. The results of monolingual experimentsare included for comparison.

DatasetsMultilingual

SPMem SPImpMem

100k-Zh, 1m-EnZh 1.442 1.484En 1.083 1.075

1m-Zh, 100k-EnZh 2.607 2.690En 0.800 0.839

Table 2. BLEU-4 scores of the unbalanced test. Numbers in bold mean that itachieves the best performance among all models trained with this dataset.

component in the proposed multilingual model, common features are learnedand transferred through both languages. Thus, when one language corpus isinsufficient, some common features from other languages are helpful. 2) Withthe scale of corpus increasing, the monolingual model is already capable enoughso that noisy information from other languages may hinder the original system.

Nevertheless, the contrary behaviors of the multilingual model on Chineseand English corpus remain suspended. As the scale of training data grows,the performance of SPMem and SPImpMem on English corpus outperforms themonolingual baselines while the performance decreases on Chinese corpus. Thismay result from the various qualities of different corpora which further influ-ence the features in the shared memory blocks. The Chinese monolingual modelwhose parameters are originally well estimated are hindered by the noise fromthe shared memory. However, the English monolingual model that is relativelypoorly trained benefit from the shared features. In a word, the higher qualitycorpus needs multilingual training less. Our model focuses on the scenario thatthe corpus of one language is scarce.

Unbalanced Test Since models benefit a lot from the multilingual model whentraining data is scarce in Table 2, we present more detailed evaluation resultsof models trained with the 100k datasets in Table 4 and Table 3. It is clearthat, with the help of another rich resource language corpus, the multilingualmodel improves the performance of language with limited training data on au-tomatic evaluation metrics except for Distinct-1. The improvements remain trueeven when comparing the unbalanced test results with the balanced test results,which are strengthened by the other language corpus with the same size. Ac-cording to the human evaluation results, SPMem and SPImpMem generate moreinformative and interesting responses (+2 responses) but perform much worse

10 Chen et al.

Dataset (100k-En) BLEU-1 BLEU-2 BLEU-3 BLEU-4 Distinct-1 Distinct-2 0 +1 +2 KappaSeq2Seq 8.513 3.331 1.566 0.779 0.019 0.077 0.36 0.60 0.04 0.54MemSeq2Seq 8.613 3.378 1.576 0.780 0.017 0.061 0.38 0.59 0.03 0.58ImpMemSeq2Seq 9.178 3.623 1.682 0.825 0.018 0.070 0.37 0.58 0.05 0.47SPMem (with 100k-zh) 8.844 3.520 1.651 0.831 0.017 0.071 0.56 0.41 0.03 0.43SPImpMem (with 100k-zh) 9.048 3.573 1.637 0.805 0.020 0.080 0.52 0.44 0.04 0.58SPMem (with 1m-zh) 10.669 3.686 1.580 0.800 0.007 0.124 0.31 0.62 0.07 0.46SPImpMem (with 1m-zh) 9.118 3.648 1.701 0.839 0.025 0.103 0.52 0.33 0.15 0.51

Table 3. Evaluation results of models trained with the 100k English dataset.

Dataset(100k-Zh) BLEU-1 BLEU-2 BLEU-3 BLEU-4 Distinct-1 Distinct-2 0 +1 +2 KappaSeq2Seq 9.463 3.035 1.168 0.485 0.026 0.113 0.47 0.45 0.08 0.74MemSeq2Seq 9.210 2.859 1.101 0.478 0.018 0.073 0.43 0.43 0.14 0.74ImpMemSeq2Seq 9.682 3.041 1.164 0.492 0.021 0.094 0.54 0.34 0.12 0.65SPMem(with 100k-en) 10.329 3.132 1.213 0.524 0.026 0.114 0.52 0.33 0.15 0.70SPImpMem(with 100k-en) 9.978 3.136 1.242 0.549 0.026 0.114 0.32 0.53 0.15 0.76SPMem(with 1m-en) 11.940 4.193 2.211 1.442 0.017 0.148 0.55 0.27 0.18 0.82SPImpMem(with 1m-en) 11.991 4.179 2.236 1.484 0.019 0.164 0.53 0.23 0.24 0.85

Table 4. Evaluation results of models trained with the 100k Chinese dataset.

Fig. 3. The figure shows a 2-dimensional PCA projection of the input block in thememory networks. The two private memory blocks are differently oriented, and theshared memory block tends to be the mixture of them. The curves located above andright show more details of the distributions along two axes.

on +1 responses for grammatical errors. Fleiss’ Kappa on all models are largerthan 0.4, which proves the correlation of the human evaluation. Therefore, somefeatures captured by the shared memory from one language can be efficientlyutilized by other languages.

5.3 Model Analysis

To illustrate the information stored in the memory components, Figure 3 visu-alizes the first input block of each memory, namely two private and one sharedmemory components. From the scatter diagram and the fitting results of theGaussian distribution, we observe some characters in the memory layer. Tunedexplicitly by each separate language, the two private memory blocks learn andstore different features that appear to distribute differently in the two dimen-sions after principal component analysis (PCA) projection. Nevertheless, the


shared memory that is jointly updated by the two languages is likely to keepsome common features of each private memory block.

6 Conclusion

This paper proposes a multi-task learning architecture with share-private mem-ory for multilingual open-domain dialogue generation. The private memory isoccupied by each separate language, and the shared memory is expected tocapture and transfer common dialogue features among different languages byexploiting non-parallel corpora. To expand the capacity of vanilla memory net-work, the entire memory is further divided into individual blocks. Experimentalresults show that our model outperforms separately learned monolingual modelswhen the training data is limited.

Acknowledgments

We thank the anonymous reviewers for their insightful comments on this pa-per. This work was supported by the National Key Research and DevelopmentProgram of China (No. 2017YFC0804001), the National Science Foundation ofChina (NSFC No. 61876196 and 61672058).

References

1. Chen, H., Liu, X., Yin, D., Tang, J.: A Survey on Dialogue Systems: Recent Ad-vances and New Frontiers. In Sigkdd Explorations, 19(2), pp.25-35,(2017)

2. Sutskever, I., Vinyals, O., Le, Q. V.: Sequence to Sequence Learning with NeuralNetworks. In: Neural Information Processing Systems, pp.3104-3112, (2014)

3. Serban, I. V., Sordoni, A., Bengio, Y., Courville, A. C., Pineau, J.: Building end-to-end dialogue systems using generative hierarchical neural network models. In:National Conference on Artificial Intelligence, 3776-3783. (2016).

4. Serban, I. V., Lowe, R., Henderson, P., Charlin, L., Pineau, J.: A Survey of Avail-able Corpora for Building Data-Driven Dialogue Systems. arXiv: Computation andLanguage (2015).

5. Li, Y., Su, H., Shen, X., Li, W., Cao, Z., Niu, S.: DailyDialog: A Manually La-belled Multi-turn Dialogue Dataset. In: International Joint Conference on NaturalLanguage Processing, pp.986-995, (2017)

6. Heigold, G., Vanhoucke, V., Senior, A. W., Nguyen, P., Ranzato, M., Devin, M.,Dean, J. : Multilingual acoustic models using distributed deep neural networks. In:International Conference on Acoustics, Speech, and Signal processing, (2013)

7. Dong, D., Wu, H., He, W., Yu, D., Wang, H.: Multi-Task Learning for MultipleLanguage Translation. In: International Joint Conference on Natural Language Pro-cessing, (2015)

8. Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., Uszkoreit,J.: One model to learn them all. arXiv preprint arXiv:1706.05137, (2017)

9. Wallace, R. S.: The anatomy of ALICE. In: Parsing the Turing Test Springer, Dor-drecht, pp.181-210 (2009).

12 Chen et al.

10. Ritter, A., Cherry, C., Dolan, W. B.: Data-Driven Response Generation in SocialMedia. In Empirical Methods in Natural Language Processing, pp.583-593, (2011)

11. Sordoni A, Galley M, Auli M, et al.: A Neural Network Approach to Context-Sensitive Generation of Conversational Responses. In: North American Chapter ofthe Association for Computational Linguistics, pp.196-205, (2015)

12. Mei, H., Bansal, M., Walter, M. R.: Coherent Dialogue with Attention-BasedLanguage Models. In: National Conference on Artificial Intelligence, pp.3252-3258,(2017)

13. Yan, R.: Chitty-Chitty-Chat Bot: Deep Learning for Conversational AI. In: Inter-national Joint Conferences on Artificial Intelligence, (2018)

14. Weston, J., Chopra, S., Bordes, A.: Memory Networks. In: International Conferenceon Learning Representations, (2015)

15. Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks.In: Neural Information Processing Systems, pp.2440-2448, (2015)

16. Miller, A. H., Fisch, A., Dodge, J., Karimi, A., Bordes, A., Weston, J.: Key-ValueMemory Networks for Directly Reading Documents. In: Empirical Methods in Nat-ural Language Processing, pp.1400-1409, (2016)

17. Bordes, A., Boureau, Y., Weston, J.: Learning End-to-End Goal-Oriented Dialog.In: International Conference on Learning Representations, (2017)

18. Madotto, A., Wu, C., Fung, P.: Mem2Seq: Effectively Incorporating KnowledgeBases into End-to-End Task-Oriented Dialog Systems. In: meeting of the Associationfor Computational Linguistics, pp.1468-1478, (2018)

19. Wu, C., Madotto, A., Winata, G. I., Fung, P.: End-to-End Dynamic Query Mem-ory Network for Entity-Value Independent Task-Oriented Dialog. In: InternationalConference on Acoustics, Speech, and Signal Processing, (2018)

20. Caruana, R.: Multitask learning. In: Machine learning, 28(1), pp.41-75, (1997)21. Collobert, R., Weston, J.: A unified architecture for natural language processing:

Deep neural networks with multitask learning. In: Proceedings of the 25th Interna-tional Conference on Machine Learning, pp.160-167, (2008)

22. Ruder, S.: An overview of multi-task learning in deep neural networks. arXivpreprint arXiv:1706.05098, (2017)

23. Liu, P., Qiu, X., Huang, X.: Adversarial Multi-task Learning for Text Classification.In: Meeting of the Association for Computational Linguistics. pp.1-10, (2017)

24. Bahdanau, D., Cho, K., Bengio, Y.: Neural Machine Translation by Jointly Learn-ing to Align and Translate. In: International Conference on Learning Representa-tions, (2015)

25. Vaswani A, Shazeer N, Parmar N, et al.: Attention is All you Need. In: NeuralInformation Processing Systems. pp.5998-6008. (2017)

26. Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a Method for Automatic Eval-uation of Machine Translation. In: Meeting of the Association for ComputationalLinguistics, pp.311-318, (2002)

27. Li J, Galley M, Brockett C, et al.: A Diversity-Promoting Objective Function forNeural Conversation Models. In: North American Chapter of the Association forComputational Linguistics, pp.110-119, (2016)

28. Wu, Y., Wu, W., Li, Z., Xu, C., Yang, D.: Neural Response Generation withDynamic Vocabularies. In: National Conference on Artificial Intelligence, pp.5594-5601, (2018)

29. Firat, O., Cho, K., Bengio, Y.: Multi-Way, Multilingual Neural Machine Trans-lation with a Shared Attention Mechanism. In: North American Chapter of theAssociation for Computational Linguistics, pp.866-875, (2016)