journal of la bilateral personalized dialogue generation

14
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Bilateral Personalized Dialogue Generation with Contrastive Learning Bin Li, Hanjun Deng Abstract—Generating personalized responses is one of the major challenges in natural human-robot interaction. Current researches in this field mainly focus on generating responses consistent with the robot’s pre-assigned persona, while ignoring the user’s persona. Such responses may be inappropriate or even offensive, which may lead to the bad user experience. Therefore, we propose a Bilateral Personalized Dialogue Gener- ation (BPDG) method for dyadic conversation, which integrates user and robot personas into dialogue generation via designing a dynamic persona-aware fusion method. To bridge the gap between the learning objective function and evaluation metrics, the Conditional Mutual Information Maximum (CMIM) crite- rion is adopted with contrastive learning to select the proper response from the generated candidates. Moreover, a bilateral persona accuracy metric is designed to measure the degree of bilateral personalization. Experimental results demonstrate that, compared with several state-of-the-art methods, the final results of the proposed method are more personalized and consistent with bilateral personas in terms of both automatic and manual evaluations. Index Terms—Bilateral persona-consistent, conditional mutual information maximum, contrastive learning, personalized dia- logue generation. I. I NTRODUCTION O NE of the major challenges in human-robot interaction is to develop an intelligent agent to generate natural, personalized, information-rich, and consistent responses [1]. For this purpose, the dialogue agents have to learn to express personalized information appropriately like humans. Currently, personalized dialogue agents have been widely applied in various human-robot interaction scenarios, such as intelligent personal assistants [2], public service robots [3], wearable devices [4], etc. The agents with personalization are considered reliable and trustworthy, and can gain the user’s confidence and trust [5]. In the past decades, personalization has played an important role in the dialogue system and attracted wide attention [6]– [10]. According to the different ways of personalized infor- mation modeling, the existing personalized dialogue systems are mainly categorized into two types: implicit personalization [7], [8] and explicit personalization [9], [10]. The implicit personalized dialogue system models personas with unstructured natural language utterances (e.g., “I am a musician.”, “I like to play the guitar.”), where the persona is implicitly mapped into the hidden state vectors. However, Bin Li is with College of Electrical and Information Engineering, and with the Key Laboratory of Visual Perception and Artificial Intelligence of Hunan Province, Hunan University, Changsha, 410082, China. ([email protected]). General GPT2 model with unilateral persona User: Where are you from? Robot: I am from Guangzhou. User: Come to my city and play with me? Robot: Yes, I want to go to Guangzhou. Our model with bilateral personas User: Where are you from? Robot: I come from Guangzhou. User: Come to my city and play with me? Robot: Yes, I want to go to Beijing. User’s personalized attributes Robot’s personalized attributes "Gender" "Female" "Gender" "Male" "Area" "Beijing" "Area" "Guangzhou" "Interests" "Shopping; Sport" "Interests" "Cooking; Movie" Fig. 1. Exemplar dialogues with/without bilateral persona-consistent in dyadic conversation. The general GPT2 model with unilateral persona can generate a response that only meets the robot’s persona, but ignores the persona of the other party. The proposed method can incorporate bilateral personas and generate a response that matches the personas of both parties. these implicit space mapping methods are poor in inter- pretability and may be over-fitting during the training process. Besides, the given utterances are mostly short and limited to only a few personas, the model may fail to utilize the persona properly when generating responses [11]. Indeed, implicit personalized dialogue corpus [7] reflecting personas in every response is also different from the way of interpersonal conversation. The explicit personalized dialogue system models the per- sonas with the structured personalized attributes, which are ex- plicitly formatted as different key-value pairs (e.g., <Gender, Female>, <Area, Beijing>). Such explicit persona modeling is more straightforward and interpretable. Specifically, the explicit personalized dialogue corpora [9], [10] are crawled on a large scale from social networking sites, such as Weibo 1 , where people may unintentionally show their personality dur- ing the conversation. However, the explicit personalization in [9] models the robot’s persona in the form of a pre-assigned profile and only emphasizes unilateral persona consisitency. The latest works [10], [12] incorporate the structured speaker’s profile into the generated response to ensure the persona consistency of the speaker. Although these methods solve the problem of unilateral persona consistency to some extent, the robot may ignore the user’s persona during the conversation. As a result, the generated responses may conflict with the 1 https://www.weibo.com arXiv:2106.07857v3 [cs.CL] 21 Sep 2021

Upload: others

Post on 16-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Bilateral Personalized Dialogue Generation withContrastive Learning

Bin Li, Hanjun Deng

Abstract—Generating personalized responses is one of themajor challenges in natural human-robot interaction. Currentresearches in this field mainly focus on generating responsesconsistent with the robot’s pre-assigned persona, while ignoringthe user’s persona. Such responses may be inappropriate oreven offensive, which may lead to the bad user experience.Therefore, we propose a Bilateral Personalized Dialogue Gener-ation (BPDG) method for dyadic conversation, which integratesuser and robot personas into dialogue generation via designinga dynamic persona-aware fusion method. To bridge the gapbetween the learning objective function and evaluation metrics,the Conditional Mutual Information Maximum (CMIM) crite-rion is adopted with contrastive learning to select the properresponse from the generated candidates. Moreover, a bilateralpersona accuracy metric is designed to measure the degree ofbilateral personalization. Experimental results demonstrate that,compared with several state-of-the-art methods, the final resultsof the proposed method are more personalized and consistentwith bilateral personas in terms of both automatic and manualevaluations.

Index Terms—Bilateral persona-consistent, conditional mutualinformation maximum, contrastive learning, personalized dia-logue generation.

I. INTRODUCTION

ONE of the major challenges in human-robot interactionis to develop an intelligent agent to generate natural,

personalized, information-rich, and consistent responses [1].For this purpose, the dialogue agents have to learn to expresspersonalized information appropriately like humans. Currently,personalized dialogue agents have been widely applied invarious human-robot interaction scenarios, such as intelligentpersonal assistants [2], public service robots [3], wearabledevices [4], etc. The agents with personalization are consideredreliable and trustworthy, and can gain the user’s confidence andtrust [5].

In the past decades, personalization has played an importantrole in the dialogue system and attracted wide attention [6]–[10]. According to the different ways of personalized infor-mation modeling, the existing personalized dialogue systemsare mainly categorized into two types: implicit personalization[7], [8] and explicit personalization [9], [10].

The implicit personalized dialogue system models personaswith unstructured natural language utterances (e.g., “I am amusician.”, “I like to play the guitar.”), where the personais implicitly mapped into the hidden state vectors. However,

Bin Li is with College of Electrical and Information Engineering, and withthe Key Laboratory of Visual Perception and Artificial Intelligence of HunanProvince, Hunan University, Changsha, 410082, China. ([email protected]).

General GPT2 model with unilateral persona

User: Where are you from?

Robot: I am from Guangzhou.

User: Come to my city and play with me?

Robot: Yes, I want to go to Guangzhou.

Our model with bilateral personas

User: Where are you from?

Robot: I come from Guangzhou.

User: Come to my city and play with me?

Robot: Yes, I want to go to Beijing.

User’s personalized attributes Robot’s personalized attributes

"Gender" "Female" "Gender" "Male"

"Area" "Beijing" "Area" "Guangzhou"

"Interests" "Shopping; Sport" "Interests" "Cooking; Movie"

Fig. 1. Exemplar dialogues with/without bilateral persona-consistent in dyadicconversation. The general GPT2 model with unilateral persona can generatea response that only meets the robot’s persona, but ignores the persona ofthe other party. The proposed method can incorporate bilateral personas andgenerate a response that matches the personas of both parties.

these implicit space mapping methods are poor in inter-pretability and may be over-fitting during the training process.Besides, the given utterances are mostly short and limitedto only a few personas, the model may fail to utilize thepersona properly when generating responses [11]. Indeed,implicit personalized dialogue corpus [7] reflecting personas inevery response is also different from the way of interpersonalconversation.

The explicit personalized dialogue system models the per-sonas with the structured personalized attributes, which are ex-plicitly formatted as different key-value pairs (e.g., <Gender,Female>, <Area, Beijing>). Such explicit persona modelingis more straightforward and interpretable. Specifically, theexplicit personalized dialogue corpora [9], [10] are crawledon a large scale from social networking sites, such as Weibo1,where people may unintentionally show their personality dur-ing the conversation. However, the explicit personalization in[9] models the robot’s persona in the form of a pre-assignedprofile and only emphasizes unilateral persona consisitency.The latest works [10], [12] incorporate the structured speaker’sprofile into the generated response to ensure the personaconsistency of the speaker. Although these methods solve theproblem of unilateral persona consistency to some extent, therobot may ignore the user’s persona during the conversation.As a result, the generated responses may conflict with the

1https://www.weibo.com

arX

iv:2

106.

0785

7v3

[cs

.CL

] 2

1 Se

p 20

21

Page 2: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

user’s personalized information.In the dyadic interpersonal conversations, both two inter-

acting parties know each other’s personalized information[13]. When responding, the speaker should not only focuson their own personalized expression, but also consider thequestions and persona of the other party [14]. As is shown inFig. 1, during the conversation, the robot should generate re-sponses consistent with the robot’s own personalized attributes(i.e., unilateral persona-consistent). Furthermore, the robotalso should know the user’s persona and generates responsesconsistent with the user’s personalized attributes (i.e., bilateralpersona-consistent). Once these factors are ignored, it mayannoy the user and reduce the user experience.

To solve the above problem, we propose a bilateral per-sonalized dialogue generation (BPDG) method to generate re-sponses consistent with both personas. Specifically, the BPDGmethod is based on the structure of language model withmulti-task transfer learning. The proposed method optimizesthree tasks simultaneously: the language model task, thepersona presence prediction task, and the dialogue generationtask. In the language model task, the dialogue utterancesembedded with the corresponded personalized attributes andrelative position are used to train the encoder. In the personapresence prediction task, the dialogue contextual encoding isused to predict the possibilities of the personas’ presence inthe response. More precisely, the encodings of the dialoguecontext, bilateral personas and the right-shifted outputs arefused with four different fusion methods (i.e., add, max,mean and dynamic) to demonstrate the importance of bilateralpersonas. In the dialogue generation task, the fused encoding isinput into the decoder to generate response candidates with thediverse beam search strategy [15]. Finally, in order to ensurethe generated responses are more personalized and bilateralpersona-consistent, we adopt the Conditional Mutual Informa-tion Maximum (CMIM) criterion with contrastive learning toselect the final response from the diversified generated candi-dates. Thus, the proposed BPDG method can utilize bilateralpersonalized information to generate personalized and bilateralpersona-consistent responses for better user experience in thehuman-robot interaction.

The main contributions of this article can be summarizedas follows.

1) We propose a novel BPDG method, which integrates thebilateral personas to generate responses consistent withboth personas. To the best of our knowledge, this is thevery first to propose the bilateral persona consistency inthe personalized dialogue generation.

2) A dynamic persona-aware fusion module is developedto adaptively control the encodings of the bilateralpersonalized information, the dialogue context, and theshifted right outputs for decoding to generate bilateralpersona-consistent responses.

3) We adopt the criterion of the CMIM with contrasti-ve learning, which briges the gap between the learningobjective and evaluation metrics.

4) Both automatic and manual evaluations show that ourmethod outperforms state-of-the-art methods.

The remainder of this article is structured as follows.Section II reviews the work related to the personalized dia-logue system. Section III formulates the problem and detailsthe proposed BPDG method. Section IV fully describes theexperimental setups. Automatic and human evaluations areillustrated and analyzed in detail in Section V and SectionVI respectively. Finally, the conclusions and some possiblefuture work are pointed out in Section VII.

II. RELATED WORK

A. Personalized Dialogue Generation

Inspiring by the “Big Five” [16] in psychology, Mairesseet al. [17] take the lead in incorporating the personalitiesinto the framework of dialogue generation, thereby generatingresponses with recognizable personality. However, the person-ality of the “Big Five” is extremely implicit and subtle. It isnecessary to build rules to capture personality characteristics.Besides, it is a challenge to construct a corpus with limitedand laborious collection. With the popularity of deep learning,hand-craft rule modeling is gradually replaced by data-drivenmodeling. Li et al. [18] first propose a personalized dialoguegeneration model, mapping the persona in natural utteranceinto distributed representation vectors on the seq2seq frame-work, which is benefited from the neural machine translation[19]. Subsequently, there are other different methods usedfor personalized dialogue generation modeling, for example,Song et al. [8] adopt the CVAE method implicitly learnsthe responses that contain personalized information to gen-erate personalized responses. Madotto et al. [20] design apersonalized dialogue generation model with meta-learning.Yang et al. [21] describe an empirical survey of personalizeddialogue generation via reinforcement learning. The abovemethod is effective, but it also faces the problem of generatinggeneral or bilateral-inconsistent responses. Different from theprevious work, the proposed BPDG method further integratespersonalized information from both parties into the pre-trained decoder-only framework, to generate bilateral persona-consistent responses with multi-task learning and transferlearning.

B. Multi-task Transfer Learning

Multi-task Transfer learning aims to extract and transfer theknowledge from the source domain to the target domain [22]with different well-designed learning tasks, which has beenvery popular in the field of the NLP in the past decade [23].Recent advances in natural language generation rely on pre-training a large generative language model with a large corpusof unsupervised data. It mainly follows the two-stage paradigmof pre-training and fine-tuning. In the field of personalizeddialogue generation, Zhang et al. [24] first introduce transferlearning into the two-stage personalized dialogue generation.Wolf et al. [25] design a pre-trained dialogue generation modelthat jointly learns two tasks (e.g., next sentence prediction andlanguage model) when finetuning. The experimental resultsshow that multi-task learning can greatly improve the scoresin automatic metrics. Golovanov et al. [26] integrate multi-task learning into the transfered model with shared parameters,

Page 3: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

UserInput

Multi-head Attention

User Persona

Layer Norm

Feed Forward

+

Layer Norm

+

Robot PersonaEncoding

Outputs Encoding(Shifted Right)

User PersonaEncoding

Dialogue ContextEncoding

X 12 X 12

Multi-headAttention

Layer Norm

+

DecoderEncoder

Feed Forward

Layer Norm

Linear Output

+

DynamicPersona-Aware

Fusion

BilateralAttributes

Robot PersonaDialogueHistory

PersonalizedHistory

Word Embbeding

Word Embbeding

Dialogue Context

+

Bilateral Personas

Word Embbeding

Word Embbeding

++Position

EmbeddingPosition

Embedding

Position Embedding

UserAttributes

+

Fig. 2. The overview of the proposed BPDG method.

and designs three sub-tasks, including language model task,dialogue generation task and expected risk task. These tasksare proven to improve the performance in human evaluation.Zheng et al. [12] leverage target persona information in gen-erating unilateral persona-consistent responses by designingthree different tasks, including the language model, personarouting, and dialogue generation. In this article, apart fromthe language model task and dialogue generation task, wefurther design a persona prediction task for the dynamicpersona-aware fusion module, adaptively fusing the encodingsof different information for decoding, to generate responsesconsistent with bilateral personas.

C. Contrastive Learning

Contrastive Learning [27]–[29] uses self-supervised meth-ods to learn the representation of the positive examples andnegative examples. The contrastive learning method learnsthe general features of the corpus without labels by teachingthe model which data is similar or different. In the fieldof natural language processing, contrasted learning has goodperformance in tasks such as language model task [30], imagecaptioning [31], text summarization [32]. In the field ofhuman-robot interaction, contrastive learning is conducive tocapturing the information implicit in the dialogue [33] and it isuseful for filling the gap between learning objective functionand evaluation metrics [32]. Therefore, this paper introducesthe conditional mutual information criterion in the bilateralpersonalized dialogue generation. By ranking the candidateresponses through comparative learning, the final outputs canbe rich in bilateral personalized information.

III. PROPOSED METHOD

In the dyadic interpersonal conversation, both interactingparties have their own personas such as gender, area, individual

interests, etc. Such information may be presented in theresponse. In the human-robot dialogue, given the user personaU, the robot persona R, the personalized history H and the userinput X, the robot generates a natural, fluent and personalizedresponse Y, which can be formulated as follows:

Y = argmaxY ′

P (Y ′ | X,H,U,R) (1)

where the user persona U and the robot persona R canbe represented with the personal profile, which is formattedas a set of attributes composed of key-value pairs. Eachattribute in the user persona U = {u1, u2, . . . , um} is akey-value pair ui = 〈ki, vi〉. The robot persona R is rep-resented likewise. The personalized history is representedas H =

{{XU

1 , U},{XR

2 , R}, . . . ,

{XRl , R

}}, where the

superscript indicates the speaker, and the subscript indicatesthe number of the dialogue rounds. Each sentence is associatedwith the persona of the corresponded speaker. The user inputX =

{XUl+1, U

}contains the user current input XU

l+1 withthe user persona U .

Combining the user input X and the personalized historyH into the context of the dialogue C, the equation (1) can befurther written as the equation (2):

Y = argmaxY ′

P (Y ′ | C,U,R) (2)

where the dialogue context C =< H,X > represents that thepersonalized history H is concatenated with the current userinput X .

Fig. 2. is the overview of the proposed BPDG method. TheBPDG method consists of the encoder, the dynamic persona-aware fusion module, and the decoder. Following the GPT2framework, the encoder and decoder share the same weightsand act as a backbone to learn the sentence representation.The encoder trains the language model with the dialoguecontext embedding and encodes the embedding of the user

Page 4: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

2

User Input Robot Output

𝒙 SEP 𝒙 𝒙Utterance Embeddings 𝒙

𝒈 𝒈 𝒈 𝒈Gender Embeddings 𝒈

𝒕 𝒕 𝒕 𝒕Tag Embeddings 𝒕

𝒂 𝒂 𝒂 𝒂Area Embeddings 𝒂

𝐸 𝐸 𝐸 𝐸Position Embeddings 𝐸

𝒉 𝒉History Embeddings

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+ + + + +

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

Fig. 3. The structure of personalized history embeddings.

persona and the robot persona independently. The persona-aware fusion module is used for fusing the dialogue contextencoding, the bilateral persona encodings and the shifted rightoutputs encoding. Afterwards, the fused encoding is sent intothe decoder for generating several candidate responses withthe diverse beam search strategy. Finally, the CMIM criterionis adopted to output a personalized and bilateral persona-consistent response.

A. Dialogue Context Modeling

Dialogue context modeling means that each dialogue ut-terance embedding is added with the corresponded personaembedding and relative position embedding to obtain theembeddings of personalized history. The dialogue contextembedding can be obtained by concatenating the embeddingsof the personalized history and the current user input. The di-alogue context encoding is obtained with the dialogue contextembedding being encoded. The process can be described asfollows:

1) Utterence Embedding: The utterances of the user and therobot are first embedded with word embedding respectively.The XU represents the embedded user input, and the XR

represents the embedded robot output. Both embeddings arespecified with the same length n. If the corresponded lengthdoes not reach the specified length, we use < PAD > as aplaceholder. Otherwise, a truncation operation is taken. Theword embedding process is shown as follows:

XU ={xU1 ,x

U2 ,x

U3 , . . . ,x

Un

}(3)

XR ={xR1 ,x

R2 ,x

R3 , . . . ,x

Rn

}(4)

where the XU is the embedding of the user input, the xUi isthe word embedding of the i-th token in the sentence input bythe user, and the XR is the embedding of the robot response,the xRi is the word embedding of the i-th token in the sentenceoutput by the robot.

2) Persona Embedding: Persona embedding means the ut-terances embedded with the corresponded personas attributes.As is mentioned before, the profile consists of three attributes:gender, area, and individual interests. The value of the genderis binary (i.e., 0 for male and 1 for female). The value of thearea is represented with the index of the corresponded item inthe look-up table. The items of the look-up table are sortedby the occurrence frequency of the area in the corpus. Theindividual interests are represented in a similar way. To take

the operation of the user as an example, the process is shownin the equation (5):

GU ={gU1 , g

U2 , . . . , g

Uj , . . . , g

Un | gUj = gU

}AU =

{aU1 ,a

U2 , . . . ,a

Uj , . . . ,a

Un | aUj = aU

}TU =

{tU1 , t

U2 , . . . , t

Uj , . . . , t

Un | tUj = tU

} (5)

where the gU represents the word embedding of the user’sgender extracted from the profile, the gUj represents the genderembedding gU corresponding to the position j in the userinput embedding XU , j ∈ [1, n]. The aU and tU represent theword embedding of the user’s area and individual interests tagextracted from the profile respectively. For multiple individualinterests, we take the average of the first-three embeddings ofindividual interests.

The relative position embedding [34] is adopted to make theembedded tokens more sensitive to the position in the sentencefor further attention operation. The position embedding iswritten as follows:

Ei(2k) = sin

(i

100002k

dmodel

)Ei(2k + 1) = cos

(i

100002k

dmodel

) (6)

where i is the position of the token in the sentence, krepresents the k-th dimension of the word embedding, dmodelis the fixed embedding dimension.

3) Personalized history embbedings: The Fig. 3. showsthe structure of personalized history embeddings. The per-sonalized history embeddings are a combination of the afore-mentioned three types of embeddings, i.e., the embeddingsof the utterance, the persona embeddings, and the positionembeddings, with the < SEP > being used as the separator.Specifically, the personalized history embeddings are format-ted utterance by utterance with concatenation, which can bewrriten as the equation (7).

H = Concat {h1,h2, . . . ,hj , . . . ,hl}

hj =

{DU , if mod (j, 2) = 0

DR, if mod (j, 2) = 1, j ∈ [1, l]

(7)

where the Concat {} represents the operation of concatenation,l represents the total number of rounds of the personalizedhistory, hj represents the personalized history of the j round,j ∈ [1, l]. For each utterance, the personalized history em-beddings are calculated via aligning the embeddings by tokenand performing token-wise aggregation. This process can beexpressed as follows:

DU = Add (XU , GU , AU , TU , E) (8)

DR = Add (XR, GR, AR, TR, E) (9)

where the Add () represents the token-wise addition operationof the different embeddings with the same embedded length.

Page 5: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

RobotPersona Encoding

OutputsEncoding

UserPersona Encoding

Dialogue Context

Encoding

Masked Multi-headAttention

Shifted Right

EprevEREU

Unmasked Multi-headAttention

+

UnmaskedMulti-headAttention

UnmaskedMulti-headAttention

EC

Oenc

OROUOC Oprev

Persona PresencePrediction

× ××a b g

Fig. 4. The structure of the dynamic persona-aware fusion module.

4) Dialogue Context embedding: The personalized historyembeddings and the user current input at the l + 1 round areconcatenated into the dialogue context embedding C, whichcan be expressed as follows:

C = Concat {H,hl+1} (10)

Finally, the dialogue context encoding EC is obtained afterthe dialogue context embedding C is encoded.

B. Bilateral Profile Modeling

To take advantage of the bilateral personas in the dialoguegeneration, the explicit form of persona, i.e., the profile, is usedin the proposed method. Word embedding is performed on theprofile text to represent the semantic information in the sameway as the utterance, which will benefit the further processing.Specifically, the word embedding of the user persona U andthe robot persona R can be written as follows:

U = {u1,u2,u3 | ui = {s,v}, i = 1, 2, 3} (11)

R = {r1, r2, r3 | ri = {s′,v′} , i = 1, 2, 3} (12)

where each attribute ui in the embedded user persona U isthe word embedding of the key-value pair. The embeddeduser persona U is the concatenation of the three attributescorresponding to gender, area, and individual interests respec-tively. The comma is used as the separator to concatenate eachkey-value pair. The embedded robot persona R is formattedlikewise.

Further, the embedded user persona U with relative positionembedding E is input into the encoder to obtain the userpersona encoding EU , while the embedded robot persona Rturns into the ER is in the same way. The above process isimplemented independently, which means that the EU and ERdo not participate in the training of the encoder.

C. Persona-aware Fusion Module

In the bilateral personalized dialogue generation, two crit-ical problems have to be addressed for appropriate persona

expression: (1) when to express persona, and (2) whosepersona should be expressed. Therefore, we propose dynamicpersona-aware fusion to predict the presence of the bilateralpersonas and adaptively fuse them into the encodings forthe further personalized response generation. Fig. 4. showsthe structure of the dynamic persona-aware fusion module.The persona-aware means that the presence of the personain the generated response can be predicted with the dialoguecontextual encoding OC obtained from the attention operation.The prediction probability is used to dynamically weighted tothe corresponded attention encoding for fusion.

1) Encoding Attention Mechanism: In order to effectivelyutilize the information of the encodings, we design differ-ent encoding attention mechanisms. Each encoding from theencoder participates in the unmasked multi-head attentionmechanism. The masked multi-head attention mechanism isdesigned to avoid feeding the shifted-right ground-truth tokenswhen training. The prev represents the previously decodedoutput word, which turns into the outputs encoding Eprev withword embedding and position embedding. The EU is inputinto the unmasked multi-head attention network to obtain theuser personalized encoding OU and the robot personalizedencoding OR is obtained in the same way. The unmaskedmulti-head attention process is shown as follows:

OU = Multi-head (Eprev, EU , EU ) (13)

OR = Multi-head (Eprev, ER, ER) (14)

where the Eprev is the query, the EU is both the key and thevalue in the unmasked multi-head mechanism, the operationof the robot personalized encoding OR is the same.

The context encoding EC and the outputs encoding at theprevious moment Eprev are used to obtain the dialogue con-textual encoding OC with the unmasked multi-head attentionnetwork, as is shown in equation (15):

OC = Multi-head (Eprev, EC , EC) (15)

where the Eprev is the query, the EC is both the key and thevalue in the unmasked multi-head mechanism.

Furthermore, a masked multi-head attention network is usedto obtain the previous outputs encodings Oprev , as is shownin equation(16):

Oprev = MaskedMulti-Head (Eprev, Eprev, Eprev) (16)

where the Eprev is the query, the key, and the value in themasked multi-head mechanism.

2) Persona Presence Prediction: The presence of the bi-lateral personas in the response is predicted for the dynamicpersona-aware fusion of different encodings. To train a subnet-work for this task, we construct a heuristic script to label theutterance with three labels based on the presence of bilateralpersonas. The dialogue contextual encoding OC is used topredict the probability of three types of information, whichis presented in the response sentence. The loss function isdesigned as follows:

LP (θ) = −3∑j=1

lj logPθ (l = j | OC) (17)

Page 6: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

where the lj represents the label of different persona type,logPθ (l = j | OC) represents the probability of the personatype predicted in the generated response based on the dialoguecontextual encoding OC .

3) Persona Encoding Fusion: To utilize personalized infor-mation of different encodings, the dynamic encoding fusion isdesigned to adaptively control the persona presented in thegenerated response. The probability of three categories, whichis used as the persona-aware weight for dynamic encodingfusion. Specifically, each category is operated with the softmaxoperation, which can be shown in the equation (18):

Pθ (l = j | OC) =expO

(j)C∑

i=3 expO(i)C

, j = 0, 1, 2 (18)

where the O(j)C represents the dialogue contextual encoding

OC corresponding to the j-th label, which is obtained with atwo-layer perception network with global and average pooling.

Each prediction probability is defined as the persona-awareweight as follows:

α = Pθ (l = 0 | OC) (19)

β = Pθ (l = 1 | OC) (20)

γ = Pθ (l = 2 | OC) (21)

where the α represents the probability of the user personalizedinformation presented in the response, the β represents theprobability of the robot personalized information presentedin the response, and the γ represents the probability thatpersonalized information does not present in the response,which means the context-related. Three different encodings aredynamically weighted and fused, with the dialogue contextualencoding OC and the previous outputs encodings Oprev . Theseencodings together form the fused encoding Oenc, as is shownin equation (22):

Oenc = αOU + βOR + (γ + 1)OC +Oprev (22)

where α+ β + γ = 1.After fusing the different encodings with the dynamic

persona-aware fusion module, the fused encoding Oenc isinput into the decoder for dialogue generation.

D. Multi-task Learning for Dialogue Generation

To train the proposed BPDG model, three different taskshave to be accomplished including language model task,persona prediction task and dialogue generation task. Thesetasks will be described below.

1) Language Model Task: A pre-trained model is firstutilized to initialize the parameters of the GPT2 framework.In order to bridge the gap between the data utilized in thepre-training and fine-tuning stage, the language model is thenadopted to finetune with the bilateral personalized dialoguedataset mentioned in Section IV-A. The language model istrained by optimizing the standard maximum log-likelihoodloss, as shown in equation (23):

LLM (ϕ) = −∑i

logPϕ (xi | xi−k, . . . , xi−1) (23)

where ϕ represents the parameters of the language model, kis the size of the context window, and xi−k, . . . , xi−1, xi issequence of tokens sampled from the training corpus.

2) Persona Prediction Task: The persona prediction taskis to predict the persona presence according to the contextualencoding OC . The loss function is shown in the equation (17).As a result, the prediction probability is used to dynamicallyweighted the different encodings to get the fused encodingOenc. Finally, the Oenc is input into the decoder for dialoguegeneration.

3) Dialogue Generation Task: The dialogue generation taskis designed to to generate the bilateral personalized responses,the loss function of the dialogue generation task is shown asequation (24):

LD(ϕ) = −∑i

logPϕ (yi | y0, . . . , yi−1, EC , EU , ER)

= −∑i

logPϕ (yi | Oenc)(24)

where yi represents the i-th word generated by the decoder,and y0, . . . , yi−1 is a sequence of previously generated words.Identically, the input of the decoder also can be written as thefused encoding.

Finally, the joint loss function of the entire model ispresented in equation (25):

L(ϕ, θ) = LD(ϕ) + λ1LLM (ϕ) + λ2LP (θ) (25)

where the λ1 and λ2 are the balance weights of the lossfunction of the language model task and the loss function ofthe persona prediction task respectively.

E. Candidate Selection with CMIM

After the dialogue generation via dynamic persona-awarefusion, the response is output with the decoding strategy.However, the top-ranked candidates with the beam searchstrategy are usually general, short, or even unrelated [35],so that responses related to both personas and history con-ditions often fail to achieve high ranking scores. To remedythis, the criterion of CMIM [36] is adopted to constrainthe personalized and history information that reflects in theresponse. Specifically, the BPDG method utilizes the diversebeam search strategy to generate the best diversed top-20candidate list, and adopts the CMIM criterion to select theresponse with the largest conditional mutual information valueas the final response.

1) Conditional Mutual Information Modeling: In order tosimplify the modeling process, the user persona U , the robotpersona R, and personalized history information H can beregarded as the condition Z. The illustration of conditionalmutual information is shown in Fig. 5. Given the differentconditions, i.e., H,U,R in the same dialogue, the value ofconditional mutual information CMIv of the user input X andthe robot generated candidate response Yi can be expressed asequation (26):

CMIv(Yi) ≡ I(Yi ;X | H ,U ,R)

= H(Yi | H ,U ,R)︸ ︷︷ ︸Relevance Ranking

−H(Yi | X ,H ,U ,R)︸ ︷︷ ︸Dialogue Generation

(26)

Page 7: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Η 𝑍|X,𝑌

Η 𝑌Η 𝑋

Η 𝑍

ConditionalMutual Information Η 𝑌|𝑋,𝑍Η 𝑋|𝑌,𝑍

𝚰 𝒀;𝑿 𝒁

Ι 𝑌;𝑋;𝑍

Ι 𝑋;𝑍|𝑌 Ι 𝑌;𝑍|𝑋

Fig. 5. Illustration of conditional mutual information. The circles represent theinformation entropies of the different variables. The dashed circle representsthe information entropy of the generated responses.

where the CMIM criterion can be modeled with two terms,i.e., the dialogue generation item and the relevance rankingitem.

According to the definition of the CMI [36], the maximumof the equation (26) can be achieved by solving the followingoptimization problem

Y ∗ = argmaxYi

logP (Yi | X,Z)P (Yi | Z)

(27)

where the Y ∗ represents the final response in the top-20candidate list. The P (Yi|X,Z) and P (Yi|Z) are correspondedto the dialogue generation term and relevance ranking term inequation (26) respectively.

The P (Yi|X,Z) is the probability of the generated responseconditioned on the input and the context with the word granu-larity, while the P (Yi|Z) is the relevance of the response to thecontextual content with the sentence granularity. Therefore, theP (Yi|X,Z) and P (Yi|Z) of equation (27) are not optimizedjointly.

2) Dialogue Generation: The P (Yi|X,Z) can be mod-eled with the BPDG model and calculated with the diversebeam search score. By substituting the Z with H,U,R, theP (Yi|X,H,U,R) can be written as equation (28):

logP (Yi | X,H,U,R) ≡ logPψ (Yi | X,EH , EU , ER)= logPψ (Yi | Oenc)

(28)

where the ψ represents the parameters of the trained BPDGmodel, containing all the parameters in (25).

3) Relevance Ranking with Contrastive Learning: Afterthe candidate list is generated with the diverse beam searchstrategy, each candidate can be ranked with relevance ranking.Given the condition Z, i.e., the user persona U , the robot

persona R, and the personalized history H , the relevanceprobability is calculated as

logP (Yi | H,U,R) =logPφ(Yi, H, U,R)

logPφ(H,U,R)

∝ logPφ(Yi, H, U,R)

(29)

where the φ represents the parameters of the content relevanceclassifier model trained on the corpus, the co-occurrenceprobability Pφ(H,U,R) is not related to Yi which can beomitted, and the Pφ(Yi, H, U,R) represents the co-occurrenceprobability of Yi, H , U and R in the same dialogue.

Therefore, the relevance probability of each candidatecan be modeled with the content relevance classifierPφ(Yi, H, U,R), we adopt the contrastive learning trainingmethod [33] to perform relevance ranking step. To constructthe training corpus for content relevance classifier, the Y , H ,U , and R from the corpus D are used as positive trainingsamples, which marked as, while the Y ′, H , U , and Rfrom different corpus are sampled as negative samples, whichis inspired by the practice in [37]. The cross-entropy lossfunction used to train content relevance classifier φ is asfollows

Lφ =− 1

N

∑(Y,H,U,R)∈D

logP((Y,H,U,R)+;φ

)− 1

N

∑(Y ′,H,U,R)∈D

log[1− P

(Y ′, H, U,R)−;φ

)].

(30)

4) Candidate Selection: With the BPDG model and thecontent relevance classifier, the optimization problem in (27)can be written as folows.

Y ∗ = argmaxYi

logPψ (Yi | Oenc)Pφ(Yi, H, U,R)

(31)

where Yi represents the response candidates.In order to further fintune the relevance of the generated

response to the contextual content, an additional penaltyparameter λ3 is introduced for the relevance ranking item.The too-large value of λ3 may reduce the relevance of thefinal response to the user input. Thus, the final response canbe selected by equation (32):

Y ∗ = argmaxYi

logPψ (Yi | Oenc)− λ3 logPφ (Yi, H, U,R) (32)

IV. EXPERIMENTS

A. Data Sets Description

To evaluate the effectiveness of the BPDG method, exten-sive experiments are conducted based on the PersonalDialogdataset [10]. This corpus contains sparse personas of multi-party, where the personalized responses in dyadic dialoguesinvolve bilateral personalized information. It is very chal-lenging to choose which persona to generate, so we pickdyadic dialogues from the original corpus for our research.This dataset provides personalized profiles of both speakers,including three personal attributes i.e., “Gender”, “Area” and“Individual interests”.

Since in some cases of the original corpus, the personalizedprofiles are missing. We construct a heuristic script to selectthe data with complete personalized information of both

Page 8: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

parties. The constructed dialogue dataset is referred to asthe bilateral personalized dataset in this article. The bilateralpersonalized dataset consists of 410K dialogues in total, where400K is randomly sampled as the training set, and the rest 10Kdata as the validation set. The average length of each dialogueis about 3.5 rounds and the average length of each sentenceis about 7.45 characters.

The evaluation settings of the ECDT2 are adopted, to testthe performance of different methods in different contexts.Specifically, two test sets3 (i.e., a random set and a biasedset) are constructed for the evaluation. The random set is acollection of dialogues between both parties, most of whichdo not contain personas. It is constructed for testing theperformance of different methods in a context where the twointeracting parties do not intentionally show their personas.The biased test set is the dialogue set between both partieswith personalized information, where the speaker tends toreveal personalized information about both parties during theconversation. It is used for testing the performance of differentmethods in the context where the speakers intentionally showtheir personas.

B. Bilateral Persona Classifier

To better evaluate whether the response is bilateral persona-consistent or not, we design the bilateral persona classifier Pπas an objective metric, which is trained with the aforemen-tioned personalized labels. Each sentence is labeled with oneof the three labels: 0 for the sentence related to the persona ofthe user, 1 for the sentence related to the persona of the robot,and 2 for the sentence that does not contain the persona.

The bilateral persona classifier is used to evaluate whetherthe response Y contains the user persona U or the robotpersona R. To calculate each probability of the respectivecategory, the response Y with bilateral personas is concate-nated with < SEP >. After calculating each probability, theprobability of category 0 and category 1 is added together asthe probability of the bilateral personalized response. About10K rounds of dialogues containing bilateral personas arerandomly sampled from the bilateral personalized dataset,where the category ratio is 1:1:3. Then, we divide the abovedata into training, validation, and test set at a ratio of 8:1:1to train the bilateral persona classifier. The accuracy of theclassifier on the test set reaches 90.2% in a 5-fold cross-validation setting.

C. Content Relevance Classifier

The content relevance classifier is used for ranking thecandidates under the criterion of the CMIM. After the can-didate list is generated by the BPDG model, we calculate thecontent relevance probability of each generated response co-occurring in the current dialogues under the conditions of thepersonalized history H , the user persona U , and the robotpersona R. These conditions and each generated response

2http://conference.cipsc.org.cn/smp2019/evaluation.html3https://worksheets.codalab.org/worksheets/0x8f68b61a8b2249d7b314c6e8

00e2dace

are concatenated with < SEP > for calculating the contentrelevance probability. After the probability of each generatedresponse is calculated, the final response is selected to output.Specifically, the content relevance classifier is trained on thebilateral personalized dataset, using the ERNIE-base model[38] to finetune in the labeled dialogues. In the 5-fold cross-validation setting, the accuracy reaches 80.4%.

D. Implementation Details

We implement all the experiments of the bilateral person-alized dialogue generation with the pre-train model calledLCCC-base [39], which a Chinese pre-trained model basedon the GPT2 framework with a vocab of 13088 characters, isused to initialize the parameters of the encoder and decoderwith transfer learning. According to [40], the shared weightsof the encoder and decoder are adopted in this article, as it isbeneficial for improving the quality of generated responses.The encoder and decoder include 12 Transformer blocks,among which the self-attention heads are 12. The size ofthe token embedding is 768 and the context window is 512.The parameter dmodel = 512, n = 64, λ1 = 0.2, λ2 = 0.5,and λ3 = 0.3. The diverse beam search strategy adopted inthe proposed method is to generate the candidate list withthe BPDG model, where the beam size is set to 20 and thegroup size is set to 4. The content relevance classifier is tocalculate the relevance probability for each sentence in thecandidates list under the criterion of the CMIM. The finalgenerated response Y ∗ is selected to output. The BPDG modelis finetuned directly on the bilateral personalized dataset for 30epochs, where the batch size is 64 with gradient accumulation,using the Noam optimization scheduler [41] with 2000 warm-up steps on two NVIDIA 2080 Ti GPUs. All the experimentalcodes are released at https://github.com/Lireanstar/BPDG4.

E. Compared Methods

Several state-of-the-art baseline methods are compared withours. These methods are described below:

1) S2S + Atten.: This method applies a three-layer Bi-GRU to project the input text into embeddings witha fixed size. Another three-layer GRU utilizes an at-tention mechanism [42] for response generation. Theword embedding parameters of encoder and decoderare initialized by the pre-trained word vector5. Theparameter weights of the GRU network are initializedwith a uniform distribution [-0.05, 0.05]. The modelis optimized by implementing the Adam optimizationscheduler.

2) Trans.: The Trans. employs the Transformer [34] usingthe self-attention mechanism to generate responses. Themodel is initialized with the uniform distribution [-0.02,0.02] and takes the concatenated dialogue history asinput without using personas. We optimize the modelby implementing the Noam [41] optimization scheduler.

4Code and data will be publicly available5https://github.com/Embedding/Chinese-Word-Vectors

Page 9: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

3) TTransfo.: The TTransfo. is introduced by [25] opti-mizing a multi-task object for training. This model isinitialized by the LCCC-base pre-trained model and fine-tunes directly on the bilateral personalized dataset onlywith the concatenated history. The Norm optimizationscheduler is used for training the model with gradientaccumulation (with batch size 64).

4) LConv.: The LConv. represents the multi-input modelproposed in [43]. This model is initialized with theLCCC-base pre-trained model, which shares the param-eters of the encoder and decoder. The model finetunesdirectly on the bilateral personalized dataset with theconcatenated dialogue history. The Norm optimizationscheduler is used for training the model with gradientaccumulation (with batch size 64).

5) TTransfo.+P: It extends the TTransfo. by incorporatingthe speaker’s persona. When fine-tuning, the contextualdialogues concatenated with the speaker’s personalizedinformation are input into the model. The Norm opti-mization scheduler is implemented for training, wherethe batch size is set to 64 with gradient accumulation.

6) LConv.+P: It extends the LConv. by incorporating thespeaker’s persona. When fine-tuning the contextual di-alogues concatenated with the speaker’s personalizedinformation are input into the model. The Norm opti-mization scheduler is implemented for training, wherethe batch size is set to 64 with gradient accumulation.

7) PWDWP: The PWDWP [12] is initialized by the LCCC-base pre-trained model and finetunes on the bilateralpersonalized dataset. This model incorporates person-alized attributes embedding in the dialogue context foreach speaker and devises a persona routing to weigh thepersona-related encodings that are input into the decoder.The Norm optimization scheduler is implemented fortraining, where the batch size is set to 64 with gradientaccumulation. This model is similar to our method,which is the strong baseline method in the explicitpersonalized dialogue system.

V. AUTOMATIC EVALUATION

In order to fully evaluate the effectiveness of the proposedmethod compared with the baseline methods, we choosevarious metrics for the automatic evaluation. In this section,we introduce these metrics and give a detailed analysis of theresults.

A. Objective Metrics Introduction

1) Bi-Persona AccThe Bi-Persona Acc (BPAcc) is used to measure the degree

of personalization in the response. We extend the unilateralpersona-consistent [12], which represents that the persona isconsistent with the speaker, to the bilateral persona-consistent.The Bi-Persona Acc represents the bilateral persona classifi-cation accuracy of the sentence which is not only consistentwith the speaker’s persona but also with the persona of theother party. Each generated response and the bilateral personas

are input into the bilateral persona classifier to obtain the Bi-Persona Acc. Therefore, we add the user and robot personaclassification accuracy together to obtain the possibility ofthe response that contains bilateral personalized information.The higher Bi-Persona Acc score means that the generatedresponse is more personalized and more likely to be bilateralpersona-consistent.

BPAcc =Pπ(1) +Pπ(2)

Pπ(0) + Pπ(1) + Pπ(2)(33)

where Pπ(x) represents the output probability with label x ofbilateral persona classifier.

2) BLEUThe BLEU (bilingual evaluation understudy) [44] is utilized

to evaluate the quality of the text in translation. In dialoguegeneration, the BLEU is calculated with the weighted n-gramoverlap between the ground-truth response Y and generatedresponses Y ∗. The n-gram calculation is shown in the equation(34):

Pn(Y , Y∗) =

∑kmin

(Cntclip(k, Y ),Cntclip(k, Y

∗))

∑k Cnt(k, Y )

(34)

where k traverses all the n-grams candidates, the Y and theY ∗ represent the ground-truth response and the generatedresponse respectively, Cntclip(k, Y

∗) represents the clippedn-grams number in the generated response Y ∗, Cnt(k, Y )represents n-grams number in the ground-truth response Y .The weight BP (Y , Y ∗) can be calculated as equation (35):

BP (Y , Y ∗) =

{1, if |Y ∗| > |Y |e(1−|Y |/|Y

∗|, if |Y ∗| ≤ |Y |(35)

where |Y ∗| represents the length of the generated response,|Y | represents the length of the ground-truth response. TheBLEU is calculated as follows:

BLEU = BP (Y , Y ∗) · exp

(N∑n=1

wn logPn(Y , Y∗)

)(36)

where N is set to 2 and the weighted factor wn is set to 1/N ,the percentile fraction we use is set to 1000, which is the samesettings as the NLTK6. The higher the BLEU score, the betterthe quality of the generated response.

3) F1The F1 score is implemented to measure the accuracy of

the model on the data set compared to the ground truth,which includes two parts: precision and recall. The precisionmeans the proportion of words in the generated responsecontained in the ground-truth response, and the recall meansthe proportion of words in the ground-truth response containedin the generated response. The calculation of F1 score is thesame as [45] and can be written as equation (37):

F1 = 2× precision · recallprecision + recall

(37)

4) Distinct

6https://github.com/nltk/nltk

Page 10: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

TABLE IEXPERIMENTAL RESULTS OF FIVE METRICS ON

RANDOM TEST SET

Method BPAcc BLEU F1 Distinct PPL

S2S + Atten. 9.71 3.18 4.32 0.098 87.41Trans. 11.84 3.27 7.48 0.187 77.55

TTransfo. 15.43 6.33 10.40 0.255 47.48TTransfo.+P 44.68 6.27 10.11 0.251 52.14

LConv. 12.41 3.38 7.52 0.243 51.32LConv.+P 51.36 6.02 8.81 0.234 55.04PWDWP 63.13 10.56 10.47 0.280 60.33

Ours 64.63 11.66 11.30 0.282 61.93Ours, α=1 87.12 10.64 10.97 0.279 60.55Ours, β=1 82.42 10.32 10.52 0.274 60.12Ours, γ=1 52.11 8.34 10.22 0.272 59.92

The Distinct [46] is adopted to measure the average scoreof the sum of unique unigrams and bigrams contained in thegenerated responses, which is divided by the total number ofgenerated words. The equation can be written as follows:

Distinct =1

2× Cnt(Uuni) + Cnt(Ubi)

Numtokens(38)

where the Cnt(Uuni) represents the number of unigrams thatare not repeated in the response compared with the ground-truth response, the Numtokens represents the total numberof generated words, the higher the distinct score, the morespecific and diverse the response generated.

5) PPLThe PPL (perplexity) [47] is widely used to measure the

performance that the model predicts different utterances in thetest set. For the ground-truth response Y = {y1, y2, . . . , ym},the perplexity is calculated by the trained model, and can becalculated as equation (39):

Perplexity = exp(− 1N

∑mi=1 ti

)(39)

ti =

{logP (yi) + ε, if yi ∈ Flog(P (unk)/|R|) + ε, if yi ∈ R

(40)

where the F represents the set of words in the frequentvocabulary and the R represents the set of words that are inthe rare vocabulary, P (unk) represents the logits of unknowntoken predicted by the model. |R| is the number of words thatare in the rare vocabulary, the ε is set to 10−8, which is usedto ensure that logits are not zero.

B. Results and Analysis

Table I. and Table II. respectively show the comparisonresults of the proposed method and different baseline methodson five metrics, and also present the performance of ourmethod with different persona-aware weights. It can be seenfrom the results that, compared with the baseline methods,our method is superior to all metrics except the PPL. Notedthat the ppl. score is inconsistent with [12], because they haveused external personalized corpus for pre-training, while thispre-training corpus is not open source. Table I. and Table II.

TABLE IIEXPERIMENTAL RESULTS OF FIVE METRICS ON

BAISED TEST SET

Method BPAcc BLEU F1 Distinct PPL

S2S + Atten. 22.42 4.32 4.71 0.116 97.53Trans. 29.14 6.89 8.07 0.251 79.23

TTransfo. 43.12 15.22 15.93 0.268 49.59TTransfo.+P 60.25 15.55 16.27 0.260 57.63

LConv. 42.73 10.03 9.09 0.270 56.41LConv.+P 57.43 12.13 10.02 0.263 58.74PWDWP 88.19 17.21 16.40 0.285 60.81

Ours 92.14 24.64 18.05 0.287 58.59Ours, α=1 93.75 23.51 17.97 0.286 61.77Ours, β=1 91.12 24.72 18.52 0.285 60.11Ours, γ=1 57.62 17.51 17.01 0.282 62.66

show that we have used the open-source LCCC model forinitializing all the baseline models.

Under the same experimental conditions, further conclu-sions are that: (1) under the same automatic weighting setting,our method is better than the strong baseline method (i.e.,PWDWP). On the random set, it outperforms with 1.5% inBPAcc, 1.1% in BLEU, 0.83% in F1, and 0.2% in Distinct.While on the biased set, our method outperforms with 3.95%in BPAcc, 7.43% in BLEU, 1.65% in F1, and 0.2% in Distinct.Especially on the baised set, our method is superior to thecompared baseline methods. This shows that our method cangenerate more personalized and better responses. (2) It canbe found that both in Table I. and Table II. the PPL scoresin bold (i.e., 47.48 and 49.59) show that the best resultsof the PPL appear on the TTransfo, which is the methodwithout incorporating the personalized information. However,the methods with personalized information (i.e., TTransfo.+P,LConv.+P, PWDWP, and our method) all obtain the higherPPL score. This indicates that generating responses withpersonalized information will hurt the PPL score. It occursbecause the words involving the persona in social conversationare relatively rare. Such words may bring bias and lead tothe worse perplexity score, which is in line with the resultsin [12], [45]. The baseline methods with a lower perplexityscore tend to generate more general responses, thus theycannot generate responses that match the bilateral personas.As a result, the BPAcc scores of these baseline methodsare relatively low. (3) Compared with the methods withoutpersonalized information (i.e., S2S + Atten., TTransfo. andLConv.), the methods with unilateral personalized information(i.e., TTransfo.+P, LConv.+P, and PWDWP) on the two testsets get higher BPAcc scores. Moreover, the method withbilateral personalized information (i.e., our method) has ahigher BPAcc score on the two test sets than the strongbaseline method with unilateral personalized information (i.e.,PWDWP). This indicates the effectiveness of the proposedbilateral persona classifier to evaluate the degree of person-alization and bilateral-consistent. (4) On the random set, theproposed method outperforms the other baseline methods thatonly incorporate the unilateral persona in BPAcc (i.e., 87.12 inbold). Similar trends are observed on the biased set (i.e., 93.75

Page 11: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

TABLE IIIABLATION RESULTS OF OUR PROPOSED METHOD

ON RANDOM TEST SET

Method BPAcc BLEU F1 Distinct PPL

BPDG 64.63 11.66 11.30 0.282 61.93w/o LM 40.42 9.21 10.27 0.273 60.17w/o PAF 47.12 8.06 10.51 0.267 57.34w/o PreT 42.19 4.71 8.74 0.247 88.89

w/o PEmb 43.24 9.12 11.01 0.270 63.12w/o CMIM 60.14 8.13 8.83 0.301 61.93

TABLE IVABLATION RESULTS OF OUR PROPOSED METHOD

ON BIASED TEST SET

Method BPAcc BLEU F1 Distinct PPL

BPDG 92.14 24.64 18.05 0.287 58.59w/o LM 69.54 20.96 17.87 0.254 62.50w/o PAF 77.22 16.95 17.15 0.240 60.34w/o PreT 74.57 13.56 15.12 0.232 77.52

w/o PEmb 74.41 18.79 17.44 0.249 61.19w/o CMIM 87.11 17.22 16.12 0.312 58.59

in bold), which indicates that incorporating the other party’spersonalized information in the decoding process is beneficialto generate more personalized and more bilateral persona-consistent responses. (5) The proposed different persona-aware weights (i.e., α, β, and γ) can be used to control thepersona presented in the generated response. The results ofthe two test sets show that under different context settings,it will improve the effect of personalized response generationwith different persona-aware weights. This indicates that theproposed dynamic persona-aware fusion module is beneficialto generate diversified dialogue responses rich in bilateralpersonalized information.

C. Ablation Study

In order to test the performance of different modules onthe proposed method, several ablation experiments are imple-mented as follows. (1) Each module of multi-task settingsis deleted respectively, including the language model (w/oLM) and the dynamic persona-aware fusion module (w/oPAF). (2) The pre-trained model is also deleted (w/o PreT)to test the performance of transfer learning. (3) The dialogueutterance with corresponded personas embedding (w/o PEmb)and the conditional mutual information maximum criterion(w/o CMIM) are deleted respectively to test the effect ofdifferent strategies on the BPDG method.

Table III. and Table IV. show the ablation results. Fromthe results, the further conclusion can be drawn that: (1)the LM module learns the language’s semantics from thedialogue context. Without the LM module, it will hurt thedynamic persona-aware fusion on the BPDG method. As aresult, the BPAcc score will be decreased most. (2) ThePAF module is beneficial to generate more personalized anddiversified responses. The above different modules of multi-task learning prove to improve the total effect of personalized

TABLE VHUMAN EVALUATION ON THE RANDOM AND BIASED TEST SET.

WITH * INDICATE THE SIGNIFICANT DIFFERENCE WITHTHE BEST RESULT (T-TEST AND P-VALUE < 0.05)

MethodSentence Bilateral Persona ContextFluency Consistency Consistency

Rand Biasd Rand Biasd Rand BiasdS2S + Atten. 1.24* 1.14* 0.72* 0.91* 0.95* 1.12*Trans. 1.39* 1.42* 0.88* 1.07* 1.12* 1.38*TTransfo. 1.41* 1.37* 0.92* 1.19* 1.04* 1.41TTransfo.+P 1.42* 1.43* 0.96* 1.27* 1.05* 1.39*LConv. 1.55* 1.54* 1.08* 1.42 1.01* 1.58LConv.+P 1.70* 1.66* 1.12* 1.45 1.23 1.60*PWDWP 1.79 1.82 1.35 1.63* 1.25* 1.71

Ours 1.82 1.87* 1.41 1.74 1.27 1.75Ours, α=1 1.79* 1.89* 1.42 1.80* 1.18* 1.71*Ours, β=1 1.81 1.85 1.47 1.72* 1.16* 1.67Ours, γ=1 1.76* 1.78* 1.09* 1.32* 1.24 1.69

Human Resp 1.89 1.90 1.07* 1.78 1.67 1.88

dialogue generation. (3) The pre-trained language model pro-vides a good parameter initialization for the BPDG method,which helps to improve training efficiency by transferringthe knowledge of the original domain to the target domain.(4) The PEmb strategy improves the final performance byembedding the personalized attributes to the correspondeddialogue utterances. (5) More importantly, the CMIM criterionis effective to improve the BPAcc, BLEU, and F1 scores,but it may decrease the Distinct scores, which are boldedin Table III. and Table IV. This is because the sorting andselection steps from the candidates may hurt the diversity ofthe generated responses.

VI. HUMAN EVALUATION

We also perform a human evaluation to test the quality ofresponses generated by different methods. In this section, weintroduce these metrics and give a comprehensive analysis ofthe results.

A. Subjective Metrics Introduction

The evaluation metrics we choose are from three aspects,as is shown below.

1) Sentence fluencySentence fluency represents the fluency of responses gener-

ated by different methods.2) Bilateral persona consistencyBilateral persona consistency indicates whether the informa-

tion is consistent with the user’s or the robot’s personalizedinformation when generating a response by different methods.

3) Context consistencyContext consistency means whether the response generated

by different methods is consistent with the dialogue context.Three annotators are required to rate the quality of the

responses according to the following three rating criteria: (1)+2: the response is not only semantically and grammaticallyrelated, but also bilateral persona-consistent. (2) +1: the re-sponse satisfies the grammatical rules and can be used as a

Page 12: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

All profiles: User: {"Gender": "Male", "Area": "Shanghai", "Interests": "Sports; Movie"}Robot: {"Gender": "Female", "Area": "Guangzhou", "Interests": "Shopping; Makeups"} User profile:{"Gender": "Male", "Area": "Shanghai", "Interests": "Sports; Movie"}Robot profile: {"Gender": "Female", "Area": "Guangzhou", "Interests": "Shopping; Makeups"} History: User: What are your recent arrangements? Robot: I plan to travel to Nanjing recently.User’s question: Nanjing is not fun, come to visit me?Ground-truth response: Are there any interesting attractions in Shanghai?S2S + Atten.: Ok, it’s fun.Trans.: Ok, let’s play fun.LConv.: Ok, let’s play together.LConv. + P: Go to fun attractions together?TTransfo.: I have a chance to come to play with you!TTransfo. + P: I will come to Guangzhou.PWDWP: I will see you in Guangzhou, and play with you!Ours: What’s fun in Shanghai? I’m in Guangzhou.Ours, = 1: What’s fun in Shanghai?Ours, = 1: Why are you not so interesting about Nanjing? I’m in Guangzhou.Ours, = 1: Are there any fun places to travel?

Fig. 6. Sampled responses generated by baseline methods and our method.

response, but is too general and trivial. (3) +0: the responseis semantically irrelevant, ungrammatical, or conflicts with thepersonalized information.

B. Results and Analysis

We sample 100 dialogue sessions from the original randomand biased test set respectively for the human evaluation. Theinter-annotator agreement is measured with Fleiss’s kappa κ[48]. Particularly, the κ value for sentence fluency, bilateralpersona consistency, and context consistency is 0.81, 0.71,0.64 on the random test set respectively, and 0.75, 0.67, 0.61on the biased test set respectively. The results indicate thatthe sentence fluency, the bilateral persona consistency, andthe context coherency of two test sets achieve substantialannotation agreement.

Table V. shows the results of the human evaluation thatthe proposed method outperforms all baseline methods in allhuman metrics (t-test and p-value < 0.05). Further observa-tions indicate that (1) incorporating bilateral personas into thegenerated response will impair the sentence fluency and thecontext consistency, which corresponds to the high BPAccscore and the low PPL score in the automatic evaluation.Despite this, our method has achieved significant advantagesin fluency and context consistency in two test sets comparedwith other methods. (2) The proposed dynamic persona-awarefusion module is designed to control different persona-awareweights for the personalized response generation. This modulecontributes to better bilateral persona consistency. At the sametime, the bilateral persona consistency outperforms the humanin the random test set and the test set. This shows that theproposed dynamic persona-aware fusion module is conduciveto generating more personalized responses in both dialogue

contexts. This observation is also in line with the BPAccin automatic evaluation shown in Table I. and Table II. (3)Compared with the PWDWP method, the proposed BPDGhas a great improvement in context consistency. This is due tothe effect of the CMIM criterion, which selects the responsefrom the generated the candidate list under the condition ofthe bilateral personas and the context. This observation alsocorresponds with the automatic evaluation results of BLEUand F1 metrics shown in Table III. and Table IV.

C. Case Study

The case study is shown in Fig. 6. The proposed methodcan generate a response consistent with the personas of bothparties in the conversation. As we can see, the responsegenerated by the TTransfo.+P and the PWDWP methodsmay be unilateral persona-consistent without incorporatingthe persona of the other party. The other baseline methods(i.e., S2S + Atten., TTrans., TTransfo., LConv., LConv.+P)may also generate a general response that lacks personalizedinformation. The proposed BPDG method utilizes bilateralpersonalized information to generate responses that are in linewith human cognition while constraining the contents of thegenerated responses with the CMIM criterion. Specifically,given the user input and the bilateral personas, our method cancontrol the generated response content with different persona-aware weights. The α = 1 means that the user’s personalizedinformation presents in the response, such as Shanghai. Theβ = 1 means that the robot’s personalized information presentsin the response such as Guangzhou. The γ = 1 means that thepersonalized information does not present in the response, butit is relevant to the context, such as travel.

Page 13: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

VII. CONCLUSION

In this article, we propose the BPDG method to generatemore personalized and bilateral persona-consistent responses.Our method utilizes transfer learning via initializing parame-ters with a pre-trained model, then jointly trains the encoder,the dynamic persona-aware fusion module, and the decoderwith multi-task learning. Experiments show that the transferlearning and multi-task learning method are conducive toimproving the performance of dialogue in various metrics. Inaddition, the generated candidate responses are selected withthe CMIM criterion, which shows that the quality of the finalresponse can be greatly improved. Extensive experiments areconducted to measure the effectiveness of the BPDG method,which shows that the BPDG has advantages in all metics butthe PPL. The human evaluation results also prove that theBPDG method generates more fluent, context-consistent, andbilateral persona-consistent responses than several state-of-the-art methods.

It is worth noting that in open-domain dialogue, the humanresponse is one-to-many, and the open-domain corpus cannotcontain all the situations. Moreover, people will respond andreason based on existing information during the conversation.In the future, we will explore other fusion strategy baseddialogue generation methods with comprehensive reasoningof the existing information, to further improve the quality ofthe generated response.

REFERENCES

[1] D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan,Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu et al., “Towards a human-like open-domain chatbot,” arXiv preprint arXiv:2001.09977, 2020.

[2] A. de Barcelos Silva, M. M. Gomes, C. A. da Costa, R. da Rosa Righi,J. L. V. Barbosa, G. Pessin, G. De Doncker, and G. Federizzi, “Intelligentpersonal assistants: A systematic literature review,” Expert Systems withApplications, vol. 147, p. 113193, 2020.

[3] A. Sheth, H. Y. Yip, A. Iyengar, and P. Tepper, “Cognitive services andintelligent chatbots: current perspectives and special issue introduction,”IEEE Internet Computing, vol. 23, no. 2, pp. 6–12, 2019.

[4] A. Queiros, A. G. Silva, P. Simoes, C. Santos, C. Martins, N. P.da Rocha, and M. Rodrigues, “Smartwalk: personas and scenarios defini-tion and functional requirements,” in 2018 2nd International Conferenceon Technology and Innovation in Sports, Health and Wellbeing (TISHW).IEEE, 2018, pp. 1–7.

[5] S. Roller, Y.-L. Boureau, J. Weston, A. Bordes, E. Dinan, A. Fan,D. Gunning, D. Ju, M. Li, S. Poff et al., “Open-domain conversationalagents: Current progress, open problems, and future directions,” arXivpreprint arXiv:2006.12442, 2020.

[6] M. Huang, X. Zhu, and J. Gao, “Challenges in building intelligent open-domain dialog systems,” ACM Transactions on Information Systems(TOIS), vol. 38, no. 3, pp. 1–32, 2020.

[7] S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston,“Personalizing dialogue agents: I have a dog, do you have pets too?”in Proceedings of the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers), 2018, pp. 2204–2213.

[8] H. Song, W.-N. Zhang, Y. Cui, D. Wang, and T. Liu, “Exploitingpersona information for diverse generation of conversational responses,”in Proceedings of the 28th International Joint Conference on ArtificialIntelligence. AAAI Press, 2019, pp. 5190–5196.

[9] Q. Qian, M. Huang, H. Zhao, J. Xu, and X. Zhu, “Assigning personal-ity/profile to a chatting machine for coherent conversation generation,”in Proceedings of the 27th International Joint Conference on ArtificialIntelligence, 2018, pp. 4279–4285.

[10] Y. Zheng, G. Chen, M. Huang, S. Liu, and X. Zhu, “Person-alized dialogue generation with diversified traits,” arXiv preprintarXiv:1901.09672, 2019.

[11] M. Xu, P. Li, H. Yang, P. Ren, Z. Ren, Z. Chen, and J. Ma, “A neuraltopical expansion framework for unstructured persona-oriented dialoguegeneration,” arXiv preprint arXiv:2002.02153, 2020.

[12] Y. Zheng, R. Zhang, M. Huang, and X. Mao, “A pre-training based per-sonalized dialogue generation model with persona-sparse data.” AAAIPress, 2020, pp. 9693–9700.

[13] M. A. Walker, J. E. Cahn, and S. J. Whittaker, “Improvising linguisticstyle: Social and affective bases for agent personality,” in Proc. 1st Int.Conf. Autonom. Agents, 1997, pp. 96–105.

[14] A. Isard, C. Brockmann, and J. Oberlander, “Individuality and alignmentin generated dialogues,” in Proceedings of the Fourth InternationalNatural Language Generation Conference, 2006, pp. 25–32.

[15] A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee,D. Crandall, and D. Batra, “Diverse beam search: Decoding diverse so-lutions from neural sequence models,” arXiv preprint arXiv:1610.02424,2016.

[16] L. R. Goldberg, “The structure of phenotypic personality traits.” Amer-ican psychologist, vol. 48, no. 1, pp. 26–34, 1993.

[17] F. Mairesse and M. Walker, “Personage: Personality generation fordialogue,” in Proceedings of the 45th annual meeting of the associationof computational linguistics, 2007, pp. 496–503.

[18] J. Li, M. Galley, C. Brockett, G. Spithourakis, J. Gao, and B. Dolan,“A persona-based neural conversation model,” in Proceedings of the54th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers), 2016, pp. 994–1003.

[19] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learningwith neural networks,” in Advances in neural information processingsystems, 2014, pp. 3104–3112.

[20] A. Madotto, Z. Lin, C.-S. Wu, and P. Fung, “Personalizing dialogueagents via meta-learning,” in Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics, 2019, pp. 5454–5459.

[21] M. Yang, W. Huang, W. Tu, Q. Qu, Y. Shen, and K. Lei, “Multitasklearning and reinforcement learning for personalized dialog generation:An empirical study,” IEEE Transactions on Neural Networks andLearning Systems, pp. 1–14, Mar. 2020.

[22] K. Mo, S. Li, Y. Zhang, J. Li, and Q. Yang, “Personalizing a di-alogue system with transfer reinforcement learning,” arXiv preprintarXiv:1610.02891, 2016.

[23] D. Wang and T. F. Zheng, “Transfer learning for speech and languageprocessing,” in 2015 Asia-Pacific Signal and Information ProcessingAssociation Annual Summit and Conference (APSIPA). IEEE, 2015,pp. 1225–1237.

[24] W.-N. Zhang, Q. Zhu, Y. Wang, Y. Zhao, and T. Liu, “Neural per-sonalized response generation as domain adaptation,” World Wide Web,vol. 22, no. 4, pp. 1427–1446, Jun. 2018.

[25] T. Wolf, V. Sanh, J. Chaumond, and C. Delangue, “Transfertransfo:A transfer learning approach for neural network based conversationalagents,” arXiv preprint arXiv:1901.08149, 2019.

[26] S. Golovanov, A. Tselousov, R. Kurbanov, and S. I. Nikolenko, “Lostin conversation: A conversational agent based on the transformer andtransfer learning,” in The NeurIPS’18 Competition. Springer, 2020, pp.295–315.

[27] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction bylearning an invariant mapping,” in 2006 IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2.IEEE, 2006, pp. 1735–1742.

[28] M. U. Gutmann and A. Hyvarinen, “Noise-contrastive estimation ofunnormalized statistical models, with applications to natural imagestatistics.” Journal of Machine Learning Research, vol. 13, no. 2, 2012.

[29] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple frameworkfor contrastive learning of visual representations,” in Internationalconference on machine learning. PMLR, 2020, pp. 1597–1607.

[30] P. Baltescu and P. Blunsom, “Pragmatic neural language modelling inmachine translation,” in Proceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics:Human Language Technologies, 2015, pp. 820–829.

[31] B. Dai and D. Lin, “Contrastive learning for image captioning,” inProceedings of the 31st International Conference on Neural InformationProcessing Systems, 2017, pp. 898–907.

[32] Y. Liu and P. Liu, “Simcls: A simple framework for contrastive learningof abstractive summarization,” arXiv preprint arXiv:2106.01890, 2021.

[33] H. Cai, H. Chen, Y. Song, Z. Ding, Y. Bao, W. Yan, and X. Zhao,“Group-wise contrastive learning for neural dialogue generation,” inProceedings of the 2020 Conference on Empirical Methods in NaturalLanguage Processing: Findings, 2020, pp. 793–802.

Page 14: JOURNAL OF LA Bilateral Personalized Dialogue Generation

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

[34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advancesin neural information processing systems, 2017, pp. 5998–6008.

[35] I. Kulikov, J. Lee, and K. Cho, “Multi-turn beam search for neuraldialogue modeling,” arXiv preprint arXiv:1906.00141, 2019.

[36] F. Fleuret, “Fast binary feature selection with conditional mutual infor-mation,” Journal of Machine Learning Research, vol. 5, pp. 1531–1555,2004.

[37] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,“Albert: A lite bert for self-supervised learning of language representa-tions,” in International Conference on Learning Representations, 2019.

[38] Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu,H. Tian, and H. Wu, “Ernie: Enhanced representation through knowledgeintegration,” arXiv preprint arXiv:1904.09223, 2019.

[39] Y. Wang, P. Ke, Y. Zheng, K. Huang, Y. Jiang, X. Zhu, and M. Huang,“A large-scale chinese short-text conversation dataset,” arXiv preprintarXiv:2008.03946, 2020.

[40] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T.-Y. Liu, and W.-Y. Ma, “Duallearning for machine translation,” in Advances in neural informationprocessing systems, 2016, pp. 820–828.

[41] A. M. Rush, “The annotated transformer,” in Proceedings of workshopfor NLP open source software (NLP-OSS), 2018, pp. 52–60.

[42] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches toattention-based neural machine translation,” in Proceedings of the 2015Conference on Empirical Methods in Natural Language Processing,2015, pp. 1412–1421.

[43] S. Golovanov, R. Kurbanov, S. Nikolenko, K. Truskovskyi, A. Tselousov,and T. Wolf, “Large-scale transfer learning for natural language genera-tion,” in Proceedings of the 57th Annual Meeting of the Association forComputational Linguistics, 2019, pp. 6053–6058.

[44] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method forautomatic evaluation of machine translation,” in Proceedings of the 40thannual meeting of the Association for Computational Linguistics, 2002,pp. 311–318.

[45] E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Urbanek,D. Kiela, A. Szlam, I. Serban, R. Lowe et al., “The second conversationalintelligence challenge (convai2),” arXiv preprint arXiv:1902.00098,2019.

[46] J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan, “A diversity-promoting objective function for neural conversation models,” in Pro-ceedings of the 2016 Conference of the North American Chapterof the Association for Computational Linguistics: Human LanguageTechnologies, 2016, pp. 110–119.

[47] F. Huang, D. Wan, Z. Shao, P. Ke, J. Guan, Y. Niu, X. Zhu, andM. Huang, “Cotk: An open-source toolkit for fast development and fairevaluation of text generation,” arXiv preprint arXiv:2002.00583, 2020.

[48] J. J. Randolph, “Free-marginal multirater kappa (multirater k [free]): Analternative to fleiss’ fixed-marginal multirater kappa.” Online submission,2005.