towards intelligent conversational systemsgjf/talks/2019/smp-talk0813.pdf · • dialogue tracker...
TRANSCRIPT
Towards Intelligent Conversational Systems:
Informativeness, Diversity and Controllability
Jiafeng Guo (郭嘉丰), Professor
Institute of Computing Technology, Chinese Academy of Sciences
University of Chinese Academy of Sciences
Homepage: www.bigdatalab.ac.cn/~gjf
1
Collaborators
2
Hainan Zhang Ruqing Zhang Yixing Fan Yanyan Lan Xueqi Cheng
3
The Turing test, developed by Alan Turing in 1950, is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversations between a human and a machine designed to generate human-like responses.
Eliza 1966 - Earliest Social Bot
4
• Created from 1964 to 1966 at the MIT Artificial Intelligence Laboratory by Joseph Weizenbaum
• Features• Hand-crafted scripts• Keyword spotting• Template matching• Substitution
Weizenbaum, Joseph (1966). "ELIZA—a computer program for the study of natural language communication between man and machine". Communications of the ACM. 9: 36–45.
Conversation is a Big Trend
6From Yun-Hung Chen
Conversational Systems
7
Task-Oriented Chit-ChatPersonal assistant, achieve a certain task
No specific goal, focus on conversation flow
Xiao Ice
Mitsuku
笨笨
Methodology in Conversational System
Pipeline-based
Four main module: • Natural Language
Understanding• Dialogue
Management• Natural Language
Generation[Goddeau et al.,1996; Lemon and Pietquin,2007; Young et al., 2010; Williams,2012;
Su et al., 2016]
1960s
➢Large human effort➢Accumulative errors➢Difficult to optimize
Methodology in Conversational System
2000s
Pipeline-based
Four main module: • Natural Language
Understanding• Dialogue Tracker• Policy Learning• Natural Language
Generation[Goddeau et al.,1996; Lemon and Pietquin,2007; Young et al., 2010; Williams,2012;
Su et al., 2016]
1960s 2010s
➢Large human effort➢Accumulative errors➢Difficult to optimize
Retrieval-based
Two Step: • Retrieval <utterance,
response> pairs fromdatabase as candidateresponses
• Rerank the candidateresponses with thematching functions
[Leuki et al.,2009; Hu et al., 2014; Lu and Li, 2013; Wu et al., 2016; Lowe et al., 2015;
Zhou et al.,2016b]
➢Simple and effective➢Require large database➢Limited language flexibility
Generative-based
Neural Encoder-Decoder: • Encode the post as a
fixed vector withRNN/LSTM.
• Decode the responseone step by one stepbased the above vector
[Stuskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015; Shang et al., 2015b; Serban et al., 2016; Serban et
al., 2017a,b]
➢More like human process➢Fully learnable
End-to-End methods
Methodology in Conversational System
11
Neural Encoder-Decoder Model
Grey Box: Statistical Memory
Black Box?
Similar to sequence models in Neural Machine Translation (NMT), summarization, etc.Uses either RNN, LSTM, GRU, etc.
Data-driven: the responses are in general fluent and relevant
Challenge: The blandness problem
12
general meaningless responses
Image From Jianfeng Gao
Challenge: The collapse problem
13
X: Can you recommend me a tourist city?Y1: Yes, Beijing is a beautiful city .Y2: Yes, Beijing is a very beautiful city !Y3: Yes, Beijing is a beautiful city ~ !
X: The Shenzhou spacecraft is about to dock with TiangongY1: Wow, our motherland is really strong.Y2: Wow, our motherland is very strong!Y3: Wow, our motherland is so strong ~ !
responses in similar pattern
Challenge: The consistency problem
14
poor response consistency
Image From Jianfeng Gao
Challenge: The no-grounding problem
15
not knowledge-grounded
Learn the general shape of conversations
Towards Intelligent Conversational SystemsBeyond and
PersonalizedGrounded
Consistency
…
DiversityInformativeness
FluencyRelevance
Towards Intelligent Conversational Systems
Fluency
Engagement
Personalized
Grounded
Consistency
…
Diversity
Informativeness
Relevance
Reinforcing Coherence for Sequence toSequence Model in Dialogue Generation(IJCAI 2018)
Informativeness
Many-to-one characteristics of Conversation
Shenzhou 8 spacecraft is going to launch at 5:58:10 pm.
Today I have got my PhD degree finally.
The new TV show time has been determined.
The NBA team is coming to Shanghai next month.
Support! Cheers!
The Course of the Blandness Problem
Glad to know.
The Course of the Blandness Problem
Traditional Seq2Seq(Sutskever et al. ,2014)
Observation: Seq2Seq models with MLE are likely to generate common response (patterns), lacking specific information
Seq2Seq with MLE is equivalent to optimizing Kullback–Leibler(KL) divergence
ℒ = −න𝑌𝜖𝑃𝑟(𝑌|𝑋)
log𝑃g(𝑌|𝑋) 𝑑𝑌
= −න𝑦
𝑃𝑟 𝑌 𝑋 log𝑃g(𝑌|𝑋)d𝑌
= න𝑦
𝑃𝑟 𝑌 𝑋 log𝑃𝑟 𝑌 𝑋
𝑃g 𝑌 𝑋𝑑𝑌 −න
𝑦
𝑃𝑟 𝑌 𝑋 log𝑃𝑟 𝑌 𝑋 𝑑𝑌
= 𝐾𝐿 𝑃𝑟 𝑌 𝑋 ||𝑃g 𝑌 𝑋 + 𝐻(𝑃𝑟 𝑌 𝑋 )
Analysis: Not penalize the case whose generation probability is high while the true probability is low.
The Course of the Blandness Problem
generation probability true probability
Key Idea
But true data probability is difficult to estimate
Using true data probability to rectify the generation probability
Reward
ActionSeq2Seq
ModelAgen
t
Coherence
ModelEnvir. Reward
Action
Reward
Action
Post→ Response
Seq2Seq
ModelAgen
t
Response→ Post
Seq2Seq
ModelAgen
t
(a) (b)
True
Probability
Key Idea——Statistical analysis
60.19
57.39
60.3
62.06
66.04
56
58
60
62
64
66
1 2 3 4 5
cosi
ne
sim
ilari
ty(%
)of
po
stan
dg
en
era
tio
n
Human score
Human criteria
1. Nonfluent or logically wrong
2. Not related
3. Common response
4. Strongly related
5. Like a real person’s tone
Observation: The coherence between a post and its generated response is consistent with the human evaluation. In other words, the true probability of a response is highly likely to be proportional to the coherence score between the post and the response.
Manual annotation over 300 randomly sampled posts and their corresponding generated responses.
Model
Reward
ActionSeq2Seq
ModelAgen
t
Coherence
ModelEnvir. Reward
Action
Reward
Action
Post→ Response
Seq2Seq
ModelAgen
t
Response→ Post
Seq2Seq
ModelAgen
t
(a) (b)Unlearned Similarity Function
𝑟𝑐𝑜𝑠 =< ℎ 𝑋 , ℎ(𝐺) >
ℎ(𝑋) ℎ(𝐺)
Can
…
city
Post X
You
…
Alps
Generation G
… …
Mean operator Mean operator
h(X) h(G)
cosine
Model
Reward
ActionSeq2Seq
ModelAgen
t
Coherence
ModelEnvir. Reward
Action
Reward
Action
Post→ Response
Seq2Seq
ModelAgen
t
Response→ Post
Seq2Seq
ModelAgen
t
(a) (b)Pre-trained Matching Function
Unlearned Similarity Function
GRU bilinear model MatchPyramid
Model
Reward
ActionSeq2Seq
ModelAgen
t
Coherence
ModelEnvir. Reward
Action
Reward
Action
Post→ Response
Seq2Seq
ModelAgen
t
Response→ Post
Seq2Seq
ModelAgen
t
(a) (b)Dual Learning Architecture
Pre-trained Matching Function
Unlearned Similarity Function
𝐺1 = arg 𝑚𝑎𝑥𝑃1(𝐺1|𝑋)𝑟𝑑𝑢𝑎𝑙1 𝑋, 𝐺1 = 𝑙𝑜𝑔𝑃2 𝑋 𝐺1𝐺2 = arg 𝑚𝑎𝑥𝑃2(𝐺2|𝑌)
𝑟𝑑𝑢𝑎𝑙2 𝑌, 𝐺2 = 𝑙𝑜𝑔𝑃1(𝑌|𝐺2)
Model: Dual Learning
Motivation: Mutual Reciprocity
Forward: Post->Response
Backward: Response->Post
Common Response:I don’t know!
What do you think of the team today?
Do you think the team is doing well today?
How is the game of the team today?
Response:I think the team had a great game today!
……
Reinforce the coherence of post and generation based on thepost-response mutual reciprocity.
Poor Backward
Experiments
Experiment Dataset
• Chinese Weibo Dataset• Extract post-review pairs from the Chinese weibo website
• STC:
• 3 million training pairs
• 0.39 million developing pairs
• 0.4 million testing pairs
• English Open Subtitle Dataset• OSDb is a large, open-domain dataset containing roughly 60M-70M
scripted lines spoken by movie characters.
• OSDb:• 3 million training pairs
• 0.4 million developing pairs
• 0.4 million testing pairs• https://github.com/jiweil/Neural-Dialogue-Generation
Experiments
• Baseline Methods• Seq2Seq model [Sutskever et al., 2014]
• RNN-encdec model [Cho et al., 2014]
• Seq2Seq with attention model [Bahdanau et al., 2015]
• MMI-based models: MMI [Li et al., 2016b], back-MMI [Li et al., 2016b]
• GAN-based model: Adver-REGS [Li et al., 2017]
• Evaluation Metrics
• Automatic evaluation:PPL,BLEU,distinct-1,distinct-2
• Human evaluation
Experiments
Our coherence models produce more fluent and specific results, as compared with baseline methods.
Metric-based Evaluation
Experiments
The end-to-end dual learning approach improve 2.8% distinct-2.
Metric-based Evaluation
Experiments
Human Evaluation
The percentage of strongly related sentences of Seq2SeqCo-dual is significantly better.
The Seq2SeqCo-dual improve 14.5% human evaluation score.
Experiments
Case Study
Summary
• Neural generation model with the MLE objective is likely to generate common meaningless responses, due to not penalizing the responses whose generation probability is high while the true probability is low.
• By reinforcing the coherence between posts and responses, we are able to remedy the blandness problem, leading to informative responses.
• The dual learning architecture, which uses the mutual reciprocityof conversation, can significantly improve the informativeness of responses.
Tailored Sequence to Sequence Models to Different Conversation Scenarios(ACL 2018)
Diversity
Motivation
Teacher : Please describe the winter(大家描述一下冬天)
Student1: Winter is so beautiful. (冬天真美丽)
Student2:Winter is as white as snow. (冬天洁白如雪)
Student3:Suddenly as the night spring breeze, thousandsof trees thousands of pear blossom! (忽如一夜春风来,千树万树梨花开)
Student4: In winter, a thin layer of snow, like a huge softwool blanket, covered the vast wilderness, shining withcold silver.(冬天,一层薄薄的白雪,像巨大的轻软的羊毛毯子,覆盖摘在这广漠的荒原上,闪着寒冷的银光)
People are able to generate diverse responses given a same input utterance
People One-to-many characteristics of Conversation
Motivation
Teacher : Describe the winter?
Bot1:So beautiful 。Bot2:Nice view 。Bot3:Nice view!Bot4:So beautiful !
Machine
Top-k results with beam search are highly similar to each other, leading to boring responses (the collapse problem)
The Course of the Collapse Problem
MLE learning under the Independent assumption
Teacher : Please describe the winter(大家描述一下冬天)
Student1: Winter is so beautiful. (冬天真美丽)
Student2:Winter is as white as snow. (冬天洁白如雪)
Student3:Suddenly as the night spring breeze, thousands oftrees thousands of pear blossom! (忽如一夜春风来,千树万树梨花开)
Student4: In winter, a thin layer of snow, like a hugesoft wool blanket, covered the vast wilderness, shiningwith cold silver.(冬天,一层薄薄的白雪,像巨大的轻软的羊毛毯子,覆盖摘在这广漠的荒原上,闪着寒冷的银光)
① Ignore the underlying one-to-many structure
② Learn what is easy to learn, not care about the worst cases
Our idea: recover the Structure and consider the Cost in the worst cases
Cost Sensitive Loss
• Value at Risk (VaR)
• VaR is the maximum cost that might be incurred with probability at least α
• VaR is the α-quantile of the distribution of X
• VaR is the smallest cost in the (1- α)*100% worst cases
• VaR is the highest cost in the α *100% best cases
A prominent risk measure used extensively in finance
𝑉𝑎𝑅𝛼(𝑋) ≔ min{𝑐: 𝑃(𝑋 ≤ 𝑐} ≥ 𝛼}
Cost Sensitive Loss
• Conditional Value at Risk (CVaR)
• Averaged VaR
• Expected Shortfall
• Tail Conditional Expectation
• Entropy Value at Risk (EVaR)
• Where 𝑀𝑋(𝑧) is the moment-generating function of X at z
• We have that
Coherent risk measures
𝐸𝑉𝑎𝑅𝛼(𝑋) ≔ inf𝑧>0
{𝑧−1ln(𝑀𝑋(𝑧)/(1 − 𝛼))}
𝐶𝑉𝑎𝑅𝛼(𝑋) ≔ 𝐸[𝑋|𝑋 ≥ 𝑉𝑎𝑅𝛼(𝑋)]
𝑉𝑎𝑅𝛼(𝑋) ≤ 𝐶𝑉𝑎𝑅𝛼(𝑋) ≤ 𝐸𝑉𝑎𝑅𝛼(𝑋)
Solution: Optimize the conditional value-at-risk (CVaR) instead of the traditional likelihood.
Given the post X and its ground-truth responses {𝑌𝑋(1), 𝑌𝑋
(2), … , 𝑌𝑋
(𝑚𝑋)},
where 𝑚𝑋 is the number of ground-truth responses for post X. We use
− log 𝑃(𝑌|𝑋) to denote the cost.
The objective function of CVaR is to minimize:
where 𝑦1−𝛼 is a collection of ground-truth responses such that:
CVaR tries to pay more attention to the post-responses pairsthat have not been generated well so far.
ℒ = −
𝑋
1
1 − 𝛼
𝑌𝑋(𝑘)
𝜖𝑦1−𝛼
log 𝑃(𝑌𝑋(𝑘)|𝑋)
Robust Model I
Sup 𝑃 𝑌𝑋𝑖𝑋 : 𝑌𝑋
𝑖𝜖𝑦1−𝛼 ≤ 1 − 𝛼
Illustration: suppose 1-𝛼=1/3
Post:总决赛继续等待韦德(Waiting for Wade in the final games.)
Responses1:每个人都有每个人的喜爱(Everyone has his favorite stars.)
Responses2:比新浪分析的好多了(Analysis is much better than Sina)
Responses3:等待闪电侠彻底爆发(Waiting for the explosion of Mr.Flash)
P(Resonses1|X)=0.5
P(Resonses2|X)=0.2
P(Resonses2|X)=0.3
Model Illustration
Post:总决赛继续等待韦德(Waiting for Wade in the final games.)
Responses1:每个人都有每个人的喜爱(Everyone has his favorite stars.)
Responses2:比新浪分析的好多了(Analysis is much better than Sina)
Responses3:等待闪电侠彻底爆发(Waiting for the explosion of Mr.Flash)
P(Resonses1|X)=0.3
P(Resonses2|X)=0.3
P(Resonses2|X)=0.4
Model Illustration
Illustration: suppose 1-𝛼=1/3
Post:总决赛继续等待韦德(Waiting for Wade in the final games.)
Responses1:每个人都有每个人的喜爱(Everyone has his favorite stars.)
Responses2:比新浪分析的好多了(Analysis is much better than Sina)
Responses3:等待闪电侠彻底爆发(Waiting for the explosion of Mr.Flash)
P(Resonses1|X)=0.33
P(Resonses2|X)=0.33
P(Resonses2|X)=0.34
Model Illustration
improve the ability of generating diverse responses for beam-search.
Illustration: suppose 1-𝛼=1/3
Solution: Optimize the expected value-at-risk (EVaR) instead of the traditional likelihood.
Given the post X and its ground-truth responses {𝑌𝑋(1), 𝑌𝑋
(2), … , 𝑌𝑋
(𝑚𝑋)},
where 𝑚𝑋 is the number of ground-truth responses for post X.
The objective function of EVaR is to minimize:
EVaR tries to adjust the loss of each training sample to make each sample achieve balanced generation probability
ℒ =
𝑋
1
𝑧𝑙𝑜𝑔
1
𝑚𝑋
𝑖=1
𝑚𝑋
𝑒𝑍×(− log 𝑃(𝑌𝑖|𝑋)) − log(1 − 𝛼)
Robust Model II
➢ If 𝑃(𝑌𝑖|𝑋) is large, the change of loss will be small
➢ If 𝑃(𝑌𝑖|𝑋) is small, the change of loss will be large
➢ The larger the 𝛼 is,the more emphasis on the change of data
Experiment Dataset• Chinese Weibo Dataset• Extract post-review pairs from the Chinese Weibo website• STC:
• 3 million training pairs
• 0.38 million developing pairs
• 0.4 million testing pairs
Baseline methods:• Seq2Seq model [Sutskever et al.,2014]• RNN-encdec model [Cho et al., 2014]• Seq2Seq with attention model [Bahdanau et al., 2015]• GAN-based method: Adver-REGS [Li et al., 2017]• Style-based method: Mechanism [Zhou et al., 2017]
Experimental Settings
Both risk-aware models obtain lower overlap and divrs than baselinesCVaR obtains larger improvement on diversity than EVaR, as it can change the response distribution more efficiently
Experiments
Metric-based Evaluation
Diversity measure:
Both EVaR and CVaR obtain higher score in terms of manual evaluation as compared with baselines.EVaR achieves higher score than CVaR. EVaR is the up-bound ofCVaR, which learns over all the data while CVaR may ignore some data it considers well trained.
Experiments
Manual Evaluation
1. the response is nonfluent or has wrong logic; or the response is fluent but not related with the post;
2. the response is fluent and weak related, but it’s common which can reply many other posts;
3. the response is fluent and strong related with its post, which is like following a real person’s tone.
Experiments
Case Study
Our model produces both fluent and diverse results
• Neural generation models learned with MLE objectives under the independent sample assumption are likely to face the model collapse problem.
• We are the first to use the robust distribution model (CVaR andEVaR Model ) to enhance the diversity of dialogue.
• Both CVaR and EVaR Model can be used in many other tasks to improve the diversity or robustness.
Summary
Learning to Control the Specificity in Neural Response Generation(ACL 2018)
Controllability
Conversation Systems
Neural generation models
input x
output y
Fully data-driven,modelling correlationLack the intervene mechanism
Human Conversation Process
Do you know a good eating place for Australianspecial food?
current mood
knowledge state dialogue partner
I’m familiar with the topic
I don’t know
general response
Good Australian eating places include steak, seafood, cake, etc. What do you want to choose?
specific response
Haha, I know several wonderful places in the downtown.
positive response
negative response
No, to be frank, I do not like Australian food.
response purpose
Key Idea
• Introduce a causal generation model with an explicit control variable to represent the response purpose
- Summarizes many latent factors
- Has explicit meaning, e.g., specificity or emotion
- Actively controls the generation, could handle the blandness/collapse problem in another way
current mood
knowledge state dialogue partnerExplicit control variable
Neural generation models
input x
output y
response purpose
Model Architecture
• The control variable 𝒔 is introduced into the Seq2Seq model • One-fit-all model -> multiple causal model
• different <utterance, response>, different 𝑠, different models
• Dual representations• Semantic Representations: relate to the semantic meaning
• Usage Representations: relate to the usage preference
Model - Encoder
• Bi-RNN: modeling the utterance from both forwardand backward directions ➢ {𝒉1
→, … , 𝒉𝑇→} 𝒉𝑻
←, … , 𝒉1←
➢ 𝐡𝑡 = [𝒉𝑡→, 𝒉𝑇−𝑡+1
← ]
Model - Decoder
• Predict target word based on a mixture of two probabilities: the semantic-based and usage-based generation probability
𝑝 𝑦𝑡 = 𝛽𝑝𝑀 𝑦𝑡 + 𝛾𝑝𝑆(𝑦𝑡)
➢ Semantic-based probability✓ Decides what to say next given the input
𝑝𝑀 𝑦𝑡 = 𝑤 = 𝒘𝑇 𝑾𝑀ℎ ∙ 𝒉𝑦𝑡 +𝑾𝑴
𝒆 ∙ 𝒆𝑡−1 + 𝒃𝑀
hidden state semantic
representation
Model – Decoder (Continuous control variable)
➢ Specificity-based probability✓ decides how specific we should reply
• Specificity control variable 𝒔 ∈ [0,1]
✓ 0 denotes the most general response
✓ 1 denotes the most specific response
• Gaussian Kernel layer
✓ the specificity control variable interactswith the usage representation of words
✓ let the word usage representation regressto the variable 𝑠
𝑝𝑆 𝑦𝑡 = 𝑤 =1
2𝜋𝜎exp(−
(Ψ𝑆 𝑼,𝒘 − 𝑠)2
2𝜎2)
Ψ𝑆 𝑼,𝒘 = 𝜎(𝒘𝑇(𝑼 ∙ 𝑾𝑈 + 𝒃𝑈))
usage representationvariance
Model – Decoder (Discrete control variable)
➢ Emotion-based probability✓ decides which emotion we should express
• Emotion control variable 𝒔 ∈ {1,…,N}
✓ N denotes the number of emotion categories
✓ The emotion categories may include angry, disgust, happy, like, sad, and others.
• Softmax layer
✓ the emotion control variable interacts with the usage representation of words
✓ let the word usage representation beclassified to the variable 𝑠
𝑝𝑆 𝑦𝑡 = 𝑤 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(Φ𝐸(𝑼,𝒘))𝑠Φ𝐸(𝑼,𝒘) = 𝒘𝑇(𝑼 ∙ 𝑾𝑈 + 𝒃𝑈)
Model Training
• Objective function – log likelihood
ℒ =(𝑿,𝒀)𝜖𝒟
log 𝑃(𝒀|𝑿, 𝑠; 𝜃)
• Training data: triples (𝑿, 𝒀, 𝑠)
• Sometimes, we only observe (𝑿, 𝒀) while the control variable s is not directly available in the raw conversation corpus
How to obtain s to learn our model?
We propose to acquire distant labels for 𝑠.
Distant Supervision (Specificity)
• Normalized Inverse Response Frequency (NIRF) ➢ A response is more general if it corresponds to more input utterances ➢ The Inverse Response Frequency (IRF) in a conversation corpus
• Normalized Inverse Word Frequency (NIWF)➢ The response is more specific if it contains more specific words ➢ The maximum of the Inverse Word Frequency (IWF) of all the words in
a response
Specificity Controlled Response Generation
• Given a new input utterance, we can generate responses at different specificity levels by varying the control variable s
• Different s, different models, different responses➢ 𝑠 = 1: the most specific response
➢ 𝑠 ∈ 0,1 : dynamic style in response
➢ 𝑠 = 0: the most general response
0 1
General response Specific response
s
Emotion Controlled Response Generation
• Given a new input utterance, we can generate responses in different emotional states by selecting the control variable s
• Different s, different emotions, different responses➢ 𝑠 = 1: angry
➢ 𝑠 = 2: disgust
➢ 𝑠 = 3: happy
➢ 𝑠 = 4: like
➢ 𝑠 = 5: sad
➢ 𝑠 = 6: others
1 2 3 4 5 6
Experiments - Dataset
• Short Text Conversation (STC) dataset ➢ released in NTCIR-13
➢ a large repository of post-comment pairs from the Sina Weibo
➢ 3.8 million post-comment pairs
➢ Jieba Chinese word segmenter
• Emotional STC (ESTC) Dataset ➢ released in NLPCC-2017
➢ apply the classifier, Bi-LSTM, to annotate the STC Dataset with six emotion categories
➢ Jieba Chinese word segmenter
STC Statistics
ESTC Statistics
Experiments - Model Analysis
1. We vary the discrete control variable s by setting it to five different values (i.e., 0, 0.2, 0.5, 0.8, 1) 2. Varying the discrete variable s from 0 to 1, the generated responses turn from general to specific 3. NIWF (word-based) is a good distant label for the response specificity
general
specific
Table: Model analysis of our SC-Seq2Seq under the automatic evaluation for the specificity controlled generation
Experiments - Comparisons
1. When 𝑠 = 1, our SC-Seq2SeqNIWF model can achieve the best specificity performance
2. When 𝑠 = 0.5, Our SC-Seq2SeqNIWF model can achieve the best performance as compared with all the baseline methods
Automatic evaluation
Table: Comparisons between our SC-Seq2Seq and the baselines under the automatic evaluation for the specificity controlled generation
Experiments - Comparisons
1. SC-Seq2SeqNIWF,𝑠=1 generates the most informative and interesting responses as compared with all the baseline models2. The biggest kappa value is achieved by SC-Seq2SeqNIWF,𝑠=0
Human evaluation
Table: Results on the human evaluation for the specificity controlled generation
+2: the response is not only semantically relevant and grammatical, but also informative and interesting;
+1: the response is grammatical correct and can be used as a response to the utterance, but is too trivial (e.g., “I don’t know”);
+0: the response is semantically irrelevant or ungrammatical (e.g., grammatical errors or UNK).
Experiments - Analysis
1. Neighbors based on semantic representations are semantically related2. Neighbors based on usage representations are not so related but with similar specificity levels
Table: Target words and their top-5 similar words under usage and semantic representations respectively for the specificity controlled generation.
Figure: t-SNE embeddings of usage and semantic vectors for the specificitycontrolled generation.
Experiments - Case study
With s from 1 to 0, SC-Seq2SeqNIWF
can generate very long and specific responses, to more general and shorter responses.
Table: Examples of response generation from the STC test data. s = 1, 0.8, 0.5, 0.2, 0 are the outputs of our SC-Seq2SeqNIWF with different s values evaluation for the specificity controlled generation.
Experiments - Case study
With s sampled from the set (0,1,2,3,4,5), SC-Seq2Seq generate theresponse corresponding to the given emotion category.
Table: Examples of response generation from the ESTC test data. s = 0, 1, 2, 3, 4, 5 are the outputs of our SC-Seq2Seq with different s values evaluation for the emotion controlled generation.
Conclusion
• Modern neural generation model-based conversation systems lack the intervene mechanism to actively control the style of the output responses.
• We propose to introduce a causal generation model with an explicit control variable to handle different post-response relationships in terms of specificity and emotion.
➢Such control mechanism also solve the blandnessand collapse problem in a different way.
Future Work
73
No grounding Grounded
Short term/immediate reward
Long Term/User feedback
Correlation Causal/Interpretable
neural neural-symbolic
SL RL
data-driven/statistics causal/cognitive theory
Jiafeng Guo(郭嘉丰)[email protected]
Thank you