task-oriented dialog agents using memory …...msc artificial intelligence master thesis...

MSc Artificial IntelligenceMaster Thesis

Task-Oriented Dialog Agents UsingMemory-Networks and Ensemble Learning

by

Ricardo Fabian Guevara Melendez11390786

August 23, 2018

36 creditsJanuary - June 2018

Supervisor:Maarten Stol

Assessor:Shaojie Jiang

Abstract

Task-oriented dialog agents are ever more relevant systems to engage with an user through voice or text naturallanguage input to fulfill domain-specific tasks. Recently, neural based approaches are gaining popularity overtraditional rule based systems thanks to promising results (Williams et al., 2017) in standard tasks (Bordes et al.,2016). While most modern approaches follow an architecture based on a pipeline of components (Young et al.,2013; Bocklisch et al., 2017; Williams et al., 2017), there is still much room for design in these architectures, andmany promising models that can be incorporated in the pipeline. One such pipeline architecture is the HybridCode Networks (HCN) (Williams et al., 2017), which uses example human-bot conversations to train an LSTM(Hochreiter and Schmidhuber, 1997) to track conversation state and predict the next action. Inspired by thepromising results from Bordes et al. (2016), this thesis focuses on building an architecture similar to HCN butimproving the action policy by using a Memory Network instead of an LSTM and then using an ensemble thatcombines both the Memory Network and the LSTM from HCN into a single action policy. By combining thebenefits of the HCN architecture with the Memory Network and the ensemble policy, the model achieves perfectscores in bAbI task 5, almost 4% accuracy improvement over Bordes et al. (2016) and almost 8% higher thanthe LSTM action policy from Williams et al. (2017), while the ensemble policy by itself proved responsible formore than 1% improvement over the HCN architecture on bAbI task 6. The final results prove not only theadvantages of a Memory Network over an LSTM for some scenarios, but more importantly, the fact that bothpolicies complement each other and benefit of working together, even when trained with the same data.

1

Acknowledgements

This work would not had been possible without the help of my assessor Shaojie Jiang. His expertise, expressed inthe form of feedback as well as setting the right priorities, dramatically increased the quality of the descriptionsand results presented here.

Of equal importance was the role played by BrainCreators B.V., which provided the environment and day-to-day support to make this possible. Special thanks to my company supervisor Maarten Stol, who allowed andencouraged the creativity that enabled me to work on my research interests in Dialogue Agents.

Finally, I want to thank all the dear friends that constantly provided their support during the last 2 years,nearby and from afar. And specially, to my mother and Simon, without whom this long awaited dream wouldnot had been possible.

2

Contents

1 Introduction 5

2 Related Work 72.1 Dialog State Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Neural Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 From User Input to Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Action Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Theoretical Foundation 103.1 Natural Language Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Dialog State Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Action Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.2 Memory Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.3 Policy Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Natural Language Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Memory Networks and Ensemble Learning as Action Policy 154.1 Input Features for the Action Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Memory Networks as Action Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Building an Action Policy Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4 Designing and Extracting Intents and Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.4.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4.2 How to get User Intents and Actions from the bAbI Tasks . . . . . . . . . . . . . . . . . . 18

5 Experimental Setup 225.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1.1 NLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1.2 Memory Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1.3 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.1.4 Stacking Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Test Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2.1 NLU in Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2.2 Policy Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Results 266.1 NLU Performance in Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.2 Policy Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.2.1 Task 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.2.2 Task 5 OOV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296.2.3 Task 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 Conclusions 337.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

A Task 6 bot templates 37

B Task 6 user intent map rules 41

3

C Example Dialogs 43

4

Chapter 1

Introduction

Dialog Agents are regarded both as an ultimate proof of actual machine intelligence and a very hard task sinceeven before the term Artificial Intelligence (AI) was coined by Turing (1950). While there is still a long way toachieve general domain dialog agents that can really deceive a human, there is a simpler version that can beachieved with varying degrees of success with current technology. These are task-oriented dialog agents. Whilemany authors (Chen et al., 2017; Vlad Serban et al., 2015) make no further classification than task-oriented ornon task-oriented chatbots, Jurafsky and Martin (2018) use the term chatbot exclusively for open domain orchitchat bots, while using the term ‘frame based dialog agents’ to refer to bots that fill slots by asking informationto the user until having sufficient input to perform a task and provide an answer. This fits the description ofthe models explored in this work, and will henceforth be referred to as ‘task-oriented dialog agents’. Thesedomain constrained models are even more relevant for industries, eager for automation (e.g. customer support,question answering or transaction processing). In these scenarios the agent only needs to deal with a constraineddomain of possible actions to execute, and decide one after each user query in a sequence. This work is furtherconstrained to text input only (as opposed to spoken dialog) where choosing an action means to select a texttemplate answer. This is appropriate for practical scenarios, since many such bots are featured on online chatwindows on a website or leverage existing messaging platforms such as Facebook’s messenger or Telegram.

Until recently, task-oriented dialog agents have traditionally worked as completely deterministic rule-basedsystems. This approach is not only expensive to engineer (and especially to maintain as the number of rulesgrows large), but also limited in the amount of context the agent can use to provide sensible answers. Theamount of rules just grows dramatically fast when more context is considered to make the next decision. Infact, many such agents just detect words and select an answer based on it (Bordes et al., 2016), in a Q&Afashion with no further context.

Today’s availability of data and computing power allow for machine learning approaches that do more thanjust detecting keywords and acting upon them. At the same time, tackling this as the problem of choosingthe right action at each step, encourages the use of a variety of classifiers available. Young et al. (2013)models this problem as separate tasks handled by a pipeline of modules (see figure 2.1) including a naturallanguage understanding module to handle the inputs from the user, a representation of the conversation state,a policy to decide the next action based on this state and a natural language generator module to output ananswer corresponding to the action. Several authors (Williams et al., 2017; Bocklisch et al., 2017) propose touse an LSTM (Hochreiter and Schmidhuber, 1997) as action policy, learning the right action from exampleconversations. The Hybrid Code Networks (HCN) architecture (Williams et al., 2017) successfully integratesa neural approach to the action policy, with domain specific rules to guide the learning process. On the otherhand, Bordes et al. (2016) proposed the use of a Memory Network (Weston et al., 2014) action policy, since thisrecent model family has achieved promising results in tasks related to sequence processing, even outperformingLSTM and other RNN architectures at language modeling (Sukhbaatar et al., 2015). This Memory Networkaction policy obtains the highest results from all the baselines and tasks in their experiments.

The goal of this thesis is to start from the HCN architecture and then improve it in 3 different ways,specifically:

• To propose Memory Networks as an action policy, due to its appropriate structure (further explained insection 3.3.2). Although this was already proposed by Bordes et al. (2016), they used a Memory Networkto select an utterance from a set of several thousands at each turn in the dialog. This thesis however, makesbetter usage of the model, by implementing the template actions from HCN instead, which dramaticallyreduces the number of classes from several thousands to less than a hundred. This achieves perfect scoresin the toy dataset, improving the results obtained by Bordes et al. (2016) in action accuracy by 3.9% andby 50.6% in percentage of perfect dialogs.

• To incorporate an ensemble of action classifiers as yet another action policy, consisting of the same LSTM

5

from Williams et al. (2017) plus the Memory Network. This led to a consistent improvement in all testscenarios, and to new state-of-the-art results on the hard dataset (bAbI task 6) with more than 1% turnaccuracy increase in some scenarios.

• To incorporate and show the benefits of using a Natural Language Understanding module to deal withthe user inputs and compute features. While the ‘input features’ Williams et al. are straightforward andeffective, it puts an extra effort on the action policy, since it has to learn from high dimensional noisyinputs. An NLU on the other hand is a model specialized in dealing with natural language input. Itproduces a simple label that captures the meaning of the text, effectively freeing the action classifier fromhaving to deal with all the different ways a user can express the same idea. This produces slight butconsistent improvements in the toy dataset. As an added benefit of the NLU, it can also perform NamedEntity Recognition (NER), a necessary task for slot filling (e.g. request area and date to book a hotel).Williams et al. relies on simple regular expressions to look for known names in the user text, but thiscannot generalize to unseen entity values which are a common occurrence. The NLU, on the contrary,leverages more sophisticated methods such as Conditional Random Fields (CRM). This generalizes betterfor unseen entities.

This document is further organized in the following chapters: §Related Work reviews other existing ap-proaches for task-oriented dialog agents. §Theoretical Foundation explains the theory behind the models ex-plored. §Memory Networks and Ensemble Learning as Action Policy explains the original contributions of thiswork in detail, the general design and training data formatting. §Experimental Setup summarizes the researchquestions the models aims to answer, as well as the score metrics, the different test settings and specific im-plementation details on the models explored. §Results chapter presents the scores and interesting observationsobtained in the experiments. Finally, §Conclusions chapter explains what can be inferred from those results.

6

Chapter 2

Related Work

Dialog agents are computer models aimed to provide natural language interaction with an user. According totheir purpose, these systems are classified in two classes (Su et al., 2016):

1. open domain or chitchat bots: this refers to dialog systems that engage in a natural conversationwith the user without a predefined goal. This is a scenario commonly tackled with sequence-to-sequencemodels that build an entire phrase word by word conditioned on a phrase from the user. Vinyals andLe (2015) is a famous example of this approach, achieving engaging and often interesting conversationswith the user. More sophisticated approaches may use a whole ensemble of models to decide what to say.Serban et al. (2017) is a remarkable example of this technique. Chitchat bots however can hardly keeptrack of any meaningful context in the conversation and are therefore inadequate to perform tasks fromthe user, since that usually result in a sequence of exchanges or dialog turns, making this highly contextsensitive.

2. task-oriented chatbots: these dialog agents are meant to deal with a small domain, specializing toperform just a few set of possible requests from the user and so, their architectures are optimized tokeep track of the dialog context. Normally they follow a more straightforward approach and just selectthe next phrase to say based on this context. This phrase could be a template with free slots that arefilled in according to that context, conditioned on the dialog state. Young et al. (2013) review a seriesof approaches for this sort of statistical agents, stating the problem as a Partially Observable MarkovDecision Process (POMDP), i.e. the next action is determined by an estimated current state. All theseapproaches can be seen as a pipeline of components. Figure 2.1 shows a general diagram for these agents(Chen et al., 2017; Young et al., 2013). Here the user input is processed by a natural language interpretermodule (assuming the user inputs audio, which need not be the case). Its output is used to estimate anew conversation state on which a policy will act to determine an action that is finally translated intolanguage.

Figure 2.1: General Architecture of task oriented dialog agents (Chen et al., 2017)

7

2.1 Dialog State Tracking

Task-oriented dialog agents are highly context sensitive: at each turn, their answers need to consider the previousturns in the dialog to make the right choice. In the pipeline architecture from Young et al. (2013), the modulein charge of keep all this dialog context is the Dialog State Tracker. In their work, the conversation state is ahidden discrete variable, and its value is estimated at each turn in the dialog. On the other hand, Hendersonet al. (2014b) uses Recurrent Neural Networks (RNNs) to keep track of the state and decide the next action,so the dialog state tracker and the action policy overlap.

In the Rasa architecture (Bocklisch et al., 2017), there is an explicit dialog state tracking object containinginformation such as the current slot values, the past actions taken by the agent and the previous utterancesfrom the user. At every point, the action policy has access of the information contained in this object to decidethe next action.

The HCN architecture (Williams et al., 2017) tracks the slot values explicitly, just like Rasa, but dialogstate tracking is mostly relied on the memory capacity of their LSTM action policy, just like Henderson et al.(2014b). This is the case in this thesis, since it is based on this architecture.

2.2 Neural Based Approaches

The Hybrid Code Networks architecture of Williams et al. (2017) is a machine learning/rule based hybridmodel that excels at the bAbI tasks, achieving state-of-the-art results. This raises the question of whether afully statistical based approach could fulfill these tasks or not and if so, what must be changed. The experimentsahead will prove this is indeed possible, and that such an approach can excel at the bAbI tasks.

2.2.1 From User Input to Features

By using machine learning methods, then comes a second question on which user input features to compute, andso, several ways to process the natural language input from the user into informative features are tried inthisthesis. Figure 2.1 implies the use of semantic features, often in the form of semantic classes or dialog acts alongwith their confidence levels. Lee (2013) also uses semantic features to estimate the dialog state with positiveresults but is not conclusive on whether the success is due to the features, the model or both. On the other hand,Henderson et al. (2014b) explores the use of word-based features (i.e. disregarding the semantic classes from theNLU component, and working with the user input words instead). By using n-grams detected by the AutomaticSpeech Recognition (ASR) module as features for an RNN, it achieves better results than by using the semanticclasses computed from those words. They conclude that this is due to the fact that these word-based featureskeep the most information from the user input, allowing the model to learn any useful features from it, whileusing semantic classes inevitably takes information away. However, their results come from the Second DialogState Tracking Challenge (DSTC2) (Henderson et al., 2014a) which is a highly noisy dataset where keeping allthat extra information might be beneficial. This need not be case for less noisy datasets. On this line, HCNalso uses word-based features, consisting mainly of Bag of Words (BoW) and word embeddings (Mikolov et al.,2013), achieving the state of the art on bAbI tasks 5 and 6.

User input comes usually as text, mostly due to the use of chatbots on commercial websites, and messagingplatforms such as messenger, telegram or slack. This by itself removes much of the noise in the input, and makessemantic features a more attractive option. This is evidenced by the amount of available NLU services suchas Microsoft’s Luis, Amazon’s Alexa, IBM’s Watson or the recently created open source Rasa NLU (Bocklischet al., 2017) (of which Braun et al. (2017) does an extensive comparison). They conclude that Luis performsbetter overall, but Rasa is not so far behind. This thesis uses Rasa given the advantages of an open sourcesolution, such as knowing the underlying models and allowing any customization needed (e.g. it uses a fullycustomizable pipeline of models to produce its outputs). There is also a Rasa core module providing an entireframework based on HCN, claiming to be robust under scarcity of data. Their architecture is roughly the sameas Young et al. (2013), which can be observed in figure 2.2

2.2.2 Action Policies

Most machine learning approaches rely on some sort of RNN such as Henderson et al. (2014b) or the LSTM usedby both Williams et al. (2017) and Bocklisch et al. (2017). Bordes et al. (2016) provides several baselines includ-ing Information Retrieval (IR) techniques, but amongst non rule-based systems, their best (by a wide margin)and most interesting result was obtained by using a Memory Network action policy, as the one from Sukhbaataret al. (2015). However they do not provide results with semantic features, and the model predicts from amongthe 2407 utterances ever observed in bAbI task 6 train, development or test data (4212 for task 5). This is a hugeamount specially considering that most bot utterances are very similar except for the entities they mention (e.g.‘here it is resto tokyo expensive thai 4stars address’ and ‘here it is resto seoul expensive thai 7stars address’).

8

Figure 2.2: Rasa task oriented chatbot architecture (Bocklisch et al., 2017)

This can be simplified by using templates with slots that can later be filled by a module that keeps track ofthe entities (such as the NLU). This is in fact what HCN does, resulting in just 58 possible action templatesfor task 6 and 16 for task 5 (without even using an NLU, but simple key word match on user inputs to detectslot values). This gap leads to the main focus of this work: the Memory Network action policy proved to bepromising according to the results from Bordes et al. (2016), but it could be severely hampered by the lack ofaction templates in the policy. At the same time, HCN does use templates, but they do not use a MemoryNetwork that could potentially improve its performance. For instance, they achieve perfect scores on task 5 butonly by using hard-coded rules. This makes it tempting to explore if a fully machine learning based methodcould also achieve the perfect score, since hard coded rules are expensive to develop and maintain. Moreover,a model could combine the results of HCN’s LSTM policy and the Memory Network, using an ensemble justlike Serban et al. (2017), an idea also explored by Henderson et al. (2014a). Finally, none of those two modelsuse an NLU, but rely only on word-based features, that are easy to compute and perform well in noisy dialogs(Williams et al., 2017) but do not work well with out of vocabulary (OOV) terms.

9

Chapter 3

Theoretical Foundation

The proposed modified HCN architecture requires several artifacts. Each one of them is explained below.

3.1 Natural Language Understanding

An NLU module takes natural language input from the user and produces a dialog act out of it. Additionally, itcan also provide Named Entity Recognition (NER) which is required for slot filling, a crucial part of the dialogagent architectures explored in this work (Bordes et al., 2016). Both aspects are explained below:

• Intent classification: Natural language deals with a huge search space, even under the limited domainof a task-oriented dialog agent. While the user can express an intention in many grammatical ways, therecan only be a few such intentions in a constrained domain. Classify this intention means to take a naturallanguage sentence and assign a class to it, as if determining what the user wants the agent to do (Bocklischet al., 2017). This is therefore a classification task, and the Rasa framework (used in the experimentsahead) does this by computing features representing the natural language input (e.g. word embeddingsand Part of Speech (POS) tags). These features are then fed to a classifier, such as an Support VectorMachine (SVM).

• Entity recognition: A task-oriented dialog agent normally requires input values from the user in orderto perform a task. For instance, it could require a date before booking a hotel room for the user. Thisimplies that the bot recognizes the existence of the ‘date’ entity, and when the user provides a value forit, the agent needs to detect it. These task is referred to as slot filling (Jurafsky and Martin, 2018) and itcan also be regarded as a classification task, but in this case it classifies word sequences instead of singleelements. Detecting these slot values or entities is crucial and techniques could be as simple as findingknown phrases with regular expressions. Rasa NLU uses a Conditional Random Field (CRM) which cangeneralize to detect unknown entity values. To do so, it classifies each word using Beginning (B), Inside(I) and Outside (O) labels as explained by Jurafsky and Martin (2018). Table 3.1 shows example IOBlabels to train such a classifier. A B label indicates the first word of a slot. If the slot is composed frommore than one word (as in south asian, for the cuisine example slot), then subsequent words are labelledwith an I label. Words that do not make part of any slot value are labelled with an O. The intent labelis present to be explicit in that the same sentences used to train the entity recognizer are usually used totrain the intent classifier as well.

Sentence please find a south asian restaurant in LondonLabel O O O B-cuisine I-cuisine O O B-locationIntent search restaurant

Table 3.1: Example labelled sentence for intent classification and slot filling using IOB labels

There are many modules available to perform this, like Facebook’s Wit AI1 or IBM’s Watson 2, ready to beused even as a cloud service. Further details on how this approach is used in this work are explained in chapter4.

1https://wit.ai/2https://www.ibm.com/watson/services/conversation/

10

3.2 Dialog State Tracker

The architecture used in this thesis relies mostly on Hybrid Code Networks. It relies on the memory of theaction policy plus a separate component to explicitly keep the slot values. In some of the experiments, theaction policy is implemented as a Memory Network, which keeps a feature vector for each previous turn. Thismeans the policy relies on itself to check the dialog context that enables the right choices. In later experiments,the policy is implemented by an ensemble of both a Memory Network and an LSTM, hence relying on thememory capacity of those models for dialog state tracking. In all those settings, the entities are always keptby a separate module, that feeds the action policies at each turn and also completes the values of the actiontemplates (e.g. so that an action template such as ‘there are no more restaurants which serve <cuisine >food ’can be completed as ‘there are no more restaurants which serve thai food ’).

3.3 Action Policy

This section provides a brief summary on the action policies considered in this work, namely LSTM and MemoryNetworks. Their specific use for task-oriented dialog agents is further explained in chapter 4.

3.3.1 LSTM

An LSTM is a well known recurrent neural model to deal with sequential data. In a dialog, each turn isrepresented by a feature vector or embedding, which is fed to the network. At each step, the LSTM actionpolicy outputs a vector indicating the distribution over actions. The max of this output vector is regarded asthe selected action.

The features representing a dialog turn must contain sufficient information for the action policy to makethe right decision. In the HCN architecture, the features representing a turn include information such as whichslots are filled, which was the last action from the bot and other features representing the words in the userutterance. The Rasa (Bocklisch et al., 2017) framework, uses its LSTM action policy in a similar way, with theinteresting difference that each token in the sequence contains features representing not just the last turn, buta fixed number of previous turns. Although the authors do not explain the reason for this design decision, thisway to define tokens could compensate the memory loss suffered by RNNs. Since usually the most recent turnsare the most relevant in the dialog, keeping more than just the last in the current turn’s input could alleviatethe memory loss.

3.3.2 Memory Networks

The use of Memory Networks for dialog action prediction is a crucial aspect of this work. For this reason, andbecause they are a relatively recent invention, this section explains the model in detail.

When dealing with sequential data, Recurrent Neural Networks (RNNs) (Goller and Kuchler, 1996) arepossibly the most used approach, being the de facto option for sequential problems such as language modeling,POS tagging or machine translation. These well-studied models process the sequence token by token, usuallymaking a decision at each step. Here, Memory Networks (Weston et al., 2014) are a recent alternative, thataims to tackle the tendency RNNs have to forget long range dependencies (Hochreiter and Schmidhuber, 1997).It is important to realize that instead of a single model, memory networks are a family of models, just like RNNsare. This work explores the specific approach to Memory Networks taken by Sukhbaatar et al. (2015), since itis easier to train and more adapted to the dialog domain than the original work on the subject by Weston et al.(2014), and is therefore the one tried by Bordes et al. (2016).

Instead of keeping a single vector memory in charge of representing every token in the sequence so far, aMemory network computes a representation or embedding for each token, and considers them all at once whenmaking a decision. This gives them an equal chance to contribute to the answer. This is not the case withRNNs, since the most recently processed tokens have a higher chance to influence the output. This is due tothe fact they all share a single memory which tends to keep less information of the old tokens every time anew token is processed. Figure 3.1 shows a diagram of the Memory Network architecture adopted in this work,proposed by Sukhbaatar et al. (2015).

The input of the model is the current utterance from the user (q) and the conversation history, a sequence offeature vectors x1, ...xM , each one representing the respective dialog turn. Here, M is called the memory size,a hyper parameter of the model, so that every time an utterance is added to the history, the oldest is discardedso that there are never more than M memories considered. In Sukhbaatar et al. (2015), the way the q featurevector is computed is assumed to be the same in which each utterance in the history was, so that the current qis added to the history for the next turn prediction without further operations. This is not strictly necessary,but it is also the way it is done in this work. An embedding is computed for q and for each utterance in the

11

Figure 3.1: Diagram of a Memory Network. (Sukhbaatar et al., 2015)

history through means of embedding matrices B (d× h) and A (of same dimensions as B):

u = qB

mi = xiA

Each of the M memory embeddings is compared with u by a dot product. A softmax is then applied to producea simplex that scores each of the candidate memories. A final fixed size memory representation is obtained bycomputing other M embeddings of the history, c1, ...cM by using a matrix C of same dimensions as A and B.The final embedding o is the average c, weighted by the vector of weights p:

pi =mi.u∑′i mi.u

ci = xiC

o =∑i

ci.pi

From here, all that is pending is to map this to the desired output space, which is the number of actions V .Here the author proposes to take the sum o + u and use a simple mapping to the V space, simply by using amatrix W (h× V ). But the author also reports having better results by applying another embedding to q witha matrix H(d× d), therefore this is the alternative chosen in this work.

The model performs well up to this point, but the author reports performance gains on several tasks byapplying a recursive step called a memory hop. This is akin to the model reconsidering and refining its answer.Instead of applying the final transformation with matrix W , using the output o + q (or o + qH), a hop impliesusing this term as the input for another round through the entire model, repeating this recursive step a fixednumber of times and applying the final W transformation only at the last step. This multiplies the numberof parameters of the model, leading to the question of which ones to share or constrain. The authors reportsincreased performance in many tasks such as language modeling with up to 6 hops. Regarding number ofparameters, this work adopts the so called ‘adjacent’ approach from Sukhbaatar et al. (2015), which means allmatrices are the same across hops, except A and C. In the proposed variant, there are as many A matrices ashops, and in hop j, Cj = Aj+1. This is the same approach taken by Bordes et al. (2016). Figure 3.2 shows adiagram of the recursive operations taken in a Memory Network.

The history is the same across hops (and the number of hops is constant for the model, not dependent on thecurrent turn). In this diagram, matrix H is omitted, and no constrain is explicitly enforced on any matrix. Donote how matrix W is applied only with the output of the final hop. Only then the final answer a is computed.

3.3.3 Policy Ensemble

Using several action policies to predict an answer and leveraging all their predictions to compute a better oneis what makes an ensemble, and this has been proven beneficial in many domains, including dialog agents

12

Figure 3.2: Diagram of a recursive Memory Network with 3 hops. (Sukhbaatar et al., 2015)

(Henderson et al., 2014a; Serban et al., 2017). Given a set of M models, each one computes a prediction inthe form of a categorical distribution over the available actions. Let pi be the prediction of model i for anygiven turn in the dialog. The immediate question is how to leverage the knowledge in each one of the M suchdistributions to compute a final better answer. Henderson et al. (2014a) proposes 2 methods, namely scoreaveraging and stacking, which are explained below along with the simple highest confidence approach.

Highest Confidence

The most straightforward solution. Let the model with the highest confidence decide:

action = max(max(p1, ..., pM ))

In the case where one model is right more often than the rest or if some subset of models complement eachother by being correctly confident where the others are not, this approach should improve overall predictionaccuracy.

Average Prediction

An even smarter and similarly simple approach is to take the average across all predictions and use that as thefinal categorical distribution:

action = max(1

M

M∑i

pi)

As defended by Henderson et al. (2014a), as long as each model’s predictions are correct more than half the timeand their errors are not correlated, average prediction is guaranteed to improve performance. Using differentmodels encourages decorrelation but using different training data is a simple way to help decorrelate as well.

Stacking

Instead of deciding what to do with each prediction, yet a separate model can be used to learn from thosepredictions. This idea comes from Wolpert (1992) who calls it stacked generalization.

action = f(p1, ..., pM )

Where f is a learned function.This approach has the advantage that potentially, it can learn whatever features are required to get the

right answer given the independent predictions of each underlying policy, therefore getting better results thanaveraging, as it is the case in Henderson et al. (2014a) but it requires an extra dataset to train this separatemodel.

13

3.4 Natural Language Generator

In open domain chitchat bots, the Natural Language Generator is a complex task, requiring to model anutterance posterior probability given (at least) the dialog estate (Vlad Serban et al., 2015), a task often solvedthrough generative language models like Sequence to Sequence (Vinyals and Le, 2015). On the contrary, taskoriented dialog agents do not need to produce creative answers, but to fulfill a given goal only. Therefore,generating an answer is something that often does not take much more than the policy deciding a messagetemplate with unfilled slots (e.g. Hotel La belle ville is in <city> and then filling those slots with the valuestracked in the Dialog State Tracker module before outputting the answer. This approach is used by manyauthors such as Williams et al. (2017); Bocklisch et al. (2017); Vlad Serban et al. (2015) and it is the one usedin this work as well.

14

Chapter 4

Memory Networks and EnsembleLearning as Action Policy

This chapter goes deeper into the core contributions of this thesis. Section 4.1 lists and explains the featurescompared in the experiments. Section 4.2 explains how a Memory Network can be used as action policy for atask-oriented dialog agent, highlighting its differences with an LSTM. Section 4.3 continues with how to buildan ensemble of action policies with the Memory Network and the LSTM. Finally, section 4.4 covers the dataused and how it was formatted to be compatible with the architectures used in this thesis.

4.1 Input Features for the Action Policies

One important contribution of this work is the design of task specific user intents and the use of an NLU tocompute this feature and add it to the HCN architecture. This feature is directly compared to the defaultfeatures from HCN. While the HCN architecture uses word-based features, user intent is a common featureprovided by most popular NLU cloud services (Braun et al., 2017). Rasa NLU (Bocklisch et al., 2017) is the onechosen for the experiments performed in this work, because it is completely open source, highly customizableand effective as can be seen in chapter 6. An advantage of user intents is that they could be common in differentdomains, increasing the availability of training data. Another potential advantage is it is less noisy than word-based features, making it easier for an action policy to learn from them. This work explores the value of usingan NLU to get the user intent feature in comparison to the HCN word-based features, as well as the NLU entityrecognition in comparison with HCN regular expression based pattern matching. The following list explains thefeatures used in the experiments ahead and how to compute them. Most of the list consists on features takenfrom the HCN architecture, while the user intent and the way entity flags are computed are direct contributionsof this thesis.

• User intent: this is implemented as a vector whose size equals the total number of intents defined. Inthis work, this feature is computed by Rasa NLU by using labelled training data to feed an SVM as intentclassifier. The exact set of intents to use is domain dependent and it is a design decision. Section 4.4.2explains the intents used on each task explored in this thesis.

• Entity flags: these features consist of binary flags, indicating the presence of an entity in an utterance.They could either be provided by Rasa NLU (using a CRM) or by using regular expressions to detectknown phrases in the utterances. This is the approach used by HCN. Both options are compared in theexperiments.

• Bag of Words: a binary vector of vocabulary size, plus one extra bit, to accommodate for unknown wordsduring test time. Each entry corresponds to a word in the vocabulary and is set to 1 if the correspondinguser utterance includes that word. This is easy to compute, with the major drawback of losing word orderand disregarding word similarity (e.g. a test input could be very similar to a train input, except for a wordreplaced by a synonym. The BoW approach would disregard this similarity). Another major drawback isthat it performs poorly with unknown and low frequency words.

• Word Embeddings: these features fix most flaws from BoW features, since now each word is representedby a vector holding semantic information. There are several implementations from word embeddings, HCNuses those from Mikolov et al. (2013). It specifically uses the Google News 300 dimensional embeddingstrained with 3M vocabulary size 1. The sentence embedding is simply calculated by averaging the word

1Google News 100B model from https://github.com/3Top/word2vec-api

15

embeddings. This approach still loses word order but the impact it has in a task oriented domain isnegligible, as opposed to an open domain full of nuances and sentiments. The word embeddings alongwith the BoW features are commonly referred to as ‘word based features’ in this work.

• Turn: An integer representing the index of a turn in the dialog. This feature is specially important fora Memory Network action policy, since otherwise it doesn’t consider the order of its memories, unlikean RNN. Sukhbaatar et al. (2015) even propose to learn a time-embedding matrix, a matrix with onerow for each turn index that can be considered. This time embedding is added to the input featuresand learned during training. In this work, an integer is preferred since it is sufficient to represent theorder information, while the benefit of time embeddings is not justified by the authors. It also puts alimit on the number of turns that can be considered in a dialog, as the time embedding matrix needsto add rows accordingly. Another design decision concerning the turn feature is whether to encode it asa binary number or integer. The latter approach introduces a bias when the action policy learns withback-propagation, since higher numbers obtain a training boost. But since in a dialog, higher numberedturns (i.e. more recent turns) usually have more influence in the dialog context (this is studied in section6.2.3), this bias can be beneficial.

• Bot previous action: a 1 hot vector indicating the previous action of the dialog agent. This actioncould be the ground truth last bot action or the actual last predicted action, with significant consequencesin performance that are explored in the experiments. For more details about these two possibilities, seesection 5.2.2.

• Context flags: The HCN architecture includes special context bits in the input, which are meant toprovide context about the dialog. These features are highly dependent on the restaurant booking domainof bAbI task 5 and task 6. The complete list of such flags is:

1. presence of each entity in the dialog state. Do note this is different from the features that indicatepresence of each entity in the current utterance, thus HCN uses two sets of features dedicated toentities. This is the only context feature that HCN uses for bAbI task 5, while task 6 uses all thecontext features on this list.

2. whether the database has been queried yet. It is not used in this work since it contains almostthe same information as the ‘results presented’ context feature explained below, except for marginalscenarios where the database has no results for the given query.

3. whether there are any restaurants to offer given the current filters. Williams et al. (2017) implementsthis as two binary flags to represent the two possibilities as 1 0 and 0 1.

4. whether any results have been presented so far.

5. whether all results have been exhausted for the current query (this one uses two bit flags as well).

6. whether the cuisine type is unknown (which happens often in the bAbI dialogs).

7. whether the query yields any results in the training set. In this work, this feature was disregardedsince it proved to add little to no value during hyper-parameter optimization, arguably because ofthe differences between the train and test bots.

This thesis puts the HCN word based features in comparison with the NLU provided user intent feature. Therest of the features are common to both settings.

4.2 Memory Networks as Action Policy

The main goal of the action policy is to decide the next action. Recurrent Neural Networks are a popular choicebecause their memory capacity enables them to fulfill both state tracking and action policy tasks. MemoryNetworks also possess memory capacity, which in some tasks, such as Q&A or language modeling, has achievedstate-of-the-art results, even outperforming LSTM based approaches (Sukhbaatar et al., 2015).

Inspired by Bordes et al. (2016), this work proposes the use of a Memory Network as action policy. Theapproach presented here differs however in two crucial aspects.

1. Action templates: Bordes et al. (2016) considers every possible bot utterance int their domain as anaction to classify on, which results in thousands of actions for their experiments. The model presentedhere considers action templates instead, just like Williams et al. (2017). An action template is a botutterance with some slots to be filled in after the prediction is performed. This results in just tens ofpossible actions for the same experiments used by Bordes et al. (2016). Therefore, the Memory Networkhas an easier task at hand which allows it to perform even better.

16

2. NLU input processing: While other authors make use of word-based features such as Bag of Words orword embeddings (Williams et al., 2017; Bordes et al., 2016), this thesis experiments on the effect of usinga Natural Language Understanding module to process the user raw text input and produce a semanticfeature, namely a one-hot vector indicating the user intention. This intent vector is easier for the actionpolicy to deal with, albeit losing some information. The posterior experiments try these effect in both,artificial toy scenarios and real human-robot conversations with high noise conditions.

Other than the above, the usage of a Memory Network for this task is the same as in Bordes et al. (2016).That is, every conversation turn is converted into a fixed size vector of features. At every turn the networkkeeps track of the history of such vectors, up to a fixed length. These history is composed of the xi sentences infigure 3.1. The user utterance at the current turn corresponds to the question q. Consider for instance that ata given turn, q is a vector of features representing the utterance ‘I want a restaurant please’. If the user has notprovided any slot values yet, this will be visible in the history. The network can conclude from here that the bestnext action is to ask for a slot value, by asking for instance ‘what kind of cuisine would you like? ’. Otherwise, ifenough information is present in the history, the policy can conclude that the best action to perform is searchfor a restaurant and offer it to the user.

The history of previous turns plus the current turn vector of features is the input of the Memory Networkat each turn. Its output is then an action from the set of action templates. These templates are strings of textwith open slots to be filled by another module that specializes in entity recognition and does not care aboutaction prediction (such as Rasa NLU NER or the regular expression based pattern matching technique fromHCN).

4.3 Building an Action Policy Ensemble

A Memory Network is just one potential action policy, just like an LSTM. Once there are several models,assuming they have different knowledge (that is, the predictions perform sufficiently well and their errors arenot fully correlated), their knowledge could be combined to make a final prediction. This approach has been usedsuccessfully for open domain dialog agents (Serban et al., 2017) and also for dialog state tracking (Hendersonet al., 2014a). This thesis proposes to use it for task-oriented dialog agents. It explores the three ensembleapproaches explained in section 3.3.3, using the Memory Network as proposed in section 4.2 and the LSTM asused by Williams et al. (2017) as action policies. For any of the three ensemble approaches, the input is alwaysthe output prediction of each policy, in the form of a vector with the softmax over available classes. For thestacking ensemble, a Multi Layer Perceptron (MLP) is used as classifier, which is in line with Henderson et al.(2014a) and is a popular choice for simple classification tasks such as this one.

4.4 Designing and Extracting Intents and Actions

This section explains important contributions from this work regarding the way it used the bAbI datasets. Theexact set of actions predicted by the action policies, as well as the user intents classified by the NLU requireimportant design decisions regarding the dataset used. Before explaining these decisions, the bAbI tasks areproperly introduced to better understand the data format and main characteristics.

4.4.1 Dataset Description

The bAbI tasks

The bAbI tasks (Bordes et al., 2016) is a well known dataset aimed to train, test and compare task orienteddialog agents, produced by Facebook AI Research 2 with many benchmarks available. It is composed by a seriesof 6 tasks with increasing levels of complexity, making it a popular choice used by authors such as Williamset al. (2017). Each task tests a specific aspect of dialog, and builds on the previous tasks in an incremental way.For instance the first task is meant to test the ability of a system to learn to issue API calls when required, thesecond task focuses on the ability to update the API calls if the user changes her mind and so on. Task number5 tests full dialogs, effectively subsuming the previous 4 tasks and therefore, both HCN and this work focus onit and on task 6, a similar but much harder task explained ahead.

Each task consists of a training, validation and test set. Each set is composed of dialogs, which are lists ofuser-bot utterance pairs. They are all in the domain of restaurant booking, where the user has a specific needin mind (e.g. thai food in the center of town). The bot must obtain these query filters (also called slot valuesor entities) in order to perform the query on a restaurant database and offer the results. For task 5, the slot

2https://research.fb.com/downloads/babi/

17

Task T5 T6Average num. utterances: 55 54Average user utterances 13 6Average bot utterances 18 8Vocabulary size 3747 1229Different bot utterances 4212 2406Train set dialogs 1000 1618Dev. set dialogs 1000 500Test set dialogs 1000 1117

Table 4.1: Statistics for bAbI tasks 5 and 6

values that the user can provide to filter the database are type of cuisine, location, number of people and pricerange. For an example raw dialog format example, see appendix C, figure C.2.

There is an extra Task 5 test set with out of vocabulary (OOV) entity values, that is, a set of dialogs wherethe user mentions entity values that were never seen in the training set. To produce this OOV test file, theauthors split all cuisine types and locations in half. Training, development and test set use only restaurantsfrom one of those halves, while the test OOV file uses values only from the other.

Given the simplicity of this dataset, it is desired to see how a model would perform in a more realisticenvironment. To this end, many authors recur to the bAbI task 6 (Williams et al., 2017; Bordes et al., 2016),which is an adaptation from the Second Dialog State Tracking Challenge (Henderson et al., 2014a) to the bAbItasks format. This task also deals with the restaurant booking domain.

Table 4.1 summarizes relevant statistics about bAbI tasks 5 and 6.

The Second Dialog State Tracking Challenge

This challenge was originally created with the purpose of testing how well can a given model predict the stateof a conversation at each turn. To this end, 3 simple bots with different levels of complexity were created tohave phone conversations with human users pretending to have an actual need of a restaurant.

The DSTC2 dialogs have many sources of noise, including the actual noise on the phone line. In fact,conversations were recorded under different noise levels on purpose. For some extreme cases, even the humantranscribers could not decide what the message is. Added to that, the bots handling the calls are rather simple,relying in basic probabilistic models to track the conversation state and hand-crafted action policies (only thebot from the test set used reinforcement learning in its action policy, making the distributions governing thetest dialogs different than those from the training and validation sets), and very often, the bot misunderstandsthe user message, but the conversation keeps carrying the mistake nonetheless, posing an important challengefor any model trying to predict the bot side.

The conversations where thoroughly annotated with a wide range of meta-data such as transcripts, userprovided quality assessment, semantic annotations on both the bot and human utterances, as well as theAutomatic Speech Recognition hypotheses on the words uttered by the user. The actual challenge gathered 31contestants (from 9 research groups) so that their models could learn to estimate the conversation state at eachturn, defined as the current value of each slot, the currently requested slots (such as phone number or address ofthe restaurant) and the currently desired search method (which roughly equals to the user intent as defined insection 4.1). There are other Data State Tracking Challenges, like the original Dialog State Tracking Challenge(Williams et al., 2013), which focuses on the bus timetable domain with easier dialogs (e.g. users cannot changetheir mind), or the third challenge, which uses the same training data of the second but provides a test set onthe tourist information domain, to explore domain adaptability. There are yet other challenges, up to the sixth,exploring more and more complex aspects including switch of language at test time. The Second DSTC is usedin this work since it is the most appropriate to the problem at hand, it is freely available, it has been alreadyconverted to the bAbI format and it has been used by other authors, including the ones that this thesis focusesthe most on (Bordes et al., 2016; Williams et al., 2017).

The dialogs are formatted into json files, having one file for the human side and another for the bot. Anexample fragment showing the file format is available in appendix C, figure C.1.

4.4.2 How to get User Intents and Actions from the bAbI Tasks

This section describes the design decisions made in this thesis to choose which intents to define for each taskand how to get the training data for the NLU to detect them, as well as how to define the bot action templatesfor the action policy to classify on.

18

Intent Templates Count Examplegreet 3 1000 good morninginform 8 4998 i am looking for a cheap restaurantdeny 4 2404 no i don’t like thataffirm 4 1000 it’s perfectrequest phone 3 755 do you have its phone numberrequest address 3 779 do you have its addressthankyou 3 1000 thanksbye 2 1000 no thankssilence 1 5404 <SILENCE>

Table 4.2: Intent statistics for bAbI task 5 observed in training data (development and test data have the sameintents, and the number proportions are similar). The silence special utterance is used for turns in which thebot says something that requires no input from the user, such as in line 8 of figure C.2

Act Count Templategreet 1000 hello what can i help you with todayon it 1000 I’m on itask location 512 where should it beask number of people 480 how many people would be in your partyask price 494 which price range are looking forannounce search 2000 ok let me look into some options for youapi call 2000 api call <cuisine><cuisine><location><number><price>request updates 2018 sure is there anything else to updatesuggest restaurant 2404 what do you think of this option: <restaurant name>announce keep searching 1404 sure let me find an other option for youreserve 1000 great let me do the reservationgive phone 755 here it is <phone>ask anything else 1000 is there anything i can help you withbye 1000 you’re welcomeask cuisine 494 any preference on a type of cuisinegive address 779 here it is <address>

Table 4.3: Bot dialog act statistics for bAbI task 5. Counts come from training set, templates remain the sameacross train, dev or test sets

Task 5 and Task 5 OOV

In these dialogs the user tries to get a restaurant by providing up to 5 slot values (namely, cuisine, numberof people, price and location), potentially requesting either the phone or address once an option is offered.The user can also change her mind during the conversation and the bot needs to accommodate for this. Theutterances from the user are also very simple since these conversations were artificially produced. Thus, boththe bot and human utterances do not have much diversity. In fact, rule based systems can effectively excel atthese tasks, but an important goal in this work is to succeed at these tasks without relying in anything elseother than machine learning.

By templatizing every user utterance, just 31 possible utterance templates are obtained (e.g. 3 ways to greet,namely ‘good morning ’, ‘hello’ and ‘hi ’). This is valuable knowledge for the NLU, as all those 31 templatescan be further grouped in just 9 intents. Table 4.2 shows the intent classes defined for this dataset as part ofthe specific model design used in this thesis. Do note this intent classification is not enforced or suggested inany way by the authors of the dataset. But given the lack of grammatical diversity, and the context in whichthe utterances are used, any author having to classify those utterances by intent could hardly get a differentclassification. Table 4.2 lists the intents for task 5.

The bot phrases are also limited and are very consistently used on each context. They can all be representedwith 16 templates as can be seen in table 4.3

Task 6

This task is the result of taking the DSTC2 dialogs and putting them in the bAbI format. The authors explainfurther that the training and development set split is not the same as in DSTC2, but the test set is preserved.Furthermore, the utterance diversity both on the human and bot side is significantly greater than that of task5, requiring more effort and creativity to define user intents and bot actions.

19

Figure 4.1: Total occurrences of each bot act in bAbI t6 test set

Figure 4.2: Frequencies of each bot act in bAbI t6, both for test and training set

Bot action templatesSince there were different bots in each set, templates seen in train, development and test data are not entirely

the same, but they can be easily grouped together based on their meaning. By doing that, 56 templates wereidentified. This is an easier set for the action policy to work with, instead of all the 4212 possible bot utterancesused by Bordes et al. (2016). Some of these templates are very similar to others, like. ‘I’m sorry but thereis no restaurant serving <cuisine >’ and ‘I am sorry but there is no <cuisine >restaurant that matches yourrequest ’. Such subtle changes might be due only to the fact that a different bot produced each, and that wouldconstitute a latent variable that is not explicitly accounted for in the models explored in this work or by otherauthors such as Williams et al. (2017). The complete list of templates is available in appendix A. Figure 4.1shows the total occurrences of each bot action template in bAbI task 6 test set.

Since the bot generating the utterances of the test set is different than those of the train set, it is interesting tosee how the distributions vary across datasets. Figure 4.2 shows both frequencies side by side. Both distributionsare reasonably similar, but the training set is more skewed towards the dominating 3 acts, while others appearonly marginally or not at all (note for instance the huge gap of the repeat act, which almost never appears inthe train set, but it does much more often in the test set. This one is very hard to predict not only becauseof this gap but mostly because it has much to do with the output of the ASR in the bots generating the data,and this information is not available for the bAbI task 6. This is just one more reason why task 6 is so muchharder than task 5.

To have a better idea of how different the training and test distributions are for bAbI task 6, one can usetask5 as a baseline. Figure 4.3 shows this frequency comparison for task 5.

For bAbI task 5, the training distribution is extremely similar to that on the test set. A common measure ofhow similar two distributions are is the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951), whichreports 0 for two identical distributions and 1 as the maximum divergence. Task 5 reports a KL(train, test) of0.0004, while that of task 6 is 0.3658, a difference of several orders of magnitude.

User actsSince the utterances were produced by actual humans, the diversity is too great to handle with templates

as done with the bot side, and templates from the training set would not be of much use in the test set. For-tunately, DSTC2 dataset is provided with semantic annotations. Every utterance in that dataset has semantic

20

Figure 4.3: Frequencies of each bot act in bAbI t5, both for test and training set

Intent Descriptionaffirm affirmationbye dialog end cue

dontcareindicate the answer to question just asked isirrelevant

inform provide slot values for a querynegate anwer with a negation

reqaltsindicate the offered option is not desiredand ask for a new one

request address request the address of the offered restaurantrequest food request the food type of the offered restaurantrequest location request the location of the offered restaurantrequest phone request the phone of the offered restaurantrequest postcode request the postcode of the offered restaurantrequest price request the price range of the offered restaurant

silenceused in bAbI format for turns that require no input from the user (e.g. bot suggesting arestaurant right after an api call is considered a new turn with no input from the user)

unknown any utterance not matched by the rules

Table 4.4: User intents extracted from bAbI task 6, by using word matching rules on all user utterances fromtraining set. The rules were determined according to what can be served from the bot side

labels provided by Amazon Mechanical Turk, from 13 possible options. Each such utterance can have morethan one label. For instance, the utterance ‘ah yes im looking for persian food and i dont care about the price’is annotated with labels ‘affirm’, ‘inform(food=persian)’ and ‘inform(price=dontcare)’. Since the NLU archi-tecture is designed to assign a single intent to each utterance, every possible combination observed in DSTC2training set was analyzed and mapped to a single intent. But this by itself still does not answer the questionabout which intents to define. To answer that, all the possible bot template answers were considered, so thatevery user intent defined can be acted upon by the bot. Every other possible user utterance is assigned to the‘unknown’ intent. Table 4.4 lists the intents used for task 6.

To detect entities in the user utterances to produce training data for the NER component of the NLU,regular expressions were used to detect the known values for each entity type. Do note that Williams et al.(2017) uses these regular expressions as their entire NER component, but in this thesis, regular expressions areonly used to gather the NLU training data, which should be able to deal better with unknown entity values.The complete list of regular expressions used is available in appendix B.

21

Chapter 5

Experimental Setup

This section establishes the research questions that this work intends to answer. For each such question, thissection provides specifics about the test environment, such as metrics, conditions and implementation details.

5.1 Implementation Details

This section describes technical specifications of the models developed as part of this work. For a descriptionon the feature sizes see table 5.1.

5.1.1 NLU

Rasa NLU works as a customizable pipeline of models. Each model computes features and takes as input theuser utterance as well as any features computed by models earlier in the pipeline. The default Rasa NLU pipeline(as of version 0.11.3) was used. This pipeline uses GloVe embeddings (Pennington et al., 2014), n-grams, Part ofSpeech (POS) tags and synonyms as its main features. It then feeds them to the sklearn SVM implementationusing a linear kernel to classify intents and grid search to optimize the C paramenter and sklearn crfsuite 1 forentity tagging.

5.1.2 Memory Network

The model is an original implementation in tensorflow 1.6.0, for the ‘adjacent ’ architecture as described inSukhbaatar et al. (2015). Dropout was used in the last layer as regularization since it proved to increaseperformance slightly. Early stopping with patience parameter 5 was used. That is, stopping training after 5epochs with no increase in validation set performance, with accuracy as the chosen metric of performance.

The development set was used for hyper parameter tuning. The input dimension depends on the typeof features used and whether the bot previous utterance is added to the input or not. Table 5.1 lists thecontribution of each parameter to the input dimension. Hyper parameter values are listed in table 5.2.

1https://sklearn-crfsuite.readthedocs.io/en/latest/

Feature Setting Task sizeEntities in current turn Both Both 4 (task 5), 3 (task 6)Turn Both Both 1Intent NLU Both 16 (task 5), 14 (task 6)Previous action Both (offline test) Both 16 (task 5), 56 (task 6)Bag of Words HCN Both 85 (task 5), 523 (task 6)Word embeddings HCN Both 300Context HCN 6 1 (task 5), 9 (task 6)

Table 5.1: Dimensions each type of feature adds to the input data both for LSTM and Memory Network policies.Column ‘Setting’ indicates if the feature is used only by the HCN architecture, only by the alternative NLUbased setting proposed in this thesis or by both, while column ‘Task’ indicates if the feature is used in task 6,5 or both. Do note HCN reports 14 context features while this table reports 9. The difference is due to thefact that HCN counts the entities in current turn in those 14 and the fact they use 2 features to denote if thedatabase has been queried, while that feature is not used in this work as explained in section 4.1

22

Hyper parameter Valuehops 2embedding size 100batch size 32mem size 9epochs <35 *gradient norm clip 1keep probability 0.8weight initialization N (0, 1)optimizer Adam, lr=1e-4, e=1e-8error function cross-entropy

Table 5.2: Hyper parameters for the Memory Network policy for both bAbI tasks. Values were optimizedaccording to accuracy on the development set, using early stopping (patience 5) to determine number of epochs* depends on setting due to early stopping. Always below 35

Hyper parameter Valuehidden size 128gradient norm clip 1batch size 1 turnoptimizer AdaDelta, lr=0.1epochs <35 *error function cross entropyweight initialization Xavier with Uniform

Table 5.3: Hyper parameters for the LSTM policy for both bAbI tasks. Values were optimized according toaccuracy on the development set, using early stopping (patience 5) to determine number of epochs* depends on setting due to early stopping. Always below 35

5.1.3 LSTM

The model is an original tensorflow 1.6.0 implementation, based on the implementation details provided by theHCN authors (Williams et al., 2017), except for the action mask, since this work focuses on fully neural actionpolicies with no domain specific rules. The input is first projected to the ‘hidden size’ space. Gradients arecomputed every turn with full unrolling to the very first turn during the entire conversation (that is, batch sizeis 1 dialog turn). This is feasible thanks to the fact there are never more than 50 turns. Input data is the sameas in table 5.1, hyper parameters are listed in table 5.3.

5.1.4 Stacking Ensemble

The ensemble uses the outputs of both Memory Network and LSTM policies as input and outputs a finalcategorical distribution over actions. From the possible ensembles described in section 3.3.3, stacking is theonly one that requires further explanation on the implementation details. The input is simple enough, being theconcatenation of the outputs of both policies. Since the policies are expected to agree very often, learning to findthe argmax of both distributions is not difficult for the stacking ensemble. Just like Henderson et al. (2014a),it is implemented as Multilayer Perceptron using Keras 2.1.5 with tensorflow backend. It consists of 4 denselayers with ReLU activation except the last one, that uses softmax. Loss function is categorical cross-entropyand the optimizer is RMSProp . Since the model should be trained with data different to the one used to traineach policy in it, the exact topology and hyper parameters were determined by training on the same trainingdata and using the development data for hyper parameter tuning. Once the topology is fixed, it was trainedfrom scratch using the development data.

5.2 Test Conditions

5.2.1 NLU in Isolation

This is one of the three main areas of interest of this thesis. Since this module is the entry point of the dialogagent architecture, it is convenient to check its performance in isolation first, to lay a reliable foundation forsubsequent experiments. Then following sections deal with end to end test of the dialog agent, which imply acomparison of the NLU against other type of features from the user input.

23

NLU Training Data

Task 5Remember this task consists of simulated conversations with very limited grammatical diversity. It is thenpossible to identify the intent of every sentence observed with simple regular expressions. One such rule wasdefined to classify each utterance into one of the intents presented in table 4.2. These rules work equally wellin the train, development and test sets.

For entity recognition, Rasa NLU requires example sentences where the entities are marked (i.e. indicatingcharacter start and end). Regular expressions where used to detect known entity values in the train set. Task 5OOV however includes values that were never seen in the train set, so detecting those relies completely in theNLU generalization capacity.

36797 Training examples were obtained from task 5 train and development sets.Task 6

The greater variety of task 6 makes it unfeasible to apply the same approach from task 5. Instead, intents wereobtained from DSTC2 semantic annotations, as explained in section 4.4.2, obtaining examples for each of theintents in table 4.4.

For entity recognition, the approach was identical as with task 5, since there is no fundamental difference inthis regard.

14186 train examples were obtained from DSTC2 train set.

NLU Performance

Task 5 is too simple for an NLU since the grammatical structure of the sentences is well known. Therefore, forthis task, this component will only be tested end to end. For task 6, the DSTC2 semantic annotations fromthe test set were used as the ground truth. The NLU is tested for the usual classification metrics of precision,recall and F1 score.

5.2.2 Policy Performance

This is the original question motivating this work. Being a relatively recent model, the literature is yet todiscover most of Memory Netorks potential. This thesis intends to see how well a Memory Network performsin the common benchmark of the bAbI tasks, by using a similar architecture to HCN, putting it in directcomparison against an LSTM policy. Then another interesting question that arises is: can a Memory Networkand an LSTM trained on the same data, benefit from the knowledge of each other? To this end, the ensembleis also tested, as yet a new policy.

The Memory Network is tested both in task 5 and 6 to check if there is any significant difference in perfor-mance under different levels of noise and complexity. The LSTM policy is tested in the same settings to allowfor comparison.

The following subsections explain the metrics used to test this performance and the specific conditions ofthe tests.

Evaluation Metrics

This is a typical classification setting. There are many common metrics for these problems, such as precision,recall, accuracy or F1 score. This work uses turn accuracy and percentage of perfect dialogs, allowing directcomparison with Bordes et al. (2016) and Williams et al. (2017). Turn accuracy is broadly used in much ofdialog agent research including dialog state tracking (Young et al., 2013; Henderson et al., 2014b).

Turn accuracy refers to the fraction of correctly predicted actions from every turn in every dialog, expressedas a percentage. Perfect dialog accuracy refers to the fraction of perfectly predicted dialogs (i.e. a dialog whereevery turn was correctly predicted), also expressed as a percentage.

Other conditions

Online vs Offline TestThe previous bot action is a very important feature since only this and the context features provide the policy

with dialog state information to determine the next action. This feature is used by the HCN architecture andthe models explored here. There are three possible test settings regarding this feature:

1. Offline testing: the ground truth prediction, as present in the test set, is added as a feature for thenext turn. This is the approach used to compare against Williams et al. (2017) and Bordes et al. (2016),although it is unrealistic since such ground truth cannot be available in an actual live test scenario.

24

2. Online testing: actual predictions are used as the features, so errors accumulate and degrade perfor-mance.

3. No previous action features: a compromise between the previous 2 approaches. Since having theprevious turn information is troublesome for realistic scenarios, it is interesting to know how would modelsperform without this information at all.

Literal match vs act matchIn both task 5 and 6, the policies predict an action in the form of a message template. A fully formed messageis produced after the template slots are filled with the entity values tracked either by the NLU or the patternmatching rules. This yields two possible test settings:

1. Literal match: the model predicted action will be filled with the tracked slot values to produce a fullyformed message. This message is compared against the ground truth message, character by character andonly perfect matches count. This effectively tests both the right prediction and the entity tracking.

2. Act match: since entity tracking can fail (specially for task 6), it is interesting to separate errors producedby this from those produced by a bad prediction from the policy. Act match comparison consists in compareonly the action predicted (i.e. the bot act) and not the fully formed message.

Comparison between HCN and NLU FeaturesTo see if the user intent feature computed by the NLU is more informative than the word based features usedby HCN, the user natural language input will be processed in both ways and the experiments will be performedwith both possible feature sets. For a fair comparison, the ‘NLU features’ will be comprised not only of the userintent, but also of the dialog context and entity flag features defined in section 4.1 (one exception for task 5,where context flags are not used in any case, in line with Williams et al. (2017)). The exact features and theirsizes for both the NLU and the HCN feature settings are explained in table 5.1

25

Chapter 6

Results

6.1 NLU Performance in Isolation

The intent confusion matrix for task 6 is presented in figure 6.1b. As a baseline, another NLU model was trainedwith training data gathered entirely from regular expressions on bAbI task 6 train set (for a complete list of theregular expressions mapped to each intent, see appendix B). The corresponding confusion matrix is presentedin figure 6.1a. Table 6.1 shows the precision, recall, F-score and number of test cases for each intent, for bothNLU modules tested.

The NLU trained with the help of semantic annotations from DSTC2 has a clear advantage. This is mostlydue to the fact that the regular expression based approach is limited in the diversity of examples it can get,causing many training examples to be classified in the ‘unknown’ intent (the default when no regular expressionmatched). This class unbalance is reflected in the large number of false positives in the ‘unknown’ intent.

The results for entity recognition are presented in table 6.2.Even though both NLU modules got their entity training data with the same method, the results are superior

for the NLU trained with DSTC2 semantic labels, with most differences in the cuisine entity, since the othershave just a few possible values. This is potentially due to the fact that this approach provided more trainingexamples.

Finally, table 6.3 shows how this NLU performs regarding entity recognition, compared to the approach ofWilliams et al. (2017) which is solely based on regular expressions to detect entity values.

These results prove the NLU has a clear advantage, therefore no HCN style regular expression based entityrecognition is performed in further experiments, but NLU-based only.

6.2 Policy Performance

6.2.1 Task 5

The results for the Memory Network and LSTM policies, for both sets of input feature types considered and alltest settings explained in section 5.2.2 are presented in table 6.4.

There is a stark contrast between the Memory Network policy, which produced perfect score in almost allsettings, and the LSTM.

The LSTM from Williams et al. (2017) does achieve perfect performance, but only through the use of hard-coded domain-specific rules (such as ‘do not request the type of food if that is already known’), while the modeltested here relied completely on learning from data without any knowledge priors.

Since the Memory Network had perfect results, there is not significant use for an ensemble in this setting,and those tests are reserved for task 6 only.

Figure 6.2 shows the confusion matrix of the LSTM to make clear where does it fail.Most confusion is caused by dialog acts related to request slot values. For instance, it is common to request

the type of cuisine when the agent should have asked the location, price or number of people instead, and theLSTM failed at learning the distribution governing this behavior while the Memory Network excelled at it. Onthe same line, the model never predicted the action to ask price range, replacing it with the request action forthe other slots or the action ‘announce search’.

A second interesting observation on the results is that the NLU performed flawlessly, as can be noticedfrom the fact that literal match performance is always identical to act match, meaning the action templates arealways filled with the right slot values.

Yet another interesting finding is that the use of word-based features (listed as HCN) consistently outperformthe use of semantic features, on both literal match and act match accuracy metrics. This is an unexpected result,

26

(a) trained with pattern-matched sentences (b) trained with DSTC2 annotations

Figure 6.1: Intent classification confusion matrix for both NLU modules. a) trained with sentences extractedfrom bAbI task 6 train and development sets using the pattern matching rules from appendix B. b) trainedwith sentences from DSTC2 train set, classifying them with the semantic annotations included. Both moduleswere tested using the sentences from DSTC2 test set

Intent Precision Recall F1-score Supportaffirm 0.99 (0.99) 0.98 (0.98) 0.99 (0.99) 535bye 1 (0) 1 (0) 1 (0) 1169dontcare 0.99 (-0.01) 0.98 (0.13) 0.98 (0.06) 899inform 0.98 (0.04) 0.99 (0.09) 0.98 (0.06) 3785negate 0.98 (-0.02) 0.97 (0.05) 0.97 (0.01) 179reqalts 0.93 (0.03) 0.96 (0.11) 0.94 (0.03) 219request address 1 (0) 0.99 (0.14) 0.99 (0.06) 934request food 0.97 (0.03) 0.98 (0.12) 0.98 (0.08) 125request location 1 (0) 0.98 (0.92) 0.99 (0.86) 97request phone 0.99 (0.12) 0.99 (0.01) 0.99 (0.06) 804request postcode 0.99 (0) 1 (0.01) 1 (0.08) 164request price 0.99 (0) 0.93 (0.15) 0.96 (0.09) 97unknown 0.93 (0.58) 0.91 (0.15) 0.92 (0.44) 883avg/total 0.98 (0.13) 0.98 (0.14) 0.98 (0.15) 9890

Table 6.1: Performance metrics for the NLU intent classifier trained with utterances extracted from DSTC2train set using the provided semantic annotations. Performance is measured against DSTC2 test set, using thesame semantic annotations to obtain ground truth labels. Next to each value, in parenthesis, the variation inperformance with respect to the NLU trained with bAbI task 6 utterances

Entity Precision Recall F1-score Supportcuisine 0.98 (0) 0.99 (0.17) 0.98 (0.09) 2254location 0.98 (0) 1 (0) 0.99 (0) 1137price 0.99 (0.01) 1 (0) 0.99 (0) 990none 1 (0.01) 1 (0) 1 (0.01) 32567avg/total 1 (0.01) 1 (0.01) 1 (0.01) 36948

Table 6.2: Performance metrics for the NLU entity recognizer trained with utterances extracted from DSTC2train set using the provided semantic annotations. Performance is measured against DSTC2 test set, using thesame semantic annotations to obtain ground truth labels. In parenthesis next to each number is how much itimproved with respect to the NLU trained with bAbI task 6 user utterances

27

Precision Recall F-score

NLUcuisine 0.98 0.99 0.98location 0.98 1 0.99

price 0.99 1 0.99

Regexcuisine 0.84 0.91 0.87location 0.98 0.99 0.98

price 0.99 0.99 0.99

Table 6.3: Performance metrics of NLU based Named Entity Recognizer (NER) vs regular expression based one

Offline Online No prev. bot utter.

Policy FeaturesLiteralmatch

Actmatch

Literalmatch

Actmatch

Literalmatch

Actmatch

Memory NetworkNLU

100%(100%)

100%(100%)

100%(100%)

100%(100%)

99.36%(88.3%)

99.36%(88.3%)

HCN100%

(100%)100%

(100%)100%

(100%)100%

(100%)100%

(100%)100%

(100%)

LSTMNLU

92.88%(5.1%)

92.88%(5.1%)

92.88%(5.1%)

92.88%(5.1%)

92.47%(5.9%)

92.47%(5.9%)

HCN93.28%(9.5%)

93.28%(9.5%)

93.29%(9.5%)

93.29%(9.5%)

93.08%(6.8%)

93.08%(6.8%)

Table 6.4: Accuracy for several model settings for bAbI task 5. Results are reported for both policies and forboth feature sets tested (NLU or word-based as used in HCN (Williams et al., 2017)), usage of the previousprediction of the policy in the input (offline if using the ground truth, online if using the actual last prediction ornot using it at all) and match type (literal when comparing the bot actual phrase verbatim, or act match whencomparing only the dialog act, regardless of the entity values it was filled with). Both feature sets include turn,bot previous action and entity flags. The only difference is that NLU features include user intent, while HCNfeatures use BoW and word embeddings instead. Next to each turn accuracy is the perfect dialog accuracy inparenthesis

Figure 6.2: Confusion matrix for bot act predicitons with LSTM policy on bAbI task 5, using word-basedfeatures and offline setting (i.e. previous target prediction is used as a feature for the current turn).

28

Action Precision Recall F1-score Supportannounce keep searching 1 1 1 1470announce search 0.87 (0.04) 0.9 (-0.01) 0.89 (0.002) 2000api call 1 1 1 2000ask anything else 1 1 1 1000ask cuisine 0.49 1 0.66 489ask location 0.58 (0.16) 0.28 (-0.14) 0.38 (-0.04) 491ask number of people 0.45 (-0.04) 0.64 (0.34) 0.53 (0.16) 472ask price 0 0 0 506bye 1 1 1 1000give address 1 1 1 723give phone 1 1 1 752greet 1 1 1 1000on it 1 1 1 1000request updates 1 1 1 2025reserve 1 1 1 1000suggest restaurant 1 1 1 2470avg/total 0.92 0.93 0.92 18398

Table 6.5: Performance metrics for the LSTM action policy on bAbI task 5 using word-based features. Numbersin parenthesis indicate the variation with respect to using semantic features (i.e. NLU intent classification) inthe offline setting. No value in parenthesis indicates there is no variation

AccuracyIntent 92.88% (5.1%)Intent + BoW 93.14% (6.7%)Intent + Embeddings 93.05% (7.4%)Intent + BoW + Embeddings 93.18% (8.4%)Bow + Embeddings 93.28% (9.5%)

Table 6.6: Accuracy for bAbI task 5 LSTM policy in offline setting with different combinations of user inputfeatures. All combinations include turn and entity tracking with NLU module

since the intent classification works perfectly on this artificial dataset, and word features could hardly add anyinformation not present in the intent. The most likely reason is therefore just the fact that word based featuresimply more model parameters and therefore more learning capacity. More insight on the exact advantagesobtained due to these features is available on table 6.5.

The single most important reason for the LSTM policy with word based features to win over the LSTMpolicy with the user intent feature is the prediction of the ‘ask number of people’ action. The exact reasonfor this phenomenon is hard to notice from high level performance metrics alone but in order to know whichfeatures exactly helped the most, the same model (i.e. LSTM policy, offline setting) was trained and testedwith several combinations of features, obtaining the results on table 6.6.

Though the results are very similar, there is not a single instance in which adding the intent informationhelped over just having word-based features. This makes sense since at least in a simple dataset like this,the information in every sentence is very well captured by just a bag of words, let alone word embeddings(sentences are short and vocabulary is small, so average embedding is a safe way to represent a sentence withfeatures). This proves that the intent feature adds no extra information over word-based features. Even more: ifa model already has that word-based information, adding the intent could result in unnecessary noise. Considersentences like ‘i want a table for six people in london please’ and ‘for four people’. The two sentences providedifferent information, which the action policy could use to decide what slot not to request next. However bothcorrespond to an ‘inform’ intent, whereas a Bag of Words or an average embedding do make a difference betweenthe two.

6.2.2 Task 5 OOV

Rule based systems as well as HCN can excel at this task, reporting perfect accuracy. The added challenge withthis task is twofold. First, entity tracking must be robust enough to capture unknown entity values, which issimple since the sentence structure is so limited. A pattern matching NER module could rely on this structureto know exactly where in the sentence can it expect a cuisine type for instance. Secondly, the policy should berobust to these unknown types. The use of word embeddings is very likely to help, as unknown entity types are

29

Act match Literal match

Memory NetworkNLU 99.22% (88.7%) 88.71% (52.8%)HCN 98.93% (81%) 88.28% (49.7%)

LSTMNLU 93.85% (13.1%) 83.35% (8.4%)HCN 93.07% (73%) 82.57% (3.1%)

Table 6.7: Accuracy for bAbI task 5 Out of Vocabulary (OOV) for both policies with each type of features.Numbers in parenthesis denote the percentage of perfect dialogs

MemoryNetwork

LSTMEnsemble

Highest Average Stacking

NLUAct

52.17%(0.27%)

54.7%(0%)

55.49%(0.27%)

55.29%(0.27%)

48.16%(0.9%)

Literal47.43%(0.27%)

50.7%(0%)

50.8%(0.27%)

50.75%(0.27%)

44.4%(0.72%)

HCNAct

53.13%(1.1%)

55.98%(0.09%)

56.16%(0.45%)

56.45%(0.36%)

53.56%(0.98%)

Literal47.79%(0.72%)

50.65%(0.09%)

50.89%(0.36%)

51.03%(0.27%)

48.28%(0.81%)

Table 6.8: bAbI task 6 offline results for all policies including ensembles and all possible features (NLU andword-based from HCN, as defined for task 5, with the difference that here, both feature sets include the contextflags defined in section 4.1). Numbers in parenthesis denote the percentage of perfect dialogs

expected to have a similar embedding to known types.Since the restaurant database for task 5 OOV is different and not provided along with the bAbI tasks

dataset, it was produced from the OOV test file and this new database was used when retrieving results.Results for both Memory Network and LSTM policies are available in table 6.7.Again the Memory Network policy outperformed the LSTM, by a wide margin in all settings, confirming

once more that this model is superior at least for a simple, noise-free environment such as that of task 5. Thebest result obtained by the Memory Network (i.e. 99.22% turn accuracy, 88.7% perfect dialog accuracy, bothusing NLU features) is still worse than the reported perfect scores by Williams et al. (2017) in task 5 OOV, butthey achieve this result by using domain specific rules. When relying only on the LSTM policy without thoserules, the Memory Network achieves more than 5% higher turn accuracy, and 8% improvement in percentageof perfect dialogs (the improvement is up to 75.6% when using NLU features)

6.2.3 Task 6

The main goal of this test is to compare with the results from Williams et al. (2017). Although they do notreport their exact test settings, it is reasonable to assume their highest result of 55.6% turn accuracy and 1.9%perfect dialog percentage was achieved in the offline setting with act match, since that is the best case scenariofor the policy and the results in other settings are much lower.

Offline results are available on table 6.8. As usual, NLU and HCN features are as defined in table 5.1.Interestingly, the LSTM policy outperformed the Memory Network, different from what task 5 results suggested.A possible explanation is that the Memory Network gives potentially equal chance to each previous memoryto influence the current turn action, while the LSTM has a natural tendency to consider the recent turns morethan the older ones and in the noisy setting of task 6, this might be desired. To prove this ability of the MemoryNetwork, an analysis on how it allocates its attention to the memories in the history is presented in section6.2.3.

Another interesting phenomenon observed in table 6.8 is that the ensemble is beneficial when using highestand average predictions, but not with stacking. The top result in each setting was achieved with highestprediction ensemble, with average prediction ensemble improving over the individual policies as well, but stackingalways resulted in worse results than the LSTM policy alone. This might be due to the fact that the stackingensemble was trained with the development data (to avoid using the same set used to train the individualpolicies in the ensemble, in line with Henderson et al. (2014a)), which has only 500 dialogs, while the train sethas 1618. In fact, training the ensemble with the same train set used for the individual policies produced betterresults than using the development set (though still worse than the LSTM policy). To find whether the volumeof data is the cause of this behavior, another experiment was performed where the 1618 train dialogs werecombined with the 500 development dialogs. Then half of this combined set was used to train the individualpolicies and the other half for the ensemble. The results are available on section 6.2.3.

30

Figure 6.3: Average p vector of the Memory Network across all bAbI task 6 test set dialogs. The p vectorindicates how much attention the network allocates to each past memory in order to make the current decision.The x axis refers to the memory corresponding to each given number of turns ago, up to 20 (for task 6 test set,there are never more than 29 turns, and conversations have 10 turns on average).

Memory Attention Analysis

The Memory Network computes an embedding mi for the ith memory and then computes a measure pi indicatinghow relevant this memory is with respect to the current query. This results in a vector p whose values add up to1 (see figure 3.1). This vector is recomputed every turn and it provides insight in the decisions of the networkat each step, adding interpretability.

This vector was computed for the entire HCN setting of table 6.8, for the last 20 turns (double as much theaverage number of turns in bAbI task 6) and the distribution is presented on the blue bar plot of figure 6.3

The network clearly focuses almost exclusively on the last turn. This makes sense, since the context featuresincluded in the input contain the relevant state of the dialog, such as entity and context flags, which include bitsto indicate whether an API call has been placed and so forth (see section 4.1). With all this dialog informationconcentrated in just the last turn, it makes sense that the Memory Network mostly ignores all other memoriesand bases its decision only on the very last turn. To verify this hypothesis, the Memory Network was retrained,this time using the same features except the context ones (e.g. no more bits indicating if an API call was issuedalready). The average p vector across all bAbI task 6 test set was re computed and presented in the red barplot of figure 6.3. Although the last turn is still the most relevant, it lost a considerable part of its allocatedattention, which is distributed to older turns.

In order to show how the Memory Network learns to allocate its attention over the right past memories, thep vector was computed at each turn of an example conversation taken from bAbI task 5 test set with a MemoryNetwork trained without the context features, but still with the bits indicating which entity is already set. Task5 is preferred for this analysis since the excessive noise of task 6 implies that the right action might have not asmuch to do with previous turns that common sense would indicate as relevant. The values of the p vector forthe last 5 memories at every turn in the dialog are shown in table 6.9.

The Memory Network still usually pays attention to the last turn, after all that one always indicates thealready set entities. But some interesting observations prove the Memory Network is behaving in a naturalway. For instance, in the fourth turn it focuses mostly on the second memory (rather than the third) sincethat one has the most important cue to know that it most ask for information (requesting information from theuser is the common action to perform after the ‘on it ’ action). It is also interesting to see that it still assignsmore attention to the last turn than to the first turn in the conversation (by an entire order of magnitude),which might be due to the fact that the last turn contains the information indicating that type of cuisine hasbeen already requested, therefore another piece of information should be requested now. Same behavior can beobserved at the fifth turn, where the model focuses most of its attention on the second turn (the ‘on it ’ action)which is still the most important cue for the current action. From there onward, the last turn consistently getsmost of the attention, with some interesting allocations to past turns.

31

Turn User BotPast Memory Attention1 2 3 4 5

1 greet greet NA NA NA NA NA2 inform(location) on it 1 NA NA NA NA3 silence ask cuisine 0.99 2e-3 NA NA NA4 inform(cuisine) ask number of people 1e-3 0.99 7e-4 NA NA5 inform(number) ask price 0.01 0.00 0.98 6e-4 NA6 inform(price) announce search 0.98 0.00 0.00 0.01 0.007 silence api call 0.99 0.00 0.00 0.00 0.008 inform(cuisine) request updates 0.99 0.00 0.00 0.00 0.009 deny announce search 0.99 0.00 1e-3 0.00 0.0010 silence api call 0.99 2e-3 5e-4 3e-3 0.0011 silence suggest restaurant 0.99 0.00 0.00 0.00 0.0012 affirm reserve 0.99 1e-3 0.00 0.00 0.0013 request address give address 0.99 0.00 0.00 0.00 0.0014 thankyou ask anything else 0.99 1e-4 0.00 0.00 0.0015 bye bye 0.99 0.00 0.00 0.00 0.00

Table 6.9: Attention distribution over past memories of the Memory Network, for an example dialog in bAbItask 5 test set. For each turn, the ‘Past Memory Attention’ columns indicate the fraction of attention allocatedto past 5 memories. Any value below 0.0001 is rounded down to 0. No memory beyond 5 turns back ever gotmore than this threshold attention

MemoryNetwork

LSTMStackingEnsemble

Combineddataset

51.61%(0.8%)

51.67%(0.5%)

52.63%(1.2%)

Train/dev53.13%(1.1%)

55.98%(0.09%)

53.56%(0.98%)

Table 6.10: Accuracy of the stacking ensemble and each of its underlying policies. ‘Combined dataset ’ row showsthe results when training the individual policies with half the combined bAbI task 6 trained and developmentsets, with other half for the stacking ensemble. ‘Train/dev ’ row shows the results with the original approach(i.e. using the train set for the individual policies and the development set for the stacking ensemble, i.e sameas in table 6.8. All the results were obtained with the HCN features using the offline setting and act matchonly. As usual the numbers in parenthesis refer to percentage of perfect dialogs

Combined Train and Development Sets to Train Stacking Ensemble

To check if the poor performance of the stacking ensemble was due to lack of training data, a new dataset wasproduced by combining bAbI task 6 train and development sets, for a total of 1618 plus 500 dialogs (1118).This was randomly split in two halves, one used to train both independent policies and the other to train thestacking ensemble, in an attempt to give it a more fair chance to improve over the policies by having more data.In both cases, 10% of their half was kept as development data for hyper parameter tuning. The performance ofeach individual policy and the stacking ensemble is shown in table 6.10.

Here finally the stacking ensemble outperforms the two independent policies, both in turn accuracy andperfect dialogs percentage. However, when comparing with the original results from table 6.8 it is notedthat the original stacking ensemble performed even better, albeit worse than some of its independent policies.This indicates that even when having the extra training data available, it is still better to use it to train theindependent policies rather than a stacking ensemble, since these models proved to make better use of the extradata. Then either an average or highest confidence ensemble can be used, since they are effective and requireno extra data.

32

Chapter 7

Conclusions

This thesis explored several machine learning based architectures for task-oriented dialog agents, includingMemory Networks, LSTM and ensemble learning based action policies, using different sets of features andapproaches for entity recognition. The usage of a well-known dataset to evaluate the task allows both for directcomparison with other authors as well as to understand the advantages and nuances of different policies.

7.1 Summary

The bAbI task 5 results gave a clear advantage to the Memory Network policy, which achieved perfect resultswith all features tried, outperforming the LSTM policy in every setting.

For bAbI task 5 OOV, the Memory Network obtained results at least 5% higher than the LSTM actionpolicy in turn accuracy, and 8% higher in perfect dialog accuracy. This proved the perfect scores reported byWilliams et al. (2017) require domain specific rules, whereas in a completely neural approach, the MemoryNetwork wins over the LSTM action policy by a wide margin. This trend was however reversed in task 6, wherethe LSTM consistently outperformed the Memory Network. The main difference between both tasks is the levelof noise and complexity, thus indicating that the LSTM is better at handling noisy settings.

To better understand why the Memory Network performs worse than the LSTM in the noisy setting, severalexperiments were performed to see which memories the Memory Network considers when making its decisions.This revealed the Memory Network has indeed the capacity to focus on the relevant memories, no matter howlong ago in the history they occurred. For instance, when using context features, which encode conversationrelevant information from previous turns in the current one, the Memory Network learns to pay attention to thelast turn only, since it has all relevant information from the dialog thus far. However, when not using contextfeatures, the Memory Network pays more attention to the turns where the relevant information was provided.In a noisy setting such as task 6, this means noisy memories have a fair chance to be considered by the MemoryNetwork, while the LSTM has a natural tendency to forget older memories and be more attentive to the recentturns.

The ensemble experiments proved that even with the LSTM outperforming the Memory Network in thenoisy setting, the Memory Network still had relevant information. This meas its errors are not completelycorrelated with those of the LSTM.

Using the average prediction of the Memory Network and the LSTM was consistently better than usingany of their individual predictions, and using the highest confidence prediction was often even better than theaverage prediction, in all settings tried. This proved the value of using an ensemble of models for this task, evenwhen the independent models were trained with the same data.

Henderson et al. (2014a) proposed each individual model in the ensemble should be trained with differentsub sets of the data, to help decorrelate their errors. Such an approach is more expensive as more data isrequired. Therefore, the ensemble results obtained in this work are encouraging, since no extra data was used,relying only on the models differences to decorrelate their errors, achieving very positive results with up to 1.47% turn accuracy improvement.

The initial results for the stacking ensemble were not positive, performing worse than the best performingindividual policy, but further experiments revealed this was due only to the low amount of training data for thestacking ensemble.

The stacking ensemble achieved better results when trained with the same data as its individual underlyingpolicies, simply because that dataset was bigger, even if it was harder to extract any extra knowledge from itnot already present in the individual policies.

When training the ensemble and individual policies, each with half the combined training and developmentdata, the ensemble finally improved over its underlying policies. But the results were still worse than those

33

obtained by the LSTM trained with the entire training data. From this we can conclude that even when havingextra training data available, it is better to use it to improve the individual action policies. The ensemble canstill rely on average or highest confidence prediction. This improves the results without requiring more data.

The use of a Natural Language Understanding module (specifically, Rasa NER pipeline (Bocklisch et al.,2017)) proved more effective than using regular expressions to detect entities, since it could generalize better tounseen entity values and sentence structures. The use of an NLU to classify intents and using those intents asfeatures produced perfect results for bAbI task 5 but was consistently outperformed by the use of word-basedfeatures from Hybrid Code Networks. These features are cheap to compute but noisier than an intent encodedas a 1-hot vector and therefore might require more training data for the policy to learn to map them. Therefore,for a clean domain such as that of bAbI task 5, the use of an intent classifier is preferred.

Considering all these findings, the architectures proposed in this work proved to be an effective approach forthe tasks evaluated, with results on par and in some settings, higher than the state of the art. All of this provesthe value of using a Memory Network as action policy, an NLU for entity detection and intent classification,and an ensemble of policies to implement task oriented dialog agents.

7.2 Future Work

The fact that the user intent feature performed consistently worse than word level features was unexpected,specially for the simple task 5. This was probably due to the fact that the word level features (i.e. Bag ofWords of vocabulary size and word embeddings of size 300) imply more parameters in the action policy, givingit more learning power. To confirm or discard if this was the cause of that unexpected outcome, the actionpolicy tests should be repeated making sure model parameters are the same, no matter the features used. Theembedding size is hyper parameter that influences the total number of model parameters, so it could be adjustedaccordingly for this experiment.

Regarding the NLU model, there are many possibilities. For instance, having a single intent for eachuser utterance is limiting. DSTC2 utterances acknowledge this by allowing several semantic labels to eachutterance. Newer versions of Rasa NLU are starting to support several intents and even intent hierarchies.These modifications could give an advantage to NLU semantic features over word based features.

An important advantage that could be expected from NLU semantic features, is that they make the inputdata simpler for the action policies, potentially improving performance when training data is scarce. To thisend, it would be interesting to see how models trained with different features respond to smaller training sets.

Similarly, it would be interesting to see how different action policies react to smaller training sets. Datascarcity is a very common problem and so, knowing which model is better suited to deal with that can greatlyinfluence design decisions.

Williams et al. (2017) uses an action mask to encode hard rules in the action policy, which is necessary toachieve their perfect score in task 5 and potentially also for their high scores in task 6. While this thesis focusedon approaches fully based on machine learning, it would be interesting to see if such action mask benefits anaction policy such as the Memory Network. However, the results obtained this way would be hard to comparewith Williams et al. (2017) since they do not report their specific hard rules and general test settings.

34

Bibliography

T. Bocklisch, J. Faulkner, N. Pawlowski, and A. Nichol. Rasa: Open Source Language Understanding andDialogue Management. ArXiv e-prints, December 2017.

A. Bordes, Y. Boureau, and J. Weston. Learning End-to-End Goal-Oriented Dialog. ArXiv e-prints, May 2016.

Daniel Braun, Adrian Hernandez-Mendez, Florian Matthes, and Manfred Langen. Evaluating natural languageunderstanding services for conversational question answering systems. In Proceedings of the 18th AnnualSIGdial Meeting on Discourse and Dialogue, pages 174–185. Association for Computational Linguistics, 2017.URL http://aclweb.org/anthology/W17-5522.

H. Chen, X. Liu, D. Yin, and J. Tang. A Survey on Dialogue Systems: Recent Advances and New Frontiers.ArXiv e-prints, November 2017.

C. Goller and A. Kuchler. Learning task-dependent distributed representations by backpropagation throughstructure. In Neural Networks, 1996., IEEE International Conference on, volume 1, pages 347–352 vol.1, Jun1996. doi: 10.1109/ICNN.1996.548916.

M. Henderson, B. Thomson, and J. Williams. The Second Dialog State Tracking Challenge. In Proceedings ofSIGdial, 2014a.

Matthew Henderson, Blaise Thomson, and Steve Young. Word-based dialog state tracking with recurrent neuralnetworks. pages 292–299, 01 2014b.

Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. 9:1735–80, 12 1997.

Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural LanguageProcessing, Computational Linguistics, and Speech Recognition. Prentice Hall PTR, Upper Saddle River, NJ,USA, 3rd edition, 2018. ISBN 0130950696.

S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Statist., 22(1):79–86, 03 1951. doi:10.1214/aoms/1177729694. URL https://doi.org/10.1214/aoms/1177729694.

Sungjin Lee. Structured discriminative model for dialog state tracking. page 442–451. Association for Compu-tational Linguistics, August 2013.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed Representations of Words and Phrasesand their Compositionality. ArXiv e-prints, October 2013.

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word repre-sentation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014. URLhttp://www.aclweb.org/anthology/D14-1162.

I. V. Serban, C. Sankar, M. Germain, S. Zhang, Z. Lin, S. Subramanian, T. Kim, M. Pieper, S. Chandar, N. R.Ke, S. Rajeshwar, A. de Brebisson, J. M. R. Sotelo, D. Suhubdy, V. Michalski, A. Nguyen, J. Pineau, andY. Bengio. A Deep Reinforcement Learning Chatbot. ArXiv e-prints, September 2017.

P.-H. Su, M. Gasic, N. Mrksic, L. Rojas-Barahona, S. Ultes, D. Vandyke, T.-H. Wen, and S. Young. On-lineActive Reward Learning for Policy Optimisation in Spoken Dialogue Systems. ArXiv e-prints, May 2016.

S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. End-To-End Memory Networks. ArXiv e-prints, March2015.

A. M. Turing. I.—computing machinery and intelligence. Mind, LIX(236):433–460, 1950. doi:10.1093/mind/LIX.236.433. URL http://dx.doi.org/10.1093/mind/LIX.236.433.

O. Vinyals and Q. Le. A Neural Conversational Model. ArXiv e-prints, June 2015.

35

I. Vlad Serban, R. Lowe, P. Henderson, L. Charlin, and J. Pineau. A Survey of Available Corpora for BuildingData-Driven Dialogue Systems. ArXiv e-prints, December 2015.

J. Weston, S. Chopra, and A. Bordes. Memory Networks. ArXiv e-prints, October 2014.

J. Williams, A. Raux, D. Ramachandran, and A. Black. The Dialog State Tracking Challenge. In Proceedingsof SIGdial, 2013.

J. D. Williams, K. Asadi, and G. Zweig. Hybrid Code Networks: practical and efficient end-to-end dialog controlwith supervised and reinforcement learning. ArXiv e-prints, February 2017.

David H. Wolpert. Stacked generalization. Neural Networks, 5:241–259, 1992.

S. Young, M. Gasic, B. Thomson, and J. D. Williams. Pomdp-based statistical spoken dialog systems: A review.Proceedings of the IEEE, 101(5):1160–1179, May 2013. ISSN 0018-9219. doi: 10.1109/JPROC.2012.2225812.

36

Appendix A

Task 6 bot templates

Act Trn count Tst count Template

greet1618(0.112)

1117(0.099)

Hello , welcome to the Cambridgerestaurant system . You can ask forrestaurants by area , price range orfood type . How may I help you ?

api call1844(0.128)

1088(0.097)

api call <cuisine><location><price>

offer rest area price579(0.040)

574(0.051)

<name>is a nice place in the<location>of town and theprices are <price>

offer rest area food454(0.032)

270(0.024)

<name>is a nice place in the<location>of town servingtasty <cuisine>food

offer rest area food price245(0.017)

71(0.006)

<name>is a great restaurantserving <price><cuisine>food in the <location>oftown .

offer rest area261(0.018)

90(0.008)

<name>is a nice place in the<location>of town

offer rest food price219(0.015)

268(0.024)

<name>serves <cuisine>food in the <price>price range

offer rest food568(0.039)

371(0.033)

<name>serves <cuisine>food

offer rest price482(0.033)

110(0.010)

<name>is in the <price>pricerange

offer rest price postcode2(0.000)

3(0.000)

<name>is in the <price>pricerange , and their post code is<postcode>

offer rest471(0.033)

303(0.027)

<name>is a great restaurant

confirm food dontcare49(0.003)

247(0.022)

You are looking for a restaurantserving any kind of food right?

confirm price98(0.007)

24(0.002)

Let me confirm , You arelooking for a restaurant in the<price>price range right?

confirm price dontcare28(0.002)

8(0.001)

Let me confirm , You arelooking for a restaurant andyou dont care about theprice range right?

confirm area156(0.011)

116(0.010)

Did you say you are lookingfor a restaurant in the<location>of town?

confirm area dontcare86(0.006)

45(0.004)

Ok , a restaurant in any partof town is that right?

37

bye1618(0.112)

1117(0.099)

you are welcome

ask food907(0.063)

302(0.027)

What kind of food wouldyou like?

ask area680(0.047)

244(0.022)

What part of town do youhave in mind?

ask price359(0.025)

157(0.014)

Would you like something inthe cheap , moderate , orexpensive price range?

canthelp food511(0.035)

597(0.053)

I’m sorry but there is norestaurant serving <cuisine>food

canthelp food274(0.005)

24(0.002)

I am sorry but there is noother <cuisine>restaurantthat matches your request

canthelp food price68(0.005)

10(0.001)

I am sorry but there is noother <cuisine>restaurantin the <price>price range

canthelp food area14(0.001)

14(0.001)

I am sorry but there is noother <cuisine>restaurantin the <location>of town

canthelp price area21(0.001)

30(0.003)

Sorry but there is no otherrestaurant in the <price>price range and the<location>of town

canthelp food price area7(0.000)

10(0.001)

Sorry but there is no other<cuisine>restaurant in the<price>price range and the<location>of town

canthelp food area2152(0.011)

85(0.008)

I’m sorry but there is no<cuisine>restaurant inthe <location>of town

canthelp price food45(0.003)

54(0.005)

I’m sorry but there is norestaurant serving <price><cuisine>food

canthelp food area price37(0.003)

11(0.001)

I’m sorry but there is no<cuisine>restaurant in the<location>of town and the<price>price range

repeat58(0.004)

518(0.046)

Could you please repeat that?

canthear14(0.001)

25(0.002)

Sorry, I can’t hear you

confirm food area askprice

0(0.000)

107(0.010)

There are restaurants serving<cuisine>in the <location>of town . What price rangewould you like?

confirm food price askarea

4(0.000)

106(0.009)

There are restaurants serving<cuisine>in the <price>pricerange . What area would youlike?

confirm food pricedontcare ask area

0(0.000)

10(0.001)

There are restaurants serving<cuisine>in any price range .What area would you like?

confirm area fooddontcare ask price

0(0.000)

61(0.005)

There are restaurants in the<location>of town servingany kind of food . What pricerange would you like?

38

confirm price area askfood

3(0.000)

210(0.019)

There are restaurants in the<price>price range and the<location>of town . Whattype of food would you like?

confirm food ask area2(0.000)

164(0.015)

There are restaurants serving<cuisine>food . What area doyou want?

confirm food areadontcare ask price

0(0.000)

107(0.010)

There are restaurants serving<cuisine>food in any part oftown . What price range wouldyou like?

confirm food ask price0(0.000)

14(0.001)

There are restaurants serving<cuisine>food . What pricerange do you want?

confirm area ask food0(0.000)

155(0.014)

There are restaurants in the<location>of town .What type of food do you want?

confirm price ask food0(0.000)

184(0.016)

There are restaurants in the<price>price range . What typeof food do you want?

confirm price ask area0(0.000)

1(0.000)

There are restaurants in the<price>price range . What areado you want?

confirm area dontcareask price

0(0.000)

1(0.000)

There are restaurants in all partsof town . What type ofpricerange do you want?

confirm food dontcareask area

0(0.000)

21(0.002)

There are restaurants if you don’tcare about the food . What areado you want?

confirm food dontcareask price

0(0.000)

1(0.000)

There are restaurants if you don’tcare about the food . What pricerange do you want?

confirm area dontcarefood dontcare ask price

0(0.000)

15(0.001)

There are restaurants if you don’tcare about the area or the type offood . What price range wouldyou like?

confirm food dontcareprice ask area

0(0.000)

39(0.003)

There are restaurants serving anykind of food in the <price>pricerange . What area would you like?

confirm area dontcareask food

0(0.000)

8(0.001)

There are restaurants in all partsof town . What type of food doyou want?

confirm food223(0.015)

368(0.033)

You are looking for a <cuisine>restaurant right?

give phone1167(0.081)

795(0.071)

The phone number of <name>is <phone>

give phone20(0.000)

1(0.000)

<name>is a great restaurantserving <cuisine>food .Their phone number is<phone>.

give postcode242(0.017)

163(0.015)

The post code of <name>is<postcode>

give address0(0.000)

3(0.000)

The address of <name>is<address>.

give area983(0.068)

611(0.054)

Sure , <name>is on<location>

give address226(0.002)

24(0.002)

<name>is on <address>

39

askmore29(0.002)

165(0.015)

Can I help you withanything else?

Table A.1: Bot acts for bAbI task 6, along with their total occurrences in training and test data (frequency inparenthesis) and the template they represent in test data

40

Appendix B

Task 6 user intent map rules

Act Rulessilence <SILENCE>

dontcare

dont careit doesnt matterâny (part|area)âny price range$ˆdoesnt matter$âny$ânything$âny kind$

bye \s*(good)?bye\s*

reqalts

anything elsegive me a different restaurantˆwhat else$other (restaurant|choice)

request address

what is the addressaddress( of the venue)?( please)?( there)?$where is it$address and the price rangeaddress and (the )?(?!phone)

request phonewhat is the phone numberphone( number)?( please)?( of the venue)?$phone number and (?!address)

request postcodepost ?code( of the venue)?( please)?$postal code$

request food

(type|kind) of food (do they)|(does it)ˆwhat type of food is it$

ˆwhat type of food( does it serve)?$

ˆ(what is the )?type of food$

request location

ˆwhat is the area$ârea$ˆwhat area$ând the area$ˆ(and )?what area( of town)? is it in$

what part of town

request price

whats the pricewhat is the pricewhats its price rangethe price range( of the venue)?$ˆprice range$

41

inform

.*(world|....|afghan).*(north|...|south).*(cheap|...|expensive).*

.*(north|...|south).*(world|....|afghan).*(cheap|...|expensive).*

...

.*(world|welsh|....|afghan).*affirm ˆyes$negate ˆno$

Table B.1: Rules used to extract examples to train the NLU for bAbI task 6. Each row shows one of the 12acts defined (based on what the bot side could answer). The second column shows the regular expressions (oneper line) tried to capture examples of an intent from the training/dev data. The checks were applied in thisorder from top to bottom

42

Appendix C

Example Dialogs

{

"turn-index": 0,

"goal-labels": {

"pricerange": "moderate",

"area": "north"

},

"transcription": "moderately priced north part of town",

"method-label": "byconstraints",

"audio-file": "pt344x_0000981_0001356.wav",

"requested-slots": [],

"semantics": {

"json": [

{

"slots": [

[

"pricerange",

"moderate"

]

],

"act": "inform"

},

{

"slots": [

[

"area",

"north"

]

],

"act": "inform"

}

],

"cam": "inform(pricerange=moderate,area=north)"

}

}

Figure C.1: Example dialog human turn extracted from DSTC2

43

1 good morning hello what can i help you with today2 i’d like to book a table for two in london in a moderate price range i’m on it3 <SILENCE> any preference on a type of cuisine4 with spanish cuisine ok let me look into some options for you5 <SILENCE> api call spanish london two moderate6 instead could it be with british food sure is there anything else to update7 no ok let me look into some options for you8 <SILENCE> api call british london two moderate9 resto london moderate british 5stars R phone resto london moderate british 5stars phone10 resto london moderate british 5stars R cuisine british11 resto london moderate british 5stars R address resto london moderate british 5stars address12 resto london moderate british 5stars R location london13 resto london moderate british 5stars R number two14 resto london moderate british 5stars R price moderate15 resto london moderate british 5stars R rating 516 resto london moderate british 3stars R phone resto london moderate british 3stars phone17 resto london moderate british 3stars R cuisine british18 resto london moderate british 3stars R address resto london moderate british 3stars address19 resto london moderate british 3stars R location london20 resto london moderate british 3stars R number two21 resto london moderate british 3stars R price moderate22 resto london moderate british 3stars R rating 323 resto london moderate british 2stars R phone resto london moderate british 2stars phone24 resto london moderate british 2stars R cuisine british25 resto london moderate british 2stars R address resto london moderate british 2stars address26 resto london moderate british 2stars R location london27 resto london moderate british 2stars R number two28 resto london moderate british 2stars R price moderate29 resto london moderate british 2stars R rating 230 <SILENCE> what do you think of this option: resto london moderate british 5stars31 i love that great let me do the reservation32 what is the phone number of the restaurant here it is resto london moderate british 5stars phone33 thank you is there anything i can help you with34 no thank you you’re welcome

Figure C.2: Example dialog from bAbI task 5. Each line starts with turn number, followed by the user utteranceand bot utterance, separated by a tab. In this dialog the user changes her mind on turn 6 which keeps the botfrom displaying API call results to modify the query before querying the database. After displaying the results,the bot picks the option with highest rating and offers it. If the user happened to dislike the offered option, thebot would choose the second highest ranked option and offer it

44

task-oriented dialog agents using memory …...msc artificial intelligence master thesis...

Documents