a teacher-student framework for maintainable dialog...

A Teacher-Student Framework for Maintainable Dialog Manager

Weikang Wang†‡, Jiajun Zhang†‡, Han Zhang#, Mei-Yuh Hwang#, Chengqing Zong†‡∗ and Zhifei Li#† National Laboratory of Pattern Recognition, Institute of Automation, CAS, Beijing, China

‡University of Chinese Academy of Sciences, Beijing, China∗CAS Center for Excellence in Brain Science and Intelligence Technology, Beijing, China

#Mobvoi AI Lab, Seattle, US{weikang.wang, jjzhang, cqzong}@nlpr.ia.ac.cn

{hanzhang, mhwang, zfli}@mobvoi.com

Abstract

Reinforcement learning (RL) is an attractivesolution for task-oriented dialog systems.However, extending RL-based systems tohandle new intents and slots requires asystem redesign. The high maintenance costmakes it difficult to apply RL methods topractical systems on a large scale. To addressthis issue, we propose a practical teacher-student framework to extend RL-based dialogsystems without retraining from scratch.Specifically, the “student” is an extendeddialog manager based on a new ontology, andthe “teacher” is existing resources used forguiding the learning process of the “student”.By specifying constraints held in the newdialog manager, we transfer knowledgeof the “teacher” to the “student” withoutadditional resources. Experiments show thatthe performance of the extended system iscomparable to the system trained from scratch.More importantly, the proposed frameworkmakes no assumption about the unsupportedintents and slots, which makes it possible toimprove RL-based systems incrementally.

1 Introduction

With the flourish development of virtual personalassistants (e.g., Amazon Alexa and Google Assis-tant), task-oriented dialog systems, which can helpusers accomplish tasks naturally, have been a focalpoint in both academic and industry research. Inthe early work, the task-oriented dialog system ismerely a set of hand-crafted mapping rules definedby experts. This is referred to as a rule-basedsystem. Although rule-based systems often haveacceptable performance, they are inconvenient anddifficult to be optimized. Recently, reinforcementlearning approaches have been applied to optimizedialog systems through interaction with a usersimulator or employed real users online (Gasicet al., 2011; Su et al., 2016b; Li et al., 2016,

U: I'm looking for a Sichuan restaurant.

S: "Spicy Little Girl" is a nice Sichuan restaurant.

U: Oh!!! Where is it?

S: It is located in Zhongguancun.

S: Hello, what can I do for you?

S: "Spicy Little Girl" is a nice Sichuan restaurant.

U: Is this restaurant located in Zhongguancun?

U: Thanks.

S: You are welcome.

Goal: Find a Sichuan restaurant in Zhongguancun.

S: SystemU: User

Evaluation: The system completes the task but confuses the user.User Feedback: I hope the system can deal with my confirmation.Dialog Rule: If user Confirm, then system Inform.

Figure 1: An example of a task-oriented dialog afterthe system comes online. The user is confused becausethe “confirm” intent has not been considered in thedeployed system. Dialog rules should be embedded ina new system to handle such situations.

2017b). It has been proven that RL-based dialogsystems can abandon hand-crafted dialog managerand achieve more robust performance than rule-based systems (Young et al., 2013).

Typically, the first step of building RL-baseddialog systems is defining a user model1 andnecessary system actions to complete a specifictask (e.g., seek restaurants information or bookhotels). Based on such ontology, developers canextract dialog features and train the dialog man-ager model in an interaction environment. Suchsystems work well if real users are consistent withthe predefined user model. However, as shown inFig. 1, the unanticipated actions2 of real users willlead to a poor user experience.

In this situation, the original system should beextended to support new user actions based onuser feedback. However, adding new intents orslots will change the predefined ontology. As aconsequence, developers need to extract additional

1The user model defines what users can do in a dialogsystem, including domain specific intents and slots.

2User actions consist of intents and slots.

dialog features based on new ontology. Besides,new system actions may be added to deal with newuser actions. The network architecture of the newsystem and the original one will be different. Thenew system can not inherit the parameters from theold one directly. It will make the original dialogmanager model invalid. Therefore, developershave to retrain the new system by interactingwith users from scratch. Though there are manymethods to train a RL-based dialog manager ef-ficiently (Su et al., 2016b, 2017; Lipton et al.,2017; Chang et al., 2017; Chen et al., 2017), theunmaintainable RL-based dialog systems will stillbe put on the shelf in real-world applications (Paekand Pieraccini, 2008; Paek, 2006).

To alleviate this problem, we propose a teacher-student framework to maintain the RL-based dia-log manager without training from scratch. Theidea is to transfer the knowledge of existing re-sources to a new dialog manager.

Specifically, after the system is deployed, ifdevelopers find some intents and slots missingbefore, they can define a few simple dialog rulesto handle such situations. For example, underthe condition shown in Fig. 1, a reasonable s-trategy is to inform the user of the location ofthis restaurant. Then we encode information ofsuch hand-crafted logic rules into the new dialogmanager model. Meanwhile, user logs and dialogpolicy of the original system can guide the newsystem to complete tasks like the original one.Under the guidance of the “teacher” (logic rules,user logs, and original policy), we can reforge anextended dialog manager (the “student”) without anew interaction environment.

We conduct a series of experiments with simu-lated and real users on restaurant domain. The ex-tensive experiments demonstrate that our methodcan overcome the problem brought by the unpre-dictable user behavior after deployment. Owing toreuse of existing resources, our framework savestime in designing new interaction environmentsand retraining RL-based systems from scratch.More importantly, our method does not make anyassumptions about the unsupported intents andslots. So the system can be incrementally extendedonce developers find new intents and slots that arenot taken into account before. As far as we know,we are the first to discuss the maintainability ofdeep reinforcement learning based dialog systemssystematically.

2 Related Work

Dialog Manager The dialog manager of task-oriented dialog systems, which consists of a s-tate tracker and a dialog policy module, controlsthe dialog flow. Recently, deep reinforcementlearning (Mnih et al., 2013, 2015) has been ap-plied to optimize the dialog manager in an “end-to-end” way, including deep Q-Network (Liptonet al., 2017; Li et al., 2017b; Peng et al., 2017;Zhao and Eskenazi, 2016) and policy gradientmethods (Williams et al., 2017; Su et al., 2016a;Dhingra et al., 2016). RL methods have showngreat potential in building a robust dialog systemautomatically. However, RL-based approaches arerarely used in real-world applications because ofthe maintainability problem (Paek and Pieracci-ni, 2008; Paek, 2006). To extend the domainof dialog systems, Gasic et al. (2014) explicitlydefined kernel functions between the belief statesthat come from different domains. However,defining an appropriate kernel function is non-trivial when the ontology has changed drastically.Shah et al. (2016) proposed to integrate turn-level feedback with a task-level reward signalto learn how to handle new user intents. Thisapproach alleviates the problem that arises fromthe difference between training and deploymentphases. But it still fails when the developers havenot considered all user actions in advance. Liptonet al. (2017) proposed to use BBQ-Networks toextend the domain. However, similar to Shahet al. (2016), the BBQ-Networks have reserveda few bits in the feature vector for new intentsand slots. And system actions for handling newuser actions have been considered in the originalsystem design. This assumption is not practicalenough. Compared to the existing domain exten-sion methods, our work addresses a more practicalproblem: new intents and slots are unknown tothe original system. If we need to extend thedialog system, we should design a new networkarchitecture to represent new user actions and takenew system actions into account.Knowledge Distillation Our proposed frameworkis inspired by recent work in knowledge distilla-tion (Bucilu et al., 2006; Ba and Caruana, 2014;Li et al., 2014). Knowledge distillation meanstraining a compact model to mimic a larger teach-er model by approximating the function learnedby the teacher. Hinton et al. (2015) introducedknowledge distillation to transfer knowledge from

Policy Net

RNN dialog state

Inform(area=Beihai) Request(area) 10

GermyVictorian

GodearHoliday

Do Eat

Database

Query: Select * where

area=Beihai

encoding encoding encoding

Request(price)sample

sta

uta 1

sta

uto 1

sto

dbto

tx

1ts ts

( | ; )ta h

ts

Figure 2: An overview of the RL-based dialog managerused in our work3. In the last turn, the system inquires“Where do you want to go?” In current turn, the userinput is “Find a restaurant in Beihai.”

a large highly regularized model into a smallermodel. The knowledge which can be transferredhas not been restricted to models. Stewart andErmon (2017) proposed to distill the physics anddomain knowledge to train neural networks with-out labeled data. Hu et al. (2016) enabled a neuralnetwork to learn simultaneously from labeled in-stances as well as logic rules. Zhang et al. (2017)integrated multiple prior knowledge sources intoneural machine translation using posterior regu-larization. Our experiments are based on suchinsights. Through defining appropriate regular-ization terms, we can distill different knowledge(e.g., trained model or prior knowledge) to a newdesigned model, alleviating the need for new la-beled data or expensive interaction environments.

3 RL-based Dialog Manager

Before going to the details of our method, weprovide some background on the RL-based dialogmanager in this section. Fig. 2 shows an overviewof such dialog manager. We describe each of theparts briefly below.Feature Extraction At the t-th turn of a dialog,the user input ut is parsed into domain specificintents and slots to form a semantic frame autby a language understanding (LU) module. outand ost−1 are the one-hot representations of suchsemantic frames for the current user input and thelast system output respectively. Alternatively, outcan be a simple n-grams representation of ut. But

3Similar dialog model architectures can be found in recentwork (Liu and Lane, 2017; Williams et al., 2017; Su et al.,2016a). But designing a dialog model architecture is notour main purpose. We focus on endowing RL-based dialogsystems with maintainability and scalability. The dialogmodel used in our work can be replaced with the similararchitectures in the related work.

the vocabulary size is relatively large in real-worldapplications. It will yield slow convergence in theabsence of a LU module. Based on the slot-valuepair output with the highest probability, a queryis sent to a database to retrieve user requestedinformation. odbt is the one-hot representation ofthe database result. As a result, the observableinformation xt is the concatenation of out , ost−1 andodbt .State Representation Based on the extracted fea-ture vector xt and previous internal state st−1,recurrent neural networks (RNNs) are used to inferthe latent representation of dialog state st at step t.Current state st can be interpreted as the summaryof dialog history ht up to current step.Dialog Policy Next, the dialog state representationst is fed into a policy network. The outputπ(a|ht; θ) of the policy network is a probabilitydistribution over a predefined system action setAs. Lastly, the system samples an action ast ∈ As

based on π(a|ht; θ) and receives a new observa-tion xt+1 with an assigned reward rt. The policyparameters θ can be learned by maximizing theexpected discounted cumulative rewards:

J (θ) = E

(T−t∑k=0

γkrt+k

)(1)

where T is the maximal step, and γ is the discountfactor. Usually the parameters θ can be iterativelyupdated by policy gradient (Williams, 1992) ap-proach. The policy gradient can be empiricallyestimated as:

∇θJ (θ) ≈1

N

N∑i=1

T∑t=1

∇θ log π(asi,t|hi,t; θ)(Gi,t−b) (2)

where N is the number of sampled episodes ina batch, Gi,t =

∑T−tk=0 γ

kri,t+k is the sum ofdiscounted reward at step t in the episode i, andb is a baseline to estimate the average reward ofcurrent policy.

4 Notations and Problem Definition

Let Au and As denote the supported user andsystem action sets in the original system designrespectively. ut denotes the user input in the t-thturn. The LU module converts ut into a domainspecific intent and associated slots to form a useraction aut ∈ Au. The system will return anaction ast ∈ As according to the dialog managerπ(θ). Note that not all user actions are takeninto account at the beginning of system design.

(a) Retraining a new system from scratch. (b) An overview of our proposed teacher-student framework.

Distill Knowledge

Distill Knowledge

Training

New SystemLogic Rules

Old System

Model Architecture

Training

Interaction

New Environment

Model Architecture

New System

Figure 3: Two kinds of strategies to extend the original system. (a) means redesigning and retraining a new systemin an expensive interaction environment from scratch and (b) means transferring knowledge from existing resourcesto a new system. The network in red indicates the old system based on the original ontology. The networks in grayand purple represent the initialized and trained dialog manager models based on a new ontology respectively. Onaccount of the change in ontology, the extended system has a different network architecture. The dash lines in (a)and (b) show the ability of a new model derives from various sources.

After deployment, the developers can find thatsome user actions Au new cannot be handled bythe original system based on the human-machineinteraction logs D. Generally speaking, Au new

consists of new intents and slots. Our goal is toextend the original system to support the new useraction setA′u = Au∪Au new. The extended dialogmanager and new system action set are denotedas π(θ′) and A′s respectively. To handle new useractions, more system actions may be added to thenew system. It means that As is a subset of A′s.

5 Approach

Fig. 3 shows two kinds of strategies to extendthe original system. The first strategy requires anew interaction environment. However, building auser simulator or hiring real users once the systemneeds to be extended is costly and impractical inreal-world applications. By contrast, our methodenhances the reuse of existing resources. Thebasic idea is to use the existing user logs, originaldialog policy model and logic rules (“teacher”)to guide the learning process of a new dialogmanager (“student”). Without an expensive in-teraction environment, the developers can main-tain RL-based dialog systems as efficiently andstraightforwardly as in rule-based systems.

5.1 Distill Knowledge from the OriginalSystem

Although the ontology of the new system is dif-ferent from the original one, the extended dialogmanager can still reuse dialog policy of the ill-considered system circuitously. Given user logsD

and the original dialog manager π(θ), we definea loss L(θ′;D, θ) to minimize the difference be-tween new dialog manager π(θ′) and the old one:

L(θ′;D, θ) =∑d∈D

|d|∑t=1

KL(π(a|ht; θ) ||π(a|ht; θ′) ) (3)

where π(a|ht; θ) and π(a|ht; θ′) are the policydistributions overAs andA′s given the same dialoghistory ht. |d| means turns of a specific dialogd ∈ D. To deal with unsupported user actions,As will be a subset ofA′s. As a result, the KL termin equation (3) can be defined as follow:

KL(π(a|ht; θ) ||π(a|ht; θ′) )

=

|As|∑k=1

π(ak|ht; θ)(logπ(ak|ht; θ)− logπ(ak|ht; θ′)

)(4)

As the original policy parameters θ are fixed, theloss function in equation (3) can be rewritten as:

L(θ′;D, θ) = −∑d∈D

|d|∑t=1

|As|∑k=1

π(ak|ht; θ)logπ(ak|ht; θ′)

(5)

This objective will transfer knowledge of the origi-nal system to the “student” at the turn level. Underthe guidance of the original system, the extendedsystem will be equipped with the primary strategyto complete a task.

5.2 Distill Knowledge from Logic Rules

It’s easy for the developers to give logic rules onthe system responses to handle new user actions.

For example, if users ask to confirm a slot, thesystem should inform the value of that slot im-mediately. Note that these system actions whichhandle new user actions may not exist in the oldmodel. It means the architecture of the new systemis different from the old one.

We define a set of logic constraintsR = {(hl, al)}Ll=1, where hl ∈ HR indicatesthe dialog context condition in the l-th rule, andal ∈ A′s is the corresponding system action. Thenumber of logic rules L is equal to the numberof new user actions. These rules can be seen astriggers: if dialog context ht in current turn tmeets the context condition hl defined in logicrules, then the system should execute al. Inour work, we use the output of the LU moduleto judge whether the current dialog contextmeets the condition defined by logic rules. Analternative method is simple rules matching. Todistill the knowledge of rules to a new system, wedefine a loss function L(θ′;D,R) to embed suchconstraints in the new system:

L(θ′;D,R) =−∑d∈D

|d|∑t=1

∑hl∈HR

1{ht = hl}×

|A′s|∑

k=1

1{ak = al} log π(ak|ht; θ′)

(6)

Where 1{·} is the indicate function. Equation (6)suggests the new dialog manager π(θ′) will bepenalized if it violates the instructions definedby the dialog rules. Note that, for simplicity,we assume these rules are absolutely correct andmutually exclusive. Although this hypothesis maylead to a non-optimal dialog system, these rulesdefine reasonable system actions to correspondingdialog contexts. It implies that the new system canbe further refined by reinforcement learning oncea new interaction environment is available.

5.3 Extension of Dialog ManagerIn the absence of a new training environment,learning is made possible by exploiting structurethat holds in the new dialog manager. On onehand, we expect the new system can completetasks like the original one. On the other hand,it should satisfy the constraints defined by dialogrules. So, the learning objective of new dialogmanager π(θ′) can be defined as follow:

L(θ′;D, θ,R) =

{L(θ′;D,R) if ht ∈ HR ;

L(θ′;D, θ) else(7)

When the dialog context ht in the t-th turn satis-fies a condition defined in HR, we distill knowl-edge of rules into the new system. Otherwise, wedistill knowledge of the original system into thenew one. Instead of retraining from scratch, de-velopers can extend RL-based systems by reusingexisting resources.

6 Experiments

To evaluate our method, we conduct experimentson a dialog system extension task of restaurantdomain.

6.1 Domain

The dialog system provides restaurant informationin Beijing. The database we use includes 2988restaurants. This domain consists of 8 slots (name,area, price range, cuisine, rating, number ofcomments, address and phone number) in whichthe first four slots (inform slots) can be used forsearching the desirable restaurant and all of theseslots (request slots) can be asked by users. Ineach dialog, the user has a goal containing a setof slots, indicating the constraints and requestsfrom users. For example, an inform slot, such as“inform(cuisine=Sichuan cuisine)”, indicates theuser finding a Sichuan restaurant, and a requestslot, such as “request(area)”, indicates the userasking for information from the system (Li et al.,2016, 2017b; Peng et al., 2017).

6.2 Measurements

A main advantage of our approach is that theunconsidered user actions can be handled in theextended system. In addition to traditional mea-surements (e.g., success rate, average turns andaverage reward), we define an objective measure-ment called “Satis.” (user satisfaction) to verifythis feature in the simulated evaluation. “Satis.”indicates the rate at which the system takes rea-sonable actions in unsupported dialog situations.It can be calculated as follows:

Satis. =

∑d∈D

∑|d|t=1

∑Ll=1 1{ht = hl}1{ast = al}∑

d∈D∑|d|t=1

∑Ll=1 1{ht = hl}

(8)

where ht and ast are the dialog history and systemaction in the t-th turn, hl and al are dialogcontext condition and corresponding system ac-tion defined in the l-th rules. Intuitively, anunreasonable system reply will frustrate users andlow “Satis.” indicates a poor user experience.

Sim1 Sim2LU Error Rate Succ. Turn Reward Satis. Succ. Turn Reward Satis.

0.00 0.962 13.6 3.94 - 0.901 13.2 2.95 0.570.05 0.937 13.7 3.41 - 0.877 14.4 2.41 0.480.10 0.910 14.3 2.65 - 0.841 13.9 1.41 0.470.20 0.845 15.2 0.58 - 0.784 14.7 0.01 0.47

Table 1: Performance of the original system when interacting with different user simulators. LU error meanssimulating slot errors and intent errors in different rates. Succ.: success rate, Turn: average turns, Reward: averagereward.

LU Error Rate 0.00 0.05 0.10 0.20

Total 25857 26166 27077 28385New User Actions 1853 1859 1998 1912

Table 2: Statistics of turns when S1 interacts with Sim2.

Although “Satis.” is obtained based on our hand-crafted dialog rules, it approximately measures thesubjective experience of real users after systemdeployment.

6.3 User Simulator

Training RL-based dialog systems requires a largenumber of interactions with users. It’s commonto use a user simulator to train RL-based dialogsystems in an online fashion (Georgila et al., 2005;Pietquin and Dutoit, 2006; Scheffler and Young,2002; Li et al., 2016). As a consequence, weconstruct an agenda-based user simulator, whichwe refer to as Sim1, to train the original RL-basedsystem. The user action set of Sim1 is denoted asAu, which includes such intents4: “hello”, “bye”,“inform”, “deny”, “negate”, “affirm”, “request”,“reqalts” and “null”. The slots of Sim1 are shownin section 6.1. In each turn, the user action consistsof a intent and slots and we append the value ofslot according to the user goal.

6.4 Implementation of the Original System

For the original RL-based dialog system, a featurevector xt of size 191 is extracted. This vectoris the concatenation of encodings of LU results,the previous system reply, database results and thecurrent turn number. The LU module is imple-mented with an SVM5 for intent detection and aCRF6 for slot filling. The language generation

4A detail explanation of these intents is in DSTC2 (Hen-derson et al., 2013).

5We use the publicly available SVM tool at http://scikit-learn.org.

6We use the publicly available CRF tool athttps://pypi.python.org/pypi/sklearn-crfsuite.

module is implemented by a rule-based method.The hidden dialog state representation is inferredby a GRU (Chung et al., 2014). We set the hiddenstates of the GRU to be 120. The policy networkis implemented as a Multilayer Perceptron (MLP)with one hidden layer. The size of the hidden layeris 80. The output dimension of policy network is15, which corresponds to the number of systemactions. To encourage shorter interaction, we seta small per-turn negative reward Rturn = −1.The maximal turn is set to be 40. If the usergoal is satisfied, the policy will be encouraged bya large positive reward Rsucc = 10; otherwisethe policy will be penalized by a negative rewardRfail = −10. Discounted factor γ = 0.9.The baseline b of current policy is estimated onsampled episodes in a batch. The batch size N isset to be 32. Adadelta (Zeiler, 2012) method isused to update model parameters. The originalsystem S1 is trained by interacting with Sim1.After about 2400 interactions, the performance ofS1 starts to converge.

6.5 Simulated EvaluationTo evaluate our approach, we design another usersimulator, which we denote as Sim2, to simulatethe unpredictable real customers. The user actionset of Sim2 is denoted as A′u. The differencebetween Au and A′u is reflected on the domainspecific intents7. Specifically, in addition to theintents of Sim1, A′u includes the “confirm” intent.The difference in user action sets will result indifferent interaction strategies between Sim1 andSim2. To verify whether a recommended restau-rant meets his (her) constraints, Sim1 can onlyrequest what the value of a specific slot is, butSim2 can request or confirm.

After obtaining the original system S1, we de-ploy it to interact with Sim1 and Sim2 respectively,

7A more complex situation is shown in the humanevaluation.

0.00 0.05 0.01 0.2LU Error Rate

0.6

0.7

0.8

0.9

1.0

Succ

ess R

ate

Original SystemExtended SystemContrast System

(a) Dialog success rate of different systems.

0.00 0.05 0.01 0.2LU Error Rate

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Satis

.

Original SystemExtended SystemContrast System

(b) Satis. of different systems.

Figure 4: Performance of different systems under simulation. The original, extended and contrast systems areshown in red, blue and purple bars. We test these systems by interacting with Sim2.

under different LU error rates (Li et al., 2017a).In each condition, we simulate 3200 episodesto obtain the performance. Table 1 shows thedetails of the test performance. Table 2 shows thestatistics of turns when S1 interacts with Sim2.

As shown in Table 1, S1 achieves higher dialogsuccess rate and rewards when testing with Sim1.When interacting with Sim2, nearly half of theresponses to unsupported user actions are notreasonable. Notice even though Sim2 containsnew user actions, some of the new actions mightbe appropriately handled by S1. It may be dueto the robustness of our RL-based system. Butit’s far from being desired. The unpredictable realuser behavior in the deployment stage will lead toa poor user experience in real-world applications.It proves the importance of a maintainable system.

To maintain the original system, we define afew simple logic rules to handle unsupported useractions: if users confirm the value of a slot incurrent turn, the system should inform users of thatvalue. These rules8 are intuitive and reasonable tohandle queries such as “Is this restaurant located inZhongguancun?”. There are four slots9 that can beused for confirmation, so we define four logic rulesin all. Due to the change in ontology, we add a newstatus in dialog features to represent the “confirm”intent of users. It leads to a change in the modelarchitecture of extended dialog manager. Then wedistill knowledge of the S1 and logic rules into theextended system. No additional data is used to

8In the practical dialog system, we can inject morecomplex logic rules and take dialog history into account.These rules are not limited to question/answer mapping.

9They are “name”, “area”, “price range” and “cuisine”.

obtain the extended system.For comparison, we retrain another new sys-

tem (contrast system) from scratch by interactingwith Sim2. After about 2600 interactions withSim2, the performance of contrast system starts toconverge. Note that in order to build the contrastsystem, the developers need to redesign a new usersimulator or hire real users. It’s expensive andimpractical in industrial applications. Then wesimulate 3200 interactions with Sim2 to obtain itsperformance. Fig. 4 illustrates the performance ofdifferent systems.

As can be seen, the extended system performsbetter than the original system in terms of dialogsuccess rate and “Satis.”. This is to a large degreeattributed to the consideration of new user actions.Fig. 4(a) shows that the contrast system achieveshigher dialog success rate than the extended sys-tem. But the gap is negligible. However, thecontrast system is trained from scratch under anew interaction environment and the extendedsystem is trained by transferring knowledge ofthe original system and logic rules. To train thecontrast system, about 2600 episodes are sampledby interacting with a new interaction environment.But no additional data is used to train the extendedsystem.

In Fig. 4(b), the “Satis.” of the extendedsystem is slightly higher than the contrast system.This is due to the fact that the extended systemlearns how to deal with new user actions fromlogic rules but the contrast system obtains dialogpolicy by exploring the environment. As a result,the contrast system learns a more flexible dialog

Average Rating 2.35

Average Turns 12.1

Unseen Intents takeTaxi, bookTable

Unseen Inform SlotswithoutWaiting, carryoutService,goodforDating, privateRoom,hasInternet

Unseen Request Slots waitingLine, discount,businessHours

Unseen Confirm Slots carryoutService, goodforDating,privateRoom, hasInternet

Table 3: Details of the real user logs. Users areencouraged to interact with the original system byunsupported intents and slots. We find there are 14 useractions unseen before.

Dialog Condition System Action

takeTaxi API callbookTable API callinform unseen slots recommend a restaurantrequest unseen slots offer information of slotsconfirm unseen slots offer information of slots

Table 4: Different types of rules for new user actions.Left column shows the dialog context condition; Rightcolumn shows the corresponding system action. Wedefine 14 rules in all to handle newfound intents andslots shown in Table 3.

policy than the extended system10. However,the “Satis.” has a bias to the suboptimal rules,rather than the optimal policy gained from theenvironment. It suggests the extended system canbe further refined by reinforcement learning oncea new interaction environment is available.

6.6 Human Evaluation

In any case, the developers can’t guarantee all useractions are considered. Fortunately, our methodmakes no assumptions about the new user actionsand new dialog model architecture. As a result, thesystem can be extended over multiple iterations.

To evaluate this characteristic, we deploy theextended system11 in section 6.5 to interact withreal human users. Users are given a goal sam-pled from our corpus for reference. To elicitmore complex situations, they are encouraged tointeract with our system by new intents and slotsrelated to the restaurant domain. At the end ofeach dialog, they are asked to give a subjectiverating on the scale from 1 to 5 based on thenaturalness of the system (1 is the worst, 5 is the

10Table 6 in the Appendix shows sample dialogs from theextended system and contrast system.

11The extended system in the simulated evaluation will bethe original system in our human evaluation.

Original System Extended System

1

2

3

4

5

User

Rat

ing

Figure 5: Distribution of user ratings.

best.). After filtering dialog sessions unrelated toour task, we collect 315 episodes in total. Table 3shows the details of the user logs. As shown inTable 3, after deployment, there are a few slotsand intents unseen before. For example, users mayask for the discount information or take a taxi tothe restaurant. To represent the new intents andslots, the dimension of extracted dialog featuresis extended to 236. Meanwhile, the number ofsystem actions is extended to 29 to handle newuser actions. To deal with the newfound useractions, we define 14 rules in total. Table 4 showsthe details of new defined logic rules. Then wedistill the knowledge of the original system andlogic rules into a new system. Fig. 5 shows thecomparison in user ratings. The extended systemsignificantly gets a higher subjective rating thanthe original one. It proves that the extended systemcan give reasonable responses in unseen dialogsituations.

Table 5 shows sample dialogs from the originalsystem and extended system with real users. Wecan see that the extended system is much morecoherent since it takes new user actions into ac-count. It inspires us that a complicated RL-baseddialog system can start from a simple one and beimproved incrementally based on our framework.This design pattern will be much more efficient ifthe interaction environment is hard to access.

7 Conclusion

In this paper, we present a practical solution tomaintain RL-based dialog systems without retrain-ing from scratch. By reusing existing resources,developers can extend the RL-based dialog man-ager to deal with unpredictable user actions afterdeployment. Furthermore, our method holds noconstraints on the architecture of the new systemand retains the key benefit of data-driven methods.

Original System Extended SystemUser: I want to find a Sichuan restaurant. User: I want to find a Sichuan restaurant inSystem: Where do you want to go? Zhongguancun area.User: Zhongguancun area. System: “Hot Charge” is a nice Sichuan restaurant inSystem: “Hot Charge” is a nice Sichuan restaurant in in Zhongguancun area.Zhongguancun area. User: What’s the discount of this restaurant?User: Can you tell me the discount? System: This restaurant is offering 10% off.System: The average cost is 42 CNY. User: Is there a waiting line?User: Tell me the discount! System: Currently there are 5 parties waiting in line.System: “Hot Charge” is a nice Sichuan restaurant in User: Can you book a table for me?Zhongguancun area. System: Ok.User: Bye. User: Thanks.System: Bye. System: You are welcome.

Subjective Rating = 2 Subjective Rating = 5

Table 5: Sample dialogs of the original and extended systems in the human evaluation. The original system will collapse when encountering unseen slots and intents. After extension, the new system can handle more user actions.

It means the new system can be maintained it-eratively and further refined by RL once a newinteraction environment is available. The sim-ulation and human experiments show our pro-posed framework guarantees the maintainabilityand scalability in RL-based systems, which arenecessary for any industrial application.

8 Acknowledgments

The research work described in this paper hasbeen supported by the National Key Researchand Development Program of China under GrantNo. 2017YFB1002103 and also supported by theNatural Science Foundation of China under GrantNo. 61333018.

ReferencesJimmy Ba and Rich Caruana. 2014. Do deep nets really

need to be deep? In Advances in neural informationprocessing systems, pages 2654–2662.

Cristian Bucilu, Rich Caruana, and AlexandruNiculescu-Mizil. 2006. Model compression.In Proceedings of the 12th ACM SIGKDDinternational conference on Knowledge discoveryand data mining, pages 535–541. ACM.

Cheng Chang, Runzhe Yang, Lu Chen, XiangZhou, and Kai Yu. 2017. Affordable on-linedialogue policy learning. In Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, pages 2190–2199.

Lu Chen, Xiang Zhou, Cheng Chang, Runzhe Yang,and Kai Yu. 2017. Agent-aware dropout dqn forsafe and efficient on-line dialogue policy learning.In Proceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing, pages2444–2454.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluationof gated recurrent neural networks on sequencemodeling. arXiv preprint arXiv:1412.3555.

Bhuwan Dhingra, Lihong Li, Xiujun Li, JianfengGao, Yun-Nung Chen, Faisal Ahmed, and Li Deng.2016. End-to-end reinforcement learning ofdialogue agents for information access. arXivpreprint arXiv:1609.00777.

Milica Gasic, Filip Jurcıcek, Blaise Thomson, Kai Yu,and Steve Young. 2011. On-line policy optimisationof spoken dialogue systems via live interaction withhuman subjects. In Automatic Speech Recognitionand Understanding (ASRU), 2011 IEEE Workshopon, pages 312–317. IEEE.

Milica Gasic, Dongho Kim, Pirros Tsiakoulis, Cather-ine Breslin, Matthew Henderson, Martin Szummer,Blaise Thomson, and Steve Young. 2014. Incre-mental on-line adaptation of pomdp-based dialoguemanagers to extended domains. In Proceedings onInterSpeech.

Kallirroi Georgila, James Henderson, and OliverLemon. 2005. Learning user simulations forinformation state update dialogue systems. In NinthEuropean Conference on Speech Communicationand Technology.

Matthew Henderson, Blaise Thomson, and JasonWilliams. 2013. Dialog state tracking challenge 2& 3.

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531.

Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, EduardHovy, and Eric Xing. 2016. Harnessing deepneural networks with logic rules. arXiv preprintarXiv:1603.06318.

Jinyu Li, Rui Zhao, Jui-Ting Huang, and YifanGong. 2014. Learning small-size dnn with output-distribution-based criteria. In Fifteenth Annual Con-ference of the International Speech CommunicationAssociation.

Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao,and Asli Celikyilmaz. 2017a. Investigation oflanguage understanding impact for reinforcementlearning based dialogue systems. arXiv preprintarXiv:1703.07055.

Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, LihongLi, Jianfeng Gao, and Yun-Nung Chen. 2016. Auser simulator for task-completion dialogues. arXivpreprint arXiv:1612.05688.

Xuijun Li, Yun-Nung Chen, Lihong Li, and JianfengGao. 2017b. End-to-end task-completion neural di-alogue systems. arXiv preprint arXiv:1703.01008.

Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li,Faisal Ahmed, and Li Deng. 2017. Bbq-networks:Efficient exploration in deep reinforcement learningfor task-oriented dialogue systems. arXiv preprintarXiv:1711.05715.

Bing Liu and Ian Lane. 2017. Iterative policy learningin end-to-end trainable task-oriented neural dialogmodels. arXiv preprint arXiv:1709.06136.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Alex Graves, Ioannis Antonoglou, Daan Wierstra,and Martin Riedmiller. 2013. Playing atariwith deep reinforcement learning. arXiv preprintarXiv:1312.5602.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver,Andrei A Rusu, Joel Veness, Marc G Bellemare,Alex Graves, Martin Riedmiller, Andreas K Fid-jeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning.Nature, 518(7540):529–533.

Tim Paek. 2006. Reinforcement learning for spokendialogue systems: Comparing strengths and weak-nesses for practical deployment. In Proc. Dialog-on-Dialog Workshop, Interspeech.

Tim Paek and Roberto Pieraccini. 2008. Automatingspoken dialogue management design using machinelearning: An industry perspective. Speech commu-nication, 50(8):716–729.

Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao,Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong.2017. Composite task-completion dialogue systemvia hierarchical deep reinforcement learning. arXivpreprint arXiv:1704.03084.

Olivier Pietquin and Thierry Dutoit. 2006. A proba-bilistic framework for dialog simulation and optimalstrategy learning. IEEE Transactions on Audio,Speech, and Language Processing, 14(2):589–599.

Konrad Scheffler and Steve Young. 2002. Automaticlearning of dialogue strategy using dialogue simu-lation and reinforcement learning. In Proceedingsof the second international conference on HumanLanguage Technology Research, pages 12–19.Morgan Kaufmann Publishers Inc.

Pararth Shah, Dilek Hakkani-Tur, and Larry Heck.2016. Interactive reinforcement learning for task-oriented dialogue management. In NIPS 2016 DeepLearning for Action and Interaction Workshop.

Russell Stewart and Stefano Ermon. 2017. Label-free supervision of neural networks with physics anddomain knowledge. In AAAI, pages 2576–2582.

Pei-Hao Su, Pawel Budzianowski, Stefan Ultes, MilicaGasic, and Steve Young. 2017. Sample-efficientactor-critic reinforcement learning with superviseddata for dialogue management. arXiv preprintarXiv:1707.00130.

Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016a. Continuouslylearning neural dialogue management. arXivpreprint arXiv:1606.02689.

Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016b. On-line active reward learning for policy optimisationin spoken dialogue systems. arXiv preprintarXiv:1605.07669.

Jason D Williams, Kavosh Asadi, and Geoffrey Zweig.2017. Hybrid code networks: practical andefficient end-to-end dialog control with supervisedand reinforcement learning. arXiv preprint arX-iv:1702.03274.

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine learning, 8(3-4):229–256.

Steve Young, Milica Gasic, Blaise Thomson, andJason D Williams. 2013. Pomdp-based statisticalspoken dialog systems: A review. Proceedings ofthe IEEE, 101(5):1160–1179.

Matthew D Zeiler. 2012. Adadelta: an adaptive learn-ing rate method. arXiv preprint arXiv:1212.5701.

Jiacheng Zhang, Yang Liu, Huanbo Luan, JingfangXu, and Maosong Sun. 2017. Prior knowledgeintegration for neural machine translation usingposterior regularization. In Proceedings of the 55thAnnual Meeting of the Association for Computation-al Linguistics (Volume 1: Long Papers), volume 1,pages 1514–1523.

Tiancheng Zhao and Maxine Eskenazi. 2016. Towardsend-to-end learning for dialog state tracking andmanagement using deep reinforcement learning.arXiv preprint arXiv:1606.02560.

a teacher-student framework for maintainable dialog...

Documents