deep neural network approach for the dialog state tracking...

Proceedings of the SIGDIAL 2013 Conference, pages 467–471,Metz, France, 22-24 August 2013. c©2013 Association for Computational Linguistics

Deep Neural Network Approach for theDialog State Tracking Challenge

Matthew Henderson, Blaise Thomson and Steve YoungDepartment of Engineering,

University of Cambridge, U.K.{mh521, brmt2, sjy}@eng.cam.ac.uk

AbstractWhile belief tracking is known to be im-portant in allowing statistical dialog sys-tems to manage dialogs in a highly robustmanner, until recently little attention hasbeen given to analysing the behaviour ofbelief tracking techniques. The DialogueState Tracking Challenge has allowed forsuch an analysis, comparing multiple be-lief tracking approaches on a shared task.Recent success in using deep learning forspeech research motivates the Deep Neu-ral Network approach presented here. Themodel parameters can be learnt by directlymaximising the likelihood of the trainingdata. The paper explores some aspects ofthe training, and the resulting tracker isfound to perform competitively, particu-larly on a corpus of dialogs from a systemnot found in the training.

1 IntroductionStatistical dialog systems, in maintaining a distri-bution over multiple hypotheses of the true dialogstate, are able to behave in a robust manner whenfaced with noisy conditions and ambiguity. Suchsystems rely on probabilistic tracking of dialogstate, with improvements in the tracking qualitybeing important in the system-wide performancein a dialog system (see e.g. Young et al. (2009)).

This paper presents a Deep Neural Network(DNN) approach for dialog state tracking whichhas been evaluated in the context of the Dia-log State Tracking Challenge (DSTC) (Williams,2012a; Williams et al., 2013)1.

Using Deep Neural Networks allows for themodelling of complex interactions between arbi-trary features of the dialog. This paper shows im-provements in using deep networks over networks

1More information on the DSTC is available athttp://research.microsoft.com/en-us/events/dstc/

with fewer hidden layers. Recent developments inspeech research have shown promising results us-ing deep learning, motivating its use in the contextof dialog (Hinton et al., 2012; Li et al., 2013).

This paper presents a technique which solvesthe task of outputting a sequence of probabilitydistributions over an arbitrary number of possiblevalues using a single neural network, by learningtied weights and using a form of sliding window.As the classification task is not split into multiplesub-tasks for a given slot, the log-likelihood of thetracker on training data can be directly maximisedusing gradient ascent techniques.

The domain of the DSTC is bus route informa-tion in the city of Pittsburgh, but the presentedtechnique is easily transferable to new domains,with the learned models in fact being domain in-dependent. No domain specific knowledge is used,and the classifier learned does not require knowl-edge of the set of possible values. The tracker per-formed highly competitively in the ‘test4’ dataset,which consists of data from a dialog system notseen in training. This suggests the model is ca-pable of capturing the important aspects of dia-log in a robust manner without overtuning to thespecifics of a particular system.

Most attention in the dialog state belief trackingliterature has been given to generative Bayesiannetwork models (Paek and Horvitz, 2000; Thom-son and Young, 2010). Few trackers have beenpublished using discriminative classifiers, a no-table exception being Bohus and Rudnicky (2006).An analysis by Williams (2012b) demonstrateshow such generative models can in fact degradebelief tracking performance relative to a simplebaseline. The successful use of discriminativemodels for belief tracking has recently been al-luded to by Williams (2012a) and Li et al. (2013),and was a prominent theme in the results of theDSTC.

467

2 The Dialog State Tracking ChallengeThis section describes the domain and method-ology of the Dialog State Tracking Challenge(DSTC). The Challenge uses data collected duringthe course of the Spoken Dialog Challenge (Blacket al., 2011), in which participants implementeddialog systems to provide bus route information inthe city of Pittsburgh. This provides a large cor-pus of real phonecalls from members of the publicwith real information needs.

Set Number of calls Notestrain1a 1013 Labelled training data

train1b&c 10619 Same dialog system astrain1a, but unlabelled

train2 678 Similar to train1*

train3 779 Different participant toother train sets

test1 765 Very similar to train1*and train2

test2 983 Somewhat similar totrain1* and train2

test3 1037 Very similar to train3

test4 451 System not found inany training set

Table 1: Summary of datasets in the DSTC

Table 1 summarises the data provided in thechallenge. Labelled training sets provide labelsfor the caller’s true goal in each dialog for 5 slots;route, from, to, date and time.

Participants in the DSTC were asked to reportthe results of their tracker on the four test sets inthe form of a probability distribution over eachslot for each turn. Performance was determinedusing a basket of metrics designed to capture dif-ferent aspects of tracker behaviour Williams et al.(2013). These are discussed further in Section 4.

The DNN approach described here is referred toin the results of the DSTC as ‘team1/entry1’.

3 ModelFor a given slot s at turn t in a dialog, let St, s de-note the set of possible values for s which have oc-curred as hypotheses in the SLU for turns ≤ t. Atracker must report a probability distribution overSt, s ∪ {other} representing its belief of the user’strue goal for the slot s. The probability of ‘other’represents the probability that the user’s true goalis yet to appear as an SLU hypothesis.

A neural network structure is defined whichgives a discrete distribution over the |St, s|+1 val-ues, taking the turns ≤ t as input.

Figure 1 illustrates the structure used in this ap-proach. Feature functions fi (t, v) for i = 1 . . .M

f1 (t, v) f1 (t− T + 1, v)∑t−T

t′=0 f1 (t′, v)

f2 (t, v) f2 (t− T + 1, v)∑t−T

t′=0 f2 (t′, v)

fM (t, v) fM (t− T + 1, v)∑t−T

t′=0 fM (t′, v)

t t− T + 1 (0 . . . t− T )

f1

f2

fM

. . .

...

h1

[= tanh(W0f

T + b0)]

h2

[= tanh(W1h

T1 + b1)

]

h3

[= tanh(W2h

T2 + b2)

]

E(t, v)[= W3h

T3

]

Figure 1: The Neural Network structure for computingE (t, v) ∈ R for each possible value v in the set St, s. Thevector f is a concatenation of all the input nodes.

are defined which extract information about thevalue v from the SLU hypotheses and machineactions at turn t. A simple example would befSLU (t, v), the SLU score that s=v was informedat turn t. A list of the feature functions actuallyused in the trial is given in Section 3.1. For nota-tional convenience, feature functions at negative tare defined to be zero:∀i ∀v, t′ < 0⇒ fi (t

′, v) = 0.The input layer for a given value v is fixed in

size by choosing a window size T , such that thefeature functions are summed for turns ≤ t − T .The input layer therefore consists of (T ×M) in-put nodes set to fi (t

′, v) for t′ = t−T+1 . . . t andi = 1 . . .M , and M nodes set to

∑t−Tt′=0 fi (t

′, v)for i = 1 . . .M .

A feed-forward structure of hidden layers ischosen, which reduces to a single node denotedE (t, v). Each hidden layer introduces a weightmatrix Wi and a bias vector bi as parameters,which are independent of v but possibly trainedseparately for each s. The equations for each layerin the network are given in Figure 1.

The final distribution from the tracker is:

P (s = v) = eE(t, v)/Z

P (s /∈ St, s) = eB/Z

Z = eB +∑

v′∈St, s

eE(t, v′)

where B is a new parameter of the network, in-dependent of v and possibly trained separately foreach slot s.

468

3.1 Feature FunctionsAs explained above, a feature function is a func-tion f (t, v) which (for a given dialog) returns areal number representing some aspect of the turnt with respect to a possible value v. A turn con-sists of a machine action and the subsequent Spo-ken Language Understanding (SLU) results. Thefunctions explored in this paper are listed below:1. SLU score; the score assigned by the SLU to

the user asserting s=v.2. Rank score; 1/r where r is the rank of s=v in

the SLU n-best list, or 0 if it is not on the list.3. Affirm score; SLU score for an affirm action

if the system just confirmed s=v.4. Negate score; as previous but with negate.5. Go back score; the score assigned by the SLU

to a goback action matching s=v.6. Implicit score; 1− the score given in the SLU

to a contradictory action if the system just im-plicitly confirmed s=v, otherwise 0.

7. User act type; a feature function for each pos-sible user act type, giving the total score of theuser act type in the SLU. Independent of s & v.

8. Machine act type; a feature function for eachpossible machine act type, giving the total num-ber of machine acts with the type in the turn.Independent of s & v.

9. Cant help; 1 if the system just said that it can-not provide information on s=v, otherwise 0.

10. Slot confirmed; 1 if s=v′ was just confirmedby the system for some v′, otherwise 0.

11. Slot requested; 1 if the value of s was just re-quested by the system, otherwise 0.

12. Slot informed; 1 if the system just gave infor-mation on a set of bus routes which included aspecific value of s, otherwise 0.

4 TrainingThe derivatives of the training data likelihood withrespect to all the parameters of the model canbe computed using back propagation, i.e. thechain rule. Stochastic Gradient Descent with mini-batches is used to optimise the parameters by de-scending the negative log-likelihood in the direc-tion of the derivatives (Bottou, 1991). Termina-tion is triggered when performance on a held-outdevelopment set stops improving.

Each turn t and slot s in a dialog for which|St, s| > 0 provides a non-zero summand to thetotal log-likelihood of the training data. These in-stances may be split up by slot to train a separatenetwork for each slot. Alternatively the data can

be combined to learn a slot independent model.The best approach found was to train a slot inde-pendent model for a few epochs, and then switchto training one model per slot (see Section 4.4).

This section presents experiments varying thetraining of the model. In each case the parametersare trained using all of the labelled training sets.The results are reported for test4 since this systemis not found in the training data. They are thereforeunbiassed and avoid overtuning problems.

The ROC curves, accuracy, Mean ReciprocalRank (MRR) and l2 norm of the tracker across allslots are reported here. (A full definition of themetrics is found in Williams et al. (2013).) Theseare computed throughout using statistics at everyturn t where |St, s| > 0 (referred to as ‘schedule2’ in the terminology of the challenge.) Table 2and Figure 3 in Appendix A show these metrics.The ‘Baseline’ system (‘team0/entry1’ in the chal-lenge), considers only the top SLU hypothesis sofar, and assigns the SLU confidence score as thetracker probability. It does not therefore incorpo-rate any belief tracking.4.1 Window SizeThe window size, T , was varied from 2 to 20. Tmust be selected so that it is large enough to cap-ture enough of the sequence of the dialog, whilstensuring sufficient data to train the weights con-necting the inputs from the earlier turns. The re-sults suggest that T = 10 is a good compromise.4.2 Feature SetThe features enumerated in Section 3.1 were splitinto 4 sets. F1 = {1} includes only the SLUscores; F2 = {1, ..., 6} includes feature func-tions which depend on the user act and the value;F3 = {1, ..., 8} also includes the user act and ma-chine act types; and finally F4 = {1, ..., 12} in-cludes functions which depend on the system actand the value. The results clearly show that addingmore and more features in this manner monotoni-cally increases the performance of the tracker.4.3 StructureSome candidate structures of the hidden layers(h1, h2, ...) were evaluated, including having nohidden layers at all, which gives a logistic regres-sion model. In Table 2 the structure is representedas a list giving the size of each hidden layer in turn.

Three layers in a funnelling [20, 10, 2] configu-ration is found to outperform the other structures.The l2 norm is highly affected by the use of deepernetwork structure, suggesting it is most useful intweaking the calibration of the confidence scores.

469

ROC Acc. MRR l2Baseline

0.5841 0.7574 0.5728

Window SizeT =2 0.6679 0.8044 0.5405

5 0.6875 0.8191 0.516410 0.6922 0.8207 0.533115 0.6718 0.8107 0.535220 0.6817 0.8190 0.5174

Feature SetF1 0.5495 0.7364 0.6838F2 0.6585 0.7954 0.6631F3 0.6823 0.8134 0.5525F4 0.6922 0.8207 0.5331

Structure[] 0.6751 0.8074 0.5658

[50] 0.6679 0.8046 0.5450[20] 0.6656 0.8060 0.5394

[50, 10] 0.6645 0.8045 0.5404[20, 2] 0.6543 0.7952 0.5514

[20, 10, 2] 0.6922 0.8207 0.5331

InitialisationSeparate 0.6907 0.8206 0.5472

Single Model 0.6779 0.8111 0.5570Shared Init. 0.6922 0.8207 0.5331

Table 2: Results for variant trackers described in Section4. By default, we train using the shared initialisation trainingmethod with T = 10, all the features enumerated in Section3.1, and 3 hidden layers of size 20, 10 and 2.

4.4 InitialisationThe three methods of training alluded to in Sec-tion 4 were evaluated; training a model for eachslot without sharing data between slots (Separate);training a single slot independent model (SingleModel); and training for a few epochs a slot in-dependent model, then using this to initialise thetraining of separate models (Shared Initialisation).

The method of shared initialisation appears tobe the most effective, scoring the best on accu-racy, MRR and l2. Training in this manner is par-ticularly beneficial for slots which are under rep-resented in the training data, as it initiates the pa-rameters to sensible values before going on to spe-cialise to that particular slot.

5 Performance in the DSTCA DNN tracker was trained for entry in theDSTC. Training used T=10, the full feature set,a [20, 10, 2] hidden structure and the shared ini-tialisation training method. Other parameters suchas the learning rate and regularisation coefficientwere tweaked by analysing performance on a heldout subset of the training data. All the labelled

test1Acc.MRR

test2Acc.MRR

test3Acc.MRR

test4Acc.MRR

allAcc.MRR

0.20.30.40.50.60.70.80.9

Figure 2: Accuracy and MRR of the 28 entries in the DSTCfor all slots. Boxplots show minimum, maximum, quartilesand the median. Dark dot is location of the entry presented inthis paper (DNN system).

training data available was used. The tracker islabelled as ‘team1/entry1’ in the DSTC.

The DNN approach performed competitively inthe challenge. Figure 2 summarises the perfor-mance of the approach relative to all 28 entries inthe DSTC. The results are less competitive in test2and test3 but very strong in test1 and test4.

The performance in test4, dialogs with an un-seen system, was probably the best because thechosen feature functions forced the learning of ageneral model which was not able to exploit thespecifics of particular ASR+SLU configurations.Features which depend on the identity of the slot-values would have allowed better performance intest2 and test3, allowing the model to learn dif-ferent behaviours for each value and learn typicalconfusions. It would also have been possible to ex-ploit the system-specific data available in the chal-lenge, such as more detailed confidence metricsfrom the ASR.

For a full comparison across the entries in theDSTC, see Williams et al. (2013). In making com-parisons it should be noted that this team did notalter the training for different test sets, and submit-ted only one entry.

6 ConclusionThis paper has presented a discriminative ap-proach for tracking the state of a dialog whichtakes advantage of deep learning. While sim-ple Gradient Ascent training was tweaked in thispaper using the ‘Shared Initialisation’ scheme, apossible promising future direction would be tofurther experiment with more recent methods fortraining deep structures e.g. initialising the net-works layer by layer (Hinton et al., 2006).

Richer feature representations of the dialog con-tribute strongly to the performance of the model.The feature set presented is applicable across abroad range of slot-filling dialog domains, sug-gesting the possibility of using the models acrossdomains without domain-specific training data.

470

A ROC Curves

Window Size

0.1 0.2 0.3 0.4

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Feature Set

0.1 0.2 0.3 0.4

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Structure

0.1 0.2 0.3 0.4

0.1

0.2

0.3

0.4

0.5

0.6

Initialisation

0.1 0.2 0.3 0.4

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Figure 3: ROC (Receiver Operating Characteristic) Curvesx-axis and y-axis are false acceptance and true acceptancerespectively. Lines are annotated as per Table 2.

Acknowledgments

The authors would like to thank the organisers ofthe DSTC. The principal author was funded by astudentship from the EPSRC.

References

Alan W. Black, Susanne Burger, Alistair Conkie, He-len Wright Hastie, Simon Keizer, Oliver Lemon,Nicolas Merigaud, Gabriel Parent, Gabriel Schu-biner, Blaise Thomson, Jason D. Williams, Kai Yu,Steve Young, and Maxine Eskenazi. 2011. Spokendialog challenge 2010: Comparison of live and con-trol test results. In SigDIAL.

Dan Bohus and Alex Rudnicky. 2006. A K-hypotheses+ Other Belief Updating Model. Proc.of the AAAI Workshop on Statistical and EmpiricalMethods in Spoken Dialogue Systems.

Leon Bottou. 1991. Stochastic gradient learning inneural networks. In Proceedings of Neuro-Nımes91, Nımes, France. EC2.

Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh.2006. A Fast Learning Algorithm for Deep BeliefNets. Neural computation.

Geoffrey Hinton, Li Deng, Dong Yu, George Dahl,Abdel-rahman Mohamed, Navdeep Jaitly, AndrewSenior, Vincent Vanhoucke, Patrick Nguyen, TaraSainath, and Brian Kingsbury. 2012. Deep neuralnetworks for acoustic modeling in speech recogni-tion. Signal Processing Magazine.

Deng Li, Jinyu Li, Jui-Ting Huang, Kaisheng Yao,Dong Yu, Frank Seide, Michael L Seltzer, GeoffZweig, Xiaodong He, Jason D. Williams, YifanGong, and Alex Acero. 2013. Recent Advances inDeep Learning for Speech Research at Microsoft. InICASSP.

Tim Paek and Eric Horvitz. 2000. Conversation asaction under uncertainty. In The Sixteenth Confer-ence on Uncertainty in Artificial Intelligence. Mor-gan Kaufmann.

Blaise Thomson and Steve Young. 2010. Bayesianupdate of dialogue state: A POMDP framework forspoken dialogue systems. Computer Speech & Lan-guage.

Jason D. Williams, Antoine Raux, Deepak Ramachan-dran, and Alan W. Black. 2013. The Dialogue StateTracking Challenge. In SigDIAL.

Jason D. Williams. 2012a. A belief tracking chal-lenge task for spoken dialog systems. In NAACLHLT 2012 Workshop on Future directions and needsin the Spoken Dialog Community: Tools and Data.Association for Computational Linguistics.

Jason D. Williams. 2012b. Challenges and opportu-nities for state tracking in statistical spoken dialogsystems: Results from two public deployments. J.Sel. Topics Signal Processing, 6(8):959–970.

Steve Young, Milica Gasic, Simon Keizer, FrancoisMairesse, Jost Schatzmann, Blaise Thomson, andKai Yu. 2009. The Hidden Information State model:A practical framework for POMDP-based spokendialogue management. Computer Speech & Lan-guage.

471

deep neural network approach for the dialog state tracking...

Documents