yichun yin, lifeng shang, xin jiang, xiao chen, qun liu ... · huawei noah’s ark lab fyinyichun,...

10
Dialog State Tracking with Reinforced Data Augmentation Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Qun Liu Huawei Noah’s Ark Lab {yinyichun, shang.lifeng, jiang.xin, chen.xiao2, qun.liu}@huawei.com Abstract Neural dialog state trackers are generally lim- ited due to the lack of quantity and diver- sity of annotated training data. In this paper, we address this difficulty by proposing a re- inforcement learning (RL) based framework for data augmentation that can generate high- quality data to improve the neural state tracker. Specifically, we introduce a novel contextual bandit generator to learn fine-grained aug- mentation policies that can generate new ef- fective instances by choosing suitable replace- ments for specific context. Moreover, by al- ternately learning between the generator and the state tracker, we can keep refining the gen- erative policies to generate more high-quality training data for neural state tracker. Exper- imental results on the WoZ and MultiWoZ (restaurant) datasets demonstrate that the pro- posed framework significantly improves the performance over the state-of-the-art models, especially with limited training data. 1 Introduction With the increasing popularity of intelligent assis- tants such as Alexa, Siri and Google Duplex, the research on spoken dialog systems has gained a great deal of attention in recent years (Gao et al., 2018). Dialog state tracking (DST) (Williams et al., 2013) is an essential component of most spo- ken dialog systems, aiming to track user’s goal at each step in a dialog. Based on that, the dialog agent decides how to converse with the user. In a slot-based dialog system, the dialogue states are typically formulated as a set of slot-value pairs and one concrete example is as follows: User: Grandma wants Italian, any suggestions? State: inform(food=Italian ) Agent: Would you prefer south or center? User: It doesn’t matter. Whichever is less expensive. State: inform(food=Italian , price=cheap, area=don’t care ) Original Example I want some reasonably priced Chinese food. Generated Data Dialogue State Tracker Contextual Bandit Generator Example I wanted to find some [reasonably priced] Chinese food. “cost effective” affordabletoo expensive…… Policy State Rewards New instance: I wanted to find some [affordable] Chinese food. Figure 1: An overview of our framework. Given a dataset, we induce new instances using the RL-based Generator to improve the DST Tracker. The Genera- tor is trained with the rewards from the Tracker. The learning process is performed in an alternate manner. The state-of-the-art models for DST are based on neural network (Henderson et al., 2014b; Mrkˇ si´ c et al., 2017; Zhong et al., 2018; Ren et al., 2018; Sharma et al., 2019). They typically pre- dict the probabilities of the candidate slot-value pairs with the user utterance, previous system ac- tions or other external information as inputs, and then determine the final value of each slot based on the probabilities. Although the neural net- work based methods are promising with advanced deep learning techniques such as gating and self- attention mechanisms (Lin et al., 2017; Vaswani et al., 2017), the data-hungry nature makes them difficult to generalize well to the scenarios with limited or sparse training data. To alleviate the data sparsity in DST, we propose a reinforced data augmentation (RDA) framework to increase both the amount and diver- sity of the training data. The RDA learns to gener- ate high-quality labeled instances, which are used to re-train the neural state trackers to achieve bet- ter performances. As shown in Figure 1, the RDA consists of two primary modules: Generator and arXiv:1908.07795v2 [cs.CL] 18 Nov 2019

Upload: others

Post on 21-Mar-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Dialog State Tracking with Reinforced Data Augmentation

Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Qun LiuHuawei Noah’s Ark Lab

{yinyichun, shang.lifeng, jiang.xin, chen.xiao2, qun.liu}@huawei.com

Abstract

Neural dialog state trackers are generally lim-ited due to the lack of quantity and diver-sity of annotated training data. In this paper,we address this difficulty by proposing a re-inforcement learning (RL) based frameworkfor data augmentation that can generate high-quality data to improve the neural state tracker.Specifically, we introduce a novel contextualbandit generator to learn fine-grained aug-mentation policies that can generate new ef-fective instances by choosing suitable replace-ments for specific context. Moreover, by al-ternately learning between the generator andthe state tracker, we can keep refining the gen-erative policies to generate more high-qualitytraining data for neural state tracker. Exper-imental results on the WoZ and MultiWoZ(restaurant) datasets demonstrate that the pro-posed framework significantly improves theperformance over the state-of-the-art models,especially with limited training data.

1 IntroductionWith the increasing popularity of intelligent assis-tants such as Alexa, Siri and Google Duplex, theresearch on spoken dialog systems has gained agreat deal of attention in recent years (Gao et al.,2018). Dialog state tracking (DST) (Williamset al., 2013) is an essential component of most spo-ken dialog systems, aiming to track user’s goal ateach step in a dialog. Based on that, the dialogagent decides how to converse with the user. Ina slot-based dialog system, the dialogue states aretypically formulated as a set of slot-value pairs andone concrete example is as follows:

User: Grandma wants Italian, any suggestions?State: inform(food=Italian)

Agent: Would you prefer south or center?User: It doesn’t matter. Whichever is less expensive.State: inform(food=Italian,price=cheap, area=don’t care)

Original Example

I want some reasonably priced Chinese food.

Generated Data

Dialogue State Tracker

Contextual Bandit Generator

ExampleI wanted to find some [reasonably priced] Chinese food.

“cost effective” “affordable” “too expensive”

……

PolicyState

Rewards

New instance: I wanted to find some [affordable] Chinese food.

Figure 1: An overview of our framework. Given adataset, we induce new instances using the RL-basedGenerator to improve the DST Tracker. The Genera-tor is trained with the rewards from the Tracker. Thelearning process is performed in an alternate manner.

The state-of-the-art models for DST are basedon neural network (Henderson et al., 2014b;Mrksic et al., 2017; Zhong et al., 2018; Ren et al.,2018; Sharma et al., 2019). They typically pre-dict the probabilities of the candidate slot-valuepairs with the user utterance, previous system ac-tions or other external information as inputs, andthen determine the final value of each slot basedon the probabilities. Although the neural net-work based methods are promising with advanceddeep learning techniques such as gating and self-attention mechanisms (Lin et al., 2017; Vaswaniet al., 2017), the data-hungry nature makes themdifficult to generalize well to the scenarios withlimited or sparse training data.

To alleviate the data sparsity in DST, wepropose a reinforced data augmentation (RDA)framework to increase both the amount and diver-sity of the training data. The RDA learns to gener-ate high-quality labeled instances, which are usedto re-train the neural state trackers to achieve bet-ter performances. As shown in Figure 1, the RDAconsists of two primary modules: Generator and

arX

iv:1

908.

0779

5v2

[cs

.CL

] 1

8 N

ov 2

019

Tracker. The two learnable modules alternatelylearn from each other during the training process.On one hand, the Generator module is responsiblefor generating new instances based on a parame-terized generative policy, which is trained with therewards from the Tracker module. The Tracker, onthe other hand, is refined via the newly generatedinstances from the Generator.

Data augmentation performs perturbation on theoriginal dataset without actually collecting newdata, which has been widely used in the field ofcomputer vision (Krizhevsky et al., 2012; Cubuket al., 2018) and speech recognition (Ko et al.,2015), but relatively limited in natural languageprocessing (Kobayashi, 2018). The reason is that,in contrast to image augmentation (e.g., rotatingor flipping images), it is significantly more diffi-cult to augment text because it requires preserv-ing the semantics and fluency of newly augmenteddata. In this paper, to derive a more general andeffective policy for text data augmentation, weadopt a coarse-to-fine strategy to model the gener-ation process. Specifically, we initially use somecoarse-grained methods to get candidates (suchas cost effective, affordable and too expensive inFigure 1), some of which are inevitably noisy orunreliable for the specific sentence context. Wethen adopt RL to learn the policies for selectinghigh quality candidates to generate new instances,where the total rewards are obtained from theTracker. After learning the Generator, we use it toinduce more training data to re-train the Tracker.Accordingly, the Tracker will further provide morereliable rewards to the Generator. With alternatelearning, we can progressively improve the gen-erative policies for data augmentation and at thesame time learn the better Tracker with the aug-mented data.

To demonstrate the effectiveness of the pro-posed RDA framework in DST, we conduct exten-sive experiments with the WoZ (Wen et al., 2017)and MultiWoZ (restaurant) (Budzianowski et al.,2018) datasets. The results show that our modelconsistently outperforms the strong baselines andachieves new state-of-the-art results. In addition,the effects of the hyper-parameter choice on per-formance are analyzed and case studies on the pol-icy network are performed.

The main contributions of this paper include:

• We propose a novel framework of data aug-mentation for dialog state tracking, which

can generate high-quality labeled data to im-prove neural state trackers.

• We use RL for the Generator to produce ef-fective text augmentation.

• We demonstrate the effectiveness of the pro-posed framework on two datasets, showingthat the RDA can consistently boost the state-tracking performance and obtain new state-of-the-art results.

2 Reinforced Data Augmentation

We elaborate on our framework in three parts: theTracker module, the Generator module, and the al-ternate learning algorithm.

2.1 Tracker Module

The dialog state tracker aims to track the user’sgoal during the dialog process. At each turn,given the user utterance and the system ac-tion/response1, the trackers first estimate the prob-abilities of the candidate slot-value pairs2, andthen the pair with the maximum probability foreach slot is chosen as the final prediction. To ob-tain the dialog state of the current turn, trackerstypically use the newly predicted slot-values to up-date the corresponding values in the state of pre-vious turn. One concrete example of the Trackermodule is illustrated in Figure 2.

Our RDA framework is generic and can be ap-plied to different types of tracker models. Todemonstrate its effectiveness, we experiment withtwo different trackers: the state-of-the-art GLAD∗

model and the classical NBT-CNN (Neural Be-lief Tracking - Convolutional Neural Networks)model (Mrksic et al., 2017). The GLAD∗ is builtbased on the recently proposed GLAD (Atten-tive Dialogue State Tracker) (Zhong et al., 2018)by modifying the parameter sharing and attentionmechanisms. Due to limited space, we detail theGLAD∗ model in the Supplementary material. Weuse the Tracker to refer to GLAD∗ and NBT-CNNin the following sections.

2.2 Generator Module

We formulate data augmentation as an optimal textspan replacement problem in a labeled sentence.Specifically, given the tuple of a sentence x, its

1If the system actions do not exist in the dataset, we usethe system response as the input.

2For each slot, none value is added as one candidate slot-value pair.

Figure 2: The Tracker module. (1) System action or re-sponse, and user utterance as input; (2) The tracker pre-dicts the probabilities of all possible slot-value pairs;(3) The prediction and state of previous turn are usedto update the state of the current turn.

label y, and the text span p of the sentence, theGenerator aims to generate a new training instance(x′, y′) by substituting p in x with an optimal can-didate p′ from a set of candidates for p, which wedenote as Cp.

In the span-based data augmentation, we canreplace the text span with its paraphrases derivedeither from existing paraphrase databases or neu-ral paraphrase generation models e.g. (Zhao et al.,2009; Li et al., 2018). However, directly applyingthe coarse-grained approach can introduce ineffec-tive or noisy instances for training, and eventuallyhurt the performance of trackers. Therefore, wetrain the Generator to learn fine-grained genera-tion policies to further improve the quality of theaugmented data.Generation Process. The problem of high qualitydata generation is modeled as a contextual bandit(or one-step reinforcement learning) (Dudik et al.,2011). Formally, at each trial of a contextual ban-dit, the context including the sentence x and itstext span p, is sampled and shown to the agent,then the agent selects a candidate p′ from Cp togenerate a new instance x′ by replacing p with p′.

Policy Learning. The policy πθ(s, p′) representsa probability distribution over the valid actions atthe current trial, where the state vector s is ex-tracted from the sentence x, the text span p andthe candidate p′. Cp forms the action space of theagent given the state s, and the rewardR is a scalar

value function. The policy is learned to maximizethe expected rewards:

J (θ) = Eπθ [R], (1)

where the expectation is taken over state s and ac-tion p′.

The policy πθ(s, p′) decides which p′ ∈ Cp totake based on the state s, which is formulated as:

s = [p,p′emb,p′emb − pemb,p

′emb ◦ pemb], (2)

where p is the contextual representation of p,which is derived from the hidden states in the en-coder of the Tracker, pemb and p′emb are the wordembeddings of p and p′ respectively. For multi-word phrases, we use the average representationsof words as the phrase representation. We use atwo-layer fully connected network and sigmoid tocompute the score function f(s, p′) of p being re-placed by p′. As each p has multiple choices of re-placement Cp, we normalize the scores and obtainthe final probabilities for the alternative phrases:

πθ(s, p′) =

f(s, p′)∑p∈Cp f(s, p)

. (3)

The sampling-based policy gradient is usedto approximate the gradient of the expected re-ward. To obtain more feedback and make thepolicy learning more stable, as illustrated in Fig-ure 3, we propose to use a two-step samplingmethod: at first, sample a bag of sentences B ={(xi, yi, pi)}1≤i≤T , then iteratively sample a can-didate p′i,j for each instance in B according to thecurrent policy, obtaining a new bag of instancesB′j = {(x′i,j , y′i,j , p′i,j)}1≤i≤T . After runningthe bag-level sampling M times, the gradient ofobjective function can be estimated as:

∇J (θ) ≈ 1M

∑Mj=1

∑Ti=1∇θ log πθ(si,j , p′i,j)Ri,j ,

(4)where si,j and p′i,j denote the state and actionof the i-th instance-level sampling from the j-thbag-level sampling, respectively. Ri,j is the corre-sponding reward.

Reward Design. One key problem is assigningsuitable rewards to various actions p′i,j given statesi,j . We design two kinds of rewards: bag-levelreward RB

j and instance-level reward RIi,j in rein-

forcement learning. The bag-level reward (Fenget al., 2018; Qin et al., 2018a) indicates whetherthe new sampled bag is helpful to improve the

𝐷𝑡𝑟𝑎𝑖𝑛 𝐵Generator

𝜋(∗)

Tracker𝑝(∗)

…...

…...

𝐵𝑗′

𝐵𝑗−1′

InstanceEvaluation

BagEvaluation

Generator Learning

Tracker Learning

Generator𝜋(∗)

𝐷𝑡𝑟𝑎𝑖𝑛 𝐷𝑡𝑟𝑎𝑖𝑛′ + 𝐷𝑡𝑟𝑎𝑖𝑛

Tracker𝑝(∗)

Sampling

…...

…...

𝐷𝑣𝑎𝑙𝑖𝑑

𝐷𝑡𝑟𝑎𝑖𝑛: the training set𝐷𝑣𝑎𝑙𝑖𝑑: the validation set

Figure 3: The algorithm flow of the reinforced data augmentation framework. The left is the Generator learningand the right is the Tracker learning. The two learning processes are performed in an alternate manner.

Tracker and the instances in the same bag re-ceive the same reward value. While the instance-level reward assigns different reward values to theinstances in the same sampled bags by check-ing whether the instance can cause the Trackerto make incorrect prediction (Kang et al., 2018;Ribeiro et al., 2018). We sum two kinds of re-wards as the final reward: Ri,j = RB

j + RIi,j , for

more reliable policy learning.Bag-level reward RBj : we re-train the Tracker

with each sampled bag and use their perfor-mance (e.g., joint goal accuracy (Henderson et al.,2014a)) on the validation set to indicate their re-wards. Suppose the performance of the j-th bagB′j is denoted as U ′j , the bag-level rewards areformulated as:

RBj =

2(U ′j −min({U ′j∗}))max({U ′j∗})−min({U ′j∗})

− 1, (5)

where {U ′j∗} refers to the set {U ′j∗}1≤j∗≤M .Here we scale the value to be bounded in the rangeof [-1, 1] to alleviate the instability in RL training3.

Instance-level reward RIi,j : we evaluate each

generated instance (x′i,j , y′i,j) in the bag and de-

note the instance which causes the Tracker tomake wrong prediction, as large-loss instance(LI) (Han et al., 2018). Compared to the non-LIs,the LIs are more informative and can induce largerloss for training the Tracker. Thus, in the design ofinstance-level rewards, the LI is encouraged morewhen its corresponding bag reward is positive, andpunished more when its bag reward is negative.

3In this work, the original text span p is also used as onecandidate in Cp, which actually acts as an implicit Base-line (Sutton and Barto, 2018) in RL training.

Specifically, we define the instance-level rewardas follow:

RIi,j =

c, RB

j ≥ 0 ∧ ILI(x′i,j , y

′i,j) = 1

c/2, RBj ≥ 0 ∧ ILI(x

′i,j , y

′i,j) = 0

−c, RBj < 0 ∧ ILI(x

′i,j , y

′i,j) = 1

−c/2, RBj < 0 ∧ ILI(x

′i,j , y

′i,j) = 0,

(6)

where ILI(x′i,j , y

′i,j) is an indicator function of be-

ing a LI. We obtain the ILI(x′i,j , y

′i,j) value by

checking if the pre-trained Tracker can correctlypredict the label on the generated example. c is ahyper-parameter, which is set to 0.5 by running agrid search over the validation set.

2.3 Alternate LearningIn the framework of RDA, the learning of Gener-ator and Tracker is conducted in an alternate man-ner, which is detailed in Algorithm 1 and Figure 3.

The text span p to be replaced has differentdistribution in the training set. To make learn-ing more efficient, we first sample one text spanp, then sample one sentence (x, y) from the sen-tences containing p. This process is made itera-tively to obtain a bag B. To learn the Generator,we generate M bags of instances by running thepolicy, compute their rewards and update the pol-icy network via the policy gradient method. Tolearn the Tracker, we augment the training data bythe updated policy. Particularly for each (x, y, p),we generate a new instance (x′, y′, p′) by sam-pling based on the learned policies. To furtherreduce the effect of noisy augmented instances,we remove the new instance if its p′ has mini-mum probability among Cp. We randomly initial-ize the policy at each epoch to make the generator

Algorithm 1 The Reinforced Data Augmentation

Input: Pre-trained Tracker with parameters θr;the randomly initialized Generator with pa-rameters θπ;

Output: Re-trained Tracker1: Store θπ2: for l = 1→ L do3: Re-initialize the Generator with θπ4: for n = 1→ N do5: Re-initialize the Tracker with θr6: Sample a bag B7: for j = 1→M do8: Sample a new bag B′j9: end for

10: Compute bag reward with Eq. 511: Compute instance reward with Eq. 612: Update θπ by the gradients in Eq.413: end for14: Obtain new data D′ by the Generator15: Re-train the Tracker on D+D′, update θr16: end for17: Save the Tracker with θr which performs best

on the validation set among the L epochs

learn adaptively which policy is best for the cur-rent Tracker. The alternate learning is performedmultiple rounds and the Tracker with the best per-formances on the validation set is saved.

3 Experiment

In this section, we show the experimental results todemonstrate the effectiveness of our framework.

3.1 Dataset and Evaluation

We use WoZ (Wen et al., 2017) and Multi-WoZ (Budzianowski et al., 2018) to evaluate theproposed framework on the task of dialog statetracking4. Following the work (Budzianowskiet al., 2018), we extract the restaurant domain ofthe MultiWoZ as the evaluation dataset, denotedas MultiWoZ (restaurant). Both WoZ and Multi-WoZ (restaurant) are in the restaurant domain. Inthe experiment, we use the widely used joint goalaccuracy (Henderson et al., 2014a) as the evalua-tion metric, which measures whether all slot val-ues of the updated dialog state exactly match theground truth values at every turn.

4DSTC2 (Mrksic et al., 2017) dataset is not used be-cause its clean version (http://mi.eng.cam.ac.uk/˜nm480/dstc2-clean.zip) is no longer available.

3.2 Implementation Details

We implement the proposed model using Py-Torch5. All hyper-parameters of our model aretuned based on the validation set. To demon-strate the robustness of our model, we use thesimilar hyper-parameter settings for both datasets.Following the previous work (Ren et al., 2018;Zhong et al., 2018; Nouri and Hosseini-Asl, 2018),we concatenate the pre-trained GloVe embed-dings (Pennington et al., 2014) and the characterembeddings (Hashimoto et al., 2017) as the finalword embeddings and keep them fixed when train-ing. The epoch number of the alternate learning L,the epoch number of the generator learning N andthe sampling times M for each bag are set to 5,200 and 2 respectively. We set the dimensions ofall hidden states to 200 in both the Tracker andthe Generator, and set the head number of multi-head Self-Attention to 4 in the Tracker. All learn-able parameters are optimized by the ADAM opti-mizer (Kingma and Ba, 2015) with a learning rateof 1e-3. The batch size is set to 16 in the Trackerlearning, and the bag size in the Generator learn-ing is set to 25.

To avoid over-fitting, we apply dropout to thelayer of word embeddings with a rate of 0.2. Wealso assign rewards based on subsampled valida-tion set with a ratio of 0.3 to avoid over-fitting thepolicy network on the validation set.

In our experiments, the newly augmenteddataset is n times the size of the original trainingdata (n = 5 for the Woz and n = 3 for MultiWoz).At each iteration, we randomly sample a subset ofthe augmented data to train the Tracker. The sam-pling ratios are 0.4 for Woz and 0.3 for MutiWoz.

For the coarse-grained data augmentationmethod, we have tried the current neural para-phrase generation model. The preliminary experi-ment indicates that almost all generated sentencesare not helpful for the task of DST. The reasonis that most of the neural paraphrase generationmodels require additional labeled paraphrase cor-pus which may not be always available (Ray et al.,2018). In this work, we extract unigrams, bigramsand trigrams in the training data as the text spansin the generation process. After that, we retrievethe paraphrases for each text span from the PPDB6

database as the candidates. We also use the goldenslot value in the sentence as the text spans, the

5https://pytorch.org/6http://paraphrase.org/

Model WoZ Multi

Delexicalised Model 70.8 71.2NBT-DNN 84.4 80.3NBTKS 85.5 80.9StateNet 88.9 82.4GLAD 88.1 82.7GCE 88.5 83.5

NBT-CNN 84.0 ±0.6 79.8 ±1.0+ DA 84.2 ±0.5 79.7 ±0.7+ RDA 87.9‡±0.3 83.4†±0.6GLAD? 88.3 ±0.3 83.6 ±0.9+ DA 88.0 ±0.5 82.7 ±0.7+ RDA 90.7‡±0.2 86.7†±0.5

Table 1: Comparison of our model and other base-lines. DA refers the coarse-grained data augmenta-tion without the reinforced framework, and Multi refersthe dataset MultiWoZ (restaurant). t-test is conductedin our proposed models and original trackers (NBT-CNN and GLAD?) are used as the comparison base-lines. † and ‡: significant over the baseline trackersat 0.05/0.01. The mean and the standard deviation arealso reported.

other values of the same slot as the candidates andthe label will be changed accordingly.

3.3 Baseline MethodsWe compare our model with some baselines.Delexicalised Model uses generic tags to replacethe slot values and employs a CNN for turn-levelfeature extraction and a Jordan RNN for state up-dates (Henderson et al., 2014b; Wen et al., 2017).NBT-DNN and NBT-CNN respectively use thesummation and convolution filters to learn therepresentations for the user utterance, candidateslot-value pair and the system actions (Mrksicet al., 2017). Then, they fuse these representa-tions by a gating mechanism for the final predic-tion. NBTKS has a similar structure to NBT-DNNand NBT-CNN, but with a more complicated gat-ing mechanism (Ramadan et al., 2018). StateNetlearns a representation from the dialog history, andthen compares the distances between the learnedrepresentation and the vectors of the candidateslot-value pairs for the final prediction (Ren et al.,2018). GLAD is a global-locally self-attentivestate tracker, which learns representations of theuser utterance and previous system actions withglobal-local modules (Zhong et al., 2018). GCEis developed based on GLAD by using global re-current networks rather than the global-local mod-ules (Nouri and Hosseini-Asl, 2018).

We also use the coarse-grained data augmen-

Dataset Model 10% 20% 50%

WoZ GLAD? 50.1 72.5 81.7+ RDA 66.8 81.5 86.9

Multi GLAD? 60.0 72.6 77.6+ RDA 71.5 81.2 85.2

Table 2: The results with different sub-sampling ratioson WoZ and MultiWoZ (restaurant).

Setting WoZ Multi

RDA 90.7 86.7- Bag Reward 89.1 84.3- Instance Reward 89.8 85.4

DA 88.0 82.7

Table 3: Ablation study of performances on the test setof WoZ and MultiWoZ.

tation (DA) without the reinforced framework asthe baseline, which generate new instances by ran-domly choosing one from the candidates.

3.4 Results and AnalysesWe compare our model with baselines and thejoint goal accuracy is used as the evaluation met-ric. The results are shown in Table 1.

From the table, we observe that the pro-posed GLAD? achieves comparable performances(88.3% and 83.6%) with other state-of-the-artmodels on both datasets. The framework RDAcan further boost the performances of the com-petitive GLAD? by the margin of 2.4% and 3.1%on two datasets respectively, achieving new state-of-the-art results (90.7% and 86.7%). Comparedwith the GLAD?, the classical NBT-CNN withthe RDA framework obtains more improvements:3.9% and 3.6%. We also conduct significancetest (t-test), and the results show that the pro-posed RDA achieves significant improvementsover baseline models (p < 0.01 and p < 0.05respectively for WoZ and MultiWoZ (restaurant)).

The table also shows that directly usingcoarse-grained data augmentation methods with-out the RDA is less effective, and can evendegrade the performances, as it may generatenoisy instances. The results show that: us-ing the RDA, the GLAD? achieves improvementsof (88.0%→90.7%) and (82.7%→86.7%) respec-tively on the WoZ and MultiWoZ. The NBT-CNN obtains improvements of (84.2%→87.9%)and (79.7%→83.4%) respectively. Overall, theresults indicate that the RDA framework offersan effective mechanism to improve the quality of

augmented data.To further verify the effectiveness of the RDA

when the training data is scarce, we conduct sub-sampling experiments with the GLAD? trackertrained on different ratios [10%, 20%, 50%] ofthe training set. The results on both datasets areshown in Table 2. We find that our proposed RDAmethods consistently improve the original trackerperformance. Notably, we obtain ∼10% improve-ments with [10%, 20%] ratios of training set onboth WoZ and MultiWoZ (restaurant), which indi-cates that the RDA framework is particularly use-ful when the training data is limited.

To evaluate the performance of different levelrewards, we perform ablation study with GLAD?

on both the WoZ and MultiWoz datasets. The re-sults are shown in Table 3. From the table wecan see that both rewards can provide the im-provements of 1% to 2% in the datasets and thebag-level reward achieves larger gains than theinstance-level reward. Compared with DA setting,RDA obtains the improvements of 3% to 4% onthe datasets by combining the both rewards, whichindicates that the summation reward is more reli-able for policy learning than individual ones.

3.5 Effects of Hyper-parameters

0 1 2 3 4 5Times size of the original training

0.87

0.88

0.89

0.90

0.91

0.92

0.93

0.94

Jo

int

Go

al

Acc

ura

cy

0 2 4 6 8 10L: epochs of the alternate learning

0.87

0.88

0.89

0.90

0.91

0.92

0.93

0.94

Jo

int

Go

al

Acc

ura

cy

0 50 100 150 200 250 300 350N: epochs of the Generator learning

0.87

0.88

0.89

0.90

0.91

0.92

0.93

0.94

Jo

int

Go

al

Acc

ura

cy

Figure 4: Results of different hyper-parameters. Top:different times the size of original data; Middle: dif-ferent epochs of alternate learning; Bottom: differentepochs of the Generator learning. The solid circles ofL = 0 and N = 0 in the figure refer to the model ofcoarse-grained data augmentation (DA).

In this subsection, we investigate the effects of the

number of newly augmented data in the Trackerlearning, the epoch number of the alternate learn-ing L and the epoch number of the Generatorlearning N on performance. We conduct experi-ments with the GLAD? tracker which is evaluatedon the validation set of WoZ and the joint goal ac-curacy is used as the evaluation metric.Number of newly augmented data: we use 0 to 5times the size of original data in the Tracker learn-ing. The performance is shown in Figure 4 (top).The model continues to improve when the numberof newly added examples is less than 2 times theoriginal data. When we add more than twice theamount of original data, the improvement is notsignificant.Epoch number of the alternate learning: wevary L from 0 to 10 and the performance is shownin Figure 4 (middle). We can see that, with al-ternate learning, the model continues to improvewhenL ≤ 5, and becomes stable with no improve-ment after L > 5.Epoch number of the Generator learning: wevary N from 0 to 350, and the performance isshown in Figure 4 (bottom). We find that the per-formance increases dramatically when N ≤ 200,and shows no improvement after N > 200. Itshows that the Generator needs a large N to en-sure a good policy.

3.6 Case Study for Policy NetworkWe sample four sentences from WoZ to demon-strate the effectiveness of the Generator policy inthe case study. Due to limited space, we presentthe candidate phrases with maximum and mini-mum probabilities derived from the policy net-work and the details are shown in Table 4.

We observe that both high-quality and low-quality replacements exist in the candidate set.The high-quality replacements will generate reli-able instances, which can potentially improve thegeneralization ability of the Tracker. The low-quality ones will induce noisy instances and canreduce the performance of the Tracker. From theresults of the policy network, we find that our Gen-erator can automatically infer the quality of candi-date replacements, assigning higher probabilitiesto the high-quality candidates and lower probabil-ities to the low-quality candidates.

4 Related Work

Dialog State Tracking. DST is studied exten-sively in the literature (Williams et al., 2016). The

Sentence x and text span p Candidates Cp

Thanks , [could you give] me thephone number for the restaurant?

i was wonder if you could provideare you able to

What restaurants are on the eastside that are not [overpriced] ?

too expensivecheap enough

What is a affordable restaurant inthe [south side part] of town?

south endsouthern countries

I want Cuban food and i [do n’tcare] about the price range.

do n’t worrydo n’t give a danm

Table 4: Case study for the Generator policy. Thephrases with maximum policy values are listed at thefirst line in each cell of Candidates Cp and the oneswith minimum values are listed at the second line.

methods can be classified into three categories:rule-based (Zue et al., 2000), generative (DeVaultand Stone, 2007; Williams, 2008), and discrimi-native (Metallinou et al., 2013) methods. Thediscriminative methods (Metallinou et al., 2013)study dialog state tracking as a classification prob-lem, designing a large number of features andoptimizing the model parameters by the anno-tated data. Recently, neural networks based mod-els with different architectures have been appliedin DST (Henderson et al., 2014b; Zhong et al.,2018). These models initially employ CNN (Wenet al., 2017), RNN (Ramadan et al., 2018), self-attention (Nouri and Hosseini-Asl, 2018) to learnthe representations for the user utterance andthe system actions/response, then various gatingmechanisms (Ramadan et al., 2018) are used tofuse the learned representations for prediction.Another difference among these neural modelsis the way of parameter sharing, most of whichuse one shared global encoder for representationlearning, while the work (Zhong et al., 2018) pairseach slot with a local encoder in addition to oneshared global encoder. Although these neural net-work based trackers obtain state-of-the-art results,they are still limited by insufficient amount anddiversity of annotated data. To address this dif-ficulty, we propose a method of data augmentationto improve neural state trackers by adding high-quality generated instances as new training data.

Data Augmentation. Data augmentation aims togenerate new training data by conducting trans-formations (e.g. rotating or flipping images, au-dio perturbation, etc.) on existing data. It hasbeen widely used in computer vision (Krizhevskyet al., 2012; Cubuk et al., 2018) and speech recog-nition (Ko et al., 2015). In contrast to image orspeech transformations, it is difficult to obtain ef-fective transformation rules for text which can pre-

serve the fluency and coherence of newly gener-ated text and be useful for specific tasks. There isprior work on data augmentation in NLP (Zhanget al., 2015; Kang et al., 2018; Kobayashi, 2018;Hou et al., 2018; Ray et al., 2018; Yoo et al.,2018). These approaches do not specially designsome mechanisms to filter out low-quality gener-ated instances. In contrast, we propose a coarse-to-fine strategy for data augmentation, where thefine-grained generative polices learned by RL areused to automatically reduce the noisy instancesand retain the effective ones.Reinforcement Learning in NLP. RL is a gen-eral purpose framework for decision making andhas been applied in many NLP tasks such as infor-mation extraction (Narasimhan et al., 2016), re-lational reasoning (Xiong et al., 2017), sequencelearning (Ranzato et al., 2015; Li et al., 2018;Celikyilmaz et al., 2018), summarization (Pauluset al., 2017; Dong et al., 2018), text classifica-tion (Wu et al., 2018; Feng et al., 2018) and di-alog (Singh et al., 2000; Li et al., 2016). Previ-ous works by (Feng et al., 2018) and (Qin et al.,2018b) design RL algorithm to learn how to fil-ter out noisy ones. Our work is significantly dif-ferent from these works, especially in the prob-lem settings and model frameworks. The previ-ous work assume there are many distant sentences.However, in our work we only know possible re-placements, and our RL algorithm should learnhow to choose optimal replacements to generatenew high-quality sentences. Moreover, the actionspace and reward design are different.

5 Conclusion and Future Work

We have proposed a reinforced data augmenta-tion (RDA) method for dialogue state trackingin order to improve its performance by generat-ing high-quality training data. The Generator andthe Tracker are learned in an alternate manner,i.e. the Generator is learned based on rewardsfrom the Tracker while the Tracker is re-trainedand boosted with the new high-quality data aug-mented by the Generator. We conducted exten-sive experiments on the datasets of WoZ and Mul-tiWoZ (restaurant); the results demonstrate the ef-fectiveness of our framework. In future work, wewould conduct experiments on more NLP tasksand introduce neural network based paraphrasingmethod in the RDA framework.

ReferencesPaweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang

Tseng, Inigo Casanueva, Stefan Ultes, Osman Ra-madan, and Milica Gasic. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In EMNLP, pages5016–5026.

Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, andYejin Choi. 2018. Deep communicating agents forabstractive summarization. In NAACL, volume 1,pages 1662–1675.

Ekin D Cubuk, Barret Zoph, Dandelion Mane, VijayVasudevan, and Quoc V Le. 2018. Autoaugment:Learning augmentation policies from data. arXivpreprint arXiv:1805.09501.

David DeVault and Matthew Stone. 2007. Managingambiguities across utterances in dialogue. In Deca-log, pages 49–56.

Yue Dong, Yikang Shen, Eric Crawford, Herke vanHoof, and Jackie Chi Kit Cheung. 2018. Banditsum:Extractive summarization as a contextual bandit. InEMNLP, pages 3739–3748.

Miroslav Dudik, Daniel Hsu, Satyen Kale, NikosKarampatziakis, John Langford, Lev Reyzin, andTong Zhang. 2011. Efficient optimal learning forcontextual bandits. arXiv preprint arXiv:1106.2369.

Jun Feng, Minlie Huang, Li Zhao, Yang Yang, and Xi-aoyan Zhu. 2018. Reinforcement learning for rela-tion classification from noisy data.

Jianfeng Gao, Michel Galley, and Lihong Li. 2018.Neural approaches to conversational ai. arXivpreprint arXiv:1809.08267.

Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, MiaoXu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama.2018. Co-teaching: Robust training of deep neuralnetworks with extremely noisy labels. In NeurIPS,pages 8527–8537.

Kazuma Hashimoto, caiming xiong, Yoshimasa Tsu-ruoka, and Richard Socher. 2017. A joint many-taskmodel: Growing a neural network for multiple nlptasks. In EMNLP, pages 1923–1933.

Matthew Henderson, Blaise Thomson, and Jason DWilliams. 2014a. The second dialog state trackingchallenge. In SIGDIAL, pages 263–272.

Matthew Henderson, Blaise Thomson, and SteveYoung. 2014b. Word-based dialog state trackingwith recurrent neural networks. In SIGDIAL, pages292–299.

Yutai Hou, Yijia Liu, Wanxiang Che, and Ting Liu.2018. Sequence-to-sequence data augmentationfor dialogue language understanding. In COLING,pages 1234–1245.

Dongyeop Kang, Tushar Khot, Ashish Sabharwal, andEduard Hovy. 2018. Adversarial training for tex-tual entailment with knowledge-guided examples.In ACL.

Diederik P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. ICLR.

Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San-jeev Khudanpur. 2015. Audio augmentation forspeech recognition. In Interspeech.

Sosuke Kobayashi. 2018. Contextual augmentation:Data augmentation by words with paradigmatic re-lations. In NAACL.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classification with deep convo-lutional neural networks. In NeurIPS, pages 1097–1105.

Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky,Michel Galley, and Jianfeng Gao. 2016. Deep re-inforcement learning for dialogue generation. InEMNLP.

Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li.2018. Paraphrase generation with deep reinforce-ment learning. In EMNLP, pages 3865–3878.

Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San-tos, Mo Yu, Bing Xiang, Bowen Zhou, and YoshuaBengio. 2017. A structured self-attentive sentenceembedding. ICLR.

Angeliki Metallinou, Dan Bohus, and Jason Williams.2013. Discriminative state tracking for spoken dia-log systems. In ACL, pages 466–475.

Nikola Mrksic, Diarmuid O Seaghdha, Tsung-HsienWen, Blaise Thomson, and Steve Young. 2017.Neural belief tracker: Data-driven dialogue statetracking. In ACL, pages 1777–1788.

Karthik Narasimhan, Adam Yala, and Regina Barzilay.2016. Improving information extraction by acquir-ing external evidence with reinforcement learning.In EMNLP, pages 2355–2365.

Elnaz Nouri and Ehsan Hosseini-Asl. 2018. Towardscalable neural dialogue state tracking model. arXivpreprint arXiv:1812.00899.

Romain Paulus, Caiming Xiong, and Richard Socher.2017. A deep reinforced model for abstractive sum-marization. arXiv preprint arXiv:1705.04304.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP, pages 1532–1543.

Pengda Qin, XU Weiran, and William Yang Wang.2018a. Robust distant supervision relation extrac-tion via deep reinforcement learning. In Proceed-ings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 2137–2147.

Pengda Qin, Weiran XU, and William Yang Wang.2018b. Dsgan: Generative adversarial training fordistant supervision relation extraction. In ACL.

Osman Ramadan, Paweł Budzianowski, and MilicaGasic. 2018. Large-scale multi-domain belief track-ing with knowledge sharing. In ACL, pages 432–437.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2015. Sequence level train-ing with recurrent neural networks. arXiv preprintarXiv:1511.06732.

Avik Ray, Yilin Shen, and Hongxia Jin. 2018. Ro-bust spoken language understanding via paraphras-ing. arXiv preprint arXiv:1809.06444.

Liliang Ren, Kaige Xie, Lu Chen, and Kai Yu.2018. Towards universal dialogue state tracking. InEMNLP, pages 2780–2786.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018. Semantically equivalent adversar-ial rules for debugging NLP models. In ACL, pages856–865.

Sanuj Sharma, Prafulla Kumar Choubey, and RuihongHuang. 2019. Improving dialogue state tracking bydiscerning the relevant context.

Satinder P Singh, Michael J Kearns, Diane J Litman,and Marilyn A Walker. 2000. Reinforcement learn-ing for spoken dialogue systems. In NeurIPS, pages956–962.

Richard S Sutton and Andrew G Barto. 2018. Re-inforcement learning: An introduction, pages 329–331. MIT press.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NeurIPS, pages 5998–6008.

Tsung-Hsien Wen, David Vandyke, Nikola Mrksic,Milica Gasic, Lina M Rojas Barahona, Pei-Hao Su,Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialoguesystem. In EACL, pages 438–449.

Jason Williams, Antoine Raux, and Matthew Hender-son. 2016. The dialog state tracking challenge se-ries: A review. Dialogue & Discourse, 7(3):4–33.

Jason Williams, Antoine Raux, Deepak Ramachan-dran, and Alan Black. 2013. The dialog state track-ing challenge. In SIGDIAL, pages 404–413.

Jason D Williams. 2008. Exploiting the asr n-best bytracking multiple dialog state hypotheses. In Inter-speech.

Jiawei Wu, Lei Li, and William Yang Wang. 2018. Re-inforced co-training. In NAACL, pages 1252–1262.

Wenhan Xiong, Thien Hoang, and William YangWang. 2017. Deeppath: A reinforcement learningmethod for knowledge graph reasoning. In EMNLP,pages 564–573.

Kang Min Yoo, Youhyun Shin, and Sang-goo Lee.2018. Data augmentation for spoken language un-derstanding via joint variational generation. arXivpreprint arXiv:1809.02305.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In NeurIPS.

Shiqi Zhao, Xiang Lan, Ting Liu, and Sheng Li. 2009.Application-driven statistical paraphrase generation.In ACL, pages 834–842.

Victor Zhong, Caiming Xiong, and Richard Socher.2018. Global-locally self-attentive encoder for di-alogue state tracking. In ACL.

Victor Zue, Stephanie Seneff, James R Glass, JosephPolifroni, Christine Pao, Timothy J Hazen, and LeeHetherington. 2000. Juplter: a telephone-basedconversational interface for weather information.TASLP, pages 85–96.