a scalable reinforcement learning approach to error handling in spoken language interfaces

A Scalable Reinforcement Learning Approach to Error Handling in Spoken Language Interfaces

Dan Bohuswww.cs.cmu.edu/[email protected]

Computer Science DepartmentCarnegie Mellon UniversityPittsburgh, PA, 15217

2

problem

spoken language interfaces lack robustness when faced with understanding errors.

3

more concretely …

S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]S: Sorry I’m not sure I understood what you said. What city are you leaving from ?U: Urbana Champaign [FOR MINUTE SINCE HEY]S: I’m still having trouble understanding you. To go on, I need you to answer the

following question. What city are you leaving from?U: Chicago [CHICAGO]S: traveling from Chicago. Where would you like to go?U: Huntsville [SEOUL]S: traveling to Seoul. What day did you need to travel?U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]S: traveling in the afternoon. Okay what day would you be departing chicagoU: /uh/ the tenth of august [AT THE TENTH OF AUGUST]S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm,

arrives Seoul at 5pm the next day. Do you want to take that?U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND]S: traveling on Saturday, August 12th … I have a flight departing Chicago at

1:40pm arrives Seoul at ………

4

problem source

stems mostly from speech recognition spans most domains and interaction types exacerbated by operating conditions

spontaneous speech medium / large vocabularies large, varied, and changing user populations

5

speech recognition impact

typical word-error-rates 10-20% for natives (novice users) 40% and above for non-native users

significant negative impact on performance[Walker, Sanders]

word-error-rate

task success

6

approaches for increasing robustness

gracefully handle errors through interaction

fix recognition

detect the problems develop a set of recovery strategies know how to choose between them (policy)

a closer look : RL in spoken dialog systems : current challenges : RL for error handling

7

outline

a closer look at the problem RL in spoken dialog systems current challenges a proposed RL approach for error handling


8

MISunderstanding

non- and misunderstandings

NONunderstanding

S: What city are you leaving from?U: Urbana Champaign [OKAY IN THAT SAME PAY]S: Sorry I’m not sure I understood what you said. What city are you leaving from ?U: Urbana Champaign [FOR MINUTE SINCE HEY]S: I’m still having trouble understanding you. To go on, I need you to answer the

following question. What city are you leaving from?U: Chicago [CHICAGO]S: traveling from Chicago. Where would you like to go?U: Huntsville [SEOUL]S: traveling to Seoul. What day did you need to travel?U: no no I’m traveling to Birmingham [THE TRAVELING TO BERLIN P_M]S: traveling in the afternoon. Okay what day would you be departing chicagoU: /uh/ the tenth of august [AT THE TENTH OF AUGUST]S: a flight tomorrow… I have a Korean Airlines flight departing Chicago at 1:40pm,

arrives Seoul at 5pm the next day. Do you want to take that?U: my destination is Birmingham [FLIGHT DESTINATION MR WEEKEND]S: traveling on Saturday, August 12th … I have a flight departing Chicago at 1:40pm

arrives Seoul at ………


9

six not-so-easy pieces

detection

strategies

policy

misunderstandings non-understandings

recognition or semanticconfidence scores

typically trivial[some exceptions may apply]

explicit confirmationDid you say 10am?

implicit confirmationStarting at 10am… until what time?

accept, reject

Sorry, I didn’t catch that …Can you repeat that?

Can you rephrase that?You can say something like “at 10 a.m.”

[MoveOn]

Handcrafted heuristicsfirst notify, then ask repeat, then give help,

then give up

confidence threshold model

reject accept

explicit implicit0 1


10

outline



11

spoken dialog system architecture

LanguageUnderstanding

Dialog Manager

DomainBack-end

LanguageGeneration

SpeechRecognition

SpeechSynthesis


12

reinforcement learning in dialog systems


Dialog Manager

DomainBack-end

LanguageGeneration

SpeechRecognition

SpeechSynthesis

noisysemantic input

actions(semantic output)

debate over design choices

learn choices using reinforcement learning

agent interacting with an environment

noisy inputs temporal / sequential aspect task success / failure


13

NJFun

“Optimizing Dialog Management with Reinforcement Learning: Experiments with the NJFun System”

[Singh, Litman, Kearns, Walker]

provides information about “fun things to do in New Jersey”

slot-filling dialog type-of-activity location time

provide information from a database


14

NJFun as an MDP

define state-space define action-space define reward structure collect data for training & learn policy evaluate learned policy


15

NJFun as an MDP: state-space

internal system state: 14 variables state for RL → vector of 7 variables

greet: has the system greeted the user attribute: which attribute the system is currently querying confidence: recognition confidence level (binned) value: value has been obtained for current attribute tries: how many times the current attribute was asked grammar: non-restrictive or restrictive grammar was used history: was there any trouble on previous attributes

62 different states


16

NJFun as an MDP: actions & rewards

type of initiative (3 types) system initiative mixed initiative user initiative

confirmation strategy (2 types) explicit confirmation no confirmation

resulting MDP has only 2 action choices / state

reward: binary task success


17

NJFun as an MDP: learning a policy

training data: 311 complete dialogs collected using exploratory policy

learned the policy using value iteration begin with user initiative back-off to mixed or system initiative when re-asking for an

attribute specific type of back-off is different for different attributes confirm when confidence is low


18

NJFun as an MDP: evaluation

evaluated policy on 124 testing dialogs

task success rate: 52% → 64% weak task completion: 1.72 → 2.18 subjective evaluation: no significant

improvements, but move-to-the-mean effect

learned policy better than hand-crafted policies comparatively evaluated policies on learned MDP


19

outline



20

challenge 1: scalability

contrast NJFun with RoomLine conference room reservation and scheduling mixed-initiative task-oriented interaction

system obtains list or rooms matching initial constraints system negotiates with user to identify room that best matches their

needs 37 concepts (slots), 25 questions that can be asked

another example: LARRI

full-blown MDP is intractable not clear how to do state-abstraction


21

challenge 2: reusability

underlying MDP is system-specific MDP design still requires a lot of human expertise new MDP for each system new training & new evaluation are we really saving time & expertise?

maybe we’re asking for too much?


22

addressing the scalability problem

approach 1: user models / simulations costly to obtain real data → simulate simplistic simulators [Eckert, Levin] more complex, task-specific simulators [Scheffler & Young]

real-world evaluation becomes paramount

approach 2: value function approximation data-driven state abstraction / state aggregation [Denecke]


23

outline



24

reinforcement learning in dialog systems


Dialog Manager

DomainBack-end

LanguageGeneration

SpeechRecognition

SpeechSynthesis

semantic input

actions / semantic output

Focus RL only on the difficult decisions!


25

task-decoupled approach

decouple

error handling decisions

domain-specific dialog control decisions

use reinforcement learning

use your favorite DM framework

advantages reduces the size of the learning problem favors reusability of learned policies lessens system authoring effort


26

RavenClaw

Dialogue Task (Specification)

Domain-Independent Dialogue Engine

RoomLine

Login

Welcome

AskRegistered AskName

GreetUser

GetQuery

DateTime Location Properties

Network Projector Whiteboard

GetResults DiscussResults

user_nameregistered

query

results

RoomLine

Login

AskRegistered

Dialogue Stack

registered: [No]-> false, [Yes] -> true

registered: [No]-> false, [Yes] -> trueregistered: [No]-> false, [Yes] -> trueuser_name: [UserName]

registered: [No]-> false, [Yes] -> trueregistered: [No]-> false, [Yes] -> trueuser_name: [UserName]user_name: [UserName]query.date_time: [DateTime]query.location: [Location]query.network: [Network]

Expectation Agenda

Error HandlingDecision Process

Strategies

ErrorIndicators

ExplicitConfirm


27

decision process architecture

RoomLine

Login

Welcome

AskRegistered AskName

GreetUser

user_nameregistered

GatingMechanism

Concept-MDP Concept-MDP

Topic-MDP

Topic-MDP

Small-size models Parameters can be tied across

models Accommodate dynamic task

generation

Favors reusability of policies Initial policies can be easily

handcrafted

No Action

Explicit Confirm

No Action

No Action

ExplicitConfirmation

Independence assumption


28

reward structure & learning

Gating Mechanism

MDP MDP MDP

Action

Global, post-gate rewardsReward

Gating Mechanism

MDP MDP MDP

Action

Local rewards

Reward Reward Reward

Rewards based on any dialogue performance metric

Atypical, multi-agent reinforcement learning setting

Multiple, standard RL problems

Risk solving local problems, but not the global one


29

conclusion

reinforcement learning – very appealing approach for dialog control

in practical systems, scalability is a big issue how to leverage knowledge we have?

state-space design solutions that account or handle sparse data

bounds on policies hierarchical models

30

thankyou!

31

Structure of Individual MDPs

HC

ExplConf

ImplConf

NoAct

LC

ExplConf

ImplConf

NoAct

MC

ExplConf

ImplConf

NoAct

0NoAct

Concept MDPs State-space: belief indicators Action-space: concept scoped system actions

Topic MDPs State-space: non-understanding, dialogue-on-track indicators Action-space: non-understanding actions, topic-level actions

a scalable reinforcement learning approach to error handling in spoken language interfaces

Documents