characterizing task-oriented dialog using a simulated asr channel jason d. williams machine...
Post on 20-Dec-2015
215 views
TRANSCRIPT
Characterizing Task-Oriented Dialog using a Simulated ASR Channel
Jason D. Williams
Machine Intelligence LaboratoryCambridge University Engineering Department
• Motivation for the data collection
• Experimental set-up
• Transcription & Annotation
• Effects of ASR error rate on…– Turn length / dialog length
– Perception of error rate
– Task completion
– “Initiative”
– Overall satisfaction (PARADISE)
SACTI-1 CorpusSimulated ASR-Channel – Tourist Information
ASR channel vs. HH channel
HH dialog ASR channel
Properties
• Frequent but brief overlaps
• 80% of utterances contain fewer than 12 words; 50% < 5
• Approximately equal turn length
• Approximately equal balance of initiative
• About half of turns are ACK (often spliced)
• Few overlaps
• Longer system turns; shorter user turns
• Initiative more often with system
• Virtually no turns are ACK
• Virtually no splicing
Observations
Are models of HC dialog/grounding appropriate in the presence of the ASR channel?
• “Instant” communication
• Effectively perfect recognition of words
• Prosodic information carries additional information
• Turns explicitly segmented
• Barge-in, End-pointed
• Prosody virtually eliminated
• ASR & parsing errors
My approach1. Study the ASR channel in the abstract
– WoZ experiments using a simulated ASR channel– Understand how people behave with an “ideal” dialog manager
• For example, grounding model• Use these insights to inform state space and action set selection
– Note that collected data has unique properties useful to:• RL-based systems• Hidden-state estimation• User modeling
2. Formulate dialog management problem as a POMDP– Decompose state into BN nodes – for example:
• Conversation state (grounding state)• User action• User belief (goal)
– Train using data collected– Solve using approximations
The paradox of “dialog data”• To build a user model, we need to see the user’s
reaction to all kinds of misunderstandings• However, most systems use a fixed policy
– Systems typically do not take different actions in the same situation
– Taking random actions is clearly not an option!
– Constraining actions means building very complex systems…
• … and which actions should be in the system’s repertoire?
An ideal data collection…• …would show users reactions to a variety of
error handling strategies (no fixed policy)• BUT would not be nonsense dialogs!
• …would use the ASR channel
• …would explore a variety of operating conditions – e.g., WER rate
• …would not assume a particular state space
• ... would somehow “discover” the set of system actions
Data collection set-up
ASR-noise generator
End-pointer
Subject
Typist
Wizard
Wizard’s screen shows corrupted input
Wizard talks into microphone
ASR simulation state machine • Simple energy-based barge-in
• User interrupts wizard
SILENCE USER_TALKING
WIZARD_TALKING TYPIST_TYPING
Wizard starts talking
Wizard stops talking
Userstarts talking
Userstops talking
Userstarts talking
Typist done; reco result displayed
ASR simulation• Simplified FSM-based recognizer
– Weighted finite state transducer (WFST)
• Flow– Reference input – Spell-checked against full dictionary– Converted to phonetic string using full dictionary– Phonetic lattice generated based on confusion model– Word lattice produced– Language model composed to re-score lattice– “De-coded” to produce word strings– N-Best list extracted.
• Various free variables to induce random behavior• Plumb N-Best list for variability
ASR simulation evaluation• Hypothesis: Simulation produces errors similar to errors
induced by additive noise w.r.t. concept accuracy• Assess concept accuracy as F-Measure
using automated, data-driven procedure (HVS model)
• Plot concept accuracy of:– Real additive noise– A naïve confusion model (simple insertion,
substitution, deletion)– WFST confusion model
• WFST appears to follow real data much more closely than naïve model
Scenario & Tasks• Tourist / Tourist information scenario
– Intentionally goal-directed– Intentionally simple tasks– Mixtures of simple information gathering and basic planning
• Wizard (Information giver)– Access to bus times, tram times, restaurants, hotels, bars, tourist attraction information, etc.
• User given series of tasks • Likert scores asked at end of each task• 4 Dialogs / user; 3 users/Wizard
Example task: Finding the perfect hotelYou’re looking for a hotel for you and your travelling partner that meets a number of requirements.You’d like the following: En suite rooms Quiet rooms As close to the main square as possibleGiven those desires, find the least expensive hotel. You’d prefer not compromise on your requirements, but of course you will if you must!Please indicate the location of the hotel on the map and fill in the boxes below.
Name of accommodation Cost per night for 2 people
Shoppingarea
Tourist info
Post office
Cinema
Park
Fountain
Tower
Castle
Main square
Middle Road
Red B
ridge
Bridge N
ine
Ca
scad
e R
oad
Park Road
Riv
ersi
de D
rive
South Bridge
Castle Loop
Fountain RoadW
est
loop
Tower Hill Road
Alexander S
treet
No
rth
Roa
d
Alexander Street
Ale
xand
er S
treet
Castle Loop
Nine stree
t
Main Street
Art Square
Museum
User’s Map
Wizard’s map
H1
R4
R6
B2
B3
B5
B4
R2
R3
H3
R5
B6
H2
R1 B1
H4
H5
H6
Shoppingarea
Tourist info
Post office
Cinema
Park
Fountain
Tower
Castle
Main square
Middle Road
Red B
ridge
Bridge N
ine
Ca
scad
e R
oad
Park Road
Riv
ersi
de D
rive
South Bridge
Castle Loop
Fountain RoadW
est
loop
Tower Hill Road
Alexander S
treet
No
rth
Roa
d
Alexander Street
Ale
xand
er S
treet
Castle Loop
Nine
street
Main Street
Art Square
Museum
H1
R4
R6
B2
B3
B5
B4
R2
R3
H3
R5
B6
H2
R1 B1
H4
H5
H6
Likert-scale questions• User/wizard each given 6 questions after each task
Subject example:1. In this task, I accomplished the goal.2. In this task, I thought the speech recognition was accurate.3. In this task, I found it difficult to communicate because of
the speech recognition.4. In this task, I believe the other subject was very helpful.5. In this task, the other subject found using the speech
recognition difficult. 6. Overall, I was very satisfied with this past task.
Disagree strongly
(1)
Disagree(2)
Disagree somewhat
(3)
Neither agree nor disagree
(4)
Agree somewhat(5)
Agree(6)
Strongly agree(7)
Transcription• User-side transcribed during experiments
– Prioritized for speed
"I NEED UH I’M LOOKING FOR A PIZZA"
• Wizard-side transcribed using a subset of LDC transcription guidelines – more detail
"ok %uh% (()) sure you –- i can"
epErrorEnd=true
Annotation (acts)• Each turn is a sequence of tags
• Inspired by Traum’s “Grounding Acts”– More detailed / easier to infer from surface words
Tag Meaning
Request Question/request requiring response
Inform Statement/provision of task information
Greet-Farewell “Hello”, “How can I help,” “that’s all”, “Thanks”, “Goodbye”, etc.
ExplAck Explicit statement of acknowledgement, showing speaker understanding of OS
Unsolicited-Affirm Explicit statement of acknowledgement, showing OS understands speaker
HoldFloor Explicit request for OS to wait
ReqRepeat Request for OS to repeat their last turn
ReqAck Request for OS to show understanding
RspAffirm Affirmative response to ReqAck
RspNegate Negative response to ReqAck
StateInterp A statement of intention of OS
DisAck Show of lack of understanding of OS
RejOther Display of lack of understanding of speaker’s intention or desire by OS
Annotation (understanding)
• Each wizard turn was labeled to indicate whether the wizard understood the previous user turn
Label Wizard’s understanding of previous user turn
Full All intentions understood correctly.
Partial Some intentions understood; none misunderstood.
Non Wizard made no guess at user intention.
Flagged-Mis The wizard formed an incorrect hypothesis of the user’s meaning, and signalled a dialog problem
Un-Flagged-Mis The wizard formed an incorrect hypothesis of the user’s meaning, accepted it as correct and continued with the dialog.
Corpus summary
WERtarget
# Wiz # User # Task Completedin time limit
Per-turn WER
Per-dialog WER
None 2 6 24 83 % 0 % 0 %
Low 4 12 48 83 % 32 % 28 %
Med 4 12 48 77 % 46 % 41 %
Hi 2 6 24 42 % 63 % 60 %
Perception of ASR accuracyHow accurately do users & wizards perceive WER?
• Perceptions of recognition quality broadly reflected actual performance, but users consistently gave higher quality scores than wizards for the same WER
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
Hi Med Low None
WER Target
Ave
rage
Lik
ert S
core
Wizard User
Average turn length (words)How does WER affect wizard & user turn length?
• Wizard turn length increases• User turn length stays relatively constant
0
5
10
15
20
25
30
35
Hi Med Low None
WER Target
Av
wor
ds/tu
rn
Wizard User
Grounding behaviorHow does WER affect wizard grounding behavior?
• As WER increases, wizard grounding behaviors become increasingly prevalent
0%
20%
40%
60%
80%
100%
Hi Med Low None
WER Target
% o
f all
tags
All OthersDisAckStateInterpReqAckReqRepeatRequestExplAckInform
Wizard understandingHow does WER affect wizard understanding status?
• Misunderstanding increases with WER…• …and task completion falls (83%, 83%, 77%, 42%)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Hi Med Low None
WER Target
% o
f Wiz
ard
Tur
ns UnFlaggedMisFlaggedMisNonPartialFull
Wizard strategies
• Classify each wizard turn into one of 5 “strategies”
Label Meaning wiz init?
Tags
REPAIR Attempt to repair Yes ReqAck, ReqRepeat, StateInterp, DisAck, RejOther
ASKQ Ask task-question Yes Request
GIVEINFO Provide task info No Inform
RSPND Non-initiative taking grounding actions
No ExplAck, Rsp-Affirm, Rsp-Negate, Unsolicited-Affirm
OTHER Not included in analysis n/a All others
Wizard strategiesWhat are the most successful strategy after known dialog trouble?
• This plot shows wizard understanding status one turn after known dialog trouble: effect of ‘REPAIR’ vs ‘ASKQ’.
• ‘S’ indicates significant differences
0%
20%
40%
60%
80%
100%
REPAIR ASKQ REPAIR ASKQ
Hi Med
Strategy & WER Target
% o
f Wiz
ard
Tur
ns UnFlaggedMis
FlaggedMisNonPartialFull
S
S
S
User reactions to misunderstandings
How does a user respond after being misunderstood?• Surprisingly little explicit indication!
WERtarget
User turns including tag
DisAck RejectOther Request
None N/A N/A N/A Low 0.0 % 3.8 % 92.3 % Med 2.5 % 19.0 % 75.9 % Hi 0.0 % 12.3 % 87.0 %
Level of wizard “initiative”How does “initiative” vary with WER?
• Define wizard “initiative” using strategies, above
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Hi Med Low NoneWER Target
% o
f Wiz
ard
Tur
ns
GIVEINFORESPONDASKQREPAIR
Reward measures/PARADISE• Satisfaction = Task completion + Dialog Cost metrics• 2 kinds of user satisfaction:
– Single– Combi
• 3 kinds of task completion– User– Obj– Hyb
• Cost metrics– PerDialogWER– %UnFlaggedMis– %FlaggedMis– %Non– Turns– %REPAIR– %ASKQ
Reward measures/PARADISE• In almost all experiments using the User task completion metric, it was the only
significant predictor
• The single/combi metrics almost always selected the same predictors
Dataset Metrics (task & user sat)
R2 Significant predictors
ALL User-S 52 % 1.03 Task
ALL User-C 60 % 5.29 Task – 1.54 %UnFlagMis
ALL Obj-S 24 % -0.49 Turns + 0.38 Task
ALL Obj-C 27 % -2.43 Turns – 1.45 %UnFlagMis + 1.35 Task
ALL Hyb-S 41 % 0.74 Task – 0.36 Turns
Hi Obj-S 40 % 0.98 Task
Hi Hyb-S 48 % 1.07 Task
Med Obj-S 16 % -0.62 %Non
Med Obj-C 37 % -3.35 %Non – 2.94 Turns
Med Hyb-S 38 % 0.97 Task
Low Obj-S 28 % -0.59 Turns
Low Hyb-S 40 % -0.49 Turns + 0.40 Task
Reward measures/PARADISEWhat indicators best predict user satisfaction?
• When run on all data, mixtures of Task, Turns, %UnFlaggedMis best predict user satisfaction. – %UnFlaggedMis is serving as a better measurement of understanding accuracy than WER alone, since it effectively combines recognition accuracy with a measure of confidence.
• Broadly speaking:– Task completion is most important at the High WER level– Task completion and dialog quality is most important at the Med WER level– Efficiency is most important at the Low WER level
• These patterns mirror findings from other PARADISE experiments using Human/Computer data• This gives us some confidence that this data set is valid for training Human/Computer systems
Conclusions/Next steps• At moderate WER levels, asking task-related questions appears to be more successful than direct dialog repair. • Levels of expert “initiative” increase with WER, primarily as a result of grounding behavior. • Users infrequently give a direct indication of having been misunderstood, with no clear correlation to WER. • When run on all data, mixtures of Task, Turns, %UnFlaggedMis best predict user satisfaction. • Task completion appears to be most predictive of user satisfaction; however, efficiency shows some influence at lower WERs.
• Next… apply this corpus to statistical systems.