building & evaluating spoken dialogue systems discourse & dialogue cs 359 november 27, 2001
TRANSCRIPT
![Page 1: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/1.jpg)
Building & Evaluating Spoken Dialogue Systems
Discourse & Dialogue
CS 359
November 27, 2001
![Page 2: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/2.jpg)
Agenda
• How to get started– System bootstrapping
• “Wizard-of-Oz” design
• Strengths & Limitations
• How to tell if you succeeded– System evaluation
• What you do & how you do it
• Performance = Task success - Task cost
![Page 3: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/3.jpg)
System Bootstrapping
• Question: How should we design a system?– What should it be able to understand?
• Key: How would people talk to it?
• Suggestion 1: Like people talk to each other?– Collect human-human interactions, same task– But, computers NOT like people, act differently– Politeness, assumed knowledge, style, complexity– Adapt to needs of hearer– Balance need for understanding, reduce effort
![Page 4: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/4.jpg)
“Wizard-of-Oz” Studies
• Suggestion 2: Like people talk to computer!– Get application/domain specific language
• But, system NOT built yet!– Simulate system mediated thru human wizard
• Fast, rigid/consistent, no small errors/typos
– Structured simulations• Automate as much as possible
– E.g. response editor - hierarchical menus/templates, access to different apps, query creator, time-stamped logging
![Page 5: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/5.jpg)
Good Wizard Studies
• Requirements:– Background system:
• Fully implemented or simulated• Allows some user initiative
– Task:• Somewhat open “scenario” • Not too complex or private
– Must be piloted:• Task scenario/simulation
![Page 6: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/6.jpg)
Comparing Styles
• Human-human versus human-computer– H-H: more complex; H-C: simpler structure– Domain variability greater than individual– Vocabulary choice– Use of anaphora
• Question: Should you lie to the user?– Only way to get realistic behavior– Debrief: explain protocol, offer to destroy data
![Page 7: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/7.jpg)
System Evaluation
• Question: Which design is better?
• Approach 1: Content-based measures– Task-completion– Concept accuracy– Reference answer
• Query result versus key
• Limited: Only one strategy – Many alternatives
![Page 8: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/8.jpg)
System Evaluation (cont’d)
• Not just accuracy, but efficiency
• Approach 2: Cost-based measures– Time to completion:
• # of utterances
• # of turns
• Duration in seconds
– Error measures:• # corrections, # repetitions
![Page 9: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/9.jpg)
Combining Measures
• Issues: – Generalization: Factors affecting performance– Sub-dialogues: not just WHOLE task
• PARADISE:– Separate what agent does from how does it– Performance = task success & dialogue costs
• Performance => Usability => User satisfaction• Task success = operationalized as K-coefficient• Costs = efficiency, qualitative measures
![Page 10: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/10.jpg)
Measuring Task Success
• AVM: Attribute Value Matrix– Capture info to be exchanged b/t user & system– “Key”: AVM instantiation for scenario
• K-coefficient calculated from confusion matrix– on-diagonal: match key; off-diagonal: misunderstood
• K = P(A) - P(E)/ (1-P(E))– P(A): Proportion agreement; P(E): Proportion expect– Actual - chance agreement
• Pros: corrects for chance; compare across tasks
![Page 11: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/11.jpg)
Measuring Task Costs
• Define cost measures:– E.g. # utterances, # repairs
• Can compute across sub-dialogues– Match segment to purpose
• Hierarchical structure - link to subtasks
• Tag by AVM info goals
![Page 12: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/12.jpg)
Estimating Performance Fn
• Predicted measure: Performance– User satisfaction rating:
• Rating: 1-6 on some question or average of questions
• Predictor measures: Success & Costs– Normalize each to z-score
• Handle varying scales
– Apply multiple linear regression to compute weights
• Calculate for sub-dialogue: restrict K, costs
![Page 13: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/13.jpg)
Evaluation
• Applied to multiple tasks– Travel, Reservation/Purchase, Circuit-Fix-It– Define new AVM attributes
• Match discourse structure
• Compare dialogue strategies– Explicit/Implicit confirmation– System/User/Mixed initiative
![Page 14: Building & Evaluating Spoken Dialogue Systems Discourse & Dialogue CS 359 November 27, 2001](https://reader036.vdocument.in/reader036/viewer/2022082713/5697bf931a28abf838c8f8f3/html5/thumbnails/14.jpg)
Summary
• Building for HCI– Human-human versus human-computer– Acquire vocabulary, structure, style
• Base on “Wizard-of-Oz” simulation
• Evaluating strategies– Performance = task success - dialogue cost– task success: agreement between response & key
• Success level compensates for chance
– Costs: number of repairs, utterances