itcs 6010 vui evaluation paradise & sum. paradise paradigm for dialogue system evaluation goal:...
Post on 20-Dec-2015
218 views
TRANSCRIPT
ITCS 6010
VUI EvaluationParadise & SUM
PARADISE Paradigm for Dialogue System Evaluation
Goal: Maximize User Satisfaction
PARADISE Paradigm for Dialogue System Evaluation
Performance is modeled as a weighted function of a task-based success measure and dialogue-based cost measures, where weights are computed by correlating user satisfaction with performance.
Dialogue tasks are represented as Attribute Value Matrix (AVM) pairs.
Attribute Value Matrix (AVM)
AVM represents dialogue task Information exchange between agent and
user Represented as ordered pairs of attribute and
possible values AVM instantiation of task
Same regardless of dialogue
AVM (cont’d)
A1: Hello this is Train Enquiry Service. DC,AC,DR,DTPlease speak after the tone. Which information do you need? U1: I want to go from Torino to Milano. DC, ACA2: Do you want to go from Trento to DC, ACMilano? Yes or No?U2: No. DC, ACA3: Do you want to leave from Trento?Yes if No? DCU3: No. DCA4: Where do you want to leave from? DCU4: I want to leave from Torino. DCA5: Do you want to leave from Torino? DCYes or No?U5: Yes. DCA6: Do you want to go to Milano? ACU6: Yes. ACA7: At which time do you want to leave? DRU7: I want to travel in the evening. DRA8: Do you want to leave between 6and 9pm? Yes or No? DRU8: Yes. DRA9: There is a train leaving at 8:00pm. DT….
B1: Hello this is Train Enquiry Service. DC,AC,DR,DT
Please speak after the tone. Which information do you need? U1: I want to travel from Torino to Milano. DC, ACB2: At which time do you want to leave DC, AC,DRfrom Merano to Milano?U2: No, I want to leave from Torino in the DC, DRevening.B3: Do you want to leave from Torinobetween 6 and 11pm?Please answer Yes DC, DRor No.U3: Yes. DC,DRB4: A train leaves at 8:00pm. DT….
attribute actual value
depart-city Torino
arrival-city Milano
depart-range Evening
depart-time 8pm
PARADISE Paradigm for Dialogue System Evaluation
Advantages PARADISE approach addresses performance and user
satisfaction
Disadvantages Too complex to compute. Need a large sample size up front
Alternative Approaches
What’s important? Maximize User Satisfaction Maximize Task Success
User Satisfaction How do we measure user satisfaction?
Questionnaires
Interviews
Focus Groups
Task Success How do we measure task success?
Logging Actual Use
Performance Measurement
Walkthroughs
Pilot Testing
Task Success
For each dialogue and the entire conversation establish AVMs.
Measure task success with respect to: Task completion time Accuracy or Errors (e.g. misinterpretations)
Conclusions
PARADISE is good, but too complex!
Measure user satisfaction and task success.
What if user satisfaction not most relevant aspect?
Speech Usability Metric (SUM)
Uses 3 metrics: User satisfaction Accuracy Task completion time
Eliminates restriction of one factor to determine usability
Speech Usability Metric (SUM)
SUM = X * User Satisfaction + Y * Accuracy + Z * Completion Time X + Y + Z = 1 X, Y, Z > 0
Weights determined by evaluator
User Satisfaction
Surveys
Questionnaires
Interviews
Accuracy
Misinterpretations System recognizes wrong word
Out-of-vocabulary errors Words not in system grammar
Wrong choice Correct word recognized, wrong path chosen
Task Completion Time
Time to complete task Time for expert to complete task (ETCT) Maximum time to complete task (MTCT) Expected time to complete task (ExTCT)
Conclusion
SUM determines usability of a speech application Utilizes 3 pre-defined metrics
Allows for greater flexibility