itcs 6010 vui evaluation paradise & sum. paradise paradigm for dialogue system evaluation goal:...

ITCS 6010

VUI EvaluationParadise & SUM

PARADISE Paradigm for Dialogue System Evaluation

Goal: Maximize User Satisfaction


Performance is modeled as a weighted function of a task-based success measure and dialogue-based cost measures, where weights are computed by correlating user satisfaction with performance.

Dialogue tasks are represented as Attribute Value Matrix (AVM) pairs.

Attribute Value Matrix (AVM)

AVM represents dialogue task Information exchange between agent and

user Represented as ordered pairs of attribute and

possible values AVM instantiation of task

Same regardless of dialogue

AVM (cont’d)

A1: Hello this is Train Enquiry Service. DC,AC,DR,DTPlease speak after the tone. Which information do you need? U1: I want to go from Torino to Milano. DC, ACA2: Do you want to go from Trento to DC, ACMilano? Yes or No?U2: No. DC, ACA3: Do you want to leave from Trento?Yes if No? DCU3: No. DCA4: Where do you want to leave from? DCU4: I want to leave from Torino. DCA5: Do you want to leave from Torino? DCYes or No?U5: Yes. DCA6: Do you want to go to Milano? ACU6: Yes. ACA7: At which time do you want to leave? DRU7: I want to travel in the evening. DRA8: Do you want to leave between 6and 9pm? Yes or No? DRU8: Yes. DRA9: There is a train leaving at 8:00pm. DT….

B1: Hello this is Train Enquiry Service. DC,AC,DR,DT

Please speak after the tone. Which information do you need? U1: I want to travel from Torino to Milano. DC, ACB2: At which time do you want to leave DC, AC,DRfrom Merano to Milano?U2: No, I want to leave from Torino in the DC, DRevening.B3: Do you want to leave from Torinobetween 6 and 11pm?Please answer Yes DC, DRor No.U3: Yes. DC,DRB4: A train leaves at 8:00pm. DT….

attribute actual value

depart-city Torino

arrival-city Milano

depart-range Evening

depart-time 8pm


Advantages PARADISE approach addresses performance and user

satisfaction

Disadvantages Too complex to compute. Need a large sample size up front

Alternative Approaches

What’s important? Maximize User Satisfaction Maximize Task Success

User Satisfaction How do we measure user satisfaction?

Questionnaires

Interviews

Focus Groups

Task Success How do we measure task success?

Logging Actual Use

Performance Measurement

Walkthroughs

Pilot Testing

Task Success

For each dialogue and the entire conversation establish AVMs.

Measure task success with respect to: Task completion time Accuracy or Errors (e.g. misinterpretations)

Conclusions

PARADISE is good, but too complex!

Measure user satisfaction and task success.

What if user satisfaction not most relevant aspect?

Speech Usability Metric (SUM)

Uses 3 metrics: User satisfaction Accuracy Task completion time

Eliminates restriction of one factor to determine usability

Speech Usability Metric (SUM)

SUM = X * User Satisfaction + Y * Accuracy + Z * Completion Time X + Y + Z = 1 X, Y, Z > 0

Weights determined by evaluator

User Satisfaction

Surveys

Questionnaires

Interviews

Accuracy

Misinterpretations System recognizes wrong word

Out-of-vocabulary errors Words not in system grammar

Wrong choice Correct word recognized, wrong path chosen

Task Completion Time

Time to complete task Time for expert to complete task (ETCT) Maximum time to complete task (MTCT) Expected time to complete task (ExTCT)

Conclusion

SUM determines usability of a speech application Utilizes 3 pre-defined metrics

Allows for greater flexibility

itcs 6010 vui evaluation paradise & sum. paradise paradigm for dialogue system evaluation goal:...

Documents

user satisfaction slide

dialogue slide

task success slide

usability slide

measure user satisfaction

user satisfaction y

evaluator slide

time8pm slide