towards a method for evaluating naturalness in conversational dialog systems victor hung, miguel...
TRANSCRIPT
Towards a Method For Evaluating Naturalness in Conversational Dialog
Systems
Victor Hung, Miguel Elvir, Avelino Gonzalez & Ronald DeMara
Intelligent Systems LaboratoryUniversity of Central Florida
IEEE International Conference on Systems, Man, and CyberneticsSan Antonio, TexasOctober 12, 2009
University of Central Florida www.ucf.edu
Agenda
Introduction Background Approach Project LifeLike
University of Central Florida www.ucf.edu
Introduction
Interactive Conversation Agent Evaluation Cannot rely solely on quantitative methods Subjectivity in ‘naturalness’ No general method to judge how well a conversation
agent performs Pivotal focus will be defining naturalness
How well a chatbot can maintain a natural conversation flow
LifeLike virtual avatar project as a backdrop Provide a suitable validation and verification method
University of Central Florida www.ucf.edu
Background: Early Systems
Declarative knowledge to process data Explicitly defined rules Constrained knowledge Limited capacity to assess and adapt Goal-oriented and data-driven behavior
ALICEbot
University of Central Florida www.ucf.edu
Background: Naturalness
Automatic Speech Recognition Context retrieval experimentation Intelligent tutoring
Adaptive Control of Thought Knowledge Acquisition agents
Quality of the information received Conversation length metric
ALICE-based bots
University of Central Florida www.ucf.edu
Background: Recent Advances
Sentence-based template matching Simple conversational memory
CMU’s Julia, Extempo’s Erin Interaction occurs in a reactive manner
Wlodzislaw et al Development of cognitive modules and
human interface realism Ontologies, concept description vectors,
semantic memory models, CYC
University of Central Florida www.ucf.edu
Background: Recent Advances
Becker and Wachsmuth Representation and actuation of coherent
emotional states Lars et al
Model for sustainable conversation Awareness of the human users and the
conversation topics Relies on textual input similar to ELIZA
Use of natural language processing for reasoning about human speech
University of Central Florida www.ucf.edu
Background: Conclusion
Breadth of research using chatbots Focus on creating more sophisticated
interpretative conversational modules Need exists for generalizable metrics Conversational agents widely experimented
with, but it has been lacking a basic framework for universal performance comparison
University of Central Florida www.ucf.edu
Approach: Previous Approaches
Mix of quantitative and qualitative measures
Subjective matters employ human user questionnaire Semeraro et al’s bookstore chatbot 7 characteristics: impression, command,
effectiveness, navigability, ability to learn, ability to aid, comprehension.
Does not provide statistical conclusiveness General indicator of performance
University of Central Florida www.ucf.edu
Approach: Previous Approaches
Shawar and Atwell’s universal chatbot evaluation system ALICE-based Afrikaans conversation agent
Dialog efficiency Dialog quality: reasonable, weird but
understandable, and nonsensical Users’ satisfaction, qualitatively measured
Proper assessment is end result in how successfully it accomplishes its intended goals
University of Central Florida www.ucf.edu
Approach: Previous Approaches
Evaluation of naturalness similar to general chatbot assessment
Rzepka et al’s 1-to-10 scale metrics Naturalness degree Willing to continue a conversation degree Human judges used these measures to
evaluate a conversation agent’s utterances No concrete baseline for naturalness Able to make relative measurements of
naturalness between dialog agents
University of Central Florida www.ucf.edu
Approach: Chatbot Objectives
Walker et al’s PARAdigm for DIalogue System Evaluation (PARADISE) Dialog performance relates to the
experience of the interaction (means) Task success is concerned with the utility of
the dialog exchange (ends) Objectives
Better than other dialog system solutions Similar to a human-to-human (naturalness)
interaction
University of Central Florida www.ucf.edu
Approach: Task Success
Measure of goal satisfaction Attribute-value matrix
Derived from PARADISE Expected vs. actual Task success (κ) computed as the percentage of
correct responses
University of Central Florida www.ucf.edu
Approach: Performance Function
Derived from PARADISE Total effectiveness
Task success (κ) weighted by (α) Dialog costs (ci) weighted by (wi) Function (N) uses Z-score normalization
Balance out (κ) and (ci)
University of Central Florida www.ucf.edu
Approach: Proposed System
Task success Dialog costs
Efficiency Resource consumption Quantitative
Quality Actual conversational
content Quantitative or qualitative
University of Central Florida www.ucf.edu
Questions