towards a method for evaluating naturalness in conversational dialog systems victor hung, miguel...

Towards a Method For Evaluating Naturalness in Conversational Dialog

Systems

Victor Hung, Miguel Elvir, Avelino Gonzalez & Ronald DeMara

Intelligent Systems LaboratoryUniversity of Central Florida

IEEE International Conference on Systems, Man, and CyberneticsSan Antonio, TexasOctober 12, 2009

University of Central Florida www.ucf.edu

Agenda

Introduction Background Approach Project LifeLike


Introduction

Interactive Conversation Agent Evaluation Cannot rely solely on quantitative methods Subjectivity in ‘naturalness’ No general method to judge how well a conversation

agent performs Pivotal focus will be defining naturalness

How well a chatbot can maintain a natural conversation flow

LifeLike virtual avatar project as a backdrop Provide a suitable validation and verification method


Background: Early Systems

Declarative knowledge to process data Explicitly defined rules Constrained knowledge Limited capacity to assess and adapt Goal-oriented and data-driven behavior

ALICEbot


Background: Naturalness

Automatic Speech Recognition Context retrieval experimentation Intelligent tutoring

Adaptive Control of Thought Knowledge Acquisition agents

Quality of the information received Conversation length metric

ALICE-based bots


Background: Recent Advances

Sentence-based template matching Simple conversational memory

CMU’s Julia, Extempo’s Erin Interaction occurs in a reactive manner

Wlodzislaw et al Development of cognitive modules and

human interface realism Ontologies, concept description vectors,

semantic memory models, CYC


Background: Recent Advances

Becker and Wachsmuth Representation and actuation of coherent

emotional states Lars et al

Model for sustainable conversation Awareness of the human users and the

conversation topics Relies on textual input similar to ELIZA

Use of natural language processing for reasoning about human speech


Background: Conclusion

Breadth of research using chatbots Focus on creating more sophisticated

interpretative conversational modules Need exists for generalizable metrics Conversational agents widely experimented

with, but it has been lacking a basic framework for universal performance comparison


Approach: Previous Approaches

Mix of quantitative and qualitative measures

Subjective matters employ human user questionnaire Semeraro et al’s bookstore chatbot 7 characteristics: impression, command,

effectiveness, navigability, ability to learn, ability to aid, comprehension.

Does not provide statistical conclusiveness General indicator of performance



Shawar and Atwell’s universal chatbot evaluation system ALICE-based Afrikaans conversation agent

Dialog efficiency Dialog quality: reasonable, weird but

understandable, and nonsensical Users’ satisfaction, qualitatively measured

Proper assessment is end result in how successfully it accomplishes its intended goals



Evaluation of naturalness similar to general chatbot assessment

Rzepka et al’s 1-to-10 scale metrics Naturalness degree Willing to continue a conversation degree Human judges used these measures to

evaluate a conversation agent’s utterances No concrete baseline for naturalness Able to make relative measurements of

naturalness between dialog agents


Approach: Chatbot Objectives

Walker et al’s PARAdigm for DIalogue System Evaluation (PARADISE) Dialog performance relates to the

experience of the interaction (means) Task success is concerned with the utility of

the dialog exchange (ends) Objectives

Better than other dialog system solutions Similar to a human-to-human (naturalness)

interaction


Approach: Task Success

Measure of goal satisfaction Attribute-value matrix

Derived from PARADISE Expected vs. actual Task success (κ) computed as the percentage of

correct responses


Approach: Performance Function

Derived from PARADISE Total effectiveness

Task success (κ) weighted by (α) Dialog costs (ci) weighted by (wi) Function (N) uses Z-score normalization

Balance out (κ) and (ci)


Approach: Proposed System

Task success Dialog costs

Efficiency Resource consumption Quantitative

Quality Actual conversational

content Quantitative or qualitative


Questions

towards a method for evaluating naturalness in conversational dialog systems victor hung, miguel...

Documents