audiovisual prosody in problematic dialogue situations marc swerts communication & cognition...

56
Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Upload: tyrell-akers

Post on 31-Mar-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Audiovisual prosody in problematic dialogue situations

Marc Swerts

Communication & CognitionTilburg University

Page 2: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

General problem

Spoken dialogue systems (SDS) are prone to error, especially because of errors in the ASR component of such systems

Errors will remain a problem for future systems, e.g. when they have to operate in noisy conditions, with non-native speakers or when the domain of the system becomes larger

Therefore: key task for most dialogue managers in SDS systems is error handling:– Prevent errors (e.g. optimal dialogue strategies)– Detect errors (e.g. acoustic and semantic confidence scores)– Correct errors (e.g. feedback cues, system prompts)

Page 3: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Prosody and error handling

Recent interest in the use of speech prosody for error handling– To detect misrecognized utterances which have been shown to be

prosodically different from correctly recognized utterances (e.g. Hirschberg et al. 2004)

– To distinguish positive from negative feedback cues about the smoothness of the interaction (e.g. Krahmer et al. 2002)

– To locate places where speakers try to correct a prior utterance (corrections tend to be hyperarticulated, which often leads to ‘spiral’ errors) (Oviatt et al. 1998)

Previous research only focused on verbal features; in this talk we concentrate on the effect of errors on visual features as well (audiovisual prosody)

Page 4: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

This talk

Report on analyses of interactions between speakers and their dialogue partners (both humans and machine)

Study audiovisual features of speakers– When speakers notice they themselves have a problem (Part 1)

– When speakers notice their dialogue partners have a problem (Part 2)

Page 5: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Part 1

What are audiovisual features of a speaker who experiences communication problems?

Page 6: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Uncertainty

Speakers are not always equally confident about or committed to what they are saying

Suppose someone asks a question (Who wrote hamlet? What is the capital of Switzerland?)– Speakers may be sure about their answer, or rather uncertain

– Speakers may not know the answer, though it may be on the tip of their tongue

These differences in confidence level are reflected in the way speakers present themselves; this is useful for their addressees

Page 7: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Questions to be addressed

How can visual cues from a speaker’s face be used as signals of level of uncertainty? How important are such cues compared to auditory cues?

Are their significant differences between different kinds of speakers in their use of visual cues for uncertainty? (here: age differences)

Page 8: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Experiment 1: Production of Uncertainty(based on Smith and Clark 1993)

Experiment in three stages (Hart 1965):1. Answers to factual questions (WISC, WAISC, Trivial Pursuit ).2. Test how certain subject is (s)he would recognize the correct answer in a multiple-

choice test (Feeling of Knowing (FOK)-scores).3. Recognition test (Multiple-choice).

“Tip of the tongue”: non-answer (“I don’t know”) with a high FOK.

Subjects were filmed during first test; they could not see the experimentor.

Adults: person with highest score got a small reward.

Children all got a small award

Page 9: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Subjects and questions

20 adults Students and collegues [20 – 50] 40 questions n = 800

Who wrote Hamlet? How many degrees in a circle? What is the capital of Switzerland? ...

20 children Group 4 [7 – 8] 30 questions n = 600

Who is the president of the U.S.? Where can you buy a Happy Meal? What is the color of peanut butter? ...

Page 10: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University
Page 11: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University
Page 12: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University
Page 13: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Labelling

All 1400 utterances were manually labelled by 4 independent judges.

Consensus labeling of presence/absence of different audio-visual features.

Verbal: high intonation, filled pauses, delay, number of words.

Visual: eyebrow, smile, “funny face”, gaze [adults only]

Page 14: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Eyebrow raising

Page 15: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Smile

Page 16: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Gaze (diverted)

Page 17: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Funny face

Page 18: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Results adults

Answers: Presence of filled pause, delay, high intonation, eyebrow, smile, funny face and different gaze acts correspond with significantly lower FOK score.

Non-answers: Presence of filled pause, delay, high intonation, eyebrow, smile, funny face and different gaze acts correspond with significantly higher FOK score

FOK correlation Answers Non-answers

Words -.344 .401

Gaze acts -.309 .347

Marked features -.422 .462

Page 19: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University
Page 20: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Results children

Answers: Presence of eyebrow, funny face and delay correspond with significantly lower FOK score.

Non-answers: Presence of smile corresponds with a significantly higher FOK score.

Other than that no significant findings.

In general: children are much less expressive than adults, use occasionally very long delays, and hardly any filled pauses.

Page 21: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Conclusion experiment 1

Speakers express their level of uncertainty via various audiovisual cues.

Adults do this much more than children (‘self-presentation’)

Opposite findings for answers and nonanswers.

How is uncertainty perceived? What are the important features?– In different modalities?

– By different judges?

Page 22: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Experiment 2: Perception of uncertainty(based on Brennan and Williams 1995)

Stimuli: 60 adult responses from Experiment 1.

Answers Non-answers

High FOK 15 15

Low FOK 15 15

120 subjects participated:

Vision+sound Sound only Vision only

40 40 40

Task: judge level of uncertainty of speaker (FOAK scores).

Page 23: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University
Page 24: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

FOAK scores for answers and nonanswers

0

0,2

0,4

0,6

0,8

1

answer nonanswer

high FOKlow FOK

Page 25: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Different conditions

0

0,2

0,4

0,6

0,8

1

Vision+Sound Sound only Vision only

high FOKlow FOK

Page 26: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Conclusion experiment 2

Observers can estimate a speaker’s level of uncertainty on the basis of audiovisual cues.

Answers are “easier” than nonanswers.

Scores for unimodal stimuli are good (both sound only and vision only), but those for bimodal stimuli go best.

Page 27: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Experiment 3: Perception of uncertainty For different speakers/judges: adults vs. children Same task: judge level of (un)certainty Stimuli: only answers, selected from experiment 1.

Child answers Adult-answers

High FOK 15 15

Low FOK 15 15

Adult speaker Child speaker

Adult judge 20 20

Child judge 20 20

80 subjects participated

Page 28: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

0

0,2

0,4

0,6

0,8

1

adults children adults children

high FOKlow FOK

adult judges child judges

FOAK scores for children and adults

Page 29: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Conclusion experiment 3

Adults are “better” judges than children.(Detecting behavior one does not display is more difficult..)

Adults are “better” judged than children.(What is not signalled cannot be detected.)

Page 30: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Part 2

What are audiovisual features of a speaker who notices that his/her dialogue partner has communication problems?

Page 31: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Feedback cues

Dialogue partners continuously send and receive signals on the status of the information which is being exhanged – Positive feedback cues (‘go on’) when there are no problems– Negative feedback cues (‘go back’) when there are problems

Previous research revealed that negative feedback cues are prosodically ‘marked’ (e.g. higher, louder, longer) (e.g. Krahmer et al. 2002, Shimojima et al. 2002)

Here: series of experiments to investigate whether speakers use visual cues as well as auditory ones for distinguishing positive from negative cues

Page 32: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Data

Taken from an audiovisual corpus of 9 subjects engaged in telephone conversations with a speaker-independent traintime table information system; they had to query the system on 7 train journeys (63 interactions)

Subjects were video-taped during their interactions; they were led to believe the data collection for the development of a new video-phone

76% of the dialogues were successfully completed; 374 out of 1183 speaker turns were misunderstood by the system (32%)

Page 33: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University
Page 34: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Set-up of perception experiment

We performed three perception experiments in which 66 subjects were shown selected video-clips from these recorded human-machine interactions

The clips constituted ‘minimal pairs’, in that they consisted of comparable utterances that had originally occurred either in a problematic or in an unproblematic dialogue exchange

The subjects’ task was to guess whether the presented clip came from a problematic or unproblematic context

Page 35: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Study 1: verification questions

Subjects saw users listening to verification questions from the system (so users are silent), which can be unproblematic (such as in 1), or problematic (such as in 2)

1. User: Amsterdam

System: So you want to travel to Amsterdam?

2. User: Amsterdam

System: So you want to travel to Rotterdam?

Page 36: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Users listening to system questions

No problem Problem

Page 37: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Study 2: Destination utterances

Subjects saw speakers uttering a destination; this could the speaker’s first attempt (unproblematic) (like in 1), or it could be a correction in response to a verification question of misrecognized or misunderstood information (like in 2)

1. System: To which station do you want to travel?

User: Rotterdam

2. System: So you want to travel to Amsterdam?

User: Rotterdam

Page 38: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Slot filling (speakers utter destination)

No Problem Problem

Page 39: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Study 3: negations

Subjects saw speakers uttering a negation (“nee”, no), which could be a response to a general yes-no question (like in 1), or a response to a verification question which contains incorrect information (like in 2)

1. System: Do you want me to repeat the connection?

User: No

2. System: So you want to travel to Amsterdam?

User: No

Page 40: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Negations

No Problem Problem

Page 41: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Increasing level of frustration…

Page 42: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Findings

In all three studies, subjects were able to correctly distinguish problematic from unproblematic fragments above chance level (task was easier for verification stimuli, and slot fillers)

In order to gain insight into the audiovisual features that may have functioned as cues we labeled the data in terms of level of hyperarticulation (6 levels), and presence or absence of a number of visual features (most important: smile, head movement, diverted gaze, frown, brow raise)

Both level of hyperarticulation and relative number of visual cues were correlated with perceived and actual problems

Page 43: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Degree of hyperarticulation

Perceived problems Actual problems

Page 44: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Amount of visual variation

Perceived problems Actual problems

Page 45: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

General conclusion

Dialogue problems have been shown to have consequences for audiovisual characteristics of a speaker who experiences problems him/herself or who notices that the dialogue partner has communication problems

In general, it appears that problematic dialogue situations lead to more dynamic facial expressions and marked prosodic behaviour

Page 46: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

More information

Research reported here joint work with Emiel Krahmer, Pashiera Barkhuysen (PhD project) and Lennard van de Laar (technical assistant) within the FOAP (“Functions of audiovisual prosody”) project:

foap.uvt.nl

Other interests: audiovisual cues to end-of-utterance, focus, emotion, deceptive speech, and personality; incorporation of findings in ECAs through collaborations

Page 47: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Data collectionn FOK

Correct answers 575 0.94

Incorrect answers 129 0.76

Non-answers 96 0.42

n FOK

Correct answers 371 0.96

Incorrect answers 125 0.74

Non-answers 131 0.50

Adults

Children

Contrary to adults, children have few high FOK non-answers.

Page 48: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Manipulated data

Gain more insight into relevance of visual and auditory cues; because of ceiling effects it was difficult to establish the strength relation between these two types of cues

Answers (1 HighFOK, 1 LowFOK) from 5 speakers were selected; words had to have a similar sound shape (e.g. Goethe-Goofy; Zurich-Zorro, …)

Sound and image were separated to create mixed stimuli (e.g. HighFOK vision combined with LowFOK sound)

Both original and mixed stimuli were presented to 120 subjects who had to rate the FOK level (7-point scale) of each stimulus

Page 49: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Face:sure Face:unsureVoice:sure Voice:unsure

Page 50: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Face:sure Face:unsureVoice:unsure Voice:sure

Page 51: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Conclusions experiment 4

Overall: bias towards uncertainty in FOK ratings

FOK ratings are significantly influenced by verbal (intonation pattern) and visual cues from the face; some speaker effect

However, facial information has much stronger cue value

Page 52: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University
Page 53: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Series of studies

Production of uncertainty (based on Smith and Clark 1993):

“Feeling of Knowing” (FOK)– Experiment 1: Adults + children

Perception of uncertainty (based on Brennan and Williams 1995):

“Feeling of Another’s Knowing” (FOAK)– Experiment 2: Unimodal vs multimodal

– Experiment 3: Adults x children

Page 54: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Future goals

Integrate the findings in Embodied Conversational Agents in order to make these more natural and believable, in particular for error handling strategies (working hypothesis: Users are more likely to tolerate incorrect answers if the system signals its uncertainty)

Explore whether visual features can be used as an additional resource for error detection (growing interest in incorporating visual information in automatic recognition process)

Page 55: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Audiovisual prosody

Prosody defined as those features that do not determine what a speaker says, but rather how he or she says it

– Verbal: intonation, tempo, loudness, voice quality, pauses, ….

– Visual: facial expressions, hand and arm gestures, body language, …

Audiovisual prosody = verbal + visual prosody

Page 56: Audiovisual prosody in problematic dialogue situations Marc Swerts Communication & Cognition Tilburg University

Self-presentation

Auditory cues (Smith and Clark, 1993; Brennan and Williams, 1995):– Linguistic hedges (“I am not sure, but…”, “I think..”)

– Filled pauses (uh and uhm)

– Prosody (question intonation)

This study: possible visual cues (are natural and important ingredient of daily conversations as well)