conversational behavior for virtual humans tim bickmore, justine cassell, hannes vilhjalmsson mit...

44
nversational Behavior r Virtual Humans Bickmore, Justine Cassell, Hannes Vilhjalmsson Media Lab

Upload: abel-oconnor

Post on 17-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Conversational Behavior for Virtual Humans

Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Page 2: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Overview

• Phenomena of interest• Common challenges• Wouldn't it be great if...• Some future paradigm shifts• Towards standards and infrastructure • Who’s doing what• Current GNL approaches to challenges

Page 3: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Phenomena of interest

gaze, head movement, intonation, eyebrow movement

posture shiftshand gesture

Page 4: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Phenomena of interest

- Multi-party- Social Talk

Reference to physical world

Page 5: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Common challenges:Theory-building / Behavior modeling• What do we need to know about human body

movements during talk?• How do we get it?

– e.g., vision, body suit, head/eye tracking

• How do we describe it?– Conversational functions– Speech acts– Nonverbal behavior

• Naturalistic data collection vs. controlled experiments

Page 6: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Common technical challenges:Discourse & Dialogue

• Generation based on complete discourse plans– take speaker intentions into account– take listener knowledge and inference ability into account

• Verbal and nonverbal negotiation of conversational tasks (grounding)

• Intonation (re)generated wrt discourse knowledge• Very tight temporal coupling of nonverbal behavior

– postural mirroring, turn-taking, grounding

• Generation and recognition of spontaneous iconics and metaphorics

• Cross-cultural/multi-cultural conversational behavior

Page 7: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Common technical challenges:Input

• Multimodal recognition– Speech, Intonation, Gesture, Gaze etc.

• Multimodal input fusion– redundant info should lead to exponential increase in reliability, but

complementary info leads to exponential decrease in reliability– synchronization is big problem

• requires, at least very accurate time stamps

• most ASRs do not provide word or phoneme timing information

• Incremental understanding– Required for backchannel feedback

• Representation– User’s intent (speech act categories?)– Discourse, domain and global knowledge

Page 8: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Common technical challenges:Output Planning / Scheduling

• Planning– Integration of conversational behavior planning with action planning

• can't gesture the same way when pouring tea or firing a gun

– Generate and filter approach (BEAT) vs. Full planning (search) (PPP Persona)

– Incremental generation and execution

• Scheduling / coordination– Must take available function-to-behavior mappings into account

• Initially explored in Gandalf (Thorisson)

– TTS callbacks vs. phoneme timings

Page 9: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

• Synchronization– Most effortful part of gesture must precede or coincide with pitch peak

– Speech or gesture can be elongated to reach synch. Goal

• Co-Articulation– Some gestures can be over-ridden

– What phases are collapsed

• Parameters– Gestural emphasis (EMOTE)

– Idiosyncrasy of conversational gesture

• Animation– Key parts of gesture (handshape, trajectory) must be maintained

– Motion between segments must be smoothed for naturalness

• Blending/transitioning of modular procedural and mocap animations

Common technical challenges:Animation

Page 10: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Common technical challenges:Multi-party Conversation

• Multiple humans and/or multiple agents• Input processing and multimodal integration • Must track participation framework (Goffman)

– Current speaker– Addressee– Hearers (ratified and others)

• Must generate behaviors for non-speaker(s)

Page 11: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

• Is the model of human behavior correct?

• Is the system correctly/efficiently/elegantly

implemented?

• Does the interface succeed:• Do (which) people like/trust/enjoy the interface?• Does it make work easier/more efficient/better?• What aspects of embodiment are most powerful

Common technical challenges:Evaluation

Page 12: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Wouldn't it be great if…(assume someone else is working on)

• Tightly synchronized 3D animation and TTS module

• Quality TTS with full intonation control and timing for

output

• Speaker-independent conversational continuous

ASR for children and adults

• Gesture recognition (including handshape . .)– And other mythological “reliable” input modalities

• MSAgent allowed speech-coordinated animations

Page 13: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Where we need help

• More studies showing when nonverbal behaviors are used in wide range of domains and contexts

• How communicative behavior is modulated based on personality, mood, affect and other idiosyncracies

Page 14: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Some future paradigm shifts• From speech-act-oriented pipeline architectures to dialogic

models encompassing mutual beliefs and goals– when simulating conversation you can't just focus on the output– need representation of dialogue games/joint projects, e.g., to signal dis-

preferred moves nonverbally

• From simplistic models of speech-act-induced obligations to more general models of accommodation– must work from full discourse plan to generate all appropriate verbal and

nonverbal behavior

• From Gricean "maximally-informative" task-oriented conversation to more general models involving multiple frames of interaction– frame switches are often accomplished nonverbally (contextualization cues)– rules of interpretation and generation change in each frame– task talk is joined by relationship- and other kinds of social talk

Page 15: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Towards standards and infrastructure• XML representations of multimodal messages

– BEAT– Jean-Claude Martin, et al (annotation framework; AAMAS01)– Working Group on Multimodal Meaning Representation (Dagstuhl

workshop)– SIGSEM working group on meaning representations– VoiceXML (W3C)– Human Behavior Representation

• Representations of multimodal semantics– XSLT; Java rules (BEAT)– TAG trees (REA-SPUD, RUTH)

Page 16: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Who’s doing what (partial!)• Conversational openings and closings (Body Chat)• Head movements (Pelachaud, Stone & deCarlo)• Intonation (Steedman, Stone)• Iconic gestures (Rea)• Deictic references (gestures/verbal/locomotion):

Cosmo, Steve• Turn taking (Rea)• Backchannels (Gandalf, Nakano, Rickel & Traum)• Postures (Rea)• Integration of conversation and physical tasks for

collaboration in 3D worlds (Steve)

Page 17: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Some GNL Virtual Humans

Sam

SPARKMack

Grandchair Rea

Page 18: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

GNL Approach

• Where do NV occur WRT discourse structure?– Information structure– Topic//intentional structure

• Where do NV occur WRT conversational structure?

• How to represent the functions in discourse and conversation played by the verbal & non-verbal modalities?

• Which semantic features tend to be conveyed by NV and which by speech?

• When do NV and speech convey the same features and when do they convey different features?

• What is the morphology of NV?

• How do we evaluate under what conditions & to whom verbal & NV are useful to interactive systems?

Page 19: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

• FEMBOT model• Generating speech &

gesture from underlying discourse representation

• Generating speech & facial, head, hands from conversational structure

• RUTH (Stone & deCarlo): eyes, intonation, eyebrows from discourse

Current GNL approaches: Discourse

REA

Page 20: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Current GNL approaches: Small Talk

• -How to engage in social dialogue?– Implicature, nonverbal conveyance of meaning, contextualization cues

play a big role.– How to represent, assess and update relationship status?

UserAgent

Relational Model

FamiliaritySolidarityPowerAffect

DiscoursePlanner

Interpretation

!!!!

ConversationalMoves

RelationshipAssessment

AgentGoals

Page 21: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Current GNL approaches: Lg in Real World • How to do direction-giving

– speech only vs.– map deictics vs.– character-viewpoint-based spatial gestures vs.– observer-viewpoint-based spatial gestures

• How to do multimodal grounding– gaze cues (using IBM Blue Eyes)– backchannel feedback– contingent responses

• Shared Reality– collaborate with physical objects

Page 22: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Cassell, J., Nakano, Y., Bickmore, T., Sidner, C., Rich, C. (2001). “Non-Verbal Cues for Discourse Structure.” Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics.

Current GNL approaches: Posture

•Posture shifts occur as a function of focus stack changes (discourse segment) and turns•Use Collagen to track discourse segment

– but conversational structure was difficult in Collagen

Page 23: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Current GNL approaches: Interactivity

• Maintain engaging, child-centered interaction with minimal input information

• Keyword spotting• Shared Reality

Page 24: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Current GNL approaches: Multi-party

• Referring expressions (Map deictics & NPs)• Conversational group behavior• Dynamic participation status• Integration of different control methods

Page 25: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Current GNL approaches: Animation

• Homegrown humanoid animation: Pantomime– Plug-in behavior modules (keyframe/procedural/dynamic) – New authoring tool for defining gestures

• Solves some gesture co-articulation and blending problems– Beats defined as relative motion; thus can co-articulate with any other

arm/hand motion.– Keyframes for hand and arm positions can be specified at start of

utterance, with motion smoothed between frames.• Doesn’t work with (TTS) event-based animation.

• Continuing debate regarding how much “intelligence” to put in the body (animator)– e.g., where are arm and hand ‘relax’ commands issued?

Page 26: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

BEAT

• extensible framework for verbal and nonverbal conversational behavior– gaze, eyebrow, posture, intonation, head nods, and beat gesture – synchronization with turn and information structure– Input:

• prepared script

• NLG

• ongoing conversation among humans users of a graphical chat system

– Animation:• real-time or off-line

• event-based or scheduled animation

– animators can augment and override BEAT’s choices

Page 27: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

• XML Pipeline Architecture– supports extensibility and modularity

– many extensions can be made in XSLT

• Separation of generation and filtering of nonverbal behaviors– provides greater range of possible character behavior and allows multiple

generation algorithms to be integrated

• Implemented in Java– supports portability

BEAT Design

Page 28: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Speech TimingKnowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

“This is both good news and bad news”

BEAT Processing: Script input

Page 29: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Speech TimingKnowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

CLAUSE

THEME RHEME

OBJECT ACTION OBJECT OBJECT

NEW NEW NEW NEW

CON(1) CON(1)

This is good news and bad newsboth

BEAT Processing: Language Tagging

Page 30: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Speech TimingKnowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

CLAUSE

THEME RHEME

OBJECT ACTION OBJECT OBJECT

NEW NEW NEW NEW

CON(1) CON(1)

This is good news and bad newsboth

BEAT Processing: Behavior Generation

Page 31: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Speech TimingKnowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

CLAUSE

THEME RHEME

OBJECT ACTION OBJECT OBJECT

NEW NEW NEW NEW

CON(1) CON(1)

This is good news and bad newsboth

GEST:BEATGEST:BEAT

BEAT Processing: Behavior Generation

Page 32: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Speech TimingKnowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

TONE:BREAK

TONE:ENDHI TONE:ENDLO

EYEBROWS:RAISED

ACCT:HI

GEST:CON_R

This is good news and bad newsboth

ACCT:HI ACCT:HI ACCT:HI

EYEBROWS:RAISED

GAZE:AWAY GAZE:HEARER

HD:NOD HD:NOD HD:NOD

GEST:CON_LGEST:BEATGEST:BEAT

BEAT Processing: Behavior Generation

Page 33: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Speech TimingKnowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

TONE:BREAK

TONE:ENDHI TONE:ENDLO

EYEBROWS:RAISED

ACCT:HI

GEST:CON_R

This is good news and bad newsboth

ACCT:HI ACCT:HI ACCT:HI

EYEBROWS:RAISED

GAZE:AWAY GAZE:HEARER

HD:NOD HD:NOD HD:NOD

GEST:CON_LGEST:BEATGEST:BEAT

BEAT Processing: Behavior Filtering

Page 34: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Speech TimingKnowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

TONE:BREAK

TONE:ENDHI TONE:ENDLO

EYEBROWS:RAISED

ACCT:HI

GEST:CON_R

This is good news and bad newsboth

ACCT:HI ACCT:HI ACCT:HI

EYEBROWS:RAISED

GAZE:AWAY GAZE:HEARER

HD:NOD HD:NOD HD:NOD

GEST:CON_L

BEAT Processing: Behavior Filtering

Page 35: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Speech TimingKnowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

TONE:BREAK

TONE:ENDHI TONE:ENDLO

EYEBROWS:RAISED

ACCT:HI

GEST:CON_R

This is good news and bad newsboth

ACCT:HI ACCT:HI ACCT:HI

EYEBROWS:RAISED

GAZE:AWAY GAZE:HEARER

HD:NOD HD:NOD HD:NOD

GEST:CON_L

BEAT Processing: Behavior Scheduling

Page 36: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

0.800.80

0.00 - 0.510.00 - 0.51 0.80 – 2.380.80 – 2.38

0.80 - 1.390.80 - 1.39 1.79 - 2.381.79 - 2.38

0.390.390.990.99 1.791.79

0.00 – 2.380.00 – 2.38

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Speech TimingKnowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

TONE:BREAK

TONE:ENDHI TONE:ENDLO

EYEBROWS:RAISED

ACCT:HI

GEST:CON_R

This is good news and bad newsboth

ACCT:HI ACCT:HI ACCT:HI

EYEBROWS:RAISED

GAZE:AWAY GAZE:HEARER

HD:NOD HD:NOD HD:NOD

GEST:CON_L

BEAT Processing: Behavior Scheduling

Page 37: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Speech TimingKnowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

<AnimationScript SPEAKER="AGENT" HEARER="USER"><START SPEECH="This is both good news and bad news“><START ACTION="GAZE" DIRECTION="AWAY_FROM_HEARER" SRT="0.0"><START ACTION="VISEME" TYPE="B" SRT="0.0"><STOP ACTION="GAZE" DIRECTION="AWAY_FROM_HEARER" SRT="0.511"><START ACTION="VISEME" TYPE="A" SRT="0.511"><START ACTION="GAZE" DIRECTION="TOWARDS_HEARER" SRT="0.801"><START ACTION="EYEBROWS" SRT="0.801"><START ACTION="GESTURE_RIGHT" TYPE="CONTRAST_1"

RIGHT_TRAJECTORY=CONTRAST_TRAJECTORY" RIGHT_HANDSHAPE="CONTRAST" SRT="0.801">

<AnimationScript SPEAKER="AGENT" HEARER="USER"><START SPEECH="This is both good news and bad news“><START ACTION="GAZE" DIRECTION="AWAY_FROM_HEARER" SRT="0.0"><START ACTION="VISEME" TYPE="B" SRT="0.0"><STOP ACTION="GAZE" DIRECTION="AWAY_FROM_HEARER" SRT="0.511"><START ACTION="VISEME" TYPE="A" SRT="0.511"><START ACTION="GAZE" DIRECTION="TOWARDS_HEARER" SRT="0.801"><START ACTION="EYEBROWS" SRT="0.801"><START ACTION="GESTURE_RIGHT" TYPE="CONTRAST_1"

RIGHT_TRAJECTORY=CONTRAST_TRAJECTORY" RIGHT_HANDSHAPE="CONTRAST" SRT="0.801">

BEAT Processing: Animation Compilation

Page 38: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Maya MELCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Speech TimingKnowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

BEAT Example: Maya Compilation

Page 39: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Speech TimingKnowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

“You <smile> make me smile! </smile>”

<AnimationScript SPEAKER="AGENT" HEARER="USER"><START SPEECH=“You make me smile!“><START ACTION="VISEME" TYPE="A" SRT="0.511">…<START ACTION=“SMILE” SRT=“0.801”><START ACTION="GAZE" DIRECTION="TOWARDS_HEARER" SRT="0.801"><START ACTION="EYEBROWS" SRT="0.901">…<STOP ACTION=“SMILE” SRT=“1.5”>

BEAT Tag Pass-Through

Page 40: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

TextAnnotation

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Knowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

“You just need [1 to *type in ] { a [2 *line ] like }”

[1] ICONIC - Typing action[2] BEAT - Emphasis

* Pitch accent{} Raised eyebrows. . Gaze away__ Gaze towards

BEAT Example: Animation Instructions

Page 41: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

BEAT Example: Dope Sheet Generation

Page 42: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

BEAT Recent Developments

• Infer topic shifts from discourse markers – Posture shift rules and character-specific posture state machines

• Gantt Chart Compiler• In Progress

– Participation framework for multi-party conversation– Level of Detail

Page 43: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

BEAT Recent Developments: Spatial BEAT• How to describe arbitrary spatial relationships?

– Requires explicit representation of speaker's intent– e.g., to introduce a set of objects vs. describe their relative

configuration

“There is a table with a glass of water on it.”

• VRML scene is parsed into a set of spatial primitives. • BEAT instructed to describe the configuration of objects.

Page 44: Conversational Behavior for Virtual Humans Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Yukiko NakanoYang GaoKimiko RyokaiIan GouldstoneSam

READilbert DoyleTim BickmoreCati Vaucelle

Hannes VilhjalmssonTom StockyJustine CassellMACK

The VHuman Team at the MIT Media Lab