conversational behavior for virtual humans tim bickmore, justine cassell, hannes vilhjalmsson mit...

Conversational Behavior for Virtual Humans

Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab

Overview

• Phenomena of interest• Common challenges• Wouldn't it be great if...• Some future paradigm shifts• Towards standards and infrastructure • Who’s doing what• Current GNL approaches to challenges

Phenomena of interest

gaze, head movement, intonation, eyebrow movement

posture shiftshand gesture

Phenomena of interest

- Multi-party- Social Talk

Reference to physical world

Common challenges:Theory-building / Behavior modeling• What do we need to know about human body

movements during talk?• How do we get it?

– e.g., vision, body suit, head/eye tracking

• How do we describe it?– Conversational functions– Speech acts– Nonverbal behavior

• Naturalistic data collection vs. controlled experiments

Common technical challenges:Discourse & Dialogue

• Generation based on complete discourse plans– take speaker intentions into account– take listener knowledge and inference ability into account

• Verbal and nonverbal negotiation of conversational tasks (grounding)

• Intonation (re)generated wrt discourse knowledge• Very tight temporal coupling of nonverbal behavior

– postural mirroring, turn-taking, grounding

• Generation and recognition of spontaneous iconics and metaphorics

• Cross-cultural/multi-cultural conversational behavior

Common technical challenges:Input

• Multimodal recognition– Speech, Intonation, Gesture, Gaze etc.

• Multimodal input fusion– redundant info should lead to exponential increase in reliability, but

complementary info leads to exponential decrease in reliability– synchronization is big problem

• requires, at least very accurate time stamps

• most ASRs do not provide word or phoneme timing information

• Incremental understanding– Required for backchannel feedback

• Representation– User’s intent (speech act categories?)– Discourse, domain and global knowledge

Common technical challenges:Output Planning / Scheduling

• Planning– Integration of conversational behavior planning with action planning

• can't gesture the same way when pouring tea or firing a gun

– Generate and filter approach (BEAT) vs. Full planning (search) (PPP Persona)

– Incremental generation and execution

• Scheduling / coordination– Must take available function-to-behavior mappings into account

• Initially explored in Gandalf (Thorisson)

– TTS callbacks vs. phoneme timings

• Synchronization– Most effortful part of gesture must precede or coincide with pitch peak

– Speech or gesture can be elongated to reach synch. Goal

• Co-Articulation– Some gestures can be over-ridden

– What phases are collapsed

• Parameters– Gestural emphasis (EMOTE)

– Idiosyncrasy of conversational gesture

• Animation– Key parts of gesture (handshape, trajectory) must be maintained

– Motion between segments must be smoothed for naturalness

• Blending/transitioning of modular procedural and mocap animations

Common technical challenges:Animation

Common technical challenges:Multi-party Conversation

• Multiple humans and/or multiple agents• Input processing and multimodal integration • Must track participation framework (Goffman)

– Current speaker– Addressee– Hearers (ratified and others)

• Must generate behaviors for non-speaker(s)

• Is the model of human behavior correct?

• Is the system correctly/efficiently/elegantly

implemented?

• Does the interface succeed:• Do (which) people like/trust/enjoy the interface?• Does it make work easier/more efficient/better?• What aspects of embodiment are most powerful

Common technical challenges:Evaluation

Wouldn't it be great if…(assume someone else is working on)

• Tightly synchronized 3D animation and TTS module

• Quality TTS with full intonation control and timing for

output

• Speaker-independent conversational continuous

ASR for children and adults

• Gesture recognition (including handshape . .)– And other mythological “reliable” input modalities

• MSAgent allowed speech-coordinated animations

Where we need help

• More studies showing when nonverbal behaviors are used in wide range of domains and contexts

• How communicative behavior is modulated based on personality, mood, affect and other idiosyncracies

Some future paradigm shifts• From speech-act-oriented pipeline architectures to dialogic

models encompassing mutual beliefs and goals– when simulating conversation you can't just focus on the output– need representation of dialogue games/joint projects, e.g., to signal dis-

preferred moves nonverbally

• From simplistic models of speech-act-induced obligations to more general models of accommodation– must work from full discourse plan to generate all appropriate verbal and

nonverbal behavior

• From Gricean "maximally-informative" task-oriented conversation to more general models involving multiple frames of interaction– frame switches are often accomplished nonverbally (contextualization cues)– rules of interpretation and generation change in each frame– task talk is joined by relationship- and other kinds of social talk

Towards standards and infrastructure• XML representations of multimodal messages

– BEAT– Jean-Claude Martin, et al (annotation framework; AAMAS01)– Working Group on Multimodal Meaning Representation (Dagstuhl

workshop)– SIGSEM working group on meaning representations– VoiceXML (W3C)– Human Behavior Representation

• Representations of multimodal semantics– XSLT; Java rules (BEAT)– TAG trees (REA-SPUD, RUTH)

Who’s doing what (partial!)• Conversational openings and closings (Body Chat)• Head movements (Pelachaud, Stone & deCarlo)• Intonation (Steedman, Stone)• Iconic gestures (Rea)• Deictic references (gestures/verbal/locomotion):

Cosmo, Steve• Turn taking (Rea)• Backchannels (Gandalf, Nakano, Rickel & Traum)• Postures (Rea)• Integration of conversation and physical tasks for

collaboration in 3D worlds (Steve)

Some GNL Virtual Humans

Sam

SPARKMack

Grandchair Rea

GNL Approach

• Where do NV occur WRT discourse structure?– Information structure– Topic//intentional structure

• Where do NV occur WRT conversational structure?

• How to represent the functions in discourse and conversation played by the verbal & non-verbal modalities?

• Which semantic features tend to be conveyed by NV and which by speech?

• When do NV and speech convey the same features and when do they convey different features?

• What is the morphology of NV?

• How do we evaluate under what conditions & to whom verbal & NV are useful to interactive systems?

• FEMBOT model• Generating speech &

gesture from underlying discourse representation

• Generating speech & facial, head, hands from conversational structure

• RUTH (Stone & deCarlo): eyes, intonation, eyebrows from discourse

Current GNL approaches: Discourse

REA

Current GNL approaches: Small Talk

• -How to engage in social dialogue?– Implicature, nonverbal conveyance of meaning, contextualization cues

play a big role.– How to represent, assess and update relationship status?

UserAgent

Relational Model

FamiliaritySolidarityPowerAffect

DiscoursePlanner

Interpretation

!!!!

ConversationalMoves

RelationshipAssessment

AgentGoals

Current GNL approaches: Lg in Real World • How to do direction-giving

– speech only vs.– map deictics vs.– character-viewpoint-based spatial gestures vs.– observer-viewpoint-based spatial gestures

• How to do multimodal grounding– gaze cues (using IBM Blue Eyes)– backchannel feedback– contingent responses

• Shared Reality– collaborate with physical objects

Cassell, J., Nakano, Y., Bickmore, T., Sidner, C., Rich, C. (2001). “Non-Verbal Cues for Discourse Structure.” Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics.

Current GNL approaches: Posture

•Posture shifts occur as a function of focus stack changes (discourse segment) and turns•Use Collagen to track discourse segment

– but conversational structure was difficult in Collagen

Current GNL approaches: Interactivity

• Maintain engaging, child-centered interaction with minimal input information

• Keyword spotting• Shared Reality

Current GNL approaches: Multi-party

• Referring expressions (Map deictics & NPs)• Conversational group behavior• Dynamic participation status• Integration of different control methods

Current GNL approaches: Animation

• Homegrown humanoid animation: Pantomime– Plug-in behavior modules (keyframe/procedural/dynamic) – New authoring tool for defining gestures

• Solves some gesture co-articulation and blending problems– Beats defined as relative motion; thus can co-articulate with any other

arm/hand motion.– Keyframes for hand and arm positions can be specified at start of

utterance, with motion smoothed between frames.• Doesn’t work with (TTS) event-based animation.

• Continuing debate regarding how much “intelligence” to put in the body (animator)– e.g., where are arm and hand ‘relax’ commands issued?

BEAT

• extensible framework for verbal and nonverbal conversational behavior– gaze, eyebrow, posture, intonation, head nods, and beat gesture – synchronization with turn and information structure– Input:

• prepared script

• NLG

• ongoing conversation among humans users of a graphical chat system

– Animation:• real-time or off-line

• event-based or scheduled animation

– animators can augment and override BEAT’s choices

• XML Pipeline Architecture– supports extensibility and modularity

– many extensions can be made in XSLT

• Separation of generation and filtering of nonverbal behaviors– provides greater range of possible character behavior and allows multiple

generation algorithms to be integrated

• Implemented in Java– supports portability

BEAT Design

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Speech TimingKnowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

“This is both good news and bad news”

BEAT Processing: Script input

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder


GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

CLAUSE

THEME RHEME

OBJECT ACTION OBJECT OBJECT

NEW NEW NEW NEW

CON(1) CON(1)

This is good news and bad newsboth

BEAT Processing: Language Tagging

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder


GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

CLAUSE

THEME RHEME


NEW NEW NEW NEW

CON(1) CON(1)


BEAT Processing: Behavior Generation

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder


GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

CLAUSE

THEME RHEME


NEW NEW NEW NEW

CON(1) CON(1)


GEST:BEATGEST:BEAT


AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder


GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

TONE:BREAK

TONE:ENDHI TONE:ENDLO

EYEBROWS:RAISED

ACCT:HI

GEST:CON_R


ACCT:HI ACCT:HI ACCT:HI

EYEBROWS:RAISED

GAZE:AWAY GAZE:HEARER

HD:NOD HD:NOD HD:NOD

GEST:CON_LGEST:BEATGEST:BEAT


AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder


GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

TONE:BREAK


EYEBROWS:RAISED

ACCT:HI

GEST:CON_R



EYEBROWS:RAISED



GEST:CON_LGEST:BEATGEST:BEAT

BEAT Processing: Behavior Filtering

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder


GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

TONE:BREAK


EYEBROWS:RAISED

ACCT:HI

GEST:CON_R



EYEBROWS:RAISED



GEST:CON_L

BEAT Processing: Behavior Filtering

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder


GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

TONE:BREAK


EYEBROWS:RAISED

ACCT:HI

GEST:CON_R



EYEBROWS:RAISED



GEST:CON_L

BEAT Processing: Behavior Scheduling

0.800.80

0.00 - 0.510.00 - 0.51 0.80 – 2.380.80 – 2.38

0.80 - 1.390.80 - 1.39 1.79 - 2.381.79 - 2.38

0.390.390.990.99 1.791.79

0.00 – 2.380.00 – 2.38

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder


GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

TONE:BREAK


EYEBROWS:RAISED

ACCT:HI

GEST:CON_R



EYEBROWS:RAISED



GEST:CON_L

BEAT Processing: Behavior Scheduling

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder


GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

<AnimationScript SPEAKER="AGENT" HEARER="USER"><START SPEECH="This is both good news and bad news“><START ACTION="GAZE" DIRECTION="AWAY_FROM_HEARER" SRT="0.0"><START ACTION="VISEME" TYPE="B" SRT="0.0"><STOP ACTION="GAZE" DIRECTION="AWAY_FROM_HEARER" SRT="0.511"><START ACTION="VISEME" TYPE="A" SRT="0.511"><START ACTION="GAZE" DIRECTION="TOWARDS_HEARER" SRT="0.801"><START ACTION="EYEBROWS" SRT="0.801"><START ACTION="GESTURE_RIGHT" TYPE="CONTRAST_1"

RIGHT_TRAJECTORY=CONTRAST_TRAJECTORY" RIGHT_HANDSHAPE="CONTRAST" SRT="0.801">

<AnimationScript SPEAKER="AGENT" HEARER="USER"><START SPEECH="This is both good news and bad news“><START ACTION="GAZE" DIRECTION="AWAY_FROM_HEARER" SRT="0.0"><START ACTION="VISEME" TYPE="B" SRT="0.0"><STOP ACTION="GAZE" DIRECTION="AWAY_FROM_HEARER" SRT="0.511"><START ACTION="VISEME" TYPE="A" SRT="0.511"><START ACTION="GAZE" DIRECTION="TOWARDS_HEARER" SRT="0.801"><START ACTION="EYEBROWS" SRT="0.801"><START ACTION="GESTURE_RIGHT" TYPE="CONTRAST_1"

RIGHT_TRAJECTORY=CONTRAST_TRAJECTORY" RIGHT_HANDSHAPE="CONTRAST" SRT="0.801">

BEAT Processing: Animation Compilation

Maya MELCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder


GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

BEAT Example: Maya Compilation

AnimationCompiler

SchedulingModule

FilteringModule

GenerationModule

LanguageModule

Script Feeder


GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

“You <smile> make me smile! </smile>”

<AnimationScript SPEAKER="AGENT" HEARER="USER"><START SPEECH=“You make me smile!“><START ACTION="VISEME" TYPE="A" SRT="0.511">…<START ACTION=“SMILE” SRT=“0.801”><START ACTION="GAZE" DIRECTION="TOWARDS_HEARER" SRT="0.801"><START ACTION="EYEBROWS" SRT="0.901">…<STOP ACTION=“SMILE” SRT=“1.5”>

BEAT Tag Pass-Through

TextAnnotation

FilteringModule

GenerationModule

LanguageModule

Script Feeder

Knowledge Base

GE

N 1

GE

N 2

GE

N n

FIL

1

FIL

2

FIL

n

“You just need [1 to *type in ] { a [2 *line ] like }”

[1] ICONIC - Typing action[2] BEAT - Emphasis

* Pitch accent{} Raised eyebrows. . Gaze away__ Gaze towards

BEAT Example: Animation Instructions

BEAT Example: Dope Sheet Generation

BEAT Recent Developments

• Infer topic shifts from discourse markers – Posture shift rules and character-specific posture state machines

• Gantt Chart Compiler• In Progress

– Participation framework for multi-party conversation– Level of Detail

BEAT Recent Developments: Spatial BEAT• How to describe arbitrary spatial relationships?

– Requires explicit representation of speaker's intent– e.g., to introduce a set of objects vs. describe their relative

configuration

“There is a table with a glass of water on it.”

• VRML scene is parsed into a set of spatial primitives. • BEAT instructed to describe the configuration of objects.

Yukiko NakanoYang GaoKimiko RyokaiIan GouldstoneSam

READilbert DoyleTim BickmoreCati Vaucelle

Hannes VilhjalmssonTom StockyJustine CassellMACK

The VHuman Team at the MIT Media Lab

conversational behavior for virtual humans tim bickmore, justine cassell, hannes vilhjalmsson mit...

Documents

animation slide

common challenges wouldnt

global knowledge slide

physical world slide

behavior mappings

behavior modeling

action planning

wrt discourse knowledge