conversational behavior for virtual humans tim bickmore, justine cassell, hannes vilhjalmsson mit...
TRANSCRIPT
Conversational Behavior for Virtual Humans
Tim Bickmore, Justine Cassell, Hannes Vilhjalmsson MIT Media Lab
Overview
• Phenomena of interest• Common challenges• Wouldn't it be great if...• Some future paradigm shifts• Towards standards and infrastructure • Who’s doing what• Current GNL approaches to challenges
Phenomena of interest
gaze, head movement, intonation, eyebrow movement
posture shiftshand gesture
Phenomena of interest
- Multi-party- Social Talk
Reference to physical world
Common challenges:Theory-building / Behavior modeling• What do we need to know about human body
movements during talk?• How do we get it?
– e.g., vision, body suit, head/eye tracking
• How do we describe it?– Conversational functions– Speech acts– Nonverbal behavior
• Naturalistic data collection vs. controlled experiments
Common technical challenges:Discourse & Dialogue
• Generation based on complete discourse plans– take speaker intentions into account– take listener knowledge and inference ability into account
• Verbal and nonverbal negotiation of conversational tasks (grounding)
• Intonation (re)generated wrt discourse knowledge• Very tight temporal coupling of nonverbal behavior
– postural mirroring, turn-taking, grounding
• Generation and recognition of spontaneous iconics and metaphorics
• Cross-cultural/multi-cultural conversational behavior
Common technical challenges:Input
• Multimodal recognition– Speech, Intonation, Gesture, Gaze etc.
• Multimodal input fusion– redundant info should lead to exponential increase in reliability, but
complementary info leads to exponential decrease in reliability– synchronization is big problem
• requires, at least very accurate time stamps
• most ASRs do not provide word or phoneme timing information
• Incremental understanding– Required for backchannel feedback
• Representation– User’s intent (speech act categories?)– Discourse, domain and global knowledge
Common technical challenges:Output Planning / Scheduling
• Planning– Integration of conversational behavior planning with action planning
• can't gesture the same way when pouring tea or firing a gun
– Generate and filter approach (BEAT) vs. Full planning (search) (PPP Persona)
– Incremental generation and execution
• Scheduling / coordination– Must take available function-to-behavior mappings into account
• Initially explored in Gandalf (Thorisson)
– TTS callbacks vs. phoneme timings
• Synchronization– Most effortful part of gesture must precede or coincide with pitch peak
– Speech or gesture can be elongated to reach synch. Goal
• Co-Articulation– Some gestures can be over-ridden
– What phases are collapsed
• Parameters– Gestural emphasis (EMOTE)
– Idiosyncrasy of conversational gesture
• Animation– Key parts of gesture (handshape, trajectory) must be maintained
– Motion between segments must be smoothed for naturalness
• Blending/transitioning of modular procedural and mocap animations
Common technical challenges:Animation
Common technical challenges:Multi-party Conversation
• Multiple humans and/or multiple agents• Input processing and multimodal integration • Must track participation framework (Goffman)
– Current speaker– Addressee– Hearers (ratified and others)
• Must generate behaviors for non-speaker(s)
• Is the model of human behavior correct?
• Is the system correctly/efficiently/elegantly
implemented?
• Does the interface succeed:• Do (which) people like/trust/enjoy the interface?• Does it make work easier/more efficient/better?• What aspects of embodiment are most powerful
Common technical challenges:Evaluation
Wouldn't it be great if…(assume someone else is working on)
• Tightly synchronized 3D animation and TTS module
• Quality TTS with full intonation control and timing for
output
• Speaker-independent conversational continuous
ASR for children and adults
• Gesture recognition (including handshape . .)– And other mythological “reliable” input modalities
• MSAgent allowed speech-coordinated animations
Where we need help
• More studies showing when nonverbal behaviors are used in wide range of domains and contexts
• How communicative behavior is modulated based on personality, mood, affect and other idiosyncracies
Some future paradigm shifts• From speech-act-oriented pipeline architectures to dialogic
models encompassing mutual beliefs and goals– when simulating conversation you can't just focus on the output– need representation of dialogue games/joint projects, e.g., to signal dis-
preferred moves nonverbally
• From simplistic models of speech-act-induced obligations to more general models of accommodation– must work from full discourse plan to generate all appropriate verbal and
nonverbal behavior
• From Gricean "maximally-informative" task-oriented conversation to more general models involving multiple frames of interaction– frame switches are often accomplished nonverbally (contextualization cues)– rules of interpretation and generation change in each frame– task talk is joined by relationship- and other kinds of social talk
Towards standards and infrastructure• XML representations of multimodal messages
– BEAT– Jean-Claude Martin, et al (annotation framework; AAMAS01)– Working Group on Multimodal Meaning Representation (Dagstuhl
workshop)– SIGSEM working group on meaning representations– VoiceXML (W3C)– Human Behavior Representation
• Representations of multimodal semantics– XSLT; Java rules (BEAT)– TAG trees (REA-SPUD, RUTH)
Who’s doing what (partial!)• Conversational openings and closings (Body Chat)• Head movements (Pelachaud, Stone & deCarlo)• Intonation (Steedman, Stone)• Iconic gestures (Rea)• Deictic references (gestures/verbal/locomotion):
Cosmo, Steve• Turn taking (Rea)• Backchannels (Gandalf, Nakano, Rickel & Traum)• Postures (Rea)• Integration of conversation and physical tasks for
collaboration in 3D worlds (Steve)
Some GNL Virtual Humans
Sam
SPARKMack
Grandchair Rea
GNL Approach
• Where do NV occur WRT discourse structure?– Information structure– Topic//intentional structure
• Where do NV occur WRT conversational structure?
• How to represent the functions in discourse and conversation played by the verbal & non-verbal modalities?
• Which semantic features tend to be conveyed by NV and which by speech?
• When do NV and speech convey the same features and when do they convey different features?
• What is the morphology of NV?
• How do we evaluate under what conditions & to whom verbal & NV are useful to interactive systems?
• FEMBOT model• Generating speech &
gesture from underlying discourse representation
• Generating speech & facial, head, hands from conversational structure
• RUTH (Stone & deCarlo): eyes, intonation, eyebrows from discourse
Current GNL approaches: Discourse
REA
Current GNL approaches: Small Talk
• -How to engage in social dialogue?– Implicature, nonverbal conveyance of meaning, contextualization cues
play a big role.– How to represent, assess and update relationship status?
UserAgent
Relational Model
FamiliaritySolidarityPowerAffect
DiscoursePlanner
Interpretation
!!!!
ConversationalMoves
RelationshipAssessment
AgentGoals
Current GNL approaches: Lg in Real World • How to do direction-giving
– speech only vs.– map deictics vs.– character-viewpoint-based spatial gestures vs.– observer-viewpoint-based spatial gestures
• How to do multimodal grounding– gaze cues (using IBM Blue Eyes)– backchannel feedback– contingent responses
• Shared Reality– collaborate with physical objects
Cassell, J., Nakano, Y., Bickmore, T., Sidner, C., Rich, C. (2001). “Non-Verbal Cues for Discourse Structure.” Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics.
Current GNL approaches: Posture
•Posture shifts occur as a function of focus stack changes (discourse segment) and turns•Use Collagen to track discourse segment
– but conversational structure was difficult in Collagen
Current GNL approaches: Interactivity
• Maintain engaging, child-centered interaction with minimal input information
• Keyword spotting• Shared Reality
Current GNL approaches: Multi-party
• Referring expressions (Map deictics & NPs)• Conversational group behavior• Dynamic participation status• Integration of different control methods
Current GNL approaches: Animation
• Homegrown humanoid animation: Pantomime– Plug-in behavior modules (keyframe/procedural/dynamic) – New authoring tool for defining gestures
• Solves some gesture co-articulation and blending problems– Beats defined as relative motion; thus can co-articulate with any other
arm/hand motion.– Keyframes for hand and arm positions can be specified at start of
utterance, with motion smoothed between frames.• Doesn’t work with (TTS) event-based animation.
• Continuing debate regarding how much “intelligence” to put in the body (animator)– e.g., where are arm and hand ‘relax’ commands issued?
BEAT
• extensible framework for verbal and nonverbal conversational behavior– gaze, eyebrow, posture, intonation, head nods, and beat gesture – synchronization with turn and information structure– Input:
• prepared script
• NLG
• ongoing conversation among humans users of a graphical chat system
– Animation:• real-time or off-line
• event-based or scheduled animation
– animators can augment and override BEAT’s choices
• XML Pipeline Architecture– supports extensibility and modularity
– many extensions can be made in XSLT
• Separation of generation and filtering of nonverbal behaviors– provides greater range of possible character behavior and allows multiple
generation algorithms to be integrated
• Implemented in Java– supports portability
BEAT Design
AnimationCompiler
SchedulingModule
FilteringModule
GenerationModule
LanguageModule
Script Feeder
Speech TimingKnowledge Base
GE
N 1
GE
N 2
GE
N n
FIL
1
FIL
2
FIL
n
“This is both good news and bad news”
BEAT Processing: Script input
AnimationCompiler
SchedulingModule
FilteringModule
GenerationModule
LanguageModule
Script Feeder
Speech TimingKnowledge Base
GE
N 1
GE
N 2
GE
N n
FIL
1
FIL
2
FIL
n
CLAUSE
THEME RHEME
OBJECT ACTION OBJECT OBJECT
NEW NEW NEW NEW
CON(1) CON(1)
This is good news and bad newsboth
BEAT Processing: Language Tagging
AnimationCompiler
SchedulingModule
FilteringModule
GenerationModule
LanguageModule
Script Feeder
Speech TimingKnowledge Base
GE
N 1
GE
N 2
GE
N n
FIL
1
FIL
2
FIL
n
CLAUSE
THEME RHEME
OBJECT ACTION OBJECT OBJECT
NEW NEW NEW NEW
CON(1) CON(1)
This is good news and bad newsboth
BEAT Processing: Behavior Generation
AnimationCompiler
SchedulingModule
FilteringModule
GenerationModule
LanguageModule
Script Feeder
Speech TimingKnowledge Base
GE
N 1
GE
N 2
GE
N n
FIL
1
FIL
2
FIL
n
CLAUSE
THEME RHEME
OBJECT ACTION OBJECT OBJECT
NEW NEW NEW NEW
CON(1) CON(1)
This is good news and bad newsboth
GEST:BEATGEST:BEAT
BEAT Processing: Behavior Generation
AnimationCompiler
SchedulingModule
FilteringModule
GenerationModule
LanguageModule
Script Feeder
Speech TimingKnowledge Base
GE
N 1
GE
N 2
GE
N n
FIL
1
FIL
2
FIL
n
TONE:BREAK
TONE:ENDHI TONE:ENDLO
EYEBROWS:RAISED
ACCT:HI
GEST:CON_R
This is good news and bad newsboth
ACCT:HI ACCT:HI ACCT:HI
EYEBROWS:RAISED
GAZE:AWAY GAZE:HEARER
HD:NOD HD:NOD HD:NOD
GEST:CON_LGEST:BEATGEST:BEAT
BEAT Processing: Behavior Generation
AnimationCompiler
SchedulingModule
FilteringModule
GenerationModule
LanguageModule
Script Feeder
Speech TimingKnowledge Base
GE
N 1
GE
N 2
GE
N n
FIL
1
FIL
2
FIL
n
TONE:BREAK
TONE:ENDHI TONE:ENDLO
EYEBROWS:RAISED
ACCT:HI
GEST:CON_R
This is good news and bad newsboth
ACCT:HI ACCT:HI ACCT:HI
EYEBROWS:RAISED
GAZE:AWAY GAZE:HEARER
HD:NOD HD:NOD HD:NOD
GEST:CON_LGEST:BEATGEST:BEAT
BEAT Processing: Behavior Filtering
AnimationCompiler
SchedulingModule
FilteringModule
GenerationModule
LanguageModule
Script Feeder
Speech TimingKnowledge Base
GE
N 1
GE
N 2
GE
N n
FIL
1
FIL
2
FIL
n
TONE:BREAK
TONE:ENDHI TONE:ENDLO
EYEBROWS:RAISED
ACCT:HI
GEST:CON_R
This is good news and bad newsboth
ACCT:HI ACCT:HI ACCT:HI
EYEBROWS:RAISED
GAZE:AWAY GAZE:HEARER
HD:NOD HD:NOD HD:NOD
GEST:CON_L
BEAT Processing: Behavior Filtering
AnimationCompiler
SchedulingModule
FilteringModule
GenerationModule
LanguageModule
Script Feeder
Speech TimingKnowledge Base
GE
N 1
GE
N 2
GE
N n
FIL
1
FIL
2
FIL
n
TONE:BREAK
TONE:ENDHI TONE:ENDLO
EYEBROWS:RAISED
ACCT:HI
GEST:CON_R
This is good news and bad newsboth
ACCT:HI ACCT:HI ACCT:HI
EYEBROWS:RAISED
GAZE:AWAY GAZE:HEARER
HD:NOD HD:NOD HD:NOD
GEST:CON_L
BEAT Processing: Behavior Scheduling
0.800.80
0.00 - 0.510.00 - 0.51 0.80 – 2.380.80 – 2.38
0.80 - 1.390.80 - 1.39 1.79 - 2.381.79 - 2.38
0.390.390.990.99 1.791.79
0.00 – 2.380.00 – 2.38
AnimationCompiler
SchedulingModule
FilteringModule
GenerationModule
LanguageModule
Script Feeder
Speech TimingKnowledge Base
GE
N 1
GE
N 2
GE
N n
FIL
1
FIL
2
FIL
n
TONE:BREAK
TONE:ENDHI TONE:ENDLO
EYEBROWS:RAISED
ACCT:HI
GEST:CON_R
This is good news and bad newsboth
ACCT:HI ACCT:HI ACCT:HI
EYEBROWS:RAISED
GAZE:AWAY GAZE:HEARER
HD:NOD HD:NOD HD:NOD
GEST:CON_L
BEAT Processing: Behavior Scheduling
AnimationCompiler
SchedulingModule
FilteringModule
GenerationModule
LanguageModule
Script Feeder
Speech TimingKnowledge Base
GE
N 1
GE
N 2
GE
N n
FIL
1
FIL
2
FIL
n
<AnimationScript SPEAKER="AGENT" HEARER="USER"><START SPEECH="This is both good news and bad news“><START ACTION="GAZE" DIRECTION="AWAY_FROM_HEARER" SRT="0.0"><START ACTION="VISEME" TYPE="B" SRT="0.0"><STOP ACTION="GAZE" DIRECTION="AWAY_FROM_HEARER" SRT="0.511"><START ACTION="VISEME" TYPE="A" SRT="0.511"><START ACTION="GAZE" DIRECTION="TOWARDS_HEARER" SRT="0.801"><START ACTION="EYEBROWS" SRT="0.801"><START ACTION="GESTURE_RIGHT" TYPE="CONTRAST_1"
RIGHT_TRAJECTORY=CONTRAST_TRAJECTORY" RIGHT_HANDSHAPE="CONTRAST" SRT="0.801">
<AnimationScript SPEAKER="AGENT" HEARER="USER"><START SPEECH="This is both good news and bad news“><START ACTION="GAZE" DIRECTION="AWAY_FROM_HEARER" SRT="0.0"><START ACTION="VISEME" TYPE="B" SRT="0.0"><STOP ACTION="GAZE" DIRECTION="AWAY_FROM_HEARER" SRT="0.511"><START ACTION="VISEME" TYPE="A" SRT="0.511"><START ACTION="GAZE" DIRECTION="TOWARDS_HEARER" SRT="0.801"><START ACTION="EYEBROWS" SRT="0.801"><START ACTION="GESTURE_RIGHT" TYPE="CONTRAST_1"
RIGHT_TRAJECTORY=CONTRAST_TRAJECTORY" RIGHT_HANDSHAPE="CONTRAST" SRT="0.801">
BEAT Processing: Animation Compilation
Maya MELCompiler
SchedulingModule
FilteringModule
GenerationModule
LanguageModule
Script Feeder
Speech TimingKnowledge Base
GE
N 1
GE
N 2
GE
N n
FIL
1
FIL
2
FIL
n
BEAT Example: Maya Compilation
AnimationCompiler
SchedulingModule
FilteringModule
GenerationModule
LanguageModule
Script Feeder
Speech TimingKnowledge Base
GE
N 1
GE
N 2
GE
N n
FIL
1
FIL
2
FIL
n
“You <smile> make me smile! </smile>”
<AnimationScript SPEAKER="AGENT" HEARER="USER"><START SPEECH=“You make me smile!“><START ACTION="VISEME" TYPE="A" SRT="0.511">…<START ACTION=“SMILE” SRT=“0.801”><START ACTION="GAZE" DIRECTION="TOWARDS_HEARER" SRT="0.801"><START ACTION="EYEBROWS" SRT="0.901">…<STOP ACTION=“SMILE” SRT=“1.5”>
BEAT Tag Pass-Through
TextAnnotation
FilteringModule
GenerationModule
LanguageModule
Script Feeder
Knowledge Base
GE
N 1
GE
N 2
GE
N n
FIL
1
FIL
2
FIL
n
“You just need [1 to *type in ] { a [2 *line ] like }”
[1] ICONIC - Typing action[2] BEAT - Emphasis
* Pitch accent{} Raised eyebrows. . Gaze away__ Gaze towards
BEAT Example: Animation Instructions
BEAT Example: Dope Sheet Generation
BEAT Recent Developments
• Infer topic shifts from discourse markers – Posture shift rules and character-specific posture state machines
• Gantt Chart Compiler• In Progress
– Participation framework for multi-party conversation– Level of Detail
BEAT Recent Developments: Spatial BEAT• How to describe arbitrary spatial relationships?
– Requires explicit representation of speaker's intent– e.g., to introduce a set of objects vs. describe their relative
configuration
“There is a table with a glass of water on it.”
• VRML scene is parsed into a set of spatial primitives. • BEAT instructed to describe the configuration of objects.
Yukiko NakanoYang GaoKimiko RyokaiIan GouldstoneSam
READilbert DoyleTim BickmoreCati Vaucelle
Hannes VilhjalmssonTom StockyJustine CassellMACK
The VHuman Team at the MIT Media Lab