1 université paris 8 multimodal expressive embodied conversational agents catherine pelachaud...

of 73 /73
1 Université Paris 8 Université Paris 8 Multimodal Expressive Multimodal Expressive Embodied Conversational Embodied Conversational Agents Agents Catherine Pelachaud Catherine Pelachaud Elisabetta Bevacqua Nicolas Ech Chafai, FT Maurizio Mancini Magalie Ochs, FT Christopher Peters Radek Niewiadomski

Author: anissa-stephens

Post on 01-Jan-2016




0 download

Embed Size (px)


  • Multimodal Expressive Embodied Conversational Agents

    Universit Paris 8

    Catherine PelachaudElisabetta BevacquaNicolas Ech Chafai, FTMaurizio ManciniMagalie Ochs, FTChristopher PetersRadek Niewiadomski

  • ECAs CapabilitiesAnthropomorphic autonome figures New form on human-machine interactionStudy of human communication, human-human interactionECAs ought to be endowed with dialogic and expressive capabilities Perception: an ECA must be able to pay attention to, perceive user and the context she is placed in.

  • ECAs capabilitiesInteraction: speaker and addressee emits signalsspeaker perceives feedback from addresseespeaker may decide to adapt to addressees feedbackconsider social context Generation: expressive synchronized visual and acoustic behaviors. produce expressive behaviourswords, voice, intonation,gaze, facial expression, gesturebody movements, body posture

  • Synchrony tool - BEAT Cassell et al, Media Lab MIT

    Decomposition of text into theme and rhemeLinked to WordNetComputation of:intonationgazegesture

  • Virtual Training Environments MRE(J. Gratch, L. Jonhson, S. Marsella, USC)

  • Interactive System Real state agentGesture synchronized with speech and intonation Small talk Dialog partner

  • MAX, S. Kopp, U of BielefeldGesture understanding and imitation

  • Gilbert and George at the Bank (Upenn, 1994)

  • Greta

  • Problem to Be Solved Human communication is endowed with three devices to express communicative intention:Verbs and formulasIntonation and paralinguisticFacial expression, gaze, gesture, body movement, postureProblem: For any communicative act, the Speaker has to decide:Which nonverbal behaviors to showHow to execute them

  • Verbal and Nonverbal CommunicationSuppose I want to advise a friend to put on her coat because it is snowing.Which signals do I use?Verbal signal: use of a syntactically complex sentence: Take your umbrella because it is rainingVerbal + nonverbal signals:Take your umbrella + point out to the window to show the rain by a gesture or by gaze

  • Multimodal SignalsThe whole body communicates by using:Verbal acts (words and sentences)Prosody, intonation (nonverbal vocal signals)Gesture (hand and arm movements)Facial action (smile, frown)Gaze (eyes and head movements)Body orientation and posture (trunk and leg movements)All these systems of signals have to cooperate in expressing overall meaning of communicative act.

  • Multimodal SignalsAccompany flow of speechSynchronized at the verbal levelPunctuate accented phonemic segments and pausesSubstitute for word(s)Emphasize what is being saidRegulate the exchange of speaking turn

  • SynchronizationThere exists an isomorphism between patterns of speech, intonation and facial actionsDifferent levels of synchrony:Phoneme level (blink)Word level (eyebrow)Phrase level (hand gesture)Interactional synchrony: Synchrony between speaker and addressee

  • Taxonomy of Communicative Functions (I. Poggi)The speaker may provide three broad types of information about:Information about the world: deictic, iconic (adjectival),Information about the speakers mind: belief (certainty, adjectival)goal (performative, rheme/theme, turn-system, belief relation)emotionmeta-cognitiveInformation about speakers identity (sex, culture, age)

  • Multimodal Signals (Isabella Poggi)Characterization of multimodal signals by their placement with respect to linguistic utterance and significance in transmitting information. Eg:Raised eyebrow may signal surprise, emphasis, question mark, suggestionSmile may express happiness, be a polite greeting, be a backchannel signalNeed two information to characterize multimodal signals:Their meaningTheir visual action

  • Lexicon=(meaning, signal)Expression meaningdeictic: this, that, here, thereadjectival: small, difficultcertainty: certain, uncertainperformative: greet, requesttopic comment: emphasis Belief relation: contrast,turn allocation: take/give turnaffective: anger, fear, happy-for, sorry-for, envy, relief, .Expression signalDeictic: gaze directionCertainty: Certain: palm up open hand; Uncertain: raised eyebrowadjectival: small eye aperture Belief relation: Contrast: raised eyebrowPerformative: Suggest: small raised eyebrow, head aside; Assert: horizontal ringEmotion: Sorry-for: head aside, inner eyebrow up; Joy: raising fist upEmphasis: raised eyebrows, head nod, beat

  • Representation LanguageAffective Presentation Markup Language APML describes the communicative functions works at meaning level and not the signal level

    Good Morning, Angela. It is so wonderful to see you again. I was sure we would do so, one day! .

  • Facial Description LanguageFacial expressions defined as (meaning, signal) pairs stored in libraryHierarchical set of classes:Facial basis FB class: basic facial movementAn FB may be represented as a set of MPEG-4 compliant FAPs or recursively, as a combination of other FBs using the `+' operatorsFB={fap3=v1,,fap69=vk};FB'=c1*FB1+c2*FB2;where c1 and c2 are constants and FB1 and FB2 can be:Previous defined FBs FB of the form: {fap3=v1,,fap69=vk}

  • Facial basis classFacial basis class Examples of facial basis class:Eyebrow: small_frown, left_raise, right_raiseEyelid: upper_lid_raiseMouth: left_corner_stretch, left_corner_raise+=

  • Facial DisplaysEvery facial display (FD) is made up of one or more FBs:FD=FB1 + FB2 + FB3 + + FBn;surprise=raise_eyebrow+raise_lid+open_mouth;worried=(surprise*0.7)+sadness;

  • Facial DisplaysProbabilistic mapping between the tags and signals:Es: happy_for = (smile*0.5, 0.3) + (smile*0.25) + (smile*2 + raised_eyebrow, 0.35) + (nothing, 0.1)Definition of a function class for addressee association (meaning, signal)Class communicative function:CertaintyAdjectivalPerformativeAffective

  • Facial Temporal Course

  • Gestural LexiconCertainty: Certain: palm up open handUncertain: showing empty hands while lowering forearmsBelief-relation:List of items of same class: numbering on fingersTemporal relation: fist with extended hand moves back and forth behind ones shoulderTurn-taking:Hold the floor: raise hand, palm toward hearer Performative: Assert: horizontal ringReproach: extended index, palm to left, rotating up & down on wristEmphasis: beat

  • Gesture Specification LanguageScripting language for hand-arm gestures, based on formational parameters [Stokoe]:Hand shape specified using HamNoSys [Prillwitz et. al.]Arm position: concentric squares in front of agent [McNeill]Wrist orientation: palm and finger base orientationGestures are defined by a sequence of timed key poses: gesture frameGestures are broken down temporally into distinct (optional) phases:Gesture phase: preparation, stroke, hold, retractionChange of formational components over time

  • Gesture specification example: Certain

  • Gesture Temporal Courserest positionpreparationstroke start stroke endretractionrest position

  • ECA architecture

  • ECA ArchitectureInput to the system: APML annotated textOutput to the system: Animation files and WAV file for the audioSystem: Interprets APML tagged dialogs, i.e. all communicative functionsLooks in a library the mapping between the meaning (specified by the XML-tag) and signalsDecides which signals to convey on which modalitiesSynchronizes the signals with speech at different levels (word, phoneme or utterance)

  • Behavioral Engine

  • Modules APML Parser: XML parserTTS Festival: manages the speech synthesis and give us the list of phonemes and phonemes duration.Expr2Signal Converter: given a communicative function and its meaning, this module returns the list of facial signals Conflicts Resolver: resolves the conflicts that may happened when more than one facial signals should be activated on same facial partsFace Generator: converts the facial signals into MPEG-4 FAP valuesViseme Generator: converts each phoneme, given by Festival, into a set of FAPsMPEG4 FAP Decoder: is an MPEG-4 compliant Facial Animation Engine

  • TTS FestivalDrive the synchronization of facial expressionSynchronization implemented at word levelTiming of facial expression connected to the text embedded between the markersUse of the tree structure of Festival to compute expressions duration

  • Expr2Signal ConverterInstantiation of APML tags: meaning of a given communicative function Converts markers into facial signalsUse of a library containing the lexicon of the type (meaning, facial expressions)

  • Gaze ModelBased on communicative functions model of Isabella PoggiThis model predicts what should be the value of gaze in order to have a given meaning in a given conversational context. For example:agent wants to emphasize a given word, the model will output that the agent should gaze at her conversant.

  • Gaze Model Very deterministic behavior model: at every Communicative Function associated with a meaning correspond the same signal (with probabilistic changes)Event-driven model: only when a Communicative Function is specified the associated signals are computed only when a Communicative Function is specified, the corresponding behavior may vary

  • Gaze ModelSeveral drawbacks as there is no temporal consideration:No consideration of past and current gaze behavior to compute the new oneNo consideration of how long the current gaze state of S and L has lasted

  • Gaze AlgorithmTwo steps:Communicative prediction:Apply the communicative function model to compute the gaze behavior as to convey a given meaning for S and LStatistical prediction:The communicative gaze model is probabilistically modified by a statistical model defined with constraints:what is the communicative gaze behavior of S and L in which gaze behavior S and L were the duration of the current state of S and L

  • Temporal Gaze ParametersThe gaze behaviors depend on the communicative functions, general purpose of the conversation (persuasion discours, teaching...), personality, cultural root, social relations... Very, too, complex modelpropose parameters that control the gaze behavior overallTS=1,L=1max: maximum duration the mutual gaze state may remain active.TS=1max : maximum duration of gaze state S=1.TL=1max : maximum duration of gaze state L=1 .TS=0max : maximum duration of gaze state S=0.TL=0max : maximum duration of gaze state L=0.

  • Mutual Gaze

  • Gaze Aversion

  • Gesture PlannerAdaptive instantiation:Preparation and retraction phase adjustmentsTransition key and rest gesture insertionJoint-chain follow-throughForward time shifting of children joints in timeStroke of gesture on stressed wordStroke expansionDuring planning phase, identify rheme clauses with closely repeated emphases/pitch accentsIndicate secondary accents by repeating the stroke of the primary gesture with decreasing amplitude

  • Gesture PlannerDetermination of gesture:Look in dictionarySelection of gestureGestures associated with most embedded tags have priority (except beat): adjectival, deicticDuration of gesture:Coarticulation between successive gestures closed in timeHold for gestures belonging to higher up tag hierarchy (e.g. performative, belief-relation)Otherwise go to rest position

  • Behavior ExpressivityBehavior is related to the (Wallbott, 1998):quality of the mental state (e.g. emotion) it refers toquantity (somehow linked to the intensity factor of the mental state)Behaviors encode: content information (the What is communicating)expressive information (the How it is communicating)Behavior expressivity refers to the manner of execution of the behavior

  • Expressivity DimensionsSpatial: amplitude of movementTemporal: duration of movementPower: dynamic property of movementFluidity: smoothness and continuity of movementRepetitiveness: tendency to rhythmic repeatsOverall Activation: quantity of movement across modalities

  • Overall Activitation Threshold filter on atomic behaviors during APML tag matching Determines the number of nonverbal signals to be executed.

  • Spatial Parameter Amplitude of movement controlled through asymmetric scaling of the reach space that is used to find IK goal positions Expand or condense the entire space in front of agent

  • Temporal parameterStroke shift / velocity control of a beat gestureY position of wrist w.r.t. shoulder [cm]Frame # Determine the speed of the arm movement of a gesture's meaning-carrying stroke phase Modify speed of stroke

  • Fluidity Continuity control of TCB interpolation splines and gesture-to-gesture Continuity of arms trajectory paths Control the velocity profiles of an actioncoarticulationX position of wrist w.r.t. shoulder [cm]Frame #

  • Power Tension and Bias control of TCB splines; Overshoot reduction Acceleration and deceleration of limbsHand shape control for gestures that do not need hand configuration to convey their meaning (beats).

  • Repetitivity Technique of stroke expansion: Consecutive emphases are realized gesturally by repeating the stroke of the first gesture.

  • Multiple Modality Ex: AbruptOverall Activity = 0.6Spatial = 0Temporal = 1Fluidity = -1Power = 1Repetition = -1

  • Multiple Modality Ex: VigorousOverall Activity = 1Spatial = 1Temporal = 1Fluidity = 1Power = 0Repetition = 1

  • Evaluation of Expressive Gesture(H1) The chosen implementation for mapping single dimensions of expressivity onto animation parameters is appropriate - a change in a single dimension can be recognized and correctly attributed by users.

    (H2) Combining parameters in such a way that they reflect a given communicative intent will result in more believable overall impression of the agent.

    106 subjects from 17 to 26 years old

  • Perceptual Test StudiesEvaluation of the adequacy of the implementation of each parameter:check whether subjects could perceive and distinguish the six different expressivity parameters and indicate their direction of change. Result: good recognition for spatial and temporal parameters; lower recognition for fluidity and power parameters as they are inter-dependent.Evaluation task: does setting appropriate values for the expressivity parameters create behaviors that are judged as exhibiting corresponding expressivity?3 different types of behaviors: abrupt, sluggish, vigoroususers prefer the coherent performance for vigorous and abrupt

  • InteractionInteraction: two or more parties exchange messages. Interaction is by no means a one way communication channel between parties. Within an interaction, parties take turns in playing the roles of the speaker and of the addressee.

  • InteractionSpeaker and addressee adapt their behaviors to each otherSpeaker monitors addressees attention and interest in what he has to sayaddressee selects feedback behaviors to show the speaker that he is paying attention

  • InteractionSpeaker:Pointless for a speaker to engage in an act of communication if addressee does not pay or intend to pay attentionImportant for speaker to assess addressees engagement at:when starting an interaction: assess the possibility of engagement in interaction (establish phase)when interaction is going on: check if engagement is lasting and sustaining conversation (maintain phase)

  • Interactionaddresseeattention: pay attention to the signals produced by speaker to perceive, process and memorize themperception: of signalscomprehension: understand meaning attached to signalsinternal reaction: the comprehension of the meaning may create cognitive and emotional reactiondecision: communication or not of the internal reactiongeneration: display behaviors

  • BackchannelTypes of backchannels (I. Poggi):attentioncomprehensionbeliefinterestagreementpositive/negativeany combination of the above: pay attention but not understand; understand but non believe, etc.

  • BackchannelDepending on the type of speech act they respond to, a signal will be interpreted as a backchannel or not.backchannel: a signal of agreement / disagreement that follows the expression of opinions, evaluations, planningnot a backchannel: a signal of comprehension / incomprehension after an explicit question Did you understand?

  • BackchannelPolysemy of backchannel signals: a signal may provide different types of informationa frown: negative feedback for understanding, believing and agreeing

  • Backchannel signals of gazegaze: show direction of attentioninform on level of engagement or on intention to maintain engagementindicate degree of intimacy but alsomonitor the gaze behavior of others to establish their intention to engage or maintain engagedshared attention situation involved mutual gaze at each other partner or mutual gaze at a same object

  • Backchannel modellingReactive modelgenerates an instinctive feedback without reasoningsimple backchannel or mimicryspontaneous - sincere Cognitive modelconscious decision to provide backchannel to provoke a particular effect on the speaker or to reach a specific goaldeliberate possibly pretendedit can be shifted to automatic (ex. when listening to a bore)

  • Backchannel Demo

  • A reactive backchannelCurrently, our model is reactive in natureDependent on perceptionSpeaker interprets addressees behaviorSpeaker generates or alters its own behaviorOur focus: interest and attention on a signal level (not on a cognitive level)

  • Organization of the communication Attraction of attentionCommunicative agents: the agents provide information to the user, and should guarantee the user pay attention

    Animation expressivity: principle of staging, so that a single idea is clearly expressed at each instant of time Animation specificity: animators creativity, no realistic constraints for animators What types of gesture properties could guarantee users attention?France Telecom

  • Organization of the communication Attraction of attentionCorpus: videos from traditional animation that illustrate different types of conversational interaction

    the modulations of gesture expressivity over time play a role in managing communication, thus serving as a pragmatic toolFrance Telecom

  • Emotionelicited by the evaluation of events, objects, actionsintegration of emotions in a dialog system (Artimis, FT)identify under which circumstances a dialog agent should express emotionsFrance Telecom

  • Emotion

    BDI representationbased on OCC model: Appraisal variables [Ortony et al. 1988]:Desirability/Undesirability : Achievement or threaten of the agent's choice Degree of realization : Degree of certainty of the choice's achievementProbability of an event : Probability of feasibility of an eventAgency : The agent who is actor of the event

    France Telecom

  • Emotioncomplex emotions:superposition of 2 emotions: evaluation of an event can happen under different anglesmask an emotion by another one : consideration of social context

    joy + deception = masking

  • VideoMasking of Deception by Joy

  • ConclusionCreation of a virtual agent able to communicate nonverballyshow emotionsuse expressive gesturesperceive and be attentivemaintain the attentionTwo studies on expressivityfrom manual annotation of video corpusfrom mimicry of movement analysis

    I investigate how algorithms can control the behavior of characters in virtual worlds, endowing them with an ability to plan, engage in social interactions (with humans or other virtual characters), and exhibit some measure of personality and emotion. A goal of the research is to create a general and domain-independent architecture that supports human behavioral modeling and which can then be "embodied" in a variety of virtual environments. Emotions have a pervasive impact over our lives. They influence how we perceive the world, how we make decisions, and play a key role in social communication. In the context of virtual training environments, emotions also play a key role in the believability of the simulation and the extent to which a student will feel immersed in the experience. In the Emotion Project, we develop models that allow synthetic characters to derive an emotional response to events in the world and respond with behaviors consistent with that emotional state. Unlike work in "believable agents" for entertainment, the focus is more constrained (modeling typical human behavior); there is much greater need for generality and much less tolerance for domain-specific knowledge.