multimodal expressive embodied conversational agents

Click here to load reader

Post on 11-Feb-2016




0 download

Embed Size (px)


Multimodal Expressive Embodied Conversational Agents. Université Paris 8. Catherine Pelachaud. Elisabetta Bevacqua Nicolas Ech Chafai, FT Maurizio Mancini Magalie Ochs, FT Christopher Peters Radek Niewiadomski. ECAs Capabilities. Anthropomorphic autonome figures - PowerPoint PPT Presentation


  • Multimodal Expressive Embodied Conversational Agents

    Universit Paris 8

    Catherine PelachaudElisabetta BevacquaNicolas Ech Chafai, FTMaurizio ManciniMagalie Ochs, FTChristopher PetersRadek Niewiadomski

  • ECAs CapabilitiesAnthropomorphic autonome figures New form on human-machine interactionStudy of human communication, human-human interactionECAs ought to be endowed with dialogic and expressive capabilities Perception: an ECA must be able to pay attention to, perceive user and the context she is placed in.

  • ECAs capabilitiesInteraction: speaker and addressee emits signalsspeaker perceives feedback from addresseespeaker may decide to adapt to addressees feedbackconsider social context Generation: expressive synchronized visual and acoustic behaviors. produce expressive behaviourswords, voice, intonation,gaze, facial expression, gesturebody movements, body posture

  • Synchrony tool - BEAT Cassell et al, Media Lab MIT

    Decomposition of text into theme and rhemeLinked to WordNetComputation of:intonationgazegesture

  • Virtual Training Environments MRE(J. Gratch, L. Jonhson, S. Marsella, USC)

  • Interactive System Real state agentGesture synchronized with speech and intonation Small talk Dialog partner

  • MAX, S. Kopp, U of BielefeldGesture understanding and imitation

  • Gilbert and George at the Bank (Upenn, 1994)

  • Greta

  • Problem to Be Solved Human communication is endowed with three devices to express communicative intention:Verbs and formulasIntonation and paralinguisticFacial expression, gaze, gesture, body movement, postureProblem: For any communicative act, the Speaker has to decide:Which nonverbal behaviors to showHow to execute them

  • Verbal and Nonverbal CommunicationSuppose I want to advise a friend to put on her coat because it is snowing.Which signals do I use?Verbal signal: use of a syntactically complex sentence: Take your umbrella because it is rainingVerbal + nonverbal signals:Take your umbrella + point out to the window to show the rain by a gesture or by gaze

  • Multimodal SignalsThe whole body communicates by using:Verbal acts (words and sentences)Prosody, intonation (nonverbal vocal signals)Gesture (hand and arm movements)Facial action (smile, frown)Gaze (eyes and head movements)Body orientation and posture (trunk and leg movements)All these systems of signals have to cooperate in expressing overall meaning of communicative act.

  • Multimodal SignalsAccompany flow of speechSynchronized at the verbal levelPunctuate accented phonemic segments and pausesSubstitute for word(s)Emphasize what is being saidRegulate the exchange of speaking turn

  • SynchronizationThere exists an isomorphism between patterns of speech, intonation and facial actionsDifferent levels of synchrony:Phoneme level (blink)Word level (eyebrow)Phrase level (hand gesture)Interactional synchrony: Synchrony between speaker and addressee

  • Taxonomy of Communicative Functions (I. Poggi)The speaker may provide three broad types of information about:Information about the world: deictic, iconic (adjectival),Information about the speakers mind: belief (certainty, adjectival)goal (performative, rheme/theme, turn-system, belief relation)emotionmeta-cognitiveInformation about speakers identity (sex, culture, age)

  • Multimodal Signals (Isabella Poggi)Characterization of multimodal signals by their placement with respect to linguistic utterance and significance in transmitting information. Eg:Raised eyebrow may signal surprise, emphasis, question mark, suggestionSmile may express happiness, be a polite greeting, be a backchannel signalNeed two information to characterize multimodal signals:Their meaningTheir visual action

  • Lexicon=(meaning, signal)Expression meaningdeictic: this, that, here, thereadjectival: small, difficultcertainty: certain, uncertainperformative: greet, requesttopic comment: emphasis Belief relation: contrast,turn allocation: take/give turnaffective: anger, fear, happy-for, sorry-for, envy, relief, .Expression signalDeictic: gaze directionCertainty: Certain: palm up open hand; Uncertain: raised eyebrowadjectival: small eye aperture Belief relation: Contrast: raised eyebrowPerformative: Suggest: small raised eyebrow, head aside; Assert: horizontal ringEmotion: Sorry-for: head aside, inner eyebrow up; Joy: raising fist upEmphasis: raised eyebrows, head nod, beat

  • Representation LanguageAffective Presentation Markup Language APML describes the communicative functions works at meaning level and not the signal level

    Good Morning, Angela. It is so wonderful to see you again. I was sure we would do so, one day! .

  • Facial Description LanguageFacial expressions defined as (meaning, signal) pairs stored in libraryHierarchical set of classes:Facial basis FB class: basic facial movementAn FB may be represented as a set of MPEG-4 compliant FAPs or recursively, as a combination of other FBs using the `+' operatorsFB={fap3=v1,,fap69=vk};FB'=c1*FB1+c2*FB2;where c1 and c2 are constants and FB1 and FB2 can be:Previous defined FBs FB of the form: {fap3=v1,,fap69=vk}

  • Facial basis classFacial basis class Examples of facial basis class:Eyebrow: small_frown, left_raise, right_raiseEyelid: upper_lid_raiseMouth: left_corner_stretch, left_corner_raise+=

  • Facial DisplaysEvery facial display (FD) is made up of one or more FBs:FD=FB1 + FB2 + FB3 + + FBn;surprise=raise_eyebrow+raise_lid+open_mouth;worried=(surprise*0.7)+sadness;

  • Facial DisplaysProbabilistic mapping between the tags and signals:Es: happy_for = (smile*0.5, 0.3) + (smile*0.25) + (smile*2 + raised_eyebrow, 0.35) + (nothing, 0.1)Definition of a function class for addressee association (meaning, signal)Class communicative function:CertaintyAdjectivalPerformativeAffective

  • Facial Temporal Course

  • Gestural LexiconCertainty: Certain: palm up open handUncertain: showing empty hands while lowering forearmsBelief-relation:List of items of same class: numbering on fingersTemporal relation: fist with extended hand moves back and forth behind ones shoulderTurn-taking:Hold the floor: raise hand, palm toward hearer Performative: Assert: horizontal ringReproach: extended index, palm to left, rotating up & down on wristEmphasis: beat

  • Gesture Specification LanguageScripting language for hand-arm gestures, based on formational parameters [Stokoe]:Hand shape specified using HamNoSys [Prillwitz et. al.]Arm position: concentric squares in front of agent [McNeill]Wrist orientation: palm and finger base orientationGestures are defined by a sequence of timed key poses: gesture frameGestures are broken down temporally into distinct (optional) phases:Gesture phase: preparation, stroke, hold, retractionChange of formational components over time

  • Gesture specification example: Certain

  • Gesture Temporal Courserest positionpreparationstroke start stroke endretractionrest position

  • ECA architecture

  • ECA ArchitectureInput to the system: APML annotated textOutput to the system: Animation files and WAV file for the audioSystem: Interprets APML tagged dialogs, i.e. all communicative functionsLooks in a library the mapping between the meaning (specified by the XML-tag) and signalsDecides which signals to convey on which modalitiesSynchronizes the signals with speech at different levels (word, phoneme or utterance)

  • Behavioral Engine

  • Modules APML Parser: XML parserTTS Festival: manages the speech synthesis and give us the list of phonemes and phonemes duration.Expr2Signal Converter: given a communicative function and its meaning, this module returns the list of facial signals Conflicts Resolver: resolves the conflicts that may happened when more than one facial signals should be activated on same facial partsFace Generator: converts the facial signals into MPEG-4 FAP valuesViseme Generator: converts each phoneme, given by Festival, into a set of FAPsMPEG4 FAP Decoder: is an MPEG-4 compliant Facial Animation Engine

  • TTS FestivalDrive the synchronization of facial expressionSynchronization implemented at word levelTiming of facial expression connected to the text embedded between the markersUse of the tree structure of Festival to compute expressions duration

  • Expr2Signal ConverterInstantiation of APML tags: meaning of a given communicative function Converts markers into facial signalsUse of a library containing the lexicon of the type (meaning, facial expressions)

  • Gaze ModelBased on communicative functions model of Isabella PoggiThis model predicts what should be the value of gaze in order to have a given meaning in a given conversational context. For example:agent wants to emphasize a given word, the model will output that the agent should gaze at her conversant.

  • Gaze Model Very deterministic behavior model: at every Communicative Function associated with a meaning correspond the same signal (with probabilistic changes)Event-driven model: only when a Communicative Function is specified the associated signals are computed only when a Communicative Function is specified, the corresponding behavior may vary

  • Gaze ModelSeveral drawbacks as there is no temporal consideration:No consideration of past and current gaze behavior to compute the new oneNo consideration of how long the current gaze state of S and L has lasted

  • Gaze AlgorithmTwo steps:Communicative prediction:Apply the communicative function model to compute the gaze behavior as to convey a given meaning for S and LStatistical prediction:The communicative gaze model is probabilistically modified by a statistical model defined w

View more