a formal semantic analysis of gesture · 2009. 3. 11. · a formal semantic analysis of gesture...

A Formal Semantic Analysis of Gesture

Alex LascaridesSchool of Informatics,

University of Edinburgh

Matthew StoneComputer Science,Rutgers University

Abstract

The gestures that speakers use in tandem with speech include not only convention-alised actions with identifiable meanings (so called narrow gloss gestures or emblems) butalso productive iconic and deictic gestures whose form and meanings seem largely impro-vised in context. In this paper, we bridge the descriptive tradition with formal models ofreference and discourse structure so as to articulate an approach to the interpretation ofthese productive gestures. Our model captures gestures’ partial and incomplete meaningsas derived from form, and accounts for the more specific interpretations they derive incontext. Our work emphasises the commonality of the pragmatic mechanisms for inter-preting both language and gesture, and the place of formal methods in discovering theprinciples and knowledge that those mechanisms rely on.

1 Introduction

Face-to-face dialogue is the primary setting for language use, and there is increasing evidencethat theories of semantics and pragmatics are best formulated directly for dialogue. Forexample, many accounts of semantic content see extended patterns of interaction rather thanindividual sentences as primary (Kamp, 1981, Asher and Lascarides, 2003, Ginzburg andCooper, 2004, Cumming, 2007). Likewise, many pragmatic theories derive their principlesfrom cognitive models of interlocutors who must coordinate their interactions while advancingtheir own interests (Lewis, 1969, Grice, 1975, Clark, 1996, Asher and Lascarides, 2003, Benz,Jager and van Rooij, 2005). But face-to-face dialogue is not just words. Speakers can usefacial expressions, eye gaze, hand and arm movements and body posture intentionally toconvey meaning; see e.g. (McNeill, 1992). This raises the challenge of fitting a much broaderrange of behaviours into formal semantic and pragmatic models. We take up this challengein this paper.

We focus on a broad class of improvised, coverbal, communicative actions, which seemboth particularly important and particularly challenging for models of meaning in face-to-face dialogue. We distinguish communicative actions from other behaviours that people do inconversation, such as practical actions and incidental ‘nervous’ movements, following a longdescriptive tradition (Goffman, 1963, Ekman and Friesen, 1969, Kendon, 2004). This allowsus to focus on a core set of behaviours—which we call gestures following Kendon 2004—thatuntrained viewers are sensitive to (Kendon, 1978), that linguists can reliably annotate (Car-letta, 2007), and that any interpretive theory must account for. Gestures have what Kendoncalls “features of manifest deliberate expressiveness” (Kendon, 2004, p. 15), including thekinematic profile of the movement, as an excursion from and back to a rest position; its dy-

1

namics, including pronounced onset and release; and the attention and treatment interlocutorsafford it.

Coverbal gestures are those that are performed in synchrony with simultaneous speech.Gestures can also be performed without speech (see (Kendon, 2004, Ch. 14) for examples), inthe pauses between spoken phrases (see (Engle, 2000, Ch. 3) for examples), or over extendedspans that include both speech and silence (see Oviatt, DeAngeli and Kuhn (1997) for exam-ples). Coverbal gestures show a fine-grained alignment with the prosodic structure of speech(Kendon (1972); (Kendon, 2004, Ch. 7)). The gesture typically begins with a preparatoryphase where the agent moves the hands into position for the gesture. It continues with astroke (which can involve motion or not), which is that part of the gesture that is designedto convey meaning—we focus in this paper on interpreting strokes. Finally, it can concludewith a post-stroke phase where the hands retract to rest. Speakers coordinate gestures withspeech so that the phases of gesture performance align with intonational phrases in speechand so that strokes in particular are performed in time with nuclear accents in speech. Thiscoordination may involve brief pauses in one or the other modality, orchestrated to maintainsynchrony between temporally extended behaviours (Kendon, 2004, Ch. 7). The active align-ment between speech and gesture is indicative of the close semantic and pragmatic relationshipbetween them.

Finally, we contrast improvised gestures both with other gestures whose content is em-blematic and completely conventionalised, such as the ‘thumbs up’ gesture, and with beatgestures, which merely emphasise important moments in the delivery of an utterance. Impro-vised gestures may involve deixis, where an agent designates a real, virtual or abstract object,and iconicity, where the gesture’s form or manner of execution mimics its content. Deixis andiconicity sometimes involve the creative introduction of correspondences between the bodyand depicted space. Nevertheless, as Kendon’s (2004) fieldwork shows, even in deixis andiconic representation, speakers recruit specific features of form consistently to convey specifickinds of content. The partial conventionalisation involved in these correspondences is re-vealed not only in consistent patterns of use by individual speakers but also in cross-culturaldifferences in gesture form and meaning.

Researchers have long argued that speakers use language and gesture as an integratedensemble to negotiate a single contribution to conversation—to “express a single thought”(McNeill, 1992, Engle, 2000, Kendon, 2004). We begin with a collection of attested exampleswhich lets us develop this idea precisely (Section 2). We show that characterising the inter-pretation of such examples demands a fine-grained semantic and pragmatic representation,which must encompass content from language and gesture and formalise scope relationships,speech acts, and contextually-inferred referential connections. Our approach adopts repre-sentations that use dynamic semantics to capture the evolving structure of salient objectsand spatial relationships in the discourse and a segmented structure organised by rhetoricalconnections to characterise the semantic and pragmatic connections between gesture and itscommunicative context. We motivate and describe these logical forms in Section 3.

We then follow up our earlier programmatic suggestion (Lascarides and Stone, in press),that such logical forms should be derivable from underspecified semantic representations thatcapture constraints on meaning imposed by linguistic and gestural form, via constrainedinference which reconstructs how language and gesture are rhetorically connected. Our un-derspecified semantic representations, described in Section 4, capture the incompleteness ofmeaning that’s revealed by gestural form while also capturing, very abstractly, what a gesturemust convey given its particular pattern of shape and movement. We describe the resolution

2

from underspecified meaning to specific interpretation in Section 5: a glue logic composes thelogical form of discourse from underspecified semantic representations via default rules forinferring rhetorical connections; and as a byproduct of this reasoning underspecified aspectsof meaning are disambiguated to pragmatically preferred, specific values. These formal re-sources parallel those required for recognising how sentences in spoken discourse are coherent,as developed by Asher and Lascarides (2003) inter alia.

The distinctive contribution of our work, then, is to meet the challenge, implicit in de-scriptive work on non-verbal communication, of handling gesture within a framework that’scontinuous with and complementary to purely linguistic theories. This is both a theoreticaland methodological contribution. In formalising the principles of coherence that guide theinterpretation of gesture, we go beyond previous work—whether descriptive (McNeill, 1992,Kendon, 2004), psychological (So, Kita and Goldin-Meadow, in press, Goldin-Meadow, 2003),or applied to embodied agents (Cassell, 2001, Kopp, Tepper and Cassell, 2004). Such a logi-cally precise model is crucial to substantiating the theoretical claim that speech and gestureconvey an integrated message.

Such a model is also crucial to inform future empirical research. The new data affordedby gesture calls for refinements to theories of purely linguistic discourse, potentially resultingin more general and deeper models of semantics and pragmatics. But a formal model is oftenindispensable for formulating precise hypotheses to guide empirical work. For instance, ourframework raises for the first time a set of logically precise constraints characterising referenceand content across speech and gesture and sequences of gesture in embodied discourse. Testingthese constraints empirically will have a direct influence not only on the development offormal theory but on our understanding of the fundamental pragmatic principles underlyingmultimodal communication. A hybrid research methodology, combining empirical researchand logically precise models to mutual benefit, has proved highly successful in analysinglanguage. We hope the same will be true for analysing gesture.

2 Dimensions of Gesture Meaning in Interaction with Speech

We begin with an overview of the possible interpretations of improvised coverbal gestures. Weemphasise that the precise reference and content of these gestures is typically indeterminate, sothat multiple consistent interpretations are often available. It is the details and commonalitiesof these alternative interpretations that we aim to explain. We argue that they reveal threekey generalisations about gesture and its relationship to speech.

1. Gestures can depict the referents of expressions in the synchronous speech, inferentiallyrelated individuals, or salient individuals from the prior context.

2. Gestures can show what speech describes, or they can complement speech with distinctbut semantically related information.

3. Gesture and speech combine into integrated overarching speech acts with a uniformforce in the dialogue and consistent assignments of scope relationships.

These principles—which we defend in more detail in Lascarides and Stone (2006, in press)—underpin the formalism we present in the rest of the paper.

Our discussion follows McNeill (2005, p. 41) in characterising deixis and iconicity as twodimensions of gesture meaning, rather than two kinds of gesture. Deixis is that dimension of

3

gesture meaning that locates objects and events with respect to a consistent spatial referenceframe; our first examples highlight the semantic interaction of this spatial reference with thewords that accompany coverbal gestures. Iconicity, meanwhile, is the dimension of gesturemeaning which depicts aspects of form and motion in a described situation through a naturalcorrespondence with the form and motion of the gesture itself; we consider iconicity in rela-tively ‘pure’ examples later in this section. So characterised, of course, deixis and iconicity arenot mutually exclusive, so our formalism must allow us to regiment and combine the deicticand iconic contributions to gesture interpretation.

We start with utterance (1), which is taken from Kopp et al.’s (2004) corpus of face-to-facedirection-giving dialogues and visualised in Figure 1 (in this paper, we use square brackets toindicate the temporal alignment of speech and gesture, and where relevant smallcaps to markpitch accents):1

(1) And [Norris]1 [is exactly across from the library.]2First: The left arm is extended into the left periphery; the left palm faces right so that itand fingers are aligned with the forearm in a flat, open shape. Meanwhile, the right hand isalso held flat, in line with the forearm; the arm is held forward, with elbow high and bent,so that the fingers are directly in front of the shoulder.Second: The left hand remains in its position while the right hand is extended to the extremeupper right periphery.

“And Norris... “...is exactly across from the library.”

Figure 1: Hand gestures place landmarks on a virtual map.

The utterance concludes a direction-giving episode in which the speaker has already usedgestures to create a virtual map of the Northwestern university campus (more of this episodeappears later as examples (10) and (11)). Throughout the utterance, the speaker’s left hand ispositioned to mark the most salient landmark of this episode—the large lagoon at the centreof the campus, which she has introduced in previous speech and gesture. The gesture on theleft of Figure 1 positions the right hand at a location that was initially established for NorrisHall, while the next gesture moves the right hand to resume an earlier demonstration of thelibrary. One could interpret these gestures as demonstrating the buildings, their locations,or even the eventualities mentioned in the clause. But any of these interpretations resultin a multimodal communicative act where the right hand indexes entities referenced in theaccompanying speech and indicates the spatial relationship have that the sentence describes.To capture the deictic dimension of gesture form and meaning here, we characterise the form

1Video is available at homepages.inf.ed.ac.uk/alex/Gesture/norris-eg.mov.

4

of the gesture as targeting a specific region of space, and then use a referent to that region ofspace to characterise the content that the gesture conveys.

Example (2), from Engle (2000, p. 37), illustrates a less constrained relationship betweenthe contents conveyed by speech and gesture (pitch accents are shown with smallcaps).

(2) They [ have springs. ]Speaker places right pinched hand (that seems to be holding a small vertical object) justabove left pinched hand (that seems to be holding another small vertical thing).

The speaker here describes how the cotter pins on a lock are held in position. The utterancerefers to the set of pins with they and the whole set of corresponding springs with springs.The gesture, however, depicts a single spring and pin in combination, highlighting the verticalrelationship through which a spring pushes its corresponding pin into the key cylinder to helphold the cylinder in place. As is common, the gesture remains ambiguous; it is not clearwhich hand represents the spring and which the pin.2 But even allowing for this ambiguity,we know that the gesture elaborates on the speech by showing the vertical spatial relationshipmaintained in a prototypical case of the relationship described in speech. Furthermore, asEngle notes, the gesture serves to disambiguate the plural predication in the accompanyingsentence to a distributive interpretation.

Gestures maintain semantic links to questions and commands, as well as assertions. Suchexamples underscore the need to integrate the reference and content of gesture precisely withsemantic and pragmatic representations of linguistic units. Consider the following example,taken from the AMI corpus (dialogue ES2002b, Carletta (2007)), in which a group of fourpeople are tasked with designing a remote control:

(3) C: [Do you want to switch places?]While C speaks, her right hand has its index finger extended; it starts at her waist and moveshorizontally to the right towards D and then back again to C’s waist, and this movementis repeated, as if to depict the motion of C moving to D’s location and D moving to C’slocation.

Intuitively, the gesture in (3) adds content to C’s question: it is not about whether D wantsto switch places with someone unspecified, but rather switch places with C (the speaker). Sooverall, the multimodal action means “Do you want to switch places with me?”. A differentgesture, involving C’s hand moving between agentsD and A, would have resulted in a differentoverall question: Do you want to switch places with A? To capture this interaction in logicalform, the interrogative operator associated with the question must have a referential or scopaldependence which allows its contribution to covary with the content of the gesture.

The following example is taken from the same dialogue as (1):

(4) [You walk out the doors]The gesture is one with a flat hand shape and vertical palm, with the fingers pointing right,and palm facing outward.

The linguistic component expresses an instruction. And intuitively, the interpretation of thegesture is also an instruction: “and then immediately turn right”. The inferential connec-tion here must integrate both the semantic and pragmatic relationships between gesture andspeech. Semantically, the two modalities respect a general presentational constraint that the

2In fact, springs push pins down into the cylinder, as diagrammed in the explanation this subject relied on,and as required by the need to make locks robust against gravity.

5

time and place where the first event ends (in this case, where the addressee would be oncehe walks out the door) overlaps with where the second event starts (turning right). Prag-matically, they are interpreted as presenting an integrated instruction. (Note that if we wereto replace the clause in (4) with an indicative such as “John walked out the doors”, thenalthough the content of the utterance and the gesture would exhibit the same semantic rela-tionship, the gesture would now have the illocutionary effects of an assertion.) Both aspectscan be characterised by representing the content of the two as connected by a rhetorical rela-tion: this reflects both their integrated content and the integrated speech act that the speakeraccomplishes with the two components of the utterance (Asher and Lascarides, 2003). In thiscase, that relation is Narration, the prototypical way to present an integrated description ofsuccessive events in discourse.

Our examples thus far show how the hands can establish a consistent deictic frame to in-dicate objects and actions in a real or virtual space. Many gestures use space more abstractly,to depict aspects of form and motion in a described situation. A representative example isthe series of gestures in (5)—Example 6 and Figure 8.4 in (Kendon, 2004, p. 136) that is anextract of the story of Little Red Riding Hood:

(5) a. and [took his] [hatchet andFirst: Speaker’s right hand grasps left hand, with wrists bent.Second: Speaker lifts poised hands above right shoulder.

b. with] [a mighty sweep]First: Hands, held above right shoulder, move back then forward slightly.

c. [(pause 0.4 sec)] [sliced the wolf’s stomach open]First: Speaker turns headSecond: Arms swing in time with sliced; then are held horizontally at the left

In this example, the speaker assumes what McNeill (1992, p. 118) calls character viewpoint :she mirrors the actions of the woodsman as he kills the wolf. In (5a) and (5b), the speaker’shands, in coming together, depict the action of grabbing the handle of the hatchet and then theaction of drawing the hatchet overhead, ready to strike; in (5c), the speaker’s broad swingingmotion depicts the woodsman’s effort in delivering his blow to the wolf with the hatchet.The whole discourse thus exhibits a consistent dramatisation of the successive events, withthe speaker understood to act out the part of the woodsman, and her hands in particularunderstood as holding and wielding the woodsman’s hatchet. However, there seems to beno implication that these actions are demonstrated in the same spatial frame as previous orsubsequent depictions of events in the story. Crosscultural studies, as in the work of Haviland(2000) among others, suggest that narrative traditions differ in how freely gestures can departfrom a presupposed anchoring to a consistent real or virtual space, with examples like (5) inEnglish representing the most liberal case. Following the descriptive literature on gesture,we represent the form of such gestures with qualitative features that mirror elements of theEnglish descriptions we give for these gestures. In (5a), for example, we indicate that thespeaker’s hands are held above the right shoulder, that the right is grabbing the left, thatboth hands are in a fist shape as though grabbing something. Iconicity is captured by therelationship between these gesture elements and a naturally-related predication that eachelement contributes to the interpretation of the gesture as a whole.

Gestures with iconic meanings, like those with deictic meanings, must be interpretedin tight semantic and pragmatic integration with the accompanying utterances. Considerutterance (6) from the AMI corpus (dialogue ES2005b):

6

(6) D: And um I thought not too edgy and like a box, more kind of hand-held more um. . . not as uh [computery] and organic, yeah, more organic shape I think.When D says computery, her right hand has fingers and thumb curled downward (in a 5-clawshape), palm also facing down, and she moves the fingers as if to depict typing.

The content of D’s gesture is presented in semantic interaction with the scope-bearing ele-ments introduced in the sentence. Intuitively, the gesture depicts a keyboard, not anchored toany specific virtual space. There is nothing within the form of the gesture itself that depictsnegation. Nevertheless, D’s overall multimodal act entails not with a keyboard. This requiresthe negation that’s introduced by the word not to semantically outscope the content depictedby the gesture. This scope relation can be achieved only via a unified semantic frameworkfor representing verbal and gestural content. In fact, we will argue in Section 4.3 that exam-ple (6) calls for an integrated, compositional description of utterance form and meaning thatcaptures both linguistic and gestural components. That way, established and well-understoodmethods for composing semantic representations from descriptions of the part–whole struc-ture of communicative actions can determine the semantic scope between gestured contentand linguistically-introduced negation.

Iconicity gives rise to the same interpretive underspecification that we saw with deixis.For example consider the depiction of ‘trashing’ in the following example from a psychologylecture:3

(7) I can give you other books that would totally trash experimentalism.When the speaker says trash, both hands are in an open flat handshape (ASL 5), with thepalms facing each other and index fingers pointing forward. The palms are at a 45 degreeangle to the horizontal. The hands start at the central torso, and move in parallel upwardsand to the left.

While there is ambiguity in what the speaker’s hands denote—they could be hands metaphor-ically holding experimentalism itself, or a representation of a statement of experimentalismsuch as a book, or the content of such a book—the gesture is clearly coherent and depictsexperimentalism being thrown away.

The following example (8) illustrates the possibility for deictic and iconic imagery tocombine in complex gestures. It is extracted from Kendon’s Christmas cake narrative (2004,Fig. 15.6, pp. 321–322), where a speaker describes how his father, a small-town grocer, wouldsell pieces of a giant cake at Christmas time:

(8) a. and it was [pause 1.02 sec] this sort of [pause 0.4 sec] sizeduring the pauses, the speaker frames a large horizontal square using both hands; hisindex fingers are extended, but other fingers are drawn in, palms down.

b. and [he’d cut it off in bits]the speaker lowers his right hand, held open, palm facing to his left, in one corner ofthe virtual square established in the previous gesture

The gesture in (8b) involves both iconic and deictic meaning. The iconicity comes in theconfiguration and orientation of the speaker’s right hand, which mirrors a flat vertical surfaceinvolved in cutting: perhaps the knife used to cut the cake; or the path it follows through thecake; or the boundary thereby created. (We are by now familiar with such underspecification.)The deixis comes in the position of the speaker’s hand, which is interpreted by reference tothe virtual space occupied by the cake, as established by the previous gesture.

3See www.talkbank.org/media/ClassTalk/Lecture-unlinked/feb07/feb07-1.mov

7

We finish with an example, taken from a lecture on speech,4 that—like all our examples—underscores how gesture interpretation is dependent on both its form and its coherent linksto accompanying linguistic context:

(9) So there are these very low level phonological errors that tend to not get reported.The hand is in a fist with the thumb to the side (ASL A) and moves iteratively in thesagittal plane in clockwise circles (as viewed from left), below the mouth.

“There are these very low level phonological errors that tend not to get reported.”

Figure 2: Hand gestures depicting speech errors.

One salient interpretation is that the gesture depicts the iterative processes that cause low-level phonological errors, slipping beneath everyone’s awareness. In previous utterances, thespeaker used both words and gestures to show that anecdotal methods for studying speecherrors are biased towards noticeable errors like Spoonerisms. Those noticeable errors weredepicted with the hand emerging upward from the mouth into a prominent space betweenthe speaker and his audience. If we take the different position of the gesture in (9), belowthe mouth, nearer the speaker, as intended to signal a contrast with this earlier case, thenwe derive an interpretation of this gesture as depicting low-level phonological errors as lessnoticeable. At the same time, as in (8b), we might understand the hand shape iconically,with the fist shape suggesting the action of processes of production in bringing forth phono-logical material. This ensemble represents a coherent interpretation because it allows us tounderstand the gesture as providing information that directly supports what’s said in theaccompanying speech—the fact that these errors are less noticeable explains why anecdotalmethods would not naturally detect them.

Of course, this interpretation is just one of several coherent alternatives for this gesture.Another plausible interpretation of the gesture in (9) is that it depicts the low level of thephonological errors, rather than the fact that these errors are less noticeable. This alternativeinterpretation is also coherently related to the linguistic content: like (1) it depicts objectsthat are denoted in the sentence. In fact, this interpretation would be supported by a distinctview of the gesture’s form, where instead of conceiving it as a single stroke (as our priorinterpretation requires since the repeated movement was taken to depict an iterative process),it is several strokes—a sequence of identical gestures, each consisting of a fist moving inexactly one circle, and each circle demonstrating a distinct low-level phonological error. This

4http://www.talkbank.org/media/Class/Lecture-unlinked/feb02/feb02-8.mov

8

alternative interpretation demonstrates how ambiguity can persist in a coherent discourseat all levels, from form to interpretation. But crucially, all plausible interpretations mustsatisfy the dual constraints that (a) the interpretation offer a plausible iconic rendition ofthe gesture’s form; and (b) the interpretation be coherently related to the content conveyedby its synchronous speech. Accordingly, while computing the interpretation of gesture viaunification with the content of synchronous speech may suffice for examples where gesturecoherence is achieved through conveying the same content as speech (Kopp, Tepper andCassell, 2004), on its own it cannot account for the gestures in examples such as (4) and(6) that evoke distinct, but related, objects and properties to those in the speech. Rather,computing an interpretation of the gesture that is coherently related to the content conveyedin speech will involve commonsense reasoning.

Whether the full inventory of rhetorical relations that are attested in linguistic discourseare also attested for relating a gesture to its synchronous speech is an empirical matter. But werather suspect that certain relations are excluded—for instance interpreting a gesture so that itconnects to its synchronous speech with Disjunction seems implausible (although Disjunctioncould relate one multimodal discourse unit that includes a gestural element to another).But this doesn’t undermine the role of coherence relations in interpreting gesture any morethan it does for interpreting other purely linguistic constructions that signal the presenceof one of a strict subset of rhetorical connections. For example, sense ambiguous discourseconnectives such as and are like this: and signals the presence of a rhetorical relation betweenits complements; it underspecifies its value, but it cannot be Disjunction (Carston, 2002).Similarly, Kortmann (1991) argues that the interpretation of free adjuncts (e.g., “openingthe drawer, John found a revolver”) involves inferring coherence relationships between thesubordinate and main clauses, but certain relationships such as Disjunction are ruled out.It isn’t surprising that synchronous speech and gesture likewise signals the presence of acoherence relation whose value is not fully determined by form, although certain relations areruled out.

3 The Logical Form of Multimodal Communication

The overall architecture of our formalism responds to the claim, substantiated in Section 2,that gesture and speech present complementary, inferentially-related information as part ofan integrated, overarching speech act with a uniform force and consistent assignments ofscope relationships. We formalise this integration of gesture and speech by developing anintegrated logical form (lf) for multimodal discourse, which, like the lf of purely linguisticdiscourse, makes explicit the illocutionary content that the speaker is committed to in theconversation. As in theories of linguistic discourse, we give a central place in lf to rhetoricalrelations between discourse units. Here rhetorical relations must not only link linguisticmaterial together, but also gestures to synchronous speech and to other material in theongoing discourse.

A rhetorical relation represents a type of (relational) speech act (Asher and Lascarides,2003). Examples include Narration (describing one eventuality and then another that is incontingent succession); Background (a strategy like Narration’s save that the eventualitiestemporally overlap); and Contrast (presenting related information about two entities, usingparallelism of syntax and semantics to call attention to their differences). The inventoryalso includes metatalk relations, that relate units at the level of the speech acts rather than

9

content. For instance, you might follow “Chris is impulsive” with “I have to admit it”—anexplanation of why you said Chris is impulsive, not an explanation of why Chris is impulsive.These are symbolised with a subscript star—Explanation∗ for this example.

To extend the account to gesture, we highlight an additional set of connections whichspecify the distinctive ways embodied communicative actions connect together. The examplesfrom Section 2 provide evidence for three such relations. First, Depiction is the strategyof using a gesture to visualise exactly the content conveyed in speech. Example (1) is anillustrative case. The speaker says that Norris is across from the library at the same timeas she depicts their relative locations across from one another. Technically, Depiction mightbe formalised as a special case of Elaboration, where the gesture does not present additionalinformation to that in the speech. We distinguish Depiction, however, because Depiction doesnot carry the implicatures normally associated with redundancy in purely spoken discourse(Walker, 1993)—it is helpful, not marked.

Second, Overlay relates one gesture to another when the latter continues to develop thesame virtual space. Example (1), which is preceded in the discourse by (10), illustrates this:

(10) a. Norris is like up here—The right arm is extended directly forward from the shoulder with forearm slightlyraised; the right palm is flat and faces up and to the left.

b. And then the library is over here.After returning the right hand to rest, the right hand is re-extended now to the extremeupper right periphery, with palm held left.

The speaker evokes the same virtual space in (1) as in the preceding (10) by designatingthe same physical positions when naming the buildings. The rhetorical connection Overlaycaptures the intuition that commonalities in the use of space marks the coherent use of gesture.Here, a logical form that features Overlay between the successive gestures in (10ab) and (1)captures the correct spatial aspects of the discourse’s content.

Our third new relation is Replication, which relates successive gestures that use the bodyin the same way to depict the same entities. The gestures of example (5) illustrate Replication.The initial gesture adopts a figuration in which the speaker represents the woodsman, withher hands modelling his grip on the handle of the hatchet. While the subsequent speech nolonger explicitly mentions the hatchet or even the woodsman, subsequent gestures continuewith the imagery adopted in the earlier gesture. Connecting subsequent gestures to earlierones by Replication captures the coherence of this consistent imagery.

Our plan for this section is to formalise this programmatic outline, by developing repre-sentations of logical form that combine these rhetorical relations with appropriate models ofspatial content (Section 3.1), dynamic semantics (Section 3.2), and discourse structure (Sec-tion 3.3). We then show that these logical forms allow for the interpretive links in reference,content and scope that we observed in Section 2. The section culminates in the presentationof a language Lsdrs (Section 3.4) for describing the content of multimodal discourse that isbased on that of SDRT (Asher and Lascarides, 2003).

3.1 Spatial Content

We begin by formalising the spatial reference that underpins deixis as a dimension of gesturemeaning. Our formalisation adds symbols for places and paths through space, variables thatmap physical space to virtual or abstract spaces, and predicates that record the propositional

10

information that gestures offer in locating entities in real, virtual and abstract spaces. Thissection presents each of these innovations in turn.

We begin by adding to the model a spatiotemporal locality L ⊂ R4 within which individu-als and events can be located. We also add to the language a set of constants ~p1, ~p2, . . ., whichare mapped a subset of L by the model’s interpretation function I—i.e., [[~p]]M = IM (~p) ⊆ LM .Whenever we need to formalise the place or path in physical space designated by the gesture,we use a suitable constant ~p. We’ll return shortly to how physical locations map to locationsin the situation the speaker describes.

In Section 2, we used the gestures in (1) as representative examples of spatial reference,since the position of the right hand in space signals the (relative) locations of Norris Hall andthe library in interpretation. To formalise this, we can use a spatial reference ~pn denotinga place in front of the speaker’s shoulder, and ~pl denoting a place up and to her right. Thecontrast is now evident between the use of space in (1) and the mimicry in examples such as(5) and (7). Whereas (5) and (7) portray ‘non-spatial’ content through qualitative aspects ofmovement, (1) expresses intrinsically spatial information through explicit spatial reference.

For now, we remain relatively eclectic about how a speaker uses movement to indicatea spatiotemporal region. In (1) the speaker designates the position of the hand. But in(11)—a description of the library from the discourse preceding (1)—the speaker designatesthe trajectory followed by the hand:

(11) It’s the weird-looking building over here.The left hand shape is ASL 5 open, the palm facing right and fingers facing forward; thehand sweeps around to the right as though tracing the surface of a cylinder.

This trajectory is meant to represent the cylindrical exterior of the library—a fact that mustbe captured in LF via a suitable constant ~ps. As we saw in Section 2, such alternative meth-ods of spatial reference give rise to ambiguities in the form and interpretation of gestures—ambiguities that may never be fully resolved. Accordingly, it may not be possible or desirableto draw inferences from lf s involving spatial constants such as ~ps that depend on the constants’exact values.

A further crosscutting distinction is whether the speaker indicates the location of the handitself, as in (1) and (11), or uses the hand to designate a distant region. The typical pointinggesture, with the index finger extended (the ASL 1-index hand shape) is often used in thisway. Take the case of deferred reference illustrated in (12), after (Engle, 2000, Table 8, p38):

(12) [These things] push up the pins.The speaker points closely at the frontmost wedge of the line of jagged wedges that runsalong the top of a key as it enters the cylinder of a lock.

It seems clear that the speaker aims to call attention to the spatial location ~pw of the firstwedge, not the spatial location of the finger.5

Utterance (12) also contrasts with (1) and (11) in whether they link up with the realplaces or establish a virtual space that models the real world. In (12) the speaker locates theactual wedge on the key. In (1) and (11), however, the speaker is not actually pointing at thebuildings—she marks their imagined locations in the space in front of her. The information

5This demonstrative noun phrase and accompanying demonstration is attested in Engle’s data. Unfortu-nately Engle does not report the entire sentence context in which the gesture is used; the continuation is basedon other examples she reports. Further examples of the variety of spatial reference in gesture are provided byJohnston et al. (1997), Johnston (1998), Lucking, Rieser and Staudacher (2006).

11

that (12) gives about the real world can therefore be characterised directly in terms of thereal-world region ~pw that the speaker’s gesture designates. By contrast, the content of (1) and(11) can only be described in terms of context-dependent mappings vc and vs from the spacein front of the speaker to (in this case) the Northwestern University campus. The relationshipbetween the real positions of the speaker’s hands during her demonstrations ~pn and ~pl in (1)thus serves to signal the corresponding relationship between the actual location of Norris Hallvc(~pn) and the actual location of the library vc(~pl). A related (perhaps identical) mapping vs

is at play in (11) when the speaker characterises the actual shape of the library facade vs(~ps)in terms of the curved path ~ps of her hand.

These spatiotemporal mappings are the second formal innovation of the language to rep-resent gesture meaning. Formally, variables such as vc and vs are part of an infinite family ofvariables that denote transformations over L. They simplify the relationship between the formof a gesture and its semantics considerably. We do not have to assume an ambiguity betweenreference to physical space vs. virtual space. Rather, gesture always refers to physical spaceand always invokes a mapping between this physical space and the described situation—e.g.,the gesture in (12) makes the relevant mapping the identity function vI .

The values of these variables v1, v2, . . . are determined by context. Some continuations ofdiscourse are coherent only when the speaker continues to use the space in his current gesturein the same way as his previous gestures. Other continuations are coherent even though thecurrent gesture uses space in a different way. The values of v1, v2, . . . are therefore providedby assignment functions, which in our dynamic semantics mediate the formal treatment ofcontext dependence and context change since they are a part of the context of evaluation(see Section 3.4). To respect the iconicity, the possible values for a mapping v is tightlyconstrained: they can rotate and re-scale space but not effect a mirroring transformation.At the same time (given their origin in human cognition and bodily action), we would notexpect mappings to realise spatial relationships exactly. Here we simply assume that thereis a suitably constrained set of mappings T in any model, and so where f is an assignmentfunction, f(v) ∈ T .

As is standard in dynamic semantics (Williamson, 1994, Kyburg and Morreau, 2000,Barker, 2002), we understand the context-dependence of mapping variables to offer a solutionto the problem of vagueness in spatial mappings. These variables take on precise values, givena precise context. However, interlocutors don’t typically nail down a precise context, and ifthe denotation of a variable v is not determined uniquely, the lf will be correspondingly vagueabout spatial reference. A range of spatial reference will be possible and the interpretationof the gesture will fit a range of real-world layouts. For instance, utterances (1) and (11) caneither exemplify a statement at a particular time, or extend their spatiotemporal referencethroughout a wider temporal interval (Talmy, 1996). We also discussed in Section 2 thatthe gestures in (1) can be interpreted as demonstrating the buildings or the locations of thebuildings. These alternatives correspond to distinct values for the mapping v that mightboth be supported by the discourse context. Interlocutors can converse to resolve thesevagaries (Kyburg and Morreau, 2000, Barker, 2002). But eventually, even if some vaguenessin interpretation persists, interlocutors understand each other well enough for the purposesof the conversation (Clark, 1996).

We complete this formalisation of spatial reference in gesture by introducing two newpredicates: loc and classify respectively describe the literal and metaphorical use of space tolocate an entity. loc is a 3-place predicate, and loc(e, x, ~p) is true just in case at each momentspanned by the temporal interval e, x is spatially contained in the region specified by ~p. For

12

example, (13a) represents the interpretation of the gestures in (1):

(13) a. loc(e1,n, vc(~pn)) ∧ loc(e2, l, vc(~pl))

b. loc(e3, f, vs(~ps)) ∧ facade(l, f)

c. loc(e4, w, vI(~pw))

In words, e1 is the state of n, the discourse referent for Norris Hall that’s introduced inthe clause, being contained in the spatiotemporal image vc(~pn) on the speaker’s virtual mapof a point ~pn in front of her left shoulder; e2 is the state of the library l being containedin the spatiotemporal image vc(~pl) of the designated point ~pl further up and to the right.(13b) states that the facade f of the library lies in the real-world cylindrical shell vs(~ps)determined by the speaker’s hand movement ~ps in (11). The predication facade(l, f) is notcontributed by an explicit element of gesture morphology but is, as we describe in more detailin Section 3.2, the result of pragmatic inference that connects individuals introduced in thegesture to antecedents from the verbal content (i.e., the library). The logical form (13c) of thedeictic gesture in (12) locates the frontmost wedge w at the (distant) location ~pw where thespeaker is pointing. The deferred reference from that one wedge w to the entire set of wedgesat the top of the key as denoted by the linguistic phrase these things is obtained via pragmatics:context resolves to a specific value an underspecified relation between the deictic referent andthe NP’s referent, this underspecified relation being a part of the compositional semantics ofthe multimodal act (see Section 4.2). Indeed, identifying the gesture’s referent and spatialmapping as w and vI , resolving the referent of these things, and resolving this underspecifiedrelation between them (to exemplifies) are logically co-dependent (see Section 5).

Speakers can also use positions in space as a proxy for abstract predications. Suchmetaphorical uses are formalised through the predicate classify. A representative illustrationcan be found in the conduit metaphor for communication (Reddy, 1993), where informationcontributed by a particular dialogue agent is metaphorically located with that agent in space,as shown in the naturally-occurring utterance (14):

(14) We have this one ball, as you said, Susan.The speaker sits leaning forward, with the right hand elbow resting on his knee and theright hand held straight ahead, in a loose ASL-L gesture (thumb and index finger extended,other fingers curled) pointing at his addressee.

(14) is part of an extended explanation of the solution to a problem in physics. The speaker’sexplanation describes the results of a thought experiment that his addressee Susan had alreadyintroduced into the dialogue, and he works to acknowledge this in both speech and gesture.More precisely, both the gesture in (14) and the adjunct clause “as you said” function asmeta-comments characterising the original source of “We have this one ball”. The gesture, bybeing performed close to Susan and away from the speaker’s body, shows that his contributionhere is metaphorically located with Susan; that is, it recapitulates content she contributed.

Formally, we handle such metaphorical reference to space by assuming a background ofmeaning postulates that link spatial coordinates with corresponding properties. The predicateclassify is used to express instantiations of this link. For example, corresponding to theconduit metaphor is a virtual space vm that associates people with their contributions toconversation. If ~pi is the location of any interlocutor i, then classify(e, u, vm(~pi)) is trueexactly when utterance u presents content that is originally due to a contribution by i. Thegenerative mapping vm between interlocutors’ locations and contributed content shows why

13

the metaphor depends on spatial reference. The LF of (14) will therefore use the formulaclassify(e6, u′, vm(~pS)), where u′ denotes an utterance whose content entails “We have thisone ball”, and ~pS denotes Susan’s location.

This treatment of metaphor is continuous with models of linguistic metaphor as context-dependent indirect reference (Stern, 2000, Glucksberg and McGlone, 2001). This indirectreference depends on the conventional referent (here a point in space) and a generating prin-ciple taken from context that maps the conventional referent into another domain (here thedomain of contributing to conversation). As revealed within cognitive linguistics (Lakoff andJohnson, 1981, Gibbs, 1994, Fauconnier, 1997), such mappings are typically richly-structuredbut flexible and open-ended. Thus, interlocutors will no more fix a unique way to under-stand the metaphor than they will fix other aspects of context-dependent interpretation.So metaphorical interpretations on our account—though represented precisely in terms ofcontext-dependent reference—will remain vague.

3.2 Dynamic Semantics

We proceed by formalising a shared set of constraints on co-reference in discourse that de-scribes both deictic and iconic gesture meaning. Our formalisation distinguishes betweenentities introduced in speech and those introduced in gesture. As a provisional account of ourempirical data and linguistic intuitions, we articulate a model in which entities introducedin gesture must be bridging-related to entities introduced explicitly in accompanying speech(Clark, 1977). These inferred entities can figure in the interpretation of subsequent gesturesbut do not license pronominal anaphora in subsequent speech. This provisional model offersa lens with which to focus future empirical and theoretical research.

In Section 2, we presented a range of examples in which gestures seem most naturallyunderstood as depicting entities that are not directly referenced in the accompanying speech,but which stand in close relations to them given commonsense knowledge. For instance in(2), the prototypical spring and the prototypical pin depicted in gesture are inferable fromthe set of springs and the set of pins explicitly referenced in words. Conversely in (12), thereferent of “these things” is inferable from but not identical to the specific individual wedgethat’s denoted by the speaker’s gesture. The grip of the woodsman’s hands on the handleof the hatchet, we suggested, is inferable but not explicitly referenced in the words “tookhis hatchet” of (5a). The knife-edge depicted by the hand in (8b) is also inferable but notexplicitly referenced in the accompanying statement about the grocer’s work with the cake,“he’d cut it off into bits”.

While we do not believe gestures refer only to entities evoked in speech, we do thinksome inferential connection to prior discourse is necessary. Not only does this characteriseall the examples we have investigated, but to go beyond inference would place an uncharac-teristically heavy burden on gesture meaning, which is typically ambiguous and open-endedexcept when interpreted in close connection to its discourse context. Thus, we will formaliseinitial references to new entities in gesture by analogy to definite references to inferable en-tities in discourse—what is known as bridging in the discourse literature (Clark, 1977). Ourtechniques are standard; see e.g. Chierchia (1995). But nevertheless they offer the chanceto engage future empirical work on reference in gesture with a related formal and empiricalapproach to linguistic discourse.

Entities depicted initially in gesture remain available for the interpretation of subsequentgestures. For example, the lagoon, located as a landmark on the speaker’s left in the initial

14

segment of the direction-giving episode excerpted in (1) serves as the referent demonstratedby the speaker’s left hand in both gestures of (1)—despite the fact that the speaker does notcontinue to reference the lagoon in speech. Similarly, once introduced in (5a), the grip of thewoodsman’s hands on the handle of the hatchet continues to guide the interpretation of thegestures of (5b), even though the speaker does not mention the hatchet again.

By contrast, inferable entities evoked only in gesture seem not to be brought to prominencein such a way as to license the use of a pronoun in subsequent speech. Such examples wouldbe very surprising in light of the tradition in discourse semantics—see e.g., Heim (1982)—that sees pronominal reference as a reflex of a formal link to a linguistic antecedent. Infact, we have found no such examples. And our intuitions find hypothetical attempts atsuch reference quite unnatural. Try, for example, following (8b) “he’d cut it off into bits”with ¿‘and it would get frosting all over it”, with it understood to pick out the cutting edgedemonstrated in gesture in (8b). We show in this section how to formalise such a constraint.

Our formalism builds an existing dynamic semantic model of anaphora in discourse, sincewith dynamic semantics—unlike alternatives such as Centering Theory (Grosz, Joshi andWeinstein, 1995) and Optimality Theory (Buchwald et al., 2002)—we can build on previouswork that integrates anaphoric interpretation with rhetorical relations and discourse structure(Asher and Lascarides, 2003). We use a dynamic semantics where context is represented as apartial variable assignment function (van Eijck and Kamp, 1997). As a discourse is interpretedthe model remains fixed but the input assignment function changes into a different outputone. The basic operations are to test that the input context satisfies certain conditions (withrespect to the model), to extend the input variable assignment function one by defining avalue for a new variable, and to sequence two actions together, thereby composing theirindividual effects on their input contexts. These primitives are already used to model theanaphoric dependencies across sentence boundaries; here we will use them to model anaphoricdependencies between spoken utterances and gesture and between sequences of gestures.6

To distinguish between a set of prominent entities introduced explicitly in speech and aset of background entities that’s depicted only in gesture, we take our cue from Bittner’s2001 formalisation of centering morphology as distinguishing foreground and backgroundentities. This involves splitting the context into two assignment functions 〈f, g〉 to distinguishreferents of different status. For us, the first function f records the entities that can be usedto interpret pronouns and other anaphors in speech; the second one g records the entitiesthat are the basis for referential depiction in gesture. To ensure that linguistic indefinitescan license depiction in gesture (e.g., the indefinite springs in (2) licenses the depiction of aprototypical spring in gesture), existential quantifiers in the logic will trigger an update toboth f and g. Meanwhile, to delimit the scope of gesture, we introduce an operator [G] overformulae, which semantically restricts the context-updating operations that happen withinits scope to only the second function g (see Section 3.4 for details). [G] is extensional, notmodal. It allows us to capture the different status of referents introduced in different ways

6Our framework aims for the simplest possible dynamic semantic formalisation. This involves taking upgeneralisations from linguistic discourse as a provisional guide to the behaviour of gesture. For example, weassume that a gesture that’s outscoped by a negation, like the keyboard gesture of (6), does not license subse-quent anaphora, just as an indefinite in speech that’s outscoped by a negation does not license a subsequentpronoun (Groenendijk and Stokhof, 1991). Other analyses are possible, with more complex dynamic seman-tics, modeled for example after treatments of so-called specific indefinites in language; see Farkas (2002). Moregenerally, our developments are compatible with more complex architectures for dynamic semantics. But theformalism we provide is already sufficient for interpreting the key examples we consider here.

15

while allowing us to treat gesture and speech within a common logical representation—aswe must, given the inferential and scopal dependencies we observed in Section 2. The needfor this formal device is independent of the use of rhetorical relations to integrate contentfrom different modalities together. In particular, the need for a suitable dynamic semanticsdoes not undermine our claim that gesture and speech are rhetorically connected. Modalsubordination and grammatically-marked focus systems in language also block individualsin one clause from being an antecedent to anaphora in another, even when there’s a clearrationale for a rhetorical connection. So the anaphoric constraints across speech and gestureare no more a counterargument to rhetorical connections than modal subordination and focusgive counterarguments to using rhetorical relations to model linguistic discourse.

3.3 Rhetorical Relations and Discourse Structure

There are several existing frameworks that use rhetorical relations; e.g., Mann and Thompson(1987), Hobbs et al. (1993). We will use Segmented Discourse Representation Theory (SDRT,Asher and Lascarides (2003)) as our starting point, for three main reasons. First, SDRT fullysupports semantic underspecification. This is useful because the meaning of a gesture asrevealed by its form is highly underspecified—we can re-use SDRT’s existing techniques forresolving underspecified content to pragmatically preferred values in our model of gestureinterpretation (see Section 5). Secondly, SDRT acknowledges that ambiguity can persist in acoherent discourse. Its motivation for doing so stems originally from observing that there maybe more than one maximally coherent interpretation of linguistic discourse. We have seen thatthe same is true of gesture, and so a framework where coherence constrains interpretation butdoesn’t necessarily resolve it uniquely is essential. Finally, SDRT offers a dynamic semanticinterpretation of logical forms. We have seen already in Section 3.2 that dynamic semanticsoffers an elegant way of modelling both the vagueness in gesture interpretation and constraintson co-reference.

In SDRT, logical forms consist of labels π1, π2, . . . that each represent a unit of dis-course, and a function that associates each label with a formula that represents the unit’sinterpretation—these formulae can be rhetorical relations between labels. We will treat indi-vidual clauses and gestures as units of discourse and so they each receive a label.

Rhetorical connections among units of discourse create discourse segments: π1 immediatelyoutscopes π2 if π1’s formula includes R(π, π2) or R(π2, π) for some R and π. While a segmentmay consist of (or outscope) a continuous set of discourse units, this isn’t necessary; see Asherand Lascarides (2003) for many examples. Gestures likewise structure discourse in flexiblebut constrained ways. As we have seen, gestures like those in (10ab) followed by (1) willbear a rhetorical relation both to simultaneous speech and to previous gestures. In all cases,however, the outscopes relation over labels in an LF cannot contain cycles and must have asingle root—i.e., the unique segment that is the entire discourse.

Each rhetorical relation symbol receives a semantic interpretation that’s defined in termsof the semantics of its arguments. For instance, a discourse unit that is formed by con-necting together smaller units π1 and π2 with a veridical rhetorical relation R entails thecontent of the smaller units, as interpreted in dynamic succession, and goes on to add aset of conditions ϕR(π1,π2) that encode the particular illocutionary effects of R. For ex-ample, Explanation(π1, π2) transforms an input context C into an output one C ′ only ifKπ1 ∧Kπ2 ∧ ϕExplanation(π1,π2) also does this, where ∧ is dynamic conjunction, and Kπ1 andKπ2 are the contents of labels π1 and π2 respectively. The formula ϕExplanation(π1,π2) is a test

16

on its input context. Meaning postulates constrain its interpretation: e.g., Kπ2 must be ananswer to the question Why Kπ1? The formalisation of the three new rhetorical relations isstraightforward in this setting; see Section 3.4. The content of the entire discourse is theninterpreted in a compositional manner, by recursively unpacking the truth conditions of theformula that’s associated with the unique root label. This natural extension of the formaltools for describing discourse coherence fits what we see as the fundamental commonality inmechanisms for representing and establishing coherence across all modalities.

The structure induced by the labels and their rhetorical connections impose constraintsand preferences over interpretations: in other words, discourse structure guides the resolutionof ambiguity and semantic underspecification that’s induced by form. For example, sdrt givesa characterisation of attachment that limits the rhetorical links interpreters should considerfor a new discourse unit, based on the pattern of links in preceding discourse. Rhetoricalconnections also restrict the availability of referents as antecedents to anaphora. Both ofthese ingredients of SDRT carry over to embodied discourse (Lascarides and Stone, in press).

Anticipating the arguments of Section 4, we assume a distinction between identifyinggestures, which simply demonstrate objects, from more general visualising gestures, whichdepict some aspect of the world. Interpreting a gesture may involve resolving an ambiguity inits form, as to whether it is identifying or visualising. Identifying gestures are interpreted inconstruction with a suitable linguistic constituent. (We remain agnostic about the temporaland structural constraints between speech and gesture that may apply here.) The jointinterpretation introduces an underspecified predicate symbol—call it id rel—that relates thereferent of the identifying gesture to the semantic index of the corresponding linguistic unit.Constructing the discourse’s LF then involves resolving the underspecified relation id rel toa specific value. The complex demonstrative in (12) represents such a case. The speaker’sidentifying gesture refers to the frontmost wedge on the key. That referent exemplifies the setdenoted by the demonstrative NP these things; so in this case id rel resolves to exemplifies.Discourse structure and commonsense reasoning guide this process of resolution.

Similarly, visualising gestures are also interpreted in construction with a suitable linguisticconstituent (often a clause). The joint interpretation introduces an underspecified rhetoricalconnection vis rel(πs, πg) between the spoken part πs and the gesture part πg. Thus for avisualising gesture, constructing the LF of the discourse involves achieving (at least) four,logically co-dependent tasks that are all designed to make the underspecified logical formthat’s derived from its form more specific. First, one resolves the underspecified contentof the gesture to specific values. Second, the underspecified rhetorical connection vis rel isresolved to one of a set of constrained values (e.g., Narration is OK, Disjunction is not).Third, one identifies a label for this rhetorical connection: if it’s a new label which in turnattaches to some label in the context then πs and πg start a new discourse segment; otherwiseit’s an existing label and πs and πg continue that existing segment. And finally, one computeswhether πg and/or πs are also rhetorically connected to other labels.

Discourse structure imposes constraints on all of these decisions. In SDRT, the availablelabels in the context for connections to new ones are either (i) the last label πl that wasadded, or (ii) any label that dominates πl via a sequence of the outscopes relation and/orsubordinating relations (e.g., Elaboration, Explanation, Background are subordinating, whileNarration is not). This corresponds to the right frontier when the discourse structure isdiagrammed as a graph with outscoping and subordinating relations depicted with downward

17

arcs.7 However, extending SDRT to handle gesture introduces a complication, because thelast label is not unique: while a linguistic discourse imposes a linear order on its minimaldiscourse units (one must say or write one clause at a time), this linear order breaks downwhen one gestures and speaks at the same time. As the most conservative possible workinghypothesis, we assume there that new attachments remain limited to the right frontier, theonly difference being that instead of one last label there are two: the label πs

l for the lastminimal spoken unit, and the label πg

l for its synchronous gesture (if there was one). Sincethere are two last labels there are now two right frontiers. The ‘spoken’ right-frontier Πs is theset of all labels that dominate πs

l via outscopes and subordinating relations, and the ‘gesture’frontier Πg is the set of all labels that dominate πg

l . Thus the available labels are Πs ∪ Πg

(note that Πs ∩ Πg is non-empty and contains at least the root π0). In other words, whenan utterance attaches to its context, the dependencies of its speech and gesture are satisfiedeither through the connection to the discourse as a whole, or to one another, or to thecontinued organisation of communication in their respective modalities. Given this definitionof attachment, antecedents to anaphora can be controlled as in Asher and Lascarides 2003—roughly, the antecedent is in the same discourse unit or in one that’s rhetorically connectedto it.

This definition combines with the dynamic semantics from Section 3.2 to make precisepredictions about how logical structure and discourse structure constrain co-reference. Forinstance, the co-reference across gestures that we observed in the extended discourse (10ab)followed by (1) satisfies these constraints—each spoken clause is connected to its prior onewith the subordinating relation Background, making all the spoken labels on the ‘spoken’right frontier; and each gesture is connected to the prior one with the subordinating relationOverlay (this entails that space is used in the same way in the gestures), placing them onthe ‘gesture’ right frontier. So all labels remain available. Each gesture also connects toits synchronous clause with Depiction—a coordinating relation because the content of oneargument is not more fine-grained or ‘backgrounded’ relative to the other. Nevertheless, alllabels are on the right frontier because of the Background and Overlay connections.

Conversely, availability constrains the interpretation of the gestures in (15) given in Fig-ure 3—which is taken from the same dialogue as (10) and (1)—in ways that match intuitions.To see this, we discuss the interpretation of each multimodal action in turn. Intuitively, thegesture in (15a) is related to its synchronous clause with Depiction, since it demonstratesthe direction to turn into. (15b) attaches to the spoken and gesture labels of (15a) withthe subordinating relation Acknowledgement—thus A’s utterance (15c) can connect to (15a).And indeed it does: the speech unit in (15c) is related to that in (15a) with the coordinatingrelation Narration—so it is interpreted as a command to keep walking after turning right.Furthermore, the gesture connects to (15a)’s gesture with Overlay so that it conveys keepwalking rightward from the position where you turned.

But (15d) marks a change in how the the physical location of the hands map to thelocations of landmarks on the university campus. The discourse cue phrase and then in (15d)implicates that the linguistic unit attaches to the prior one with Narration, and its gestureattaches to the content of this clause with Depiction to capture the intuition that it depictsthe road—the object introduced in the clause. But crucially, the physical location of thisdepiction bears no relation to the spatial setup given by the prior two gestures: technically, itdoes not attach to (15c) with Overlay, reflecting the fact that the mapping v′ from physical

7Parallel, Contrast and discourse subordination relax this right-frontier constraint, but we ignore this here.

18

(15) a. A: So then, once you get to that parking lot you turn [right].When A says the word right, her right hand is held in flat open shape (ASL-5) withthe palm facing forward and fingers pointing right, and the hand is held far out to theright of her torso.

b. B: Right

c. A: And you [keep walking].When A says keep walking, she repeats the gesture from (15a).

d. A: And then there’s like a [road]When A says road, both her hands are brought back towards the centre of her torso,with flat open hand shapes (ASL-5) and palms held vertically and facing each other,with fingers facing forwards.

e. B: U-huh

f. A: [It will just kinda like consolidate, you know, like come into a road].A’s hands are in ASL-5, and they start with the palms facing each other at an angle(as if the hands form two sides of a equilateral triangle), very close to her central torso.They then sweep out towards B, and the hands go from being at an angle to beingparallel to each other.

g. A: [Just stay on the road and then walk for a little bit].A’s right hand starts at the centre of her torso and sweeps out to the right.

h. A: [There are buildings over here]A’s right hand goes to her right, to the same place where her hand was in the (15a)’sgesture . Her hand is in a loose claw shape with the palm facing down.

Figure 3: An extract from a direction giving dialogue that features Narration.

space to virtual space that is a part of its interpretation is different from the mapping vthat was used earlier. To put this another way, the road that is demonstrated in (15d) isnot to the left of the walking path that’s demonstrated in (15c), even though the hands in(15d) are to the left of where they were in (15c). These rhetorical connections mean that thegestures in (15ac) are no longer on the right frontier. And thus, according to our model, themapping v that’s evoked by (15ac) is no longer available for interpreting subsequent gestures(while the mapping v′ that’s used in (15d) is available). Interestingly, this prediction matchesour intuitions about how the gestures in (15fg) are interpreted. In particular, even thoughthe gesture in (15g) places the right hand in the same place as it was in (15a), it does notdemonstrate that the buildings it denotes are co-located with the place where the agent is toturn right (i.e., at the parking lot). The right frontier constraint likewise predicts that theclause in (15g) cannot connect to the clause in (15a); it cannot be interpreted in the same wayas the discourse “So then, once you get to the parking lot you turn right; there are buildingsover here”.

3.4 Summary: Formalism and Examples

We now complete our presentation of the logical form for embodied discourse by giving for-mally precise definitions. We start with the syntax of the language Lsdrs for expressing logicalforms, which is based on that of SDRT. It is extended to include spatial expressions (see Sec-tion 3.1), and the two last labels (see Section 3.3). The dynamic semantics of Lsdrs is similar

19

to that in Asher and Lascarides (2003), except that a context of evaluation is refined to in-clude two partial variable assignment functions rather than one; these track salient entitiesfor interpreting language and gesture respectively (see Section 3.2). We close with workedexamples to illustrate the formalism.

Definition 1 Vocabulary and TermsThe following vocabulary provides the syntactic atoms of the language Lsdrs :

• A set P of predicate symbols (P1, P2, . . .) each with a specified sort givingits arity and the type of term in each argument position;

• A set R of (2-place) rhetorical relation symbols over labels (e.g., Contrast,Explanation, Narration, Overlay, . . . );

• Individual variables (x1, x2, y, z . . .); eventuality variables (e1, e2 . . .); andconstants for spatiotemporal regions (~p1, ~p2 . . .);

• Variables over mappings between spatiotemporal regions (v1, v2 . . .);• The boolean operators (¬, ∧); the operator [G]; and quantifiers ∀ and ∃;• Labels π1, π2 . . ..

We also define terms from this vocabulary, each with a corresponding sort. Indi-vidual variables are individual terms; eventuality variables are eventuality terms;and if ~p is a constant for a spatiotemporal region and v is a variable over mappings,then ~p and v(~p) are place terms.

A logical form (LF) for discourse is a Segmented Discourse Representation Structure(SDRS). This is constructed from sdrs-formulae in Lsdrs as defined in Definition 2:

Definition 2 SDRS-FormulaeThe definition of the SDRS-formulae Lsdrs starts with a definition of a subsetLbase ⊂ Lsdrs of SDRS-formulae that feature no rhetorical relations:

1. If P ∈ P is an n-place predicate and i1, i2, . . . , in are terms of the appropriatesort for P , then P (i1, . . . , in) ∈ Lbase .

2. If φ, ψ ∈ Lbase and u is a variable, then ∃uφ,∀uφ ∈ Lbase

3. If R is a rhetorical relation and π1 and π2 are labels, then R(π1, π2) ∈ Lsdrs

4. If φ, ψ ∈ Lsdrs , then φ ∧ ψ,¬φ, [G]φ ∈ Lsdrs .

An sdrs is a set of labels (two of which are designated to be last), and a set of sdrs-formulae associated with each label:

Definition 3 SDRSAn sdrs is a triple: 〈A,F, last〉, where:

• A is a set of labels;• F is a mapping from A to Lsdrs ; and• last is set containing at most two labels {πs, πg} ⊆ A, where πs labels the

content of a token linguistic unit, and πg the content of a token gesture(intuitively, this is the last multimodal act and last will contain no gesturelabel if the act had no gesture).

20

We say that π immediately outscopes π′ iff F (π) contains π′ as a literal. Itstransitive closure � must be a well-founded partial order with a unique root (thatis, there is a unique π0 ∈ A such that ∀π ∈ A, π0 � π).

The unique root makes an SDRS the logical form of a single discourse: the segment that theroot label corresponds to is the entire discourse. The outscopes relation need not form a tree,reflecting the fact that a single communicative act can play multiple illocutionary roles in itscontext (see Section 3.3). When there is no confusion we may omit last from the specificationof an SDRS, writing it 〈A,F 〉. We may also write F (π) = φ as π : φ. And we will continueoccasionally to use Kπ as notation for the the content F (π). An example SDRS is shown in(9′); this represents one of the plausible interpretations of (9) (see Section 2)—the gesturedepicts the subconscious nature of the processes that sustain low-level phonological errors:

(9) So there are these very low level phonological errors that tend to not get reported.The hand is in a fist with the thumb to the side (ASL A) and moves iteratively in thesagittal plane in clockwise circles (as viewed from left), below the mouth.

(9′) π1 : ∃y(low-level(y) ∧ phonological(y) ∧ errors(y) ∧ go-unreported(e, y))π2 : [G]∃x(continuous(x) ∧ below-awareness(x) ∧ process(x) ∧ sustain(e′, x, y))π0 : Explanation(π1, π2)

We will shortly discuss the much more incomplete representation of meaning that is revealedby (9)’s form, and how commonsense reasoning uses that together with contextual informationto construct the SDRS (9′). But first, we give details of the model theory of SDRSs, ensuringin particular that the dynamic semantics of (9′) is as intended.

Definition 4 ModelA model is a tuple 〈D,L, T, I〉 where:

• D consists of eventualities (DE) and individuals (DI).• L ⊂ R4 is a spatiotemporal locality.• T is a set of constrained mappings from L to L (i.e., they can expand, contract

and rotate space, but not invert it).• I is an interpretation function that maps non-logical constants from Lbase to

denotations of appropriate type (e.g., I(~p) ⊆ L).

Note that I does not assign denotations to rhetorical relations; we’ll return to them shortly.But the semantics of all SDRS-formulae φ relative to a model M will specify a context-change potential that characterises exactly when φ relates an input context to an outputone. A context is a pair of partial variable assignment functions 〈f, g〉 (see Section 3.2 formotivation); these define values for individual variables (f(x) ∈ DI), eventuality variables(f(e) ∈ DE) and spatial mappings (f(v) ∈ T ).

As is usual in dynamic semantics, all atomic formulae and ¬φ are tests on the inputcontext. The existential quantifier ∃x extends the input functions 〈f, g〉 to be defined for x,and dynamic conjunction is composition. Hence ∃xφ is equivalent to ∃x∧φ. The operator [G]for gesture ensures that all formulae in its scope act as tests or updates only on the functiong in the input context 〈f, g〉, but leave f unchanged. This means that the denotations foreach occurrence of x in ([G]∃xP (x)) ∧ ([G]Q(x)) are identical, but they do not co-refer in([G]∃xP (x)) ∧Q(x). This matches intuitions about co-reference in discourse (across gesturesvs. from gesture to subsequent speech respectively) that we discussed in Section 3.2.

21

Definition 5 Semantics of SDRS-formulae without rhetorical relations

1. Where i is a constant term, 〈f, g〉[[i]]M = I(i).

2. Where i is a variable, 〈f, g〉[[i]]M = f(i).

3. Where v(~p) is a spatial term, 〈f, g〉[[v(~p)]]M = f(v)(I(~p)).

4. For a formula Pn(i1, . . . , in), 〈f, g〉[[Pn(i1, . . . , in)]]M 〈f ′, g′〉 iff 〈f, g〉 = 〈f ′, g′〉and 〈〈f, g〉[[i1]]M , . . . , 〈f, g〉[[in]]M 〉 ∈ I(Pn)

5. 〈f, g〉[[∃x]]M 〈f ′, g′〉 iff:

(a) dom(f ′) = dom(f) ∪ {x}, and ∀y ∈ dom(f), f ′(y) = f(y) (i.e., f ⊆x f′);

(b) dom(g′) = dom(g) ∪ {x} and ∀y ∈ dom(g), g′(y) = g(y) (i.e., g ⊆x g′);

(c) f ′(x) = g′(x).

6. 〈f, g〉[[φ ∧ ψ]]M 〈f ′, g′〉 iff 〈f, g〉[[φ]]M ◦ [[ψ]]M 〈f ′, g′〉.7. 〈f, g〉[[¬φ]]M 〈f ′, g′〉 iff 〈f, g〉 = 〈f ′, g′〉 and for no 〈f ′′, g′′〉, 〈f, g〉[[φ]]M 〈f ′′, g′′〉8. 〈f, g〉[[[G](φ)]]M 〈f ′, g′〉 iff f = f ′ and ∃g′′ such that 〈g, g〉[[φ]]M 〈g′′, g′〉

Finally, we address the semantics of rhetorical relations. Unlike the predicate symbols inP, these do not impose tests on the input context. As speech acts, they change the contextjust like actions generally do. We emphasise veridical relations:

Definition 6 Semantic Schema for Rhetorical RelationsLet R be a veridical rhetorical relation (i.e., Narration, Background, Elaboration,Explanation, Contrast, Parallel, Depiction, Overlay, Replication). Then:

〈f, g〉[[R(π1, π2)]]M 〈f ′, g′〉 iff 〈f, g〉[[Kπ1 ∧Kπ2 ∧ ϕR(π1,π2)]]

M 〈f ′, g′〉

In words, R(π1, π2) transforms an input context 〈f, g〉 into an output one 〈f ′, g′〉 if and onlyif the contents Kπ1 followed by Kπ2 followed by some particular illocutionary effects ϕR(π1,π2)

also do this. Meaning postulates then impose constraints on the illocutionary effects ϕR(π1,π2)

for various relations R. For instance, the meaning postulate for ϕNarration(π1,π2) stipulates thatindividuals are in the same spatio-temporal location at the end of the first described event eπ1

as they are at the start of the second described event eπ2 , and so eπ1 temporally precedes eπ2

(we assume prestate and poststate are functions that map an eventuality to the spatiotemporalregions in L of its prestate and poststate respectively):8

• Meaning Postulate for NarrationϕNarration(π1,π2) → overlap(poststate(eπ1), prestate(eπ2))

So, for instance, representing (16) with π0 : Narration(π1, π2), where π1 and π2 label thecontents of the clauses, ensures its dynamic interpretation matches intuitions: John went outthe door, and then from the other side of the door he turned right.

(16) π1. John went out the door.

π2. He turned right.8In fact, this axiom is stated here in simplified form; for details, see Asher and Lascarides (2003).

22

It’s important to stress, however, that these interpretations of rhetorical relations aredefined only with respect to complete interpretations: Kπ1 and Kπ2 must be SDRS-formulae,with all underspecified aspects of content that are revealed by form fully resolved. Accordingly,the type of speech act that is performed is a property of a contextually resolved interpretationof an utterance (or a gesture) rather than a property of its form. This belies the fact that inlinguistic discourse, it is possible to align certain linguistic forms with certain types of speechacts: e.g., indicatives tend to be assertions while interrogatives tend to be questions. Suchalignments are not possible with gesture, and our theory reflects this: the form of a gestureon its own is insufficient for inferring anything about its illocutionary effects; it is only whenit is combined with context that clues about the speech act are revealed. In short, the syntaxand model theory of Lsdrs are designed only for representing the pragmatically preferredinterpretations. And when there is more than one pragmatically plausible interpretation,there is more than one logical form expressed in Lsdrs . We will examine shortly how thesepragmatic interpretations are inferred from form and context.

Rhetorical relations that are already a part of SDRT—like Explanation—can now relatecontents of gesture. We also argued earlier for three relations whose arguments are restrictedto gesture: Depiction, Overlay and Replication. These are all veridical relations, and themeaning postulates that define their illocutionary effects match the informal definitions givenearlier. For instance, Depiction(π1, π2) holds only if π1 labels the content of a spoken unit, π2

labels the content of a gesture, and Kπ1 and Kπ2 are nonmonotonically equivalent; we omitformal details because it requires a modal model theory for Lsdrs as described in Asher andLascarides (2003). We assume that a discourse unit labels speech only if all the minimal unitsoutscoped by it label speech; similarly for gesture. Overlay(π1, π2) holds only if π1 and π2 aregestures, and Kπ2 continues to develop the same virtual space as Kπ1 : in other words, Kπ1

and Kπ2 entail contingent formulae containing the same mapping v. Finally, ϕReplication(π1,π2)

holds only if π1 and π2 are gestures, and they depict common entities in the same way. Moreformally, there is a partial isomorphic mapping µ from the constructors in Kπ1 to those inKπ2 such that for all constructors c from Kπ1 , c and µ(c) are semantically similar. We foregoa formal definition of semantic similarity here.

Definition 3 now formalises the interpretation of an SDRS:

Definition 7 The Dynamic Interpretation of an SDRSLet S = 〈A,F, last〉 be an SDRS, and let π0 ∈ A be its unique root. Then:

〈f, g〉[[S]]M 〈f ′, g′〉 iff 〈f, g〉[[F (π0)]]M 〈f ′, g′〉

If the SDRS features only veridical rhetorical relations, then it transforms an input contextinto an output one only if the contents of each clause and gesture also do this.

As discussed in Section 3.3, we minimise the changes to SDRT’s original constraints on theparts of a discourse context to which new material can rhetorically connected—the so-callednotion of availability. In other words, the available labels in embodied discourse are those onthe right frontier of at least one last label:

Definition 8 Availability for Multimodal DiscourseLet S = 〈A,F, last〉 be an SDRS for multimodal discourse (and so by Definition 3,last is a non-empty set of at most two labels). Where π, π′ ∈ A, we say that π > π′

iff either π immediately outscopes π′ or there is a label π′′ ∈ A such that F (π′′)contains the literal R(π, π′) for some subordinating rhetorical relation R (e.g.,

23

Elaboration or Explanation but not Narration). Let >∗ be the transitive closureof >. Then π ∈ A is available in S iff π ≥∗ l, where l ∈ last.

In keeping with our strategy for minimising changes that are engendered by gesture, theconstraints on anaphora match exactly SDRT’s constraints for purely linguistic discourse:

Definition 9 Antecedents to AnaphoraSuppose that Kβ contains an anaphoric condition ϕ. Then the available an-

tecedents to ϕ are terms that are:

1. in Kβ and DRS-accessible to ϕ (as defined in Kamp and Reyle (1993)); or

2. in Kα, DRS-accessible to any sub-formula in Kα, and there is a formulaR(α, γ) in the SDRS such that γ � β.

In other words, an antecedent must be in the same discourse unit as the anaphor, or accessiblein a distinct unit that is rhetorically connected to a unit that contains the anaphor.

To illustrate this formalism, we give the precise semantics for some key discourses thatguided its development. We start with the discourse (5); its SDRS is shown in (5′)—withsome simplification, since we have ignored tense and presuppositions:

(5) a. π1: and took his hatchetπ2: Speaker’s right hand grasps left hand, with wrists bent.π3: Speaker lifts poised hands above right shoulder.

b. π4: and with a mighty sweepπ5: Hands, held above right shoulder, move back then forward slightly.

c. π6: sliced the wolf’s stomach openπ7 : Arms swing in time with sliced; then are held horizontally at the left

(5′) 〈{π0, π, π1, π2, π3, π4, π5, π6, π7}, F, {π6, π7}〉, where F is as follows:π1: ∃hw[took(e1, w, h) ∧ hatchet(h)]π2: [G](∃lra[left hand(l, w) ∧ right hand(r, w) ∧ handle(h, a)∧

grab(e2, w, a) ∧ instrument(e2, l) ∧ instrument(e2, r)])π3: [G](∃d[right shoulder(d,w) ∧ lift(e3, w, h)∧

goal location(e3, d) ∧ instrument(e3, l) ∧ instrument(e3, r)])π4: ∃s[sweep(s) ∧mighty(s) ∧ with(e4, s)]π5: [G](coiled backswing(e5, w) ∧ instrument(e5, l) ∧ instrument(e5, r))π6: ∃ft[slice-open(e4, w, t)wedgestomach(t, f) ∧ wolf(f)]π7: [G][slice-open(e4, w, t)])π0: Elaboration(π1, π)∧

Narration(π1, π4) ∧ Explanation(π4, π5) ∧ Replication(π, π5) ∧Narration(π, π5)∧Background(π4, π6) ∧Depiction(π6, π7) ∧ Replication(π5, π7) ∧Narration(π5, π7)

π: Replication(π2, π3)

This logical form is built from the underspecified content that’s revealed by linguistic andgestural form via complex pragmatic inference. In fact, resolving the underspecified contentand identifying the rhetorical connections are logically co-dependent. We will discuss suchinference in Section 5. Here, we simply explain (5′)’s dynamic semantic interpretation. First,observe the referential connection between the initial grasping gesture and the clause in (5a):the woodsman w and the hatchet h are bound by quantifiers in the content F (π1) of the

24

linguistic component of (5a). But the dynamic semantics of Explanation(π1, π) (and the valueof F (π), which outscopes the content F (π2) of the gesture) then ensures that the functionsf and g in the input context 〈f, g〉 for interpreting the gesture F (π2) assign the same valuesto w and h as is used to satisfy the body of the formula F (π1)—the speech and gesture areabout the same woodsman and hatchet. Furthermore, by Definition 9, w and h are availableantecedents for the bridging references to the woodsman’s hands l and r and the handle aof the hatchet that form part of the content F (π2) of the gesture. Similarly, the continuedreferences to these individuals throughout the rest of the gestures is licensed by the sequence ofReplication relations connecting π2 to π3 and then to π5 and finally to π7 (and these rhetoricalconnections are licensed by Definition 8). These connections also entail that all the gesturesdepict the same mimicry—here, the woodsman’s embodied actions in attacking the wolf withthe hatchet. The fragments of speech also rhetorically connect together: the first clause π1

describes a first event; the next adjunct π4 continues the narrative, indicating that the sweepimmediately follows the taking described in π1; the Background relation between π4 and π6

entails that the sweep is part of the action that accomplishes the slicing. Finally, we have anadditional layer of rhetorical connections that describe the interaction of gesture and speech.We assume that the two gestures in π2 and π3 show how the woodsman takes his hatchet: bygrabbing the handle with his hands and hoisting it over his right shoulder. Then, we assumethat the coiled backswing demonstrated in gesture π5 shows how the woodsman is able todeliver such a mighty swing—so this gesture serves as an Explanation of the synchronousspeech. The final “slicing” gesture of π7 is a direct Depiction of the event described in theutterance segment π6 that accompanies it, and the Narration connection to the gesture π5

entails that the slicing happens after the coiled backswing. Again, Definition 8 makes thisrhetorical structure possible.

Now consider an example with identifying spatial reference:

(12) [These things] push up the pins.The speaker points closely at the frontmost wedge of the line of jagged wedges that runsalong the top of a key as it enters the cylinder of a lock.

(12′) π0 : ∃sp(things(s) ∧ pins(p) ∧ push up(e, s, p))∧[G]∃w(exemplifies(w, s) ∧ loc(e, w, vI(~pw)))

The construction rules for multimodal utterances introduce an underspecified anaphoric con-dition for these things, and an underspecified condition id rel for relating the denotation ofthese things to the referent of the synchronous identifying gesture. Resolving id rel to exem-plifies, identifying the denotation of these things to the set of wedges s on the top surfaceof the key, identifying the demonstrated object w as the frontmost wedge, and identifyingthe gesture as directly demonstrating a copresent object in real space (so that the spatialmapping is vI) are all logically co-dependent tasks. The individual w that’s referenced inthe identifying gesture but not in synchronous speech is outscoped by [G]; this predicts theanomaly of continuing (12) with, e.g., ??It has the right height via Definition 5.

The logical form (14′) of example (14) formalises the metaphorical use of spatial reference.We divide the speech into two segments: π1 labels “we have one ball”; and π2 elaborates thespeaker’s act in providing this information—it’s something Susan has already said.

(14) We have this one ball, as you said, Susan.The speaker sits leaning forward, with the right hand elbow resting on his knee and theright hand held straight ahead, in a loose ASL-L gesture (thumb and index finger extended,other fingers curled) pointing at his addressee.

25

(14′) π1 : ∃wb(we(w) ∧ have(e, w, b) ∧ one(b) ∧ ball(b))π2 : ∃us(susan(s) ∧ said(e′, u, s))π3 : [G]classify(e′′, u, vm(~pi))π : Depiction(π2, π3)π0 : Elaboration∗(π1, π)

The gesture π3 offers a metaphorical depiction of “as you said Susan”: it classifies the speaker’sutterance u as associated with the virtual space of Susan’s contributions. In fact, given theillocutionary effects of Elaboration∗(π1, π) (formal details of which we omit here), it is satisfiedonly if the content u of what Susan said entails Kπ1 .

4 Underspecified Meaning for Gesture

The logical forms presented in Section 3 capture specific interpretations in context. Theseresult from inference that reconciles the abstract meanings that are revealed by linguistic andgestural forms with overarching constraints on coherent communication and commonsensebackground knowledge. In this section, we formalise the abstract level of gesture meaningthat’s revealed by its form.

The formalisation follows (Kendon, 2004, Kopp, Tepper and Cassell, 2004, McNeill, 2005)in locating iconicity and deixis within individual aspects of the form of a gesture. For ex-ample, Kendon (2004) finds interpretive generalisations across related gestures with similarhandshapes or hand orientations. Kopp et al. (2004) describe image description features thatoffer an abstract representation of the iconic significance of a wider range of form features.Section 4.1 reviews how the descriptive literature analyses gestures as complexes of formfeatures, and gives a formal realisation.

Section 4.2, meanwhile, formalises the significance of these form features as constraintson interpretation. We emphasise that principles of gesture meaning such as iconicity cannotbe formalised transparently at the level of truth-conditional content. Consider, for example,the interpretive effect of holding the right hand in a fist while performing a gesture. Thefist itself might depict a roughly spherical object located in a virtual space. For example,McNeill (1992, Ex 8.3 p 224) offers a case where a speaker narrating the events of a cartoonuses a fist to depict a bowling ball. Alternatively, the fist might mirror the described poseof a character’s hand as a fist. Threatening a punch is such a case, as when Jackie Gleason,playing the role of Ralph on The Honeymooners, holds up a fist to his wife and announces“You’re going to the moon!”. Finally, the fist can depict a grip on a (perhaps abstract)object, as in the woodsman’s grip on the hatchet in (5) or our metaphorical understanding oflow-level processes as carrying speech errors with them in (9). Logically, the different casesinvolve qualitatively different relationships, with different numbers and sorts of participants;so the iconicity shared by all these examples must reside in an abstract description of iconicmeaning, rather than any specific iconic content shared by all the cases.

Finally, in Section 4.3, we formalise the additional constraints on interpretation thatemerge when gesture and speech are used together in synchrony. Our formalism representsthese constraints through abstract, underspecified relationships that connect content acrossmodalities.

The semantic constraints contributed by iconicity, deixis, and synchrony describe logi-cal form and provide input to the processes of establishing discourse coherence described inSection 5. The semantic constraints identify a range of alternative possible specific interpre-

26

tations. Recognising why the communicative action is coherent and identifying which of thepossible specific interpretations are pragmatically preferred are then logically co-dependenttasks.

4.1 The Form of Gesture

By the form of gesture, we mean the cognitive organisation that underlies interlocutors’generative abilities to produce and recognise gestures in an unbounded array. This definitionshows a clear parallel to natural language grammar, and we build on that parallel throughout.But the differences between gesture form and natural language syntax are also important. Inparticular, gesture seems to lack the arbitrary correspondence between the order of actionsand their interpretation as an ensemble, as mediated by a hierarchical formal structure, whichis characteristic of natural language syntax (McNeill, 1992).

Instead, gesture form is at heart multidimensional. A gesture involves various features ofperformance—the hand shape, the orientations of the palm and finger, the position of thehands relative to the speaker’s torso, the paths of the hands and the direction of movement.These form features are interpreted jointly; not through arbitrary ‘syntactic’ conventions, butthrough creative reasoning about the principles of iconicity, deixis, and coherence. FollowingKopp, Tepper and Cassell (2004), we represent this multidimensionality by describing theform of each gesture stroke with a feature structure. The feature structure contains a list ofattribute–value pairs characterising the physical makeup of the performance of the gesture.For example, we represent the form of the right-hand gesture identifying Norris Hall in (1)with the feature structure (17).

(17)

identifying-gestureright-hand-shape : loose-asl-5-thumb-openright-finger-direction : forwardright-palm-direction : up-leftright-location : ~c

Here ~c is the spatio-temporal coordinate in R4 of the tip of the right index finger (i.e.,up and to the right of the speaker’s shoulder) that, together with the values of the otherattributes, serves to to identify the region ~pn in space that is designated by the gesture.Unlike Kopp, Tepper and Cassell (2004), our representations are typed: e.g., (17) is typedidentifying-gesture. We particularly use this to distinguish between form features thatare interpreted in terms of spatial reference, like the feature right-location in (17), and thosethat are interpreted via iconicity, perhaps like the feature right-hand-shape in (17). Kendon(2004, Ch. 11) observes that hand shape in pointing gestures often serves as an indicationof the speaker’s communicative goal in demonstrating an object—distinguishing, presenting,orienting, directing attention—through a broadly metaphorical kind of meaning-making.

The organisation of gesture is recognised or constructed by our perceptual system. Soparsing a non-verbal signal into a contextually appropriate description of its underlying formis a highly complex task where many ambiguities must be resolved—or left open—just as inparsing language. We can see this by recapitulating the ambiguity in form associated withthe iterative circling gesture of (9).9 One account of its form is (18):

9Note that some of the values are expressed as sets (e.g., the movement direction). This allows us to capture

27

(18)

qualitative-characterising-gestureright-hand-shape : asl-aright-finger-direction : downright-palm-direction : leftright-trajectory : sagittal-circleright-movement-direction : {iterative, clockwise}right-location : central-right

This treats the hand movement in (9) as one stroke and it abstracts away from the numberand exact spatial trajectory followed in each circling of the hand. We might also analysethis gesture as a composition of several identical strokes. Furthermore, the repetition of themovement is captured in (18) via the value iterative; another licensed representation lacks thisvalue but instead makes trajectory be exactly two sagittal circles. Finally, this gesture hasa licensed representation whose values are spatiotemporal coordinates as in (17) rather thanqualitative as in (18) (and accordingly it will have a distinct root type); this form, in contrastto (18), yields spatial constants in semantics. Our theory tolerates and indeed welcomes suchambiguities.

Similarly, we regard synchrony as an underlying perceptual judgement about the rela-tionship between gesture and speech. Because we regard form as an aspect of perceptualorganisation, we do not need to assume that perceived synchrony between speech and ges-ture necessarily involves strict temporal constraints. In fact, there is no clear answer aboutthe conditions required for a gesture to be perceived as synchronous with a linguistic phrase(Oviatt, DeAngeli and Kuhn, 1997, Sowa and Wachsmuth, 2000, Quek et al., 2002). In-terlocutors’ judgements are influenced by the relative time of performance of the gesture tospeech, the type of syntactic constituent of the linguistic phrase (and possibly the type ofgesture), prosody, and perhaps other factors. We remain neutral about these details.

4.2 The Meaning of Gesture

We formalise gesture meaning using the technique of underspecification from computationalsemantics (Egg, Koller and Niehren, 2001). With underspecification, knowledge of meaningdetermines a partial description of the LF of an utterance. This partial description is expressedin a distinct language Lulf from the language Lsdrs of LFs. Each modelM for Lulf correspondsto a unique formula in Lsdrs , and M satisfies φ ∈ Lulf if and only if φ (partially) describes theunique formula corresponding to M . Semantic underspecification languages are typically ableto express partial information about semantic scope, anaphora, ellipsis and lexical sense; e.g.,Koller, Mehlhorn and Niehren (2000). For example, pronoun meaning will stipulate that theLF must include an equality between the discourse referent that interprets the pronoun andsome other discourse referent that’s present in the LF of the context, but it will not stipulatewhich contextual discourse referent this is. Thus the partial description is compatible withseveral LFs: one for each available discourse referent in the LF of the context. Such partialdescriptions are useful for ensuring that the grammar neither under-determines nor over-determines content that’s revealed by form.

generalisations over clockwise movements on the one hand (iterative or not), and iterative movements on theother (where iterative can represent a finite repetition of movement, as in this case). More generally, if featureschange during a stroke, we can specify feature values as sequences as well.

28

We illustrate the technical resources of underspecification with a sentence (19a) whosesyntax underdetermines semantic scope. In a typical underspecified representation, so-calledMinimal Recursion Semantics (MRS, Copestake et al. (1999)), the description given in (19b)underspecifies semantic scope: (i) each predication is labelled (l1, etc); (ii) scopal argumentsare holes (h1, h2 etc); (iii) there are scope constraints (h ≥ l means that h outscopes l’spredication); and (iv) the constraints admit two ways of equating holes with labels.

(19) a. Every black cat loved some dog.

b. l1 : every q(x, h1, h2)l2 : black a 1(e1, x)l2 : cat n 1(x)l3 : loved v 1(e2, x, y)l4 : some q(y, h3, h4)l5 : dog n 1(y)h2 ≥ l2, h3 ≥ l5

c. l1 : a1 every q(x), RESTR(a1, h1), BODY(a1, h2)l21 : a21 : black a 1(e1), ARG1(a21, x1)l22 : a22 : cat n 1(x2)l3 : a3 : loved v 1(e2), ARG1(a3, x3), ARG2(a3, y1)l4 : a4 : some q(y), RESTR(a4, h3), BODY(a4, h4)l5 : a5 : dog n 1(y2)h2 ≥ l2, h3 ≥ l5x = x1, x = x2, x = x3, x1 = x2, x2 = x3, y = y1, y = y2, y1 = y2, l21 = l22

Intersective modification is achieved by sharing labels across predications (e.g., l2 in (19b)),meaning that in any fully specific LF black a 1(e1, x) and cat n 1(x) are connected withlogical conjunction. Observe also the naming convention for the predicate symbols, based onword lemmas, part-of-speech tags and sense numbers. Our approach to gesture also leveragesthis ability to regiment the links between form and meaning.

An extension of these ideas is explored in Robust Minimal Recursion Semantics (RMRS,Copestake (2003))—RMRS can also underspecify the arity of the predicate symbols and whatsorts of arguments they take. Since the iconic meaning of gesture constrains, but doesn’tfully determine, all these aspects of interpretation, we adopt RMRS as the underlying se-mantic formalism Lulf . RMRS is fully compatible with the language Lsdrs of SDRT (Asherand Lascarides, 2003, p. 122)—indeed, SDRT’s existing glue logic supports any descriptionlanguage, and so can construct from the RMRSs of discourse units an SDRS (or a set of themif ambiguities persist) that captures the pragmatic interpretation (we don’t use the specificdescription language from Asher and Lascarides (2003) here because it doesn’t underspecifyarity).

The RMRS corresponding to (19b) is (19c): RMRSs offer a more factorised representationwhere the base predicates are unary and the other arguments are represented by separatebinary relations on the unique anchor of the relevant predicate symbols (a1, a2, . . .) togetherwith variable and label equalities (e.g., x = x1, l21 = l22). This factored representation allowsone to build semantic components to shallow parsers, where lexical or syntactic informationthat contributes to meaning is absent. An extreme example would be a part-of-speech (POS)tagger: one can build its semantic component simply by deriving lexical predicate symbolsfrom the word lemmas and their POS tags, as given in (20):

29

(20) a. Every AT1 black JJ cat NN1 loved VVD some DD dog NN1

b. l1 : a1 : every q(x),l21 : a21 : black a(e1),l22 : a22 : cat n(x2)l3 : a3 : loved v(e2),l4 : a4 : some q(y),l5 : a5 : dog n(y2)

Semantic relations, sense numbers and the arity of the predicates are missing from (20b)because the POS tagger doesn’t reveal information about syntactic constituency, word senseor lexical subcategorisation. But the RMRSs (19c) and (20b) are entirely compatible, theformer being more specific than the latter. In particular, the model theory of rmrs restrictsthe possible denotations of lemma tag sense (which are all constructors in the fully specificlanguage Lsdrs) to being a subset of those of lemma tag.

To regiment the interpretation of gesture formally in rmrs, we assume that interpretiveconstraints apply at the level of form features. Each attribute–value element yields an rmrspredication, which must be resolved to a formula in the LF of gesture in context.10 If theelement is interpreted by iconcity, we constrain the resolution to respect possibilities fordepiction. If the element is interpreted by spatial reference, we interpret it as locating anunderspecified individual via an underspecified correspondence to the physical point in space-time that is designated by that feature of the gesture. We treat attributes with set values(e.g., the movement-direction attribute in (18)) like intersective modification in language (e.g.,black cat in (19b)). This captures the intuition that the different aspects of the direction ofmovement in (18) must all depict properties of the same thing in interpretation.

So, more formally, each iconic attribute value pair introduces an underspecified predicationthat corresponds directly to it; for instance the hand shape in (18) introduces the predication(21) to the RMRS for this gesture:

(21) l1 : a1 : right hand shape asl-a(i1)

Here, l1 is a unique label that underspecifies the scope of the predication; a1 is the unique an-chor that provides the locus for specifying the predicate’s arguments; i1 is a unique metavari-able that underspecifies the sort of the main argument of the predication (it could be anindividual object or an eventuality); and hand shape asl-a underspecifies reference to a prop-erty that the entity i1 has and that can be depicted through the gesture’s fist shape.

We then represent the possible resolutions of the underspecified predicates via a hierar-chy of increasingly specific properties, as in Figure 4. The hierarchy of Figure 4 capturesthe metaphorical contribution of the fist to the depiction of the process in (9), by allowingright hand shape asl-a to depict a holding event, metaphorically interpreted as the event eof a process x sustaining errors y in speech production (“bearing them with it”, as it were).Following Copestake and Briscoe (1995), this treats metaphorical interpretations as a spe-cialisation of the underspecified predicate symbol that’s produced by form, as opposed tocoercion on a specific literal interpretation of a word that in turn contradicts information inthe context; e.g., Hays and Bayer (2001). We have argued elsewhere for treating metaphor in

10This suggests that the form of an iconic gesture is like a bag of words. Kopp, Tepper and Cassell (2004)liken it to a bag of morphemes, on the grounds that the resolved interpretations of the features can’t be finitelyenumerated. But word senses can’t be enumerated either (Pustejovsky, 1995); hence our analogy with wordsis also legitimate.

30

l : a : hand shape asl-a(i)

l : a : right hand shape asl-a(i)

��

��

��

��

�

@@

@@

@@

@

PPPPPPPPPPPPPPPPPPPPP

> l : a : something held(x)�� HH

l : a : marker point(x) . . .

l : a : event of holding(e)

��

��

��

HH

HH

HH

l : a : literal holding(e)

. . .

l : a : metaphorical holding(e)

��HHH

l : a : carry*(e)

��HHH

l : a : sustain(e) . . .

. . .

. . .

Figure 4: Possible resolutions for hand shape asl-a.

linguistic discourse in this way (Asher and Lascarides, 1995), and choose to treat metaphorin gesture in the same way so as to maintain as uniform a pragmatics for speech and gestureas possible.

At the same time, we can capture the contribution of a fist as depicting something heldby resolving right hand shape asl-a accordingly; e.g., if the gesture in (9) were to accompanythe utterance “the mouse ran round the wheel”, then the underspecified predicate symbolwould resolve to a marker-point x indicating a designated location on the mouse’s spinningwheel. Finally, all underspecified predications are resolvable to validity (>), since any formfeature in a gesture may contribute no meaning in context (e.g., the clockwise motion in(9)). We assume that resolving all predications to > is pragmatically dispreferred comparedwith logically contingent resolutions (see Section 5). Underspecified predicates may also sharecertain specific resolutions: e.g., marker-point is also one way of resolving the underspecifiedpredicate corresponding to a flat hand—hand shape asl-5.

Figure 4 reflects the fact that, like all dimensions of iconic gesture, the fist shape under-specifies how many entities it relates in its specific semantic interpretation. The predicatesin Figure 4 vary in the number of arguments they take and the factorised notation of RMRSlets us express this. For example, sustain is a 3-place relation and so l : a : sustain(e) entailsl : a : sustain(e), ARG1(a, x), ARG2(h, y) for some x and y, while marker-point is a 1-placeproperty, and therefore l : a : marker-point(x), ARG1(a, y) is unsatisfiable.

Figure 4 represents a special kind of commonsense background knowledge; namely, generalpossibilities for iconic representation. Technically, the interpretation of hand shape asl-a isnot defined at all with respect to the dynamic semantics given in Definition 5 because itis not a part of the language Lsdrs of fully specific logical forms; rather, the distinct andstatic model theory for RMRS ensures that hand shape asl-a denotes a constructor fromLsdrs or a combination thereof that is compatible with the hierarchy in Figure 4. More

31

informally, you can compare predications like hand shape asl-a to Kopp et al.’s (2004) imagedescription features—an abstract representation that captures gesture meaning. While someof the leaves in this hierarchy correspond to fully specific interpretations, others representvague ones. (Following Kopp, Tepper and Cassell (2004), we do not believe that the specificinterpretations that are licensed by a (unique) underspecified semantic representation canbe finitely enumerated.) We envisage that either the speaker and hearer sometimes settleon a coherent but vague interpretation, or additional logical axioms will resolve a vagueinterpretation to a more specific one in the particular discourse context.

Let’s now consider the compositional semantics of a spatially-locating component of ges-ture meaning. In words, (22) states that x denotes an individual which is spatially locatedat the coordinates v(~p), where ~p is the physical location actually designated by the gesture,and v is a mapping from this physical point to the space depicted in meaning (with the valueof v being resolved through discourse interpretation).

(22) l2 : a2 : sp ref(e), ARG1(a2, x), ARG2(a2, v(~p))

The predicate sp ref in (22) can resolve to the constructor loc or to the constructor classify inLsdrs , with its dynamic semantics defined in Section 3.1. The constant ~p in (22) is determinedas a complex function of the grounding of gesture form in physical space. For instance, apointing gesture (with hand shape 1-index) will make ~p denote a cone whose tip is the hand-location coordinate ~c, with the cone expanding out from ~c in the same direction as the valueof finger-direction (Kranstedt et al., 2006).

Finally, in Section 3.2 we motivated the introduction of an operator [G] for each stroke,which must outscope the content conveyed by the gesture’s form features. This was necessaryfor constraining co-reference. We translate each gesture overall using an instance of thisoperator, constrained to outscope each predication that’s contributed by its form features.Thus the (underspecified) content arising from the form of the visualising gesture in (18) isthe RMRS (23):

(23) l0 : a0 : [G](h)l1 : a1 : right hand shape asl-a(i1),l2 : a2 : right finger dir down(i2),l3 : a3 : right palm dir left(i3),l4 : a4 : right traj sagittal circle(i4),l5 : a51 : right move dir iterative(i5),l5 : a52 : right move dir clockwise(i5),l6 : a6 : right loc central-right(i6),h ≥ lj , for 1 ≤ j ≤ 6

To handle identifying gestures, we add an overall layer of quantificational structure as in (24)so that we model the gesture as identifying an appropriate entity x.

(24) l0 : [G](h1),l1 : a1 : deictic q(x), RESTR(a1, h2), BODY(a1, h3)l2 : a2 : sp ref(e), ARG1(a2, x), ARG2(a2, v(~p))h1 ≥ l2, h2 ≥ l2

More generally, anywhere the context-resolved interpretation of a gesture introduces an indi-vidual that is not co-referent with any individual in the synchronous speech (by the bridging

32

inference described in Section 3.2), then we must have a quantifier and inference relation tointroduce this individual, outscoped by [G] so that availability (Definition 8) and semanticinterpretation (Definition 5) constrain anaphoric dependencies correctly—i.e., the gesturedindividual cannot be an antecedent to a pronoun in subsequent speech. This scopal con-straint can be expressed as part of discourse update—the logic that builds a specific LF ofdiscourse from the underspecified logical forms of its units—although we forego details here.

Mapping the syntactic representation of gestures such as (18) to their unique RMRS (23)is very simple. Computing the predications from each attribute value pair in (18) involvesexactly the same techniques as used by Copestake (2003) to build the semantic componentof a POS tagger. Adding the scopal predication [G] and its ≥-conditions is triggered bythe gestural-type of the feature structure: e.g., qualitative-characterising-gesture intro-duces [G] via scopal modification as defined in the semantic algebra for RMRS (Copestake,Lascarides and Flickinger, 2001).

Our representations of gesture meaning are analogous, both formally and substantively,to the underspecified meanings that computational semanticists already use to represent lan-guage. In particular, as we show in Section 5, we can therefore build reasoning mechanismsthat combine information from language and gesture to derive integrated logical forms formultimodal discourse. Differences remain across modalities, however, in the kinds of seman-tic underspecification that are present in the RMRS representation of a phrase versus that ofgesture. A complete and disambiguated syntactic representation of a linguistic phrase fullyspecifies predicate argument structure. But gestures lack hierarchical syntactic structureand individual form features, unlike words, don’t fix their subcategorisation frames. Conse-quently, a complete and disambiguated syntactic analysis of gesture underspecifies all aspectsof content, including predicate argument structure.

4.3 Combining Speech and Gesture in the Grammar

Like Kopp, Tepper and Cassell (2004), we believe that we need a formal account of theintegration of speech and gesture that directly describes the organisation of multimodal com-municative actions into complex units that contribute to discourse. Kopp et al. argue for thison the grounds of generation; our motivation stems from issues in interpretation. First, ourobservations about the relative semantic scope of negation and the content depicted by thegesture in (6) suggests that scope bearing elements introduced by language can outscope ges-tured content. It is very straightforward to derive such a semantics from a single derivation ofthe structure of a multimodal utterance: use a construction rule to combine the gesture withcomputery and a further construction rule to combine the result with not. Standard meth-ods for composing semantic representations from syntax—e.g., Copestake, Lascarides andFlickinger (2001)—would then make the (scopal) argument of the negation outscope boththe atomic formula computery(x) and the gesture modality [G], as required. Of course, otheranalyses may be licensed by the grammar: for instance, the gesture might combine with thephrase not computery. (We don’t address the resolution of such ambiguities of form here.)

Secondly, we assume, as is standard in dynamic semantic theories, that discourse updatehas access to semantic representations but no direct access to form. But synchrony is an aspectof form that conveys meaning, and consequently we need to give a formal description of thisform–meaning relation as part of utterance structure. For instance, we suggested in Section 2that the content of a characterising gesture must be related to its synchronous linguistic phrasewith one of a subset of the full inventory of rhetorical connections (for instance, Disjunction is

33

excluded). This means that the synchrony that connects speech and gesture together conveyssemantic information that’s similar to that conveyed by a highly sense-ambiguous discourseconnective or a free adjunct in language: they both introduce rhetorical connections betweentheir syntactic complements, but do not fully specify the value of the relation. We mustrepresent this semantic contribution that’s revealed by form.

For identifying gestures, meanwhile, synchrony identifies which verbally-introduced indi-vidual y is placed in correspondence with the individual x designated in gesture . Moreover,synchrony serves to constrain the relationship between x and y. Sometimes the relationshipis equality but not always—as in these things said while pointing to an exemplar (see (12)).We treat the semantic relationship as underspecified but not arbitrary (Nunberg, 1978).

While we don’t give details here, we assume a unification or constraint–based representa-tion of utterance structure, following Johnston (1998). Construction rules in this specificationdescribe the form and meaning of complex communicative actions including both speech andgesture. We assume that the construction rule for combining a characterising gesture anda linguistic phrase contributes its own semantics. In other words, as well as the daughters’RMRSs being a part of the RMRS of the mother node (as is always the case), the constructionrule introduces the new predication (25) to the mother’s RMRS, where hs is the top-mostlabel of the content of the ‘speech’ daughter, and hg the top-most label of the gesture:

(25) l : a : vis rel(hs), ARG1(a, hg)

The underspecified predicate vis rel must then be resolved via discourse update to a specificrhetorical relation, where we assume at least that Disjunction is not an option.

Similarly, we assume that the construction rule that combines an identifying gesture withan NP contributes its own semantics: the labelled predication (26), where l2 is the label ofthe spatial condition sp ref introduced by the RMRS of the gesture (see (24)) and y is thesemantic index of the NP.

(26) l2 : a21 : id rel(x), ARG1(a21, y)

So, for instance, the RMRS for a multimodal constituent consisting of the NP these thingscombined with the pointing gesture in (12) is the following:

(27) l0 : [G](h1),l1 : a1 : deictic q(x), RESTR(a1, h2), BODY(a1, h3)l2 : a2 : sp ref(e), ARG1(a2, x), ARG2(a2, v(~p))l2 : a21 : id rel(x), ARG1(a21, y)l3 : a3 : these q(y), RESTR(a3, h4), BODY(a3, h5),l4 : a4 : things n 1(y)h1 ≥ l2, h2 ≥ l2, h4 ≥ l4

Identifying gestures that combine with other kinds of linguistic syntactic categories, such asPPs and VPs are also possible in principle, although we leave the details to future work.

5 Establishing Coherence through Default Inference

SDRT describes which possible ways of resolving an underspecified semantics are pragmat-ically preferred. This occurs as a byproduct of discourse update: the process by which oneconstructs the logical form of discourse from the (underspecified) compositional semantics

34

of its units. So far, SDRT’s discourse update has been used to model linguistic phenomena.Here, we indicate how it can resolve the underspecified meaning of both language and gesture.

Discourse update in SDRT involves nonmonotonic reasoning and is defined in a constraint-based way: Where φ ∈ Lulf represents the (underspecified) old content of the context, whichis to be updated with the (underspecified) new content ψ ∈ Lulf of the current utterance, theresult of update will be a formula χ ∈ Lulf that is entailed by both φ, ψ and the consequencesof a nonmonotonic logic—known as the glue logic—when φ and ψ are premises in it. Thisconstraint-based approach allows an updated interpretation of discourse to exhibit ambigui-ties. This is well established in the literature for being necessary for linguistic discourse; here,we observed via examples like (1), (7) and (9) that it is necessary for gesture as well.

The glue logic axioms specify which speech act λ : R(α, β) was performed, given thecontent and context of utterances. And being a nonmonotonic consequence of the glue logic,λ : R(α, β) becomes part of the updated LF. Formally, the glue logic axioms typically have theshape schematised below, where A > B means If A then normally B (note that without lossof generality, we omit anchors when the arguments are specified; so for instance λ : R(α, β)is a notational variant of λ : a : R(α), ARG1(a, β)):

• Glue Logic Schema: (λ :?(α, β) ∧ some stuff) > λ : R(α, β)

In words, this axiom says: if β is to be connected to α with a rhetorical relation, and theresult is to appear as part of the logical scope labelled λ, but we don’t know what the value ofthat relation is yet, and moreover “some stuff” holds of the content labelled by α and β, thennormally the rhetorical relation is R. The “some stuff” is derived from the (underspecified)logical forms (expressed in Lulf ) that α and β label (in our case, this language is that ofRMRS), and the rules are justified either on the basis of underlying linguistic knowledge,world knowledge, or knowledge of the cognitive states of the dialogue agents.

For example, the glue logic axiom Narration stipulates that one can normally infer Nar-ration if the constituents that are to be rhetorically connected describe eventualities whichare in an occasion relation. That is, there is a ‘natural event sequence’ such that events ofthe sort described by α lead to events of the sort described by β:

• Narration: λ :?(α, β) ∧ occasion(α, β)) > λ : Narration(α, β).

Schank’s (1977) scripts attempted to capture information about which eventualities occasionwhich others; in SDRT such scripts are default axioms. For example, we assume that theunderspecified logical forms of the clauses in (16) verify the antecedents of a default axiomwhose consequence is occasion(π1, π2), yielding π0 : Narration(π1, π2) via Narration:

(16) π1. John went out the door.

π2. He turned right.

Indeed, the glue logic axioms that do this should be neutral about about sentence mood, sothat they also predict that the imperatives in (28) are connected by Narration:

(28) Go out the door. Turn right.

In the model theory of SDRT, such a logical form for (28) entails that (a) both imperativesare commanded (because Narration is veridical), and (b) the command overall is to turn right

35

immediately after going out the door.11 This is exactly the discourse interpretation we desirefor the multimodal act (4) from the NUMACK corpus:

(4) You walk out the doorsThe gesture is one with a flat hand shape and vertical palm, with the fingers pointing right,and palm facing outward.

So our aim now is to ensure that discourse update in SDRT supports the following two co-dependent inferences in constructing the LF of (4): (a) the contents of the clause and gestureare related by Narration; and (b) the (underspecified) content of the gesture as revealed byits form resolves to turn right in this context. We’ll see how shortly.

Explanation is inferred on the basis of evidence in the discourse for a causal relation:

• Explanation: (λ :?(α, β) ∧ causeD(β, α)) > λ : Explanation(α, β)

Note that causeD(β, α) does not entail that β actually did cause α; the latter causal relationwould be inferred if Explanation is inferred. The formula causeD(β, α) is inferred on thebasis of monotonic axioms (monotonic because the evidence for causation is present in thediscourse, or it’s not), where the antecedent to these axioms are expressed in terms of the(underspecified) content of α and β. We assume that there will be such a monotonic axiomfor inferring causeD(π2, π1) for the discourse (29) (we omit details), which bears analogies tothe embodied utterance (9).

(29) π1. There are low-level phonological errors which tend not to get reported.

π2. They are created via subconscious processes.

In SDRT the inferences can flow in one of several directions. If the premises of a glue logicaxiom is satisfied by the underspecified semantics derived from the grammar, then a particularrhetorical relation follows and its semantics yields inferences about how underspecified partsof the utterance and gesture contents are resolved. Alternatively, there are cases where theunderspecified compositional semantics is insufficient for inferring any rhetorical relation. Inthis case, discourse update allows inference to flow in the opposite direction: one can resolvethe underspecified content to a more specific interpretation that supports an inference to arhetorical relation. If there is a choice of which way to resolve the underspecified content so asto infer a rhetorical relation from it, then one chooses an interpretation which maximises thequality and quantity of the rhetorical relations; see Asher and Lascarides (2003) for details.There may be more than one such interpretation, in which case ambiguity persists in theupdated discourse representation.

Of course, this inferential flow from possible resolved interpretations of speech and gestureto rhetorical connections represents a competence model of discourse interpretation only. Anyimplementation of SDRT’s discourse update would have to restrict drastically the massivesearch space of fully resolved interpretations that are licensed by an underspecified logicalform for discourse. We have just begun to explore such issues in related work (Schlangen andLascarides, 2002).

Let’s illustrate the inference from underspecified content to complete LF with the ex-ample (9). As described in Section 4.2, the grammar yields (23) for the content of thegesture, an RMRS for the compositional semantics of the clause (which is omitted here for

11We did not give the model theory for imperatives in order to stay within the confines of the extensionalversion of SDRT in this paper. See Asher and Lascarides (2003) for details.

36

reasons of space), and the construction rule that combines them contributes the predicationl : vis rel(hs, l0), where hs outscopes all labels in the RMRS of the clause, and l0 labelsthe scopal modifier [G] in (23). Producing a fully specific LF from this therefore involves,among other things, resolving the underspecified predications in (23), and the underspecifiedpredicate vis rel must resolve to a rhetorical relation that’s licensed by it—so not Disjunction.

Even though the RMRS (23) fails to satisfy the antecedent of any axiom for inferring aparticular rhetorical relation, one can consider alternative ways of resolving it so as to supporta rhetorical connection. One alternative is to resolve it to denote a continuous, subconsciousprocess which sustains the phonological errors as shown in (30) where y is the low-levelphonological errors introduced in the clause:

(30) [G]∃x(continuous(x) ∧ below-awareness(x) ∧ process(x) ∧ sustain(e′, x, y))

This particular interpretation is licensed by the predications in the RMRS (23), via hierarchiessuch as the one shown in Figure 4 (using y in (30) is also licensed by Definitions 8 and 9).And similarly to discourse (29), this and the compositional semantics of the clause satisfy theantecedent of an axiom whose consequent is causeD(β, α). And so an Explanation relation isinferred via Explanation, resulting in the logical form (9′) shown earlier. As stated earlier, analternative specific interpretation of the gesture (in fact, one that can stem from an alternativeanalysis of its form) entails that the gesture depicts the low level of the phonological errors.This specific interpretation would validate an inference in the glue logic that the gesture andspeech are connected with Depiction (this would be on the general grounds in the glue logicthat the necessary semantic consequences of a rhetorical connection are normally sufficientfor inferring it). If both of these interpretations are equally coherent, then discourse updatepredicts that the multimodal utterance, while coherent, is ambiguous. If, on the other hand,the interpretation given in (9′) yields a more coherent interpretation (and we believe thatit does because it supports additional Contrast relations with prior gestures that are in thecontext), then discourse update predicts this fully specific interpretation.

Finally, let’s examine deictic gesture: discourse update must support inferences whichresolve the underspecified relation id rel between the denotations of an NP and its accompa-nying deictic gesture to a specific value. This is easily achieved via default axioms such asthe following (we have omitted labels and anchors for simplicity):

• Co-reference: (id rel(x, y) ∧ loc(e, y, ~p) ∧ P (x) ∧ P (y)) > x = y

In words, Co-reference stipulates that if x and y are related by id rel, and moreover, theindividual y that is physically located at ~p shares a property with x, then normally x and yare co-referent. Other default axioms can be articulated for inferring exemplifies(x, y) ratherthan x = y in the interpretation of (12).

6 Conclusion

We have provided a formal semantic analysis of co-verbal iconic and deictic gesture whichcaptures several observations from the descriptive literature. For instance, three featuresin our analysis encapsulate the observation that speech and gesture together form a ‘singlethought’. First, the content of language and gesture are represented jointly in the same log-ical language. Secondly, rhetorical relations connect the content of iconic gesture to that ofits synchronous speech. And finally, language and gesture are interpreted jointly within an

37

integrated architecture for linking utterance form and meaning. Our theory also substanti-ates the observation that iconic gesture on its own doesn’t receive a coherent interpretation.Its form produces a very underspecified semantic representation; this must be resolved byreasoning about how it is coherently related to its context. Finally, we exploited discoursestructure and dynamic semantics to account for co-reference across speech and gesture andacross sequences of gestures.

One major advantage of our approach is that all aspects of our framework are alreadyestablished for modelling purely linguistic discourse, and consequently we demonstrate thatexisting mechanisms for representing language can be exploited to model gesture as well.We showed that they suffice for a wide variety of spontaneous and naturally occurring co-verbal gestures, ranging from simple deictic ones to much more complex iconic ones withmetaphorical interpretations. Furthermore, our model is sufficiently formal but flexible thatone can articulate specific hypotheses that can then guide empirical investigations that deepenour understanding of the phenomena.

Ultimately, we hope that the empirical and theoretical enquiries that this work enableswill support a broader perspective on the form, meaning and use of nonverbal communica-tion. This will require describing the organisation of gesture in relationship to speech, wheretheoretical and empirical work must interact to characterise ambiguities of form and describehow they are resolved in context. It also requires further research into other kinds of com-municative action. For example, we believe that formalisms for modelling intonation andfocus—e.g., Steedman (2000)—offer a useful starting point for an analysis of beat gestures.We have also ignored body posture and facial expressions. But as Krahmer and Swerts (inpress) demonstrate, they can interact in complex ways not only with speech but also withhand gestures. Finally, we have focussed here almost entirely on contributions from singlespeakers. But in conversation social aspects of meaning are important: we need to explorehow gestures affect and are affected by grounding and disputes, for instance. This is anotherplace where empirical research such as (Emmorey, Tversky and Taylor, 2000) and formalmethods such as (Asher and Lascarides, 2008) have been pursued independently and canbenefit from being brought into rapport.

Authors’ Addresses

Alex Lascarides, Matthew Stone,School of Informatics, Computer Science,University of Edinburgh, Rutgers University,10, Crichton Street, 110 Frelinghuysen Road,Edinburgh, EH8 9AB, Piscataway NJ 08854-8019,Scotland, UK. [email protected] [email protected]

Acknowledgements. This work has been presented at several workshops and conferences:the Workshop on Embodied Communication (Bielefeld, March 2006); Constraints in Discourse(Maynooth, June 2006); the Pragmatics Workshop (Cambridge, September 2006), Brandial(Potsdam, September 2006); the Workshop on Dynamic Semantics (Oslo, September 2006);Dialogue Matters (London, February 2008); the Rutgers Semantics Workshop (New Jersey,November 2008). We would like to thank the participants of these for their very usefulcomments and feedback. Finally, we would like to thank the many individuals who have

38

influenced this work through discussion and feedback: Nicholas Asher, Susan Brennan and hercolleagues at Stony Brook, Justine Cassell, Herb Clark, Susan Duncan, Jacob Eisenstein, DanFlickinger, Jerry Hobbs, Michael Johnston, Adam Kendon, Hannes Rieser and his colleaguesat Bielefeld, Candy Sidner, and Rich Thomason. We also owe a special thanks to JeanCarletta, who provided support in our search of the AMI corpus, including writing searchscripts that helped us to find what we were looking for. Finally, we would like to thank twoanonymous reviewers for this journal, for their very detailed and thoughtful comments onearlier drafts of this paper, and the editors Anna Szabolcsi and Bart Geurts. Any mistakesthat remain are our own.

Much of this research was done while Matthew Stone held a fellowship in Edinburgh,funded by the Leverhulme Trust. We would also like to thank the following grants from theNational Science Foundation (NSF) for their financial support: HLC-0308121, CCF-0541185,HSD-0624191.

References

Asher, N., and A. Lascarides. 1995. “Metaphor in Discourse.” In Proceedings of the AAAISpring Symposium Series: Representation and Acquisition of Lexical Knowledge: Polysemy,Ambiguity and Generativity , 3–7. Stanford.

Asher, N., and A. Lascarides. 2003. Logics of Conversation. Cambridge, UK: CambridgeUniversity Press.

Asher, N. and A. Lascarides. 2008. “Commitments, Beliefs and Intentions in Dialogue.” InProceedings of the 12th Workshop on the Semantics and Pragmatics of Dialogue (Londial),35–42. London.

Barker, C. 2002. “The Dynamics of Vagueness.” Linguistics and Philosophy , 25, 1, 1–36.

Benz, A. G. Jager, and R. van Rooij, eds. 2005. Game Theory and Pragmatics. Basingstoke,UK: Palgrave Macmillan.

Bittner, M.2001. “Surface composition as bridging.” Journal of Semantics, 18, 127–177.

Buchwald, A. O. Schwartz, A. Seidl, and P. Smolensky. 2002. “Recoverability OptimalityTheory: Discourse Anaphora in a Bidirectional Framework.” In Proceedings of the In-ternational Workshop on the Semantics and Pragmatics of Dialogue (EDILOG), 37–44.Edinburgh.

Carletta, J. 2007. “Unleashing the Killer Corpus: Experiences in Creating the Multi-Everything AMI Meeting Corpus.” Language Resources and Evaluation Journal , 41, 2,181–190.

Carston, R. 2002. Thoughts and Utterances: The Pragmatics of Explicit Communication.Oxford: Blackwell.

Cassell, J. 2001. “Embodied Conversational Agents: Representation and Intelligence in UserInterface.” AI Magazine, 22, 3, 67–83.

39

Chierchia, G. 1995. Dynamics of Meaning: Anaphora, Presupposition and the Theory ofGrammar . Chicago: Universit of Chicago Press.

Clark, H. 1977. “Bridging.” In P. N. Johnson-Laird and P. C. Wason, eds., Thinking: Readingsin Cognitive Science, 411–420. New York: Cambridge University Press.

Clark, H. 1996. Using Language. Cambridge, England: Cambridge University Press.

Copestake, A. 2003. “Report on the Design of RMRS.” Tech. Rep. EU Deliverable for Projectnumber IST-2001-37836, WP1a, Computer Laboratory, University of Cambridge.

Copestake, A., and E. J. Briscoe. 1995. “Semi-Productive Polysemy and Sense Extension.”Journal of Semantics, 12, 1, 15–67.

Copestake, A., D. Flickinger, I. A. Sag, and C. Pollard. 1999. “Minimal Recursion Semantics:An Introduction.” Available from http://www-csli.stanford.edu/~aac.

Copestake, A., A. Lascarides, and D. Flickinger. 2001. “An Algebra for Semantic Construc-tion in Constraint-based Grammars.” In Proceedings of the 39th Annual Meeting of theAssociation for Computational Linguistics (ACL/EACL 2001), 132–139. Toulouse.

Cumming, S. 2007. Proper Nouns. Ph.D. thesis, Rutgers University.

Egg, M. A. Koller, and J. Niehren. 2001. “The Constraint Language for Lambda Structures.”Journal of Logic, Language, and Information, 10, 457–485.

van Eijck, J. and H. Kamp. 1997. “Representing Discourse in Context.” In Johan van Ben-them and Alice ter Meulen, eds., Handbook of Logic and Linguistics, 179–237. Amsterdam:Elsevier.

Ekman, P. and W.V. Friesen. 1969. “The repertoire of nonverbal behavior: Categories,origins, usage, and coding.” semiotica, 1, 1, 49–98.

Emmorey, K. B. B. Tversky, and H. Taylor. 2000. “Using Space to Describe Space: Perspec-tive in Speech, Sign and Gesture.” Spatial Cognition and Computation, 2, 3, 157–180.

Engle, R. 2000. Toward a Theory of Multimodal Communication: Combining Speech, Ges-tures, Diagrams and Demonstrations in Structural Explanations. Ph.D. thesis, StanfordUniversity.

Farkas, D. 2002. “Specificity Distinctions.” Journal of Semantics, 19, 1–31.

Fauconnier, G. 1997. Mappings in Thought and Language. Cambridge, UK: CambridgeUniversity Press.

Gibbs, R.W. 1994. The Poetics of Mind: Figurative Thought, Language, and Understanding .Cambridge, UK: Cambridge University Press.

Ginzburg, J. and R. Cooper. 2004. “Clarification, Ellipsis and the Nature of ContextualUpdates in Dialogue.” Linguistics and Philosophy , 27, 297–366.

Glucksberg, S. and M.S. McGlone. 2001. Understanding Figurative Language: FromMetaphors to Idioms. Oxford: Oxford University Press.

40

Goffman, E. 1963. Behavior in Public Places. The Free Press.

Goldin-Meadow, S. 2003. Hearing Gesture: The Gestures we Produce when we Talk . Cam-bridge, Mass: Harvard University Press.

Grice, H. P. 1975. “Logic and Conversation.” In P. Cole and J. L. Morgan, eds., Syntax andSemantics Volume 3: Speech Acts, 41–58. New York: Academic Press.

Groenendijk, J., and M. Stokhof. 1991. “Dynamic Predicate Logic.” Linguistics and Philos-ophy , 14, 39–100.

Grosz, B., A. Joshi, and S. Weinstein. 1995. “Centering: A Framework for Modelling theLocal Coherence of Discourse.” Computational Linguistics, 21, 2, 203–226.

Haviland, J. 2000. “Pointing, gesture spaces, and mental maps.” In David McNeill, ed.,Language and Gesture, 13–46. New York: Cambridge University Press.

Hays, E. and S. Bayer. 2001. “Metaphoric Generalization through Sort Coercion.” In Proceed-ings of the the 29th annual meeting on Association for Computational Linguistics (ACL),222–228. Berkeley, California.

Heim, I. 1982. The Semantics of Definite and Indefinite Noun Phrases. Ph.D. thesis, Uni-versity of Massachussetts.

Hobbs, J. R., M. Stickel, D. Appelt, and P. Martin. 1993. “Interpretation as Abduction.”Artificial Intelligence, 63, 1–2, 69–142.

Johnston, M.1998. “Unification-based Multimodal Parsing.” In Proceedings of the 36th An-nual Meeting of the Association for Computational Linguistics and International Confer-ence in Computational Linguistics (ACL/COLING). Montreal, Canada.

Johnston, M., P. R. Cohen, D. McGee, J. Pittman, S. L. Oviatt, and I. Smith. 1997.“Unification-based multimodal integration.” In ACL/EACL 97: Proceedings of the An-nual Meeting of the Assocation for Computational Linguistics. Madrid.

Kamp, H. 1981. “A Theory of Truth and Semantic Representation.” In J. Groenendijk,T. Janssen, and M. Stokhof, eds., Formal Methods in the Study of Language, 277–322.Amsterdam: Mathematisch Centrum.

Kamp, H., and U. Reyle. 1993. From Discourse to the Lexicon: Introduction to Modeltheo-retic Semantics of Natural Language, Formal Logic and Discourse Representation Theory .Dordrecht, NL: Kluwer Academic Publishers.

Kendon, A. 1972. “Some Relationships between Body Motion and Speech: An Analsyis of anExample.” In W. Siegman and B. Pope, eds., Studies in Dyadic Communication, 177–210.Oxford: Pergamon Press.

Kendon, A. 1978. “Differential perception and attentional frame: two problems for investi-gation.” Semiotica, 24, 305–315.

Kendon, A. 2004. Gesture: Visible Action as Utterance. Cambridge, UK: Cambridge Univer-sity Press.

41

Koller, A., K. Mehlhorn, and J. Niehren. 2000. “A Polynomial-time Fragment of DominanceConstraints.” In Proceeedings of the 28th Annual Meeting of the Association for Computa-tional Linguistics (ACL2000). Hong Kong.

Kopp, S. P. Tepper, and J. Cassell. 2004. “Towards Integrated Microplanning of Languageand Iconic Gesture for Multimodal Output.” In Proceedings of ICMI . State College, PA.

Kortmann, B. 1991. Free Adjuncts and Absolutes in English: Problems in Control and Inter-pretation. Abingdon, UK: Routledge.

Krahmer, E. and M. Swerts. in press. “The Effects of Visual Beats on Prosodic Prominence:Acoustic Analyses, Auditory Perception and Visual Perception.” To appear in Journal ofMemory and Language.

Kranstedt, A. A. Luking, T. Pfeiffer, H. Rieser, and I. Wachsmith. 2006. “Deixis: Howto Determine Demonstrated Objects Using a Pointing Cone.” In Gestures in Human-Computer Interaction and Simulation, 300–311. Berlin: Springer.

Kyburg, A. and M. Morreau. 2000. “Fitting words: Vague words in context.” Linguisticsand Philosophy , 23, 6, 577–597.

Lakoff, G., and M. Johnson. 1981. Metaphors We Live by . Chicago: University of ChicagoPress.

Lascarides, A. and M. Stone. 2006. “Formal Semantics for Iconic Gesture.” In Proceedingsof the 10th Workshop on the Semantics and Pragmatics of Dialogue (Brandial). Potsdam.

Lascarides, A. and M. Stone. in press. “Discourse Coherence and Gesture Interpretation.”To appear in Gesture.

Lewis, D. 1969. Convention: A Philosophical Study . Cambridge, Mass: Harvard UniversityPress.

Lucking, A. H. Rieser, and M. Staudacher. 2006. “SDRT and Multi-Modal Situated Com-munication.” In Proceedings of BRANDIAL. Potsdam.

Mann, W. C., and S. A. Thompson. 1987. “Rhetorical Structure Theory: A Framework forthe Analysis of Texts.” International Pragmatics Association Papers in Pragmatics, 1,79–105.

McNeill, D. 1992. Hand and Mind: What Gestures Reveal about Thought . Chicago: Universityof Chicago Press.

McNeill, D. 2005. Gesture and Thought . Chicago: University of Chicago Press.

Nunberg, G.1978. The Pragmatics of Reference. Bloomington, Indiana: Indiana UniversityLinguistics Club.

Oviatt, S. A. DeAngeli, and K. Kuhn. 1997. “Integration and Synchronization of Input Modesduring Multimodal Human-Computer Interaction.” In Proceedings of the Conference onHuman Factors in Computing Systems: CHI ’97 . Los Angeles.

42

Pustejovsky, J. 1995. The Generative Lexicon. Cambridge, Mass: mit Press.

Quek, F. D. McNeill, R. Bryll, S. Duncan, X. Ma, C. Kirbas, K. McCullough, and R.Ansari. 2002. “Multimodal Human Discourse: Gesture and Speech.” ACM Transactionson Computer-Human Interaction, 9, 3, 171–193.

Reddy, M.J. 1993. “The conduit metaphor: A case of frame conflict in our language aboutlanguage.” In Andrew Ortony, ed., Metaphor and Thought , 164–201. New York: CambridgeUniversity Press.

Schank, R.C. and R. Abelson. 1977. Scripts, Plans, Goals, and Understanding . Hillsdale,New Jersey: Lawrence Erlbaum Associates.

Schlangen, D., and A. Lascarides. 2002. “Resolving Fragments using Discourse Information.”In Proceedings of the 6th International Workshop on the Semantics and Pragmatics ofDialogue (Edilog). Edinburgh.

So, W. S. Kita, and S. Goldin-Meadow. in press. “Using the Hands to Keep Track of WhoDoes What to Whom.” To appear in Cognitive Science.

Sowa, T. and I. Wachsmuth. 2000. “Coverbal Iconic Gestures for Object Descriptions inVirtual Environments: an Emprical Study.” In Post-Proceedings of the Conference ofGestures: Meaning and Use. Portugal.

Steedman, M. 2000. The Syntactic Process. Cambridge, Mass: mit Press.

Stern, J. 2000. Metaphor in Context . Cambridge, Mass: MIT Press.

Talmy, L. 1996. “Fictive motion in language and “ception”.” In Paul Bloom, Mary Peterson,Lynn Nadel, and Merrill Garrett, eds., Language and Spacea, 211–276. Cambridge, Mass:MIT Press.

Walker, M. 1993. Informational redundancy and resource bounds in dialogue. Ph.D. thesis,Department of Computer & Information Science, University of Pennsylvania.

Williamson, T. 1994. Vagueness. London: Routledge.

43

a formal semantic analysis of gesture · 2009. 3. 11. · a formal semantic analysis of gesture...

Documents