towards multimodal coreference resolution for exploratory ...about their reasoning. they viewed...

Towards Multimodal Coreference Resolution for Exploratory DataVisualization Dialogue: Context-Based Annotation and Gesture

Identification ∗

Abhinav Kumar and Barbara Di Eugenio and Jillian Aurisano and Andrew JohnsonAbeer Alsaiari and Nigel Flowers

Computer Science, University of Illinois at ChicagoChicago, IL, USA

{akumar34/jauris2/bdieugen/aej}@uic.edu

Alberto Gonzalez and Jason LeighInformation & Computer Sciences, University of Hawai’i at Manoa

Honolulu, HI, USA{agon/leighj}@hawaii.edu

Abstract

The goals of our work are twofold: gaininsight into how humans interact withcomplex data and visualizations thereof inorder to make discoveries; and use ourfindings to develop a dialogue system forexploring data visualizations. Crucial toboth goals is understanding and modelingof multimodal referential expressions, inparticular those that include deictic ges-tures. In this paper, we discuss how con-text information affects the interpretationof requests and their attendant referringexpressions in our data. To this end, wehave annotated our multimodal dialoguecorpus for context and both utterance andgesture information; we have analyzedwhether a gesture co-occurs with a specificrequest or with the context surrounding therequest; we have started addressing mul-timodal co-reference resolution by usingKinect to detect deictic gestures; and wehave started identifying themes found inthe annotated context, especially in whatfollows the request.

1 Introduction

The goals of our work are twofold. The first is togain insight into how humans interact with com-plex data in order to make discoveries. It is wellknown that visualization is very effective for ex-ploring large datasets and gaining insight into the

∗Supported by NSF awards IIS-1445751 and IIS-1445796

underlying phenomena. However, users (particu-larly visualization novices) struggle with translat-ing higher-level natural language queries to appro-priate visualizations that could assist in answeringtheir questions. Our first step has been to col-lect and analyze naturalistic dialogues in whichnovices explore such datasets. Based on the in-sights from the data collection, our second goal isthat of developing a conversational interface thatwill automatically generate the appropriate visu-alizations by participating in a natural interactionwith users. We already have a pipeline in placethat creates visualizations in response to a limitedtype of spoken requests.

In this paper, we focus on the role that contextand gestures play in the interpretation of both re-quests and referring expressions. We are certainlynot the first ones to suggest that the context ofa request and multimodality, specifically deicticgestures, are essential to providing a more natu-ral interactive system. Already (Sinclair, 1992)showed that having knowledge of utterances priorto the current one helps the human better inter-pret the utterance, which can lead to improved dis-ambiguation. Similarly, multimodal systems havebeen shown to be advantageous over unimodalsystems (Jaimes and Sebe, 2007). One reason isthat receiving multiple input signals rather thanjust speech can reduce the chances of misunder-standings as well as resolve ambiguities. Also, hu-mans are able to interact more naturally by usinggestures along with speech, making the experiencemore effective and natural.

This paper builds on our previous work (Au-risano et al., 2015; Aurisano et al., 2016; Kumar

et al., 2016) and focuses on the following newcontributions: annotation of context and gestureson our multimodal corpus; gesture detection usingKinect; and classification of contextual themes.

2 Related Work

There are several areas of research that are rel-evant to our work, the first one being the vastliterature on multimodality – we will just focuson multimodal referring expressions in this paper.As is well known (Sinclair, 1992; Kehler, 2000;Goldin-Meadow, 2005; Landragin, 2006; Navar-retta, 2011), in natural dialogue, the antecedentsof linguistic referring expressions are often intro-duced via gestures; for example in our environ-ment, the user can point to a street intersection on amap yet never have mentioned it earlier. Cruciallyfrom a computational point of view, includinghand gestures information improves the perfor-mance of the reference resolution module (Eisen-stein and Davis, 2006; Baldwin et al., 2009).Other sources of multimodal information are im-portant as well, including eye gaze (Prasov andChai, 2008; Iida et al., 2011; Liu et al., 2013), orhaptic (force exchange) information (Foster et al.,2008; Chen et al., 2015), but we will not addressthose in this paper.

Several additional challenges concerning re-solving referring expressions arise when humansinteract with graphical representations. First, theuser will likely expect that any visible object canbe discussed (Byron, 2003). Second, the same ex-pression can be used to refer to an entity in the do-main or in the visualization (Qu and Chai, 2008).For example, in our domain, users can refer to atype of crime in the world (Look how much theftaround UIC), or to the visual elements, e.g. dots,that represent theft (Can you color theft red?). Asfar as we know, only (LuperFoy, 1992) tried to ac-count for different perspectives on a referent, bylinking them to a so-called discourse peg; interest-ingly, she applied her approach to an interface formanipulating visualizations (Hollan et al., 1988).

If the graphical representation is presented ona large display, as in our case, yet additionalchallenges arise as concerns how humans inter-act with it, including window management prob-lems (Robertson et al., 2005). Closer to our inter-ests, not much work exists on interpreting deicticgestures directed to large displays, especially asconcerns recognizing the target at a semantic level

(Kim et al., 2017).Finally, as regards interactive systems that gen-

erate data visualizations more in general, the vastmajority of those are not focused on natural, con-versational interaction: (Gao et al., 2015) does notprovide two-way communication; the number ofsupported query types are limited in both (Cox etal., 2001) and (Reithinger et al., 2005), while (Sunet al., 2013) uses simple NLP methods that limitthe extent of natural language understanding pos-sible. EVIZA (Setlur et al., 2016), perhaps theclosest project to our own, does provide a dialogueinterface for users to explore visualizations; how-ever, EVIZA focuses on supporting a user inter-acting with one existing visualization, and doesn’tcover creating a new visualization, modifying theexisting one, or interacting with more than one vi-sualization at a time.

3 Foundational Work

As we describe in previously published work (Au-risano et al., 2015; Aurisano et al., 2016; Kumaret al., 2016) and briefly summarize here, our workrests on a new multimodal corpus that we col-lected, transcribed and started annotating, and onan NLP pipeline that can currently interpret a sub-set of the requests we observed in our data.

3.1 Corpus and Initial Annotations

The corpus was built by collecting spoken con-versations from 15 subjects. Each subject inter-acted with a remote Data Analysis Expert (DAE)in a Wizard-of-Oz setup, to explore data visu-alizations on Chicago crime data to understandwhen and where to deploy police officers. In eachsession users went through multiple cycles of vi-sualization construction, interaction and interpre-tation; these sessions lasted between 45 and 90minutes. Users were invited to interact with theDAE as naturally as possible, and to think aloudabout their reasoning. They viewed visualizationsand limited communications from the DAE on alarge, tiled-display wall. The DAE viewed thesubject through two high-resolution, direct videofeeds, and also had a mirrored copy of the tiled-display wall on two 4K displays. The DAE gen-erated responses to questions using Tableau,1 andused SAGE2 (Marrinan et al., 2014), a collabo-rative large-display middleware, to drive the dis-play wall. The DAE could also communicate via a

1http://www.tableau.com

Words Utterances Directly Actionable Utts.38,105 3,179 490

Table 1: Corpus size

chat window, but tried to behave like a system withlimited dialogue capabilities would. Apart fromgreetings, and status messages (sorry, it’s takinglong) the DAE would occasionally ask for clarifi-cations, e.g. Did you ask for thefts or batteries.2

However, the DAE never responded with a mes-sage, if the query could be directly visualized; nei-ther did the DAE engage in multi-turn elicitationsof the user requirements.

The dialogues were transcribed in their entirety:some basic distributional statistics are presentedin Table 1, which includes directly actionable ut-terances, the focus of our initial annotation effort.Three coders identified the directly actionable ut-terances, namely, those utterances3 which directlyaffect what the DAE is doing; the rest are non-actionable think-aloud utterances (during whichthe user was expressing out-loud what he or shewas thinking at the time). This was achieved byleaving an utterance unlabelled or labeling it withone of six directly actionable request types: 1. cre-ate new visualization (Can I see number of crimesby day of the week?); 2. modify existing visual-ization (Umm, yeah, I want to take a look closer tothe metro right here, umm, a little bit eastward ofGreektown); 3. window management operations(on windows on the screen) (If you want you canclose these graphs as I won’t be needing it any-more); 4. fact-based requests that don’t need a vi-sualization to be answered (During what time isthe crime rate maximum, during the day or thenight?) 5. clarification questions (Okay, so is thisstatistics from all 5 years? Or is this for a par-ticular year?); 6. expressing preferences (The firstgraph is a better way to visualize rather than thesefour separately). After annotation, it was foundthat only 15% of the dialogue consisted of action-able requests while the remaining 85% were non-actionable think-aloud. We obtained an excellentintercoder agreement κ = 0.84 (Cohen, 1960) onlabeling an utterance or leaving it unlabeled; andκ = 0.74 on the six types of actionable requests.

2Batteries in this context means an offensive touchingor use of force on a person without the person’s consent(Merriam-Webster).

3What counts as an utterance was defined at transcription.

3.2 The Articulate2 dialogue architecture

The current system’s (Articulate2) process flowcan be seen within the rectangular box in Figure1. It begins by translating the request to logicalform using the Google Speech API and NLP pars-ing. Three NLP structures are obtained: ClearNLP(Choi and McCallum, 2013) is used to obtainPropBank (Palmer et al., 2005) semantic role la-bels (SRLs), which are then mapped to Verbnet(Kipper et al., 2008) and Wordnet using SemLink(Palmer, 2009). The Stanford Parser is used to ob-tain the remaining two structures, i.e. the syntac-tic parse tree and dependency tree. On the basisof these three structures, a standard logical form isobtained. Then, a classifier determines the type ofthe request among the six request types we just de-scribed. At this point in time, Articulate2 can pro-cess the first three types of requests we discussedearlier: it will transform the logical form to SQLfor create new visualization and modify existing vi-sualization requests, or skip this step for windowmanagement operations (since data retrieval is notneeded in this case). Finally, the system generatesan appropriate visualization specification which isthen executed by the Visualization Executor on thedata returned by the execution of the SQL query.The system also stores each generated visualiza-tion specification to its dialogue history. At themoment, Articulate2 is limited by its inability toresolve referring expressions (e.g., it closes themost recently created visualization without check-ing if the user was referring to a different windowon the screen). In this paper, we discuss what ourdata tells us on multimodal referring expressions,and discuss the first steps we have taken to modelthose computationally.

Figure 1: Articulate2 dialogue processing archi-tecture

4 Corpus Analysis: Multimodalreferences in context

Preliminary analysis of the dialogue data showedthat users referred to visualizations throughspeech, gestures, or both. In addition, sometimesclues about identifying the object referred to bythe referring expression (in the form of speechor gesture) were found as part of the think-aloudnearby rather than temporally aligned with the ac-tionable request. This is why we decided to extendour analysis to the context surrounding an action-able request (contextual utterance annotation) aswell as any gestures that occurred during that con-text (contextual gesture annotation).

The context is comprised of three parts: setup,request, and conclusion. For the purpose of thiswork, we start from one single utterance annotatedas an actionable request, and look at its precedingand following context. The setup includes utter-ances that come prior to the request while the con-clusion includes utterances after the request. Sinceoften the utterances just prior or after the requestare part of a larger contiguous thought process thatcan be captured, all utterances up to and includingthe mention of a data attribute are included.

One example is shown in Figure 2 along withthe corresponding annotation in ANVIL (Kipp,2001) in Figure 3. The setup component includesjust one utterance because ”June”, ”July”, and”August” are part of our data attribute set. Therequest component is always just the request ut-terance itself. Finally, in this example the conclu-sion part is also a single utterance, however notbecause it mentions data attributes, but rather be-cause it is followed by another request, which sig-nals the start of a new context. Also note that UR

in Figure 2 mentions a deictic referring expression”that map”; in the conclusion utterance, clues areprovided about the referent by means of language(”It’s like that one right there or maybe it’s thatone”) and gesture (the user points to multiple visu-alizations). We believe that the interplay betweendifferent components of the context, the referringexpressions and the deictic gestures is crucial toproperly resolving a referential expression, and tointerpret a request.

4.1 Context and Gesture Annotation

We performed two separate annotations on the cor-pus, one to determine which utterances belong tothe set-up and conclusion for a certain utterance

Figure 2: Context is comprised of setup, request,and conclusion utterances.

UR, the second for gestures and their context.

Utterances. We use the label Timestep to appor-tion utterances to the three components of a con-text. By default, we start with an utterance UR

previously marked as an actionable request, whichwill be assigned the default Timestep value of Cur-rent. As we noted when discussing what type ofrequests Articulate2 currently processes, also herewe focus only on the first three types of actionablerequests (1. create new visualization; 2. modifyexisting visualization; 3. window management op-erations). This is a total of 449 requests out of the490 in Table 1.

The utterances preceding the Current utterance,i.e. UR, are coded as Previous; and those that fol-low the request and are pertaining to it as conclu-sion, are coded as Next. For an actionable utter-ance UR then, the context includes all precedingutterances marked as Previous (the context set-up) and all following utterances marked as Next(the context conclusion). We obtained a very goodκ = 0.783 on Timestep annotations for utterances.

Figure 4 shows the distribution of the codedTimestep values; and then two derived distribu-tions, Type and Context. In all three plots, the“anchor” so to speak, is the Current utterance, i.e.UR, the request of interest; hence, to the 449 Cur-rent utterances in Figure 4(a) correspond the 449Requests in Figures 4(b) and 4(c).

Figure 4(a) shows the distribution of utter-ances preceding and following UR, whereas Fig-ure 4(b) shows how those utterances are appor-tioned within the context, i.e. as set-up or conclu-sions. By comparing Figures 4(a) and 4(b) we canconclude that set-up includes about 1.8 utteranceson average, and conclusions about 2 utterances onaverage. Finally, Figure 4(c) simply confirms thatno utterance either in the set-up or the conclusionis a directly actionable utterance. Whereas this fol-lows by construction, the data confirms that no hu-

Figure 3: Annotation of a context in ANVIL (Kipp, 2001).

man errors occurred during annotation.

Gestures. The annotation for gestures includesvarious components, as shown in Figure 5. First,we mark the gesture with Timestep, as describedabove: the value for Timestep will be Previ-ous/Current/Next depending on where the gestureoccurs within the context of utterance UR. Sec-ond, Mode is used to encode whether the gestureis Deictic (that is, whether the gesture is pointingto objects on the screen). If not, then it is Non-Deictic and the Space is assigned to Peripheral orScreen. The Screen value for Space pertains togestures that the user makes in front of him or her-self while interacting with the screen, while Pe-ripheral Space is used if the gesture is made with-out screen interaction (Wagner et al., 2014). Notethat for Deictic gestures, the Space will always beassigned to Screen, since pointing to objects on thescreen is clearly interactive. Finally, if the ges-ture is Deictic, then its Type and Target are alsoassigned – the values for these two labels will bediscussed shortly.

Table 2 provides intercoder agreement for var-ious labels associated with gestures. Whereas κvalues for Timestep, Mode and Space are substan-tial, the values for Type and Target are lower. Thisis not surprising: it is difficult to determine if theuser is moving the hand while pointing, or keep-ing it stationary; and even more so, to distinguishbetween the four values the Target label can have,

including deciding whether the user is pointing toa visualization, or to objects within a visualization.

Code κgTimestep 0.718Mode 0.748Space 0.764Type 0.659Target 0.639

Table 2: Intercoder agreement for gestures

Figure 6 provides distributions for all the la-bels that are included in the gesture annotation.Figures 6(a) and 6(b) provide information aboutwhere gestures fall with respect to request UR.These two graphs show that only about 50% ofgestures are aligned with the actual request (Cur-rent in Figure 6(a) and Request in Figure 6(b));about 17% of gestures co-occur with an utterancepreceding UR (Previous/Setup), and the remain-ing 33% co-occur with an utterance following UR

(Next/Conclusion).Figure 6(c) shows that about 70% of gestures

are deictic; and Figure 6(d) shows that subjectsused gestures to interact with the screen far morethan peripherally, since apart from the 380 deic-tic gestures, also 38 non-deictic gestures interactwith the screen. Finally, Figures 6(e) and 6(f) fo-cus on deictic gestures. 4 Figures 6(e) shows that

4The attentive reader will note that totals in Figures 6(e)

(a) Timestep Frequency

(b) Context Components Frequency

(c) Type Frequency

Figure 4: Utterance contextual labels

Figure 5: Coding scheme for gestures

most deictic gestures are exclusively pointing, orpointing while also moving the hand. Finally, Fig-ure 6(f) provides the distributions of the targetsfor the deictic gestures. Three targets occur withsimilar frequencies: it is not surprising that userspoint to either individual visualizations, or indi-vidual objects within a visualization; it is less ex-pected than they point to more than one individualobject within a visualization so frequently. On theother hand, pointing to more than one visualiza-tion at the same time is not as common.

4.2 Lessons from Context AnnotationThe most important lesson is that UR does not oc-cur in a vacuum: as demonstrated by Figure 4(b),about half of the time, an actionable request UR ispreceded by contextual information directly rele-vant to UR itself; and even more frequently, about80% of the times UR is followed by pertinent in-formation. The second important lesson is thatabout half of the gestures relevant to the interpreta-tion of referring expressions contained within UR

are not aligned with UR either. This is a crucialinsight for coreference resolution.

5 Towards Multimodal CoreferenceResolution

Our coreference resolution approach begins withthe spoken contextual utterances from the user.If no referring expressions are detected, then theprocess flow described in Section 3.2 will be fol-lowed. Otherwise, if a gesture has been detected

and 6(f) are slightly lower (378 and 373 respectively), than380, the number of deictic gestures in Figure 6(c). In bothcases a very small number of gestures has been assigned anOther Type or Target. For the sake of brevity, we will notdiscuss the Other categories.

(a) Timestep Frequency (b) Context Frequency

(c) Mode Frequency (d) Space Frequency

(e) Deictic Gesture Type

(f) Deictic Gesture Target

Figure 6: Gesture features distributions

by the gesture detection process we will discussin the next section, information about any ob-jects pointed to by the user will be provided tothe Matcher. The Matcher will then be invokedand attempt to find a best match between prop-erties of each of the relevant entities. A diffi-culty we still need to address is to select the prop-erties of visualizations and objects we will keeptrack of. A first inventory of good properties to ex-tract from a visualization include, statistics in thedata (e.g., neighborhoods with lowest and highestcrime rates), trends in the data (e.g., top 5 and bot-tom 5 crime and location types), the title, plot type,and any more prominent objects within the visual-ization, such as hot-spots, street names, bus stops,and so on.

As noted earlier, when users are faced with agraphical representation, any object in the repre-sentation can become a referent. However an ad-ditional difficulty is that we do not have a declar-ative representation for all these potential refer-ents. For example, in a map representation ofcrime occurrences, each crime is represented bya dot; however, the dot is procedurally generatedby the graphics software to render one data pointin the data; that individual dot does not exist as anindividuated object in some declarative represen-tation of the visualization. The reason for a lackof representation is that the language we used forgenerating visualizations (Vega (Trifacta, 2014))abstractly performs behind-the-scenes operationson the data when producing graphs and does notdirectly provide access to individual objects.

5.1 Deictic Gesture Recognition

Whereas the Matcher still needs to be developed,we have made considerable progress on recogniz-ing deictic gestures, to which we turn now.

Several approaches are proposed to estimate thepointing direction using Computer Vision tech-niques. One common method is to model thepointing direction as the line of sight that connectsthe joints of head and hand (Kehl and Van Gool,2004). Using regular cameras to detect body jointsis still a challenging task in Computer Vision sincethey lack information about the depth of the usersbody and surrounding environment.

Since its release in 2011, the Microsoft Kinectcamera provided the capability of depth detectionat a low cost. It combines depth and infrared cam-eras with a regular RGB camera for depth stream

acquisition and skeletal tracking. The Kinect cam-era has the ability to track 24 distinct joints ofthe human body in which the 3D coordinates ofbody joints can be obtained. Using the 3D infor-mation from the Kinect camera, we constructed avirtual touch screen originally defined by (Chengand Takatsuka, 2006) and adapted later by (Jingand Ye-peng, 2013) to enable an efficient pointinggesture interaction with the large display. The userinteracts with the large display through the con-structed virtual touch screen to point to a specificvisualization on the display.

5.1.1 Virtual Touch Screen ConstructionFirst, we set up the interaction space by definingthe physical space that will model the Kinect po-sition and orientation in relation to the large dis-play position. Each acquired joint position by theKinect is rotated and translated so the center of thedisplay represents the origin of the world coordi-nate. We receive data from the Kinect camera asa stream of 3D positions of body joints per frame.Although we can track all body joints, we focusedonly on the head and the fingertip of the right handas dominant hand.

We created a virtual touch screen using head-fingertip positions to estimate the pointing target.As shown in Figure 7, the virtual screen is as-sumed to be at the position of the fingertip fromthe large display. Since the large display and theKinect are in the same plane, the z coordinate ofthe large display is zero. Each point (x, y) on thelarge display is mapped to a point (x

′, y′) on the

virtual screen through a line from the large displayto the head joint position (xh, yh, zh). Therefore:

xh − xxh − x′

=yh − yyh − y′

=zh − zzh − z′

(1)

Hence, we can estimate any point (x, y) on thelarge display by calculating x and y from Equa-tion 1.

x =zh ∗ (x

′ − xh)zh − z′

+ xh (2)

y =zh ∗ (y

′ − yh)zh − z′

+ yh (3)

The user interacts with the large display as if itwas brought forward in front of him/her and wecan map any point on the virtual screen to its cor-responding point on the large display using theabove equations. The position and dimensions of

the virtual screen are calculated based on the posi-tions of the head and fingertip, and subsequently,it is adaptive to the positions of the user head andfingertip. Using pointing data, it is possible to in-fer which visualization the user is pointing to –in particular, we are now also able to identify thewindow or windows that point (x,y) belongs to.

Figure 7: User interaction with large displaythrough constructed virtual touch screen at user’sfingertip.

6 Towards interpreting requests incontext

As we noted earlier, requests don’t occur in isola-tion: they are preceded by a set-up in 50% of thecases, and followed by a conclusion in 80% of thecases. The conversational interface clearly needsto take this information into account: the set-up inorder to further refine the request, and the conclu-sion, in order to further the task itself. As a firststep towards these goals, we focused on analyzingthe conclusion component of a context, and specif-ically, on uncovering any relevant themes that mayoccur. In the conclusion part of the context, viaadditional annotation, it was found that the userwould either: discuss resulting graphs producedfrom the current request (e.g., ”ok so it shows thatthe theft, battery, deceptive-practice and criminal-damage have the highest rate of *uh* crime.”), (2)refine the current request (e.g., ”thank you, i shalltake a look at these, by the hour.”), (3) providesome insights (e.g., ”so then maybe if may–, if itgets cold, crime goes down at least the cops cango where its warm, maybe take their vacations inthe winter.”, (4) or some unrelated utterances (e.g.,”ok, thank you. ok, thank you.”).

Figure 8 shows the distribution of these themes:66% of conclusions discuss what the user gleanedfrom the request; of these, about 60% discussthe results directly, whereas an additional 6% dis-

cuss more general insights into the phenomenonat hand. About 20% represent a further refinementof the request, which sets the stage for the nextrequest.

Figure 8: Conclusion utterances label frequency.

Given these annotations, we trained a super-vised classification model to predict the overalltheme of a set of conclusion utterances. The modelused three different categories of feature types:syntactic, semantic, and miscellaneous. The syn-tactic feature types include unigrams, bigrams,and trigrams for words, part-of-speech, and taggedpart-of-speech. The semantic category is basedon the Word2Vec word embedding representa-tion. Specifically, the utterances within a conclu-sion were added together by their correspondingWord2Vec vectors and then normalized. Finally,the remaining feature types include total numberof words across a given conclusion, the total num-ber of Chicago crime data attributes mentionedacross a given conclusion, and the total numberof utterances in the conclusion that ended with aquestion mark (because such utterances were ob-served to occur in the conclusions). The featurevector dimensions was 17,904 (feature selectionwas applied to reduce the dimensionality). Ac-curacy results when using different classifiers areshown in Table 3. Apart from Multinomial NaiveBayes, the other three classifiers all perform simi-larly. We will further investigate sources of confu-sion in classfication to improve their performance.

7 Conclusions and Future Work

In this paper we presented our work on investi-gating the role context plays in interpreting re-quests and referential expressions in task-oriented

Classifier AccuracySupport Vector Machine 74%Decision Tree 74%Random Forest 73%Multinomial Naive Bayes 64%

Table 3: Thematic conclusion classification accu-racy.

dialogues about exploring complex data via visu-alizations. This work takes place in the contextof our Articulate2 project. Our goals are bothto gain insight into how people use visualizationsto make discoveries about a domain, and to useour findings in developing an intelligent conversa-tional interface to a visualization system. In pre-vious work, we had collected a new corpus of dia-logues, started annotating and analyzing it, and setup the NLP pipeline for the Articulate2 system.

Specifically as concerns context, in this paperwe have presented how we annotated the contextsurrounding each of our directly actionable re-quests, and how we annotated for gestures also incontext. We found that indeed an actionable re-quest is preceded by a set-up 50% of the times,and followed by a conclusion 80% of the times.As concerns gestures, we found that (not surpris-ingly) the majority of them are interactional withrespect to the screen and in fact deictic; however,we also found that half of the gestures relevantto the interpretation of referring expressions con-tained within the request are not aligned with therequest, but with the setup, or more often, with theconclusions.

As concerns the computational modeling of ourfindings, so far, we have focused on recognizingdeictic gestures via Kinect, and on learning clas-sifiers for the themes contained in the conclusioncomponent of a context.

Much work remains to be done. Apart from tak-ing advantage of the context to refine and disam-biguate requests, our most pressing work regardsresolving referring expressions. As we noted, westill need to understand what specific properties ofvisualizations and objects within visualizations arethe most useful for resolving referring expressionsin our domain. From our findings on gestures andwhere they occur in the context, it is clear that ouralgorithm must be incremental. We also need toanalyze the referring expressions that users use inour data, to assess how prevalent the phenomenonof a single referent playing a dual role (in the do-main, or as a graphical element) is.

ReferencesJillian Aurisano, Abhinav Kumar, Alberto Gonzales,

Khairi Reda, Jason Leigh, Barbara Di Eugenio, andAndrew Johnson. 2015. ’show me data’: observa-tional study of a conversational interface in visualdata exploration. In Information Visualization Con-ference, IEEE VisWeek, Chicago, IL.

Jillian Aurisano, Abhinav Kumar, Alberto Gonzales,Khairi Reda, Jason Leigh, Barbara Di Eugenio,and Andrew Johnson. 2016. Articulate2: To-ward a conversational interface for visual data ex-ploration. In Information Visualization Conference,IEEE VisWeek, Baltimore, MD.

Tyler Baldwin, Joyce Y. Chai, and Katrin Kirchhoff.2009. Communicative gestures in coreference iden-tification in multiparty meetings. In Proceedings ofthe 2009 International Conference on MultimodalInterfaces, pages 211–218. ACM.

Donna K. Byron. 2003. Understanding referring ex-pressions in situated language: Some challenges forreal-world agents. In Proceedings of the First Inter-national Workshop on Language Understanding andAgents for the Real World., pages 80–87.

Lin Chen, Maria Javaid, Barbara Di Eugenio, andMilos Zefran. 2015. The roles and recognition ofhaptic-ostensive actions in collaborative multimodalhuman-human dialogues. Computer Speech & Lan-guage, 32:201–231, Nov.

Kelvin Cheng and Masahiro Takatsuka. 2006. Esti-mating virtual touchscreen for fingertip interactionwith large displays. In Proceedings of the 18thAustralia conference on Computer-Human Interac-tion: Design: Activities, Artefacts and Environ-ments, pages 397–400. ACM.

Jinho D. Choi and Andrew McCallum. 2013.Transition-based dependency parsing with selec-tional branching. In ACL, pages 1052–1062.

Jacob Cohen. 1960. A coefficient of agreementfor nominal scales. Educational and PsychologicalMeasurement, 20:37–46.

Kenneth Cox, Rebecca E Grinter, Stacie L Hibino,Lalita Jategaonkar Jagadeesan, and David Mantilla.2001. A multi-modal natural language interface toan information visualization environment. Interna-tional Journal of Speech Technology, 4(3):297–314.

Jacob Eisenstein and Randall Davis. 2006. Ges-ture Improves Coreference Resolution. In Proceed-ings of the Human Language Technology Confer-ence of the NAACL, Companion Volume: Short Pa-pers, pages 37–40.

M.E. Foster, E.G. Bard, M. Guhe, R.L. Hill, J. Ober-lander, and A. Knoll. 2008. The roles of haptic-ostensive referring expressions in cooperative, task-based human-robot dialogue. In Proceedings of the3rd ACM/IEEE International Conference on HumanRobot Interaction, pages 295–302. ACM.

Tong Gao, Mira Dontcheva, Eytan Adar, Zhicheng Liu,and Karrie G Karahalios. 2015. Datatone: Man-aging ambiguity in natural language interfaces fordata visualization. In Proceedings of the 28th An-nual ACM Symposium on User Interface Software &Technology, pages 489–500. ACM.

S. Goldin-Meadow. 2005. Hearing gesture: How ourhands help us think. Harvard University Press.

James Hollan, Elaine Rich, William Hill, David Wrob-lewski, Wayne Wilner, Kent Wittenburg, JonathanGrudin, and Members Human Interface Laboratory.1988. An introduction to hits: Human interface toolsuite. Technical report, MCC.

Ryu Iida, Masaaki Yasuhara, and Takenobu Tokunaga.2011. Multi-modal reference resolution in situ-ated dialogue by integrating linguistic and extra-linguistic clues. In IJCNLP, pages 84–92.

Alejandro Jaimes and Nicu Sebe. 2007. Multimodalhuman–computer interaction: A survey. Computervision and image understanding, 108(1):116–134.

Pan Jing and Guan Ye-peng. 2013. Human-computerinteraction using pointing gesture based on an adap-tive virtual touch screen. International Journal ofSignal Processing, Image Processing and PatternRecognition, 6(4):81–92.

Roland Kehl and Luc Van Gool. 2004. Real-timepointing gesture recognition for an immersive envi-ronment. In Automatic face and gesture recognition,2004. proceedings. sixth ieee international confer-ence on, pages 577–582. IEEE.

Andrew Kehler. 2000. Cognitive Status and Form ofReference in Multimodal Human-Computer Interac-tion. In AAAI 00, The 15th Annual Conference ofthe American Association for Artificial Intelligence,pages 685–689.

Hansol Kim, Kun Ha Suh, and Eui Chul Lee. 2017.Multi-modal user interface combining eye trackingand hand gesture recognition. Journal on Multi-modal User Interfaces, pages 1–10, March. Pub-lished on line.

Michael Kipp. 2001. Anvil-a generic annotation toolfor multimodal dialogue. In Seventh European Con-ference on Speech Communication and Technology.

Karin Kipper, Anna Korhonen, Neville Ryant, andMartha Palmer. 2008. A large-scale classificationof english verbs. Language Resources and Evalua-tion, 42(1):21–40.

Abhinav Kumar, Jillian Aurisano, Barbara Di Eugenio,Andrew E. Johnson, Alberto Gonzalez, and JasonLeigh. 2016. Towards a dialogue system that sup-ports rich visualizations of data. In SIGDIAL Con-ference, pages 304–309, Los Angeles, CA.

F. Landragin. 2006. Visual perception, languageand gesture: A model for their understanding inmultimodal dialogue systems. Signal Processing,86(12):3578–3595.

Changsong Liu, Rui Fang, and Joyce Y. Chai. 2013.Shared gaze in situated referential grounding: Anempirical study. In Eye Gaze in Intelligent User In-terfaces, pages 23–39. Springer.

Susann LuperFoy. 1992. The representation of multi-modal user interface dialogues using discourse pegs.In ACL, pages 22–31.

Thomas Marrinan, Jillian Aurisano, Arthur Nishimoto,Krishna Bharadwaj, Victor Mateevitsi, Luc Renam-bot, Lance Long, Andrew Johnson, and Jason Leigh.2014. Sage2: A new approach for data intensive col-laboration using scalable resolution shared displays.In Collaborative Computing: Networking, Appli-cations and Worksharing (CollaborateCom), pages177–186. IEEE.

Costanza Navarretta. 2011. Anaphora and gestures inmultimodal communication. In Proceedings of the8th Discourse Anaphora and Anaphor ResolutionColloquium (DAARC 2011), Faro, Portugal, Edi-coes Colibri, pages 171–181.

Martha Palmer, Daniel Gildea, and Paul Kingsbury.2005. The Proposition Bank: An Annotated Cor-pus of Semantic Roles. Computational Linguistics,31(1):71–105, March.

Martha Palmer. 2009. Semlink: Linking PropBank,Verbnet and FrameNet. In Proceedings of the Gen-erative Lexicon Conference, pages 9–15.

Zahar Prasov and Joyce Y. Chai. 2008. What’s in agaze?: the role of eye-gaze in reference resolution inmultimodal conversational interfaces. In Proceed-ings of the 13th international conference on Intel-ligent user interfaces, IUI ’08, pages 20–29, NewYork, NY, USA. ACM.

S. Qu and J. Chai. 2008. Beyond attention: The roleof deictic gesture in intention recognition in multi-modal conversational interfaces. In ACM 12th Inter-national Conference on Intelligent User interfaces(IUI).

Norbert Reithinger, Dirk Fedeler, Ashwani Kumar,Christoph Lauer, Elsa Pecourt, and Laurent Romary.2005. Miamma multimodal dialogue system usinghaptics. In Advances in Natural Multimodal Dia-logue Systems, pages 307–332. Springer.

George Robertson, Mary Czerwinski, Patrick Baud-isch, Brian Meyers, Daniel Robbins, Greg Smith,and Desney Tan. 2005. The large-display user expe-rience. IEEE Computer Graphics and Applications,25(4):44–51.

Vidya Setlur, Sarah E. Battersby, Melanie Tory, RichGossweiler, and Angel X. Chang. 2016. Eviza:A natural language interface for visual analysis. In

Proceedings of the 29th Annual Symposium on UserInterface Software and Technology, pages 365–377.ACM.

Melinda Sinclair. 1992. The effects of context on ut-terance interpretation: Some questions and some an-swers. Stellenbosch Papers in Linguistics, 25.

Yiwen Sun, Jason Leigh, Andrew Johnson, and Bar-bara Di Eugenio. 2013. Articulate: Creating mean-ingful visualizations. Innovative Approaches ofData Visualization and Visual Analytics, page 218.

Trifacta. 2014. Vega: A Visualization Grammar.https://vega.github.io/vega/.

Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014.Gesture and speech in interaction: An overview.Speech Communication, 57:209–232.

towards multimodal coreference resolution for exploratory ...about their reasoning. they viewed...

Documents