saliency-based identification and recognition of pointed ...bschauer/pdf/schauerte2010... ·...

Saliency-based Identification and Recognition of Pointed-at Objects

Boris Schauerte, Jan Richarz and Gernot A. Fink

Abstract— When persons interact, non-verbal cues are usedto direct the attention of persons towards objects of interest.Achieving joint attention this way is an important aspect ofnatural communication. Most importantly, it allows to coupleverbal descriptions with the visual appearance of objects, ifthe referred-to object is non-verbally indicated. In this contri-bution, we present a system that utilizes bottom-up saliencyand pointing gestures to efficiently identify pointed-at objects.Furthermore, the system focuses the visual attention by steeringa pan-tilt-zoom camera towards the object of interest and thusprovides a suitable model-view for SIFT-based recognition andlearning. We demonstrate the practical applicability of theproposed system through experimental evaluation in differentenvironments with multiple pointers and objects.

Index Terms— Saliency, Joint Attention, Pointing Gestures;Object Detection and Learning; Active Pan-Tilt-Zoom Camera

I. INTRODUCTIONAttention is the cognitive process of focusing the process-

ing of sensory information onto salient data, i.e. data that islikely to render objects of interest. When persons interact,non-verbal attentional signals – most importantly pointing(cf. [1]) and gazing (cf. [2]) – are used in order to establish ajoint focus of attention. Human infants develop the ability tointerpret such non-verbal signals around the age of one year,enabling them to associate verbal descriptions with the visualappearance of objects (cf. [3], [4]). There is strong evidence“that joint attention reflects mental and behavioral processesin human learning and development” [5]. Accordingly, forhuman-machine interaction (HMI), e.g. in human-robot inter-action or smart environments, interpreting attentional signalsto establish the joint focus of attention is an important aspect.

In HMI, non-verbal attentional signals can be used toinfluence either the attention of the human or the machine(e.g. [6] and [7], respectively). In interactive scenarios non-verbal signals can be used to explicitly direct the attentioneither towards a general direction, e.g. ”(go) there”, or to-wards a specific object, e.g. ”(take) that”. In this contribution,we first introduce a model that combines bottom-up saliencywith top-down deictic information to identify non-verballyreferred-to objects. Then, we present an implementation thatis able to efficiently identify and recognize pointed-at objects.Since information about the object appearance may not beavailable from the conversational context, specialized object

B. Schauerte and J. Richarz are with the Robotics Re-search Institute, TU Dortmund, 44221 Dortmund, Germany{boris.schauerte,jan.richarz}@tu-dortmund.de

G. A. Fink is with the Department of Computer Science, TU Dortmund,44221 Dortmund, Germany [email protected]

The work of B. Schauerte was supported by a fellowship of theTU Dortmund excellence programme. He is now affiliated with the Institutefor Anthropomatics, Karlsruhe Institute of Technology, 76131 Karlsruhe,Germany [email protected]

Fig. 1. A person points toward a specific region of interest. Due to thelow resolution image, the content of the referred-to region can neither bereliably recognized nor learned. Using bottom-up saliency and the top-downinformation of the pointing gesture, it is possible to estimate the referred-toregion. By foveating the region with a pan-tilt-zoom camera, an appropriateview for recognition and learning tasks can be acquired.

detectors and object recognition cannot be applied in generalto identify the referred-to object. Therefore, we base ourmodel on a computational bottom-up model of attention toidentify regions in the image that are likely to render objectsof interest (cf. e.g. [8], [9]). In order to estimate the positionof the referred-to object, we combine the visual saliency withthe directional information obtained from a pointing gesture.The foundation of this model is the assumption that thebody posture, non-verbal signals, and deictic expressions are(subconsciously) chosen in order to maximize the expectedsaliency of the referred-to object in the perception of theinteraction partners. In contrast to pointing gestures therecognition of more inconspicuous non-verbal signals as,e.g., gaze and facial expressions, requires close-up views ofthe respective person’s face, which might not be available inrealistic scenarios. Therefore, in this contribution we focuson pointing gestures, as these can be robustly identified evenin low-resolution images taken from a distance.

II. RELATED WORK

In recent years computational attention models have at-tracted an increasing interest in the field of robotics (e.g. [8],[10]) and various other application areas (e.g. [11]–[13]).Assuming that interesting objects are visually salient (cf. [9]),the aim of these models is to concentrate the availablecomputational resources by directing the attention towardspotentially relevant information, e.g. to achieve efficientscene analysis [14]. In general, saliency models can bedistinguished as either object-based (e.g. [15], [16]) or space-based models (cf. [15], [8, Sec. II]). In contrast to the(traditional) theory of space-based attention, object-basedattention suggests that visual attention can directly selectdistinct objects rather than only continuous spatial locationswithin the visual field. Recently, saliency models basedon the phase spectrum [17], [18] have attracted increasing

The 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems October 18-22, 2010, Taipei, Taiwan

978-1-4244-6676-4/10/$25.00 ©2010 IEEE 4638

interest (e.g. applied in [14]). These models exploit thewell-known effect that spectral whitening of signals will“accentuate lines, edges and other narrow events withoutmodifying their position” [19, Sec. III]. The use of saliencymodels for robotics applications – i.e. shifting the focusof attention for efficient scene exploration (e.g. [8], [10])and analysis (e.g. [14], [20], [21]) – has attracted increasinginterest during the last years. [20] applied the attentional shiftto detect, recognize and learn objects using SIFT (cf. [22];see [23]) in static images. For active scene exploration,saliency can be used to steer the sensors towards salient –thus potentially relevant – regions to detect objects of interest(e.g. [8], [10], [12]). Combining these methods, [14] utilizedbottom-up attention, stereo vision and SIFT to perform robustand efficient scene analysis on a mobile robot. Similarly,salient objects are detected and foveated for recognition in[21]. However, the latter systems rely on an object databasefor recognition and do not learn new objects in a natural wayas our proposed system. Most similar to our system, [24]used bottom-up attention to detect and SIFT to recognizeobjects, but new objects have to be placed directly in frontof the (static) stereo vision camera in order to learn them.

Since establishing joint attention is an important factor inhuman-human and human-machine interaction, it has beenan active research area throughout the years. Accordingly,we can only present a brief overview of state-of-the-artpsychological and computational literature. In general, twomain research areas can be distinguished: the developmentof joint attention (e.g. [4], [25]–[27]) and its role in naturalcommunication (e.g. [1], [2], [28]–[30]). In (spoken) human-robot interaction (HRI), computational models of joint atten-tion can be used to direct the attention of human (e.g. [6],[25], [30]) or artificial dialog-partners (e.g. [27], [30]). Aswe are mainly interested in computational methods to controlthe attention of a robot, we will focus on that aspect in thefollowing. Especially gaze following as a means to achievejoint attention has been researched intensively (e.g. [26],[27], [31]). In particular, [26] and [27] used constructivemodels to learn gaze-based joint attention. Furthermore,[31] developed a saliency-based probabilistic model of gazeimitation and shared attention. A specialized scenario isconsidered in [32], in which the addressee – i.e. the focus ofattention – in multi-person interaction scenarios is identifiedthrough head pose estimation. Although it has been shown“that pointing helps establish a joint focus of attention” [1],surprisingly little work exists that explicitly addresses sharedattention with pointing gestures. Particularly, Sugiyama etal. (see, e.g., [29] and [30]) developed a model to draw theattention of humans as well as robots in HRI with pointinggestures and verbal cues. However, they rely on motioncapturing techniques that require markers attached to thepersons in order to recognize human pointing gestures, whichlimits the practical applicability of their system.

Visually recognizing pointing gestures (e.g. [33]–[35])and estimating the referred-to target has been addressed byseveral authors in recent years, aiming at applications inrobotics (e.g. [36]–[39]), smart environments (e.g. [40]) and

wearable visual interfaces (e.g. [7], [41]). Object referencesmay be established by proximity in the image space (e.g.[42]) or by tracing an estimate of the pointing direction.An often utilized approach is to calculate the direction asthe line-of-sight between the eyes and the pointing hand orfinger (e.g. [37], [40]). In [39], 3 different possibilities wereevaluated, and the line-of-sight model was reported to be thebest. It achieves a good approximation when the pointingarm is extruded, but is less suitable for finger pointing orpointing with a bent arm. In fact, pointing is inherentlyinaccurate, as discussed in [43]. The authors estimated that ahuman pointing gesture has an angular uncertainty of approx.10◦, which should be taken into account when inferring thepointing target. This naturally leads to the definition of acorridor of attention in which referred-to targets are expected,which also allows searching for targets outside the camera’sinitial field of view. However, all mentioned systems have incommon that the positions and/or the (simplified) appearanceof the target objects in the scene are known beforehand.Thus, these systems can hardly handle environments in whichthe objects and/or their location or the viewpoint change(cf. [30, Sec. II]). In contrast to this, we aim at using pointingto steer the attention towards arbitrary – including unknownor unspecified – objects.

III. MODEL

Analyzing the process to identify referred-to objects andestablish a shared focus of attention by non-verbal signalsonly, leads to a set of subsequent tasks. Accordingly, wemodel this process as a sequence of states, which we describein the following. Firstly, the (attentional) behavior of theobserved person has to be interpreted in order to detectthe occurrence of non-verbal attentional signals (cf. [4,Sec. 2.2.2]), e.g. pointing gestures as in our implementation.

Next, the referred-to object has to be identified. Therefore,we determine the indicated direction, e.g. the pointing direc-tion, and estimate the probability that a specific image regionhas been addressed. The probability distribution has to reflectthe ambiguity and vagueness of object references, whichmay be caused by the natural impreciseness of the indicateddirection. As the addressed object may be unknown to theobserver or no further knowledge about the appearance isknown from the (conversational) context, specialized objectdetectors cannot be applied in general. Therefore, we applybottom-up saliency to detect potential regions of interestwithout prior knowledge about objects. By combining thebottom-up saliency with the top-down information acquiredfrom the pointing gesture, we are then able to infer ahypothetical position of the referred-to object. In order toidentify the addressed object, its shape has to be extracted viafigure-ground segmentation, which is implicit in object-basedsaliency models. In our experience, this two-phased model,decoupling pointing gesture recognition and identification ofthe referred-to object, is of advantage, because the objectmay lie outside the field of view.

Finally, the referred-to object has to be visually focused.In natural systems this serves two tasks: firstly, the change

4639

Fig. 2. Left to right: the input image, bottom-up saliency map, top-down pointing saliency map, and attended object.

in body posture and gaze direction acts as a feedbackto the indicating person, who can use this feedback torefine the attentional signal. Secondly, the foveation enhancesthe visual perception of the object, which supports objectrecognition and learning. In artificial systems using pan-tilt-zoom cameras, we are able to emulate the first byactively steering the camera to center the referred-to objectin the field of view. This forms also the basis to acquiredetailed views of the object via zooming. Furthermore, ifthe object cannot be identified unambiguously due to thepresence of distracting objects, an iterative shift of attentioncan be applied, sequentially focusing on different objects ofinterest.

IV. REALIZATION

In this section, we present our implementation of thedescribed model for pointing gestures as attentional signal.First, we describe the attention process, consisting of point-ing detection, (space-based as well as object-based) saliencycalculation, and object selection. Then, we describe how wefoveate the referred-to objects, and how we recognize knownand learn new objects with SIFT using the foveated views.

Note that we do not discuss the problem of providingappropriate views to recognize attentional signals of theinteracting person. Instead, we assume that the views arefeasible to detect pointing gestures. Nevertheless, in ourexperience, steering the cameras towards visually (e.g. [8])or audio-visually (e.g. [10], [12]) salient regions is a naturaland efficient method to detect and focus interacting persons.

A. Saliency-based Object Detection and Selection

1) In order to detect pointing gestures, we use a modifiedversion of the system presented in [40]. Most importantly,we replaced the face detector with a head-shoulder detectorbased on histograms of oriented gradients, which – in prac-tice – is more robust and less view-dependent. This systemrobustly estimates a pointing direction from an image se-quence. Furthermore, it offers real-time responsiveness, userindependence and robustness against changing environmentalconditions. The output for each frame i is a detection rectan-gle di around the person’s head and a list of hand positionhypotheses hi,j . From these, using the well-established line-of-sight model, the pointing direction hypotheses oi,j caneasily be calculated as oi,j = oi,j/||oi,j || with oi,j =hi,j − di, where di is the center point of di. To detectthe occurrence of pointing gestures, we characterize them

through the inherent holding phase of the pointing hand.Accordingly, temporally stable pointing hypotheses have tobe identified. Therefore, we calculate the angle difference∆αj,k

i,i+1 = arccos (oTi,j · oi+1,k) between pairs of pointing

hypotheses oi,j , oi+1,k in succeeding frames i and i + 1,and the length of their difference vector ∆lj,ki,i+1 = ||oi,j -oi+1,k||. The angle differences and vector lengths are utilizedto group the pointing hypotheses over time. Sufficiently largetemporal clusters are selected as pointing occurrence oi.In this process, we exclude pointing hypotheses alongsidethe body, because – without further assumptions about thepointing person and viewpoint – pointing gestures cannot bedistinguished reliably from idle arms alongside the body.

As pointing is inherently inaccurate and further influencedby the positional uncertainty of the head and hand hypothe-ses, we calculate a corridor of attention around the meandirection of a selected group as follows: The head-shoulderdetection windows tend to shift around slightly horizontallyand vertically as well as in size due to image noise anddetection jitter. We calculate a running mean s (we omitthe frame indices in the following for better readability)of the detection rectangle’s size over the last frames andmodel the positional uncertainty of the eyes by a Normaldensity around the detection center, with one quarter of sbeing covered by 2σe, hence pe(x|d) = N (di, σ

2e) with

σe = s/8. This models the system-inherent uncertaintycaused by estimating the eye position from the detectionrectangle. Furthermore, there are two additional sources ofuncertainty that can be identified: The variation in size of thehead detection rectangle, and the uncertainty of the estimatedpointing direction o due to shifts in the head and handdetection centers. We treat these as independent Gaussiannoise components and estimate their variances σ2

s and σ2o ,

respectively, from the observed data. Note that σ2e and σ2

s

are variances over positions, whereas σ2o is a variance over

angles, so they cannot be combined directly. But we canapproximately transform a positional into an angular varianceby normalizing by the length r = ||o|| of the pointing vector.Thus, the combined distribution becomes a distribution overangles: p(α(x, o)|d, o) = N (0, σ2

c ), with α(x, o) being theangle between the vector from the pointing origin to the pointx and the pointing direction o, and σ2

c ≈ σ2e/r

2+σ2s/r

2+σ2o .

This equation represents the probability pG(x) that a pointx in the image plane was referred-to by the pointing gesturegiven the current head-shoulder detection d and the pointingdirection o, and thus defines our corridor of attention. To

4640

account for the findings in [43], we set a lower bound of3◦, i.e. σc = max(3◦, σc), so that 99.7% (correspondingto 3σ) of the distribution’s probability mass covers at leasta corridor of 9◦. Modeling the corridor of attention asdistribution over angles takes into account that the positionaluncertainty increases with the distance to the pointer.

2) The bottom-up saliency calculation was inspired bythe phase-based approach presented in [17]. However, weperform a pure spectral whitening (cf. [19]), because thespectral residual is negligible in most situations (cf. [18]).An advantage of this approach is that it can be implementedefficiently on multi-core CPUs and modern GPUs. Thisspectral whitening saliency is calculated for intensity, Red-Green opponency and Blue-Yellow opponency. The corre-sponding conspicuity maps Ci are normalized to [0, 1] andinterpreted as probabilities that a pixel attracts the focusof attention (FoA). Therefore, we calculate the bottom-upsaliency map Sb as mixture density Sb =

∑Ni=1 wiCi with

uniform weights wi = 1N .

In our space-based saliency model we first calculate a top-down saliency map St based on the pointing gesture, whichis defined as the probability pG(x) for each pixel in theimage. This, in effect, defines a blurred cone emitted fromthe hand along the pointing direction in the image plane.The final saliency map S is obtained by calculating the jointprobability, S(x) = Sb(x)St(x). The position of the FoAis then determined as xFoA = argmaxx S(x). To determinethe underlying object O, we segment the image with themaximally stable extremal regions (MSER) algorithm [44]and select the segment closest to the FoA. In the selectionprocess, we exclude segments with a high spatial variance,because these segments are likely to represent background.

For our object-based saliency model, a saliency is cal-culated for each object. Therefore we apply the MSERalgorithm (as above) to segment the scene and then calculatea saliency value – based on the bottom-up saliency – for eachsegment as the average probability that a pixel in the segmentattracts the focus of attention, i.e. Sb(Ok) =

∑x∈Ok

P (x)|Ok| .

The top-down saliency of each object is calculated as theprobability that the center of the object xc is referred-to bythe pointing gesture, i.e. St(Ok) = gG(xc). The combinedsaliency for each object is then calculated according to thejoint probability S(Ok) = Sb(Ok)St(Ok). Finally, the objectwith the maximal saliency is selected, i.e. argmaxk S(Ok).

To allow an iterative shift of attention, we implementedan inhibition-of-return mechanism for each model. For thespace-based saliency, a 2-D Gaussian weight function issubtracted from the saliency map. The center of the Gaussianis located at the position of the selected FoA and the varianceis estimated from the spatial dimensions of the selectedobject. For the object-based model, we inhibit the selectionof already attended objects by setting their saliency to 0.

B. Object Foveation, Recognition and Learning

To attend the selected object, we center it in the view anduse the camera’s zoom to acquire a model view in whichthe object fills most of the image (cf. Fig. 2). We estimate

Fig. 3. The objects used in the evaluation.

Fig. 4. Examples of pointing gestures performed in the evaluation.

the necessary zoom factor based on the approximate objectdimensions obtained through the segmentation.

In order to recognize and learn objects, we calculate SIFTfeatures (cf. [22]; see [23]) for the acquired model views.These features are matched with a database, which is ac-quired by storing previous model views and their associatedSIFT features. The generalized Hough Transform is appliedto decide upon the presence of an object (cf. [23]). Therefore,the matches are grouped in accumulator cells according toan object pose hypothesis, i.e. position, rotation and scale,which can be estimated from the matched SIFT features.Finally, for accumulator cells with sufficient matches toestimate the assumed transformation, RANSAC (cf. [23]) isapplied to determine the pose of the detected object. Newmodel views, i.e. objects, and their SIFT features are storedin the database, which additionally links specified objectidentifiers, e.g. text or speech signals, to the models.

V. EXPERIMENTAL EVALUATIONA. Setup

Since joint attention is especially important for naturalHRI with humanoid robots, we mounted a monocular SonyEVI-D70P pan-tilt-zoom camera on eye height of an av-eragely tall human to reflect a human-like point of view(cf. [45]). The pan-tilt unit has an angular resolution ofroughly 0.75◦ and the camera offers up to ×18 optical zoom.

To assess the performance of the proposed system, wecollected a data set containing 3 persons pointing at 27objects. We limited the number of pointing persons, becausewe do not evaluate the performance of the pointing detector.Instead, we focus on the object detection and recognitioncapabilities. Therefore, we tested the system in two naturalenvironments with a large set of objects of different category,shape, and texture (cf. Fig. 3). As evaluation scenarios wechose a conference room and a cluttered office environment.

B. Procedure and Measures

Each person performed several pointing sequences, withvarying numbers and types of objects present in the scene.

4641

Fig. 5. Some objects with their database matches.

We neither restricted the body posture of the subjects inwhich pointing gestures had to be performed, nor did wedefine fixed positions for the objects and persons. The onlyrestriction imposed was that the subjects were instructed topoint with their arms extruded, so that pointing gestureswould comply with the line-of-sight model employed. Inorder to evaluate the ability of the iterative shift of attentionto focus the correct object in the presence of distractors, weoccasionally arranged clusters of objects so that the objectreference would be ambiguous. The data set accordinglycontains a wide variety of pointing references (see Fig. 4).Since we do not specifically evaluate the pointing detector(cf. [40]), we discard cases with erroneous pointing gesturedetections. Thus, in total, our evaluation set contains 220object references.

In order to evaluate the saliency-based identification ofreferred-to objects, we calculate the amount of true (TrueRef.), false (False Ref.), and missed object references (MissedRef.). In addition, to evaluate the iterative shift of attention,we calculate 10 shifts of the FoA. Thus, we are able toidentify several object reference hypotheses for each pointinggesture, sorted by the selection order. As evaluation measure,we calculate the cumulative percentage of correctly identifiedobject references after N attentional shifts (Nth FoA Shift).We report these measures separately for the two chosenenvironments (Conference, Office) and for the object-based(obj-b.) and space-based (spc-b.) saliency models.

To assess the performance of the object foveation forrecognition and learning, we use the foveated object views(which were acquired with the object-based saliency model)to build a SIFT database for each sequence. We thenmatch the acquired images of each sequence with the SIFTdatabases of all other sequences, and report the percentageof true, false, and missed object matches. However, we donot perform an in-depth evaluation of the SIFT-based objectrecognition, because it has already been reported to workwell in various systems (cf. Sec. II).

C. Results

Tab. I summarizes the results of the identification ofpointed-at objects. In both environments, the space-basedsaliency model yields superior results (overall 83.2% ac-curacy compared to 78.6%). Interestingly, the accuracy ofthe space-based saliency is considerably higher compared tothe object-based saliency in the office (84.5% to 76.7%),which stands in contrast to the only slight advantage in theconference room (81.7% to 80.8%). This can be explainedby the fact that the office environment is more challenging

Conference Office Overallobj-b. spc-b. obj-b. spc-b. obj-b. spc-b.

True Ref. 80.8 81.7 76.7 84.5 78.6 83.2False Ref. 16.4 18.3 19.8 13.8 18.2 15.9Missed Ref. 2.9 0.0 3.5 1.7 3.2 0.9

1st FoA Shift 95.2 93.3 90.5 94.8 92.7 94.12nd FoA Shift 95.2 97.1 93.1 97.4 94.1 97.33rd FoA Shift 95.2 98.1 94.0 97.4 94.6 97.7

TABLE IDETECTION RESULTS OF REFERRED-TO OBJECTS (IN %).

due to a higher amount of background clutter. Accordingly,the increased complexity of the scene has a considerableinfluence on the phase-based saliency and the segmentation,which are both fundamental for the object-based saliency.Nevertheless, in all scenarios most errors result from falsereferences, i.e. wrong objects being selected first. In manycases, this happens because of ambiguous references due toclosely neighbored objects and/or the missing depth infor-mation of the monocular camera. In fact, many of theseambiguities are hard to resolve from the images even byhuman observers. Consequently, we performed a comple-mentary experiment to assess the influence of ambiguousreferences. Therefore, we asked several humans to identifythe referred-to object in the recorded images. Interestingly,the participants were only able to identify the correct objectfor about 87% of the images.

In human-human interaction, such ambiguities are mostlyresolved using the feedback of the interaction partner to adaptand shift the FoA. In absence of this feedback, we haveto rely on the implemented inhibition-of-return mechanismsto focus the referred-to object. On average, 0.22 shiftsfor the object-based saliency, and 0.14 for the space-basedsaliency, were needed until the correct object was selected.The cumulative number of correct references for up to threeattentional shifts are shown in Tab. I. Accordingly, the correctobject was almost always among the first two candidates.Furthermore, using more than three shifts did not improvethe results anymore, but already yielded 94.6% and 97.7%correct detections, respectively. These results are promisingfor unsupervised learning tasks, because the probability thatthe addressed object is contained in a limited set of attendedobjects is very high.

The pairwise matching of the SIFT databases yielded anoverall percentage of 93.1% correct matches. For 5.5% ofthe images, no matching object was found, and 1.3% ofthe object matches were false matches. These misses aremainly caused by inappropriate viewing angles and glossyobject surfaces in combination with bad lighting. Since theacquired model views are detailed and usually contain a lowamount of background clutter (cf. Fig. 5), the object recog-nition is robust and reliable. Furthermore, we also performedinformal experiments to assess the recognition rates of thelearned objects without foveation and, as expected, achievedconsiderably inferior recognition rates.

4642

VI. CONCLUSION

In this contribution we introduced a saliency-based modelto focus the attention on referred-to objects using non-verbalcues only. As consequence of the saliency-based approach,no prior knowledge about the referred-to object is requiredand thus we are able to identify unspecified or even unknownobjects. In order to identify the referred-to objects, wecombine the top-down information of a pointing gesture withbottom-up saliency. Therefore, we apply a space-based aswell as an object-based saliency model based on spectralwhitening. Furthermore, we foveate the identified object byexploiting the pan-tilt-zoom capabilities of our monocularcamera setup. In doing so, we obtain detailed model viewsand are able to build a high-quality SIFT object database forobject recognition.

Experiments in different real-world environments and withmultiple people pointing and various objects to be referenceddemonstrate the feasibility of the overall approach. Addition-ally, the accuracy achieved in detecting the correct referred-to objects and the quality of the learned recognition modelsshow that our approach can be successfully applied to thechallenging problem of identifying and learning previouslyunknown objects on the basis of bottom-up saliency andgesture information alone.

REFERENCES

[1] M. Louwerse and A. Bangerter, “Focusing attention with deicticgestures and linguistic expressions,” in Proc. Ann. Conf. Cog. Sci.Soc., 2005.

[2] S. Bock, P. Dicke, and P. Thier, “How precise is gaze following inhumans?” Vis. Res., vol. 48, pp. 946–957, 2008.

[3] M. Tomasello, Constructing a Language: A Usage-Based Theory ofLanguage Acquisition. Harvard University Press, 2003.

[4] F. Kaplan and V. Hafner, “The challenges of joint attention,” Interac-tion Studies, vol. 7, pp. 135–169, 2006.

[5] P. Mundy and L. Newell, “Attention, joint attention, and socialcognition,” Curr. Dir. Psychol. Sci., vol. 16, pp. 269–274, 2007.

[6] M. Staudte and M. W. Crocker, “Visual attention in spoken human-robot interaction,” in HRI, 2009.

[7] G. Heidemann, R. Rae, et al., “Integrating context-free and context-dependent attentional mechanisms for gestural object reference,”Mach. Vis. Appl., vol. 16, pp. 64–73, 2004.

[8] N. Butko, L. Zhang, et al., “Visual saliency model for robot cameras,”in ICRA, 2008.

[9] L. Elazary and L. Itti, “Interesting objects are visually salient,” J. Vis.,vol. 8, pp. 1–15, 2008.

[10] J. Ruesch, M. Lopes, et al., “Multimodal saliency-based bottom-upattention: A framework for the humanoid robot iCub,” in ICRA, 2008.

[11] K. Gao, S. Lin, et al., “Attention model based SIFT keypoints filtrationfor image retrieval,” in Proc. Int. Conf. Comp. Inf. Sci., 2008.

[12] B. Schauerte, J. Richarz, et al., “Multi-modal and multi-camera atten-tion in smart environments,” in Proc. Int. Conf. Multimodal Interfaces,2009.

[13] S. Frintrop, E. Rome, and H. I. Christensen, “Computational visualattention systems and their cognitive foundation: A survey,” ACMTrans. Applied Perception, vol. 7, 2010.

[14] D. Meger, P.-E. Forssen, et al., “Curious George: An attentive semanticrobot,” in IROS Workshop: From sensors to human spatial concepts,2007.

[15] Y. Sun and R. Fisher, “Object-based visual attention for computervision,” Artificial Intelligence, vol. 146, pp. 77–123, 2003.

[16] F. Orabona, G. Metta, and G. Sandini, “Object-based visual attention:a model for a behaving robot,” in CVPR Workshop: Attention andPerformance in Computational Vision, 2005.

[17] X. Hou and L. Zhang, “Saliency detection: A spectral residual ap-proach,” in CVPR, 2007.

[18] C. Guo, Q. Ma, and L. Zhang, “Spatio-temporal saliency detectionusing phase spectrum of quaternion fourier transform,” in CVPR, 2008.

[19] A. Oppenheim and J. Lim, “The importance of phase in signals,”Proceedings of the IEEE, vol. 69, pp. 529–541, 1981.

[20] D. Walther, U. Rutishauser, et al., “Selective visual attention enableslearning and recognition of multiple objects in cluttered scenes,”Comp. Vis. Image Understand., vol. 100, pp. 41–63, 2005.

[21] K. Welke, T. Asfour, and R. Dillmann, “Active multi-view objectsearch on a humanoid head,” in ICRA, 2009.

[22] T. Tuytelaars and K. Mikolajczyk, “Local invariant feature detectors:A survey,” Found. Trends Comp. Graph. Vis., vol. 3, pp. 177–280,2007.

[23] D. G. Lowe, “Distinctive image features from scale-invariant key-points,” Int. J. Comp. Vis., vol. 60, pp. 91–110, 2004.

[24] D. Figueira, M. Lopes, et al., “From pixels to objects: Enabling aspatial model for humanoid social robots,” in ICRA, 2009.

[25] M. Doniec, G. Sun, and B. Scassellati, “Active learning of jointattention,” in Humanoids, 2006.

[26] Y. Nagai, K. Hosoda, et al., “A constructive model for the developmentof joint attention,” Connection Sci., vol. 15, pp. 211–229, 2003.

[27] J. Triesch, C. Teuscher, et al., “Gaze following: why (not) learn it?”Dev. Sci., vol. 9, pp. 125–147, 2006.

[28] H. Kozima and A. Ito, “Towards language acquisition by an attention-sharing robot,” in New Methods in Language Processing and Compu-tational Natural Language Learning, D. Powers, Ed. ACL, 1998, pp.245–246.

[29] O. Sugiyama, T. Kanda, et al., “Three-layered draw-attention modelfor communication robots with pointing gesture and verbal cues,” J.Robot. Soc. Japan, vol. 24, pp. 964–975, 2006.

[30] ——, “Natural deictic communication with humanoid robots,” inIROS, 2007.

[31] M. W. Hoffman, D. B. Grimes, et al., “A probabilistic model of gazeimitation and shared attention,” Neural Networks, vol. 19, pp. 299–310, 2006.

[32] M. Katzenmaier, R. Stiefelhagen, and T. Schultz, “Identifying theaddressee in human-human-robot interactions based on head pose andspeech,” in Proc. Int. Conf. Multimodal Interfaces, 2004.

[33] R. Kehl and L. Van Gool, “Real-time pointing gesture recognition foran immersive environment,” in Proc. Int. Conf. Automatic Face andGesture Rec., 2004.

[34] C.-Y. Chien, C.-L. Huang, and C.-M. Fu, “A vision-based real-timepointing arm gesture tracking and recognition system,” in Proc. Int.Conf. Multimedia and Expo, 2007.

[35] C.-B. Park, M.-C. Roh, and S.-W. Lee, “Real-time 3D pointing gesturerecognition in mobile space,” in Proc. Int. Conf. Automatic Face andGesture Rec., 2008.

[36] E. Sato, T. Yamaguchi, and F. Harashima, “Natural interface usingpointing behavior for human-robot gestural interaction,” IEEE Trans.Ind. Electron., vol. 54, pp. 1105–1112, 2007.

[37] J. Schmidt, N. Hofemann, et al., “Interacting with a mobile robot:Evaluating gestural object references,” in IROS, 2008.

[38] Y. Tamura, M. Sugi, et al., “Target identification through human point-ing gesture based on human-adaptive approach,” J. Robot. Mechatron.,vol. 20, pp. 515–525, 2008.

[39] K. Nickel and R. Stiefelhagen, “Visual recognition of pointing gesturesfor human-robot interaction,” Image Vis. Comp., vol. 25, pp. 1875–1884, 2007.

[40] J. Richarz, T. Plotz, and G. A. Fink, “Real-time detection andinterpretation of 3D deictic gestures for interaction with an intelligentenvironment,” in ICPR, 2008.

[41] Y. Jia, S. Li, and Y. Liu, “Tracking pointing gesture in 3d spacefor wearable visual interfaces,” in Proc. Int. Workshop on Human-Centered Multimedia, 2007.

[42] N. Hofemann, J. Fritsch, and G. Sagerer, “Recognition of deicticgestures with context,” in Proc. DAGM Symp. Pat. Rec., 2004.

[43] A. Kranstedt, A. Lucking, et al., “Deixis: How to determine demon-strated objects using a pointing cone,” in Proc. Int. Gesture Workshop,2006.

[44] J. Matas, O. Chum, et al., “Robust wide-baseline stereo from maxi-mally stable extremal regions,” Image Vis. Comp., vol. 22, pp. 761–767, 2004.

[45] M. A. McDowell, C. D. Fryar, et al., “Anthropometric reference datafor children and adults: United states, 2003–2006,” National HealthStatistics Reports, Tech. Rep., 2008.

4643

saliency-based identification and recognition of pointed ...bschauer/pdf/schauerte2010... ·...

Documents