combining environmental cues & head gestures to interact...

7
Combining Environmental Cues & Head Gestures to Interact with Wearable Devices M. Hanheide Applied Computer Science Bielefeld University P.O. Box 100131 33501 Bielefeld, Germany [email protected] C. Bauckhage Centre for Vision Research York University 4700 Keele St. Toronto, ON, M3J 1P3, Canada [email protected] G. Sagerer Applied Computer Science Bielefeld University P.O. Box 100131 33501 Bielefeld, Germany [email protected] ABSTRACT As wearable sensors and computing hardware are becoming a real- ity, new and unorthodox approaches to seamless human-computer interaction can be explored. This paper presents the prototype of a wearable, head-mounted device for advanced human-machine in- teraction that integrates speech recognition and computer vision with head gesture analysis based on inertial sensor data. We will focus on the innovative idea of integrating visual and inertial data processing for interaction. Fusing head gestures with results from visual analysis of the environment provides rich vocabularies for human-machine communication because it renders the environ- ment into an interface: if objects or items in the surroundings are being associated with system activities, head gestures can trigger commands if the corresponding object is being looked at. We will explain the algorithmic approaches applied in our prototype and present experiments that highlight its potential for assistive tech- nology. Apart from pointing out a new direction for seamless in- teraction in general, our approach provides a new and easy to use interface for disabled and paralyzed users in particular. Categories and Subject Descriptors H.5.2 [Information Interfaces and Presentation]: User Inter- faces—Input devices; I.5.4 [Pattern Recognition]: Applications—Signal processing; I.5.5 [Pattern Recognition]: Implementation—Interactive systems Keywords wearable intelligent interfaces; inertial and visual sensors; seamless interaction 1. INTRODUCTION With computing hardware and sensors constantly shrinking in size and price, ever more objects of our everyday life are being Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICMI’05, October 4–6, 2005, Trento, Italy. Copyright 2005 ACM 1-59593-028-0/05/0010 ...$5.00. Figure 1: Prototype of a pair of memory spectacles. Wearing a head mounted device equipped with multiple sensors and an augmented reality display, the user can virtually interact with objects in the environment. equipped with computing and sensing power. Simultaneously, re- cent years have seen increased efforts in research on advanced human-machine interaction. Numerous approaches to transcending the traditional keyboard/mouse/monitor setup have been devised and technologies for seamless interaction have been developed. Ubiquitous or wearable computers and sensors as well as multi- modal interfaces will soon revolutionize the way we interact with appliances at work and at home. Already, advanced prototypes are taking the step from controlled and contrived laboratory environ- ments to real world applications. Systems have become available that reliably integrate speech and image understanding for natural interaction [1, 8, 10, 17, 19]. Other far advanced projects on intu- itive interaction deal with tangible interfaces [3, 28], smart rooms [2, 5, 27, 30], as well as wearable intelligent devices where space is becoming the interface [4, 11, 13, 21, 25]. A closer look at the cited contributions reveals that there seem to be two major directions in intelligent, multimodal interface re- search. It either focuses on exploring the use of handheld input and feedback devices like, for instance, sophisticated electronic pens or it concentrates on approaches to mimicking human sensing as a basis for seamless interaction. Regarding the latter, one can note a clear bias towards visual, acoustic and haptic sensors. This appears to be an incarnation of the popular agreement that sight, hearing and touch are the predominant human senses. However, the human sensorium processes more modalities than visual, acoustic and tac- tile signals. Consider, for instance, inertial sensing. In mammals, the vestibular system in the inner ear provides in- ertial information essential for orientation, ego motion and equi-

Upload: others

Post on 19-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Combining Environmental Cues & Head Gestures to Interact ...aiweb.techfak.uni-bielefeld.de/files/papers/Hanheide2005-CEC.pdf · with head gesture analysis based on inertial sensor

Combining Environmental Cues & Head Gestures toInteract with Wearable Devices

M. HanheideApplied Computer Science

Bielefeld UniversityP.O. Box 100131

33501 Bielefeld, Germany

[email protected]

C. BauckhageCentre for Vision Research

York University4700 Keele St.

Toronto, ON, M3J 1P3,Canada

[email protected]

G. SagererApplied Computer Science

Bielefeld UniversityP.O. Box 100131

33501 Bielefeld, Germany

[email protected]

ABSTRACTAs wearable sensors and computing hardware are becoming a real-ity, new and unorthodox approaches to seamless human-computerinteraction can be explored. This paper presents the prototype of awearable, head-mounted device for advanced human-machine in-teraction that integrates speech recognition and computer visionwith head gesture analysis based on inertial sensor data. We willfocus on the innovative idea of integrating visual and inertial dataprocessing for interaction. Fusing head gestures with results fromvisual analysis of the environment provides rich vocabularies forhuman-machine communication because it renders the environ-ment into an interface: if objects or items in the surroundings arebeing associated with system activities, head gestures can triggercommands if the corresponding object is being looked at. We willexplain the algorithmic approaches applied in our prototype andpresent experiments that highlight its potential for assistive tech-nology. Apart from pointing out a new direction for seamless in-teraction in general, our approach provides a new and easy to useinterface for disabled and paralyzed users in particular.

Categories and Subject DescriptorsH.5.2 [Information Interfaces and Presentation]: User Inter-faces—Input devices;I.5.4 [Pattern Recognition]: Applications—Signal processing;I.5.5 [Pattern Recognition]: Implementation—Interactive systems

Keywordswearable intelligent interfaces; inertial and visual sensors; seamlessinteraction

1. INTRODUCTIONWith computing hardware and sensors constantly shrinking in

size and price, ever more objects of our everyday life are being

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICMI’05, October 4–6, 2005, Trento, Italy.Copyright 2005 ACM 1-59593-028-0/05/0010 ...$5.00.

Figure 1: Prototype of a pair of memory spectacles. Wearinga head mounted device equipped with multiple sensors and anaugmented reality display, the user can virtually interact withobjects in the environment.

equipped with computing and sensing power. Simultaneously, re-cent years have seen increased efforts in research on advancedhuman-machine interaction. Numerous approaches to transcendingthe traditional keyboard/mouse/monitor setup have been devisedand technologies for seamless interaction have been developed.Ubiquitous or wearable computers and sensors as well as multi-modal interfaces will soon revolutionize the way we interact withappliances at work and at home. Already, advanced prototypes aretaking the step from controlled and contrived laboratory environ-ments to real world applications. Systems have become availablethat reliably integrate speech and image understanding for naturalinteraction [1, 8, 10, 17, 19]. Other far advanced projects on intu-itive interaction deal with tangible interfaces [3, 28], smart rooms[2, 5, 27, 30], as well as wearable intelligent devices where spaceis becoming the interface [4, 11, 13, 21, 25].

A closer look at the cited contributions reveals that there seemto be two major directions in intelligent, multimodal interface re-search. It either focuses on exploring the use of handheld input andfeedback devices like, for instance, sophisticated electronic pensor it concentrates on approaches to mimicking human sensing as abasis for seamless interaction. Regarding the latter, one can note aclear bias towards visual, acoustic and haptic sensors. This appearsto be an incarnation of the popular agreement that sight, hearingand touch are the predominant human senses. However, the humansensorium processes more modalities than visual, acoustic and tac-tile signals. Consider, for instance, inertial sensing.

In mammals, the vestibular system in the inner ear provides in-ertial information essential for orientation, ego motion and equi-

Page 2: Combining Environmental Cues & Head Gestures to Interact ...aiweb.techfak.uni-bielefeld.de/files/papers/Hanheide2005-CEC.pdf · with head gesture analysis based on inertial sensor

Figure 2: Apart from speech input, hand gestures can be used to trigger system activities; moving a finger up and down in front ofthe cameras allows selecting activities from a menu displayed on the right hand side of the field of view.

librium. As it is responsible for head stabilization, inertial sensingplays a crucial role producing head gestures and consequently im-pacts interpersonal communication. Head nods and head shakes,for example, are important non-verbal gestures commonly used tocommunicate intent and emotion or to signal agreement or denial.

Since head gestures perform a wide range of conversational func-tions, they have of course been considered for human-machine in-teraction [6, 16, 18, 26]. These contributions present interfaceswhere head movements are recovered from video data. Using com-puter vision, human users are monitored and their head pose istracked in oder to detect and discern head gestures. Though it mim-ics the way a human communication partner would perceive headgestures, this approach has obvious limitations. First, the requiredexternal sensing (i.e. the static camera) restricts applications rely-ing on head gestures to prepared environments, for instance smartrooms. Second, the user’s head must not leave the field of viewof the system. This constrains his or her mobility and thereforeneutralizes an important advantage of advanced interfaces. Finally,automatic visual head gesture recognition is not a trivial task; it re-quires considerable computational effort yet robust pose estimationcannot be guaranteed.

With respect to effort and robustness sensor technology offers analternative for head gesture recognition. Inertial motion trackersaccurately measure velocities and accelerations at high temporalresolutions. Progress in miniaturization led to low-cost single chipinertial sensors whose performance is similar to their biologicalarchetypes. Integrated into head mounted devices, inertial sensorsthus provide a convenient means for reliable head motion register-ing.

Note that the human sensorium constantly integrates visual andinertial cues. Information provided by the vestibular system is cru-cial in planning eye movements for gaze holding or visual track-ing. It is thus no surprise to find reports on integrated visual andinertial sensing for human-machine interaction. So far, however,most efforts were spent on hybrid tracking (cf. e.g [22]) where thecomplementary features of visual and inertial sensors are used toimprove 3D localization of human users.

In this paper, we introduce a novel application of integrated vi-sual and inertial signal processing. We will present a wearable in-terface for user assistance tasks that extends the capabilities of anearlier prototype [11]. While our last demonstrator was constraintto interaction centered on a table in front of the user, the new sys-tem extends the interface into the surrounding space. It recognizesobjects and activities in its environment and stores this informa-tion in order to act as an artificial memory for its user. Wearinga head mounted device equipped with a microphone, an inertialsensor, two cameras and an augmented reality display the user per-

ceives the environment augmented with information generated bythe system. Using speech, hand- or head gestures, the system canbe instructed to retrieve data or it can be taught about the surround-ings. Moreover, fusing visual and inertial signal processing leads toa novel paradigm in human-machine interaction: items observablein the user’s surroundings may be assigned various communicationsemantics. Whenever these items appear in the field of view, ex-erting corresponding head gestures can trigger different operationmodes or system events.

With regard to human-machine interaction the main contribu-tions and key-features of our system are thus: (i) as the interfaceis wearable, users experience themselves as the acting subject notas a monitored object; (ii) as the interface is mobile, interactionwith the system is neither restricted to a certain location nor needthe user’s movements be constrained to a visual frustum; (iii) sinceinertial sensing copes with fast and accelerated motion, the speedof head movements does not be adapted to the temporal resolu-tion of a camera; (iv) the visual recognition of static environmentalobjects is simpler and more robust than recognizing and trackingarticulated objects like the human body; in combination with ob-ject recognition, head gestures can thus assume a more importantrole than in usual setups.

In order to convey a general impression of the potential of wear-able multimodal interfaces, the next section will summarize ourapplication scenario. We will present examples of possible inter-actions with the current prototype and discuss how to fuse visualand inertial signal processing to incorporate cues from the environ-ment. As advanced interaction technology heavily draws on algo-rithms and techniques developed in the fields of machine learningand artificial perception, section 3 will roughly outline the methodsapplied in our system. In section 4, we will present an experimentalevaluation of inertial sensor based head gesture analysis for inter-action and, finally, a summary will close this contribution.

2. MEMORY SPECTACLESDeveloping multimodal, intelligent interfaces is not an end in

itself. Rather, there are application areas where multimodality isa clear asset; we belive that assistive technology is such an area.The mobile device discussed in this paper results from researchon cognitive systems for personal assistance. It represents the firstprototype of a pair ofmemory spectacles. Our long-term goal isto proceed towards memory prosthetic devices that may assist thememory impaired. Memory prosthetic devices blend in everydaylife and help answering questions such as ’Where did I leave mykeys?’. Such a device obviously requires sensors and techniquesthat allow for interpreting the environment the user dwells in. Also,

Page 3: Combining Environmental Cues & Head Gestures to Interact ...aiweb.techfak.uni-bielefeld.de/files/papers/Hanheide2005-CEC.pdf · with head gesture analysis based on inertial sensor

(a) A switch has been detected in the environment.

(b) Detecting the cup has triggered a confirmation dialog.

Figure 3: If the ’object recognition’ mode is active, labels ofrecognized objects are shown in the augmented reality display;visual object recognition enables advanced interaction based onenvironmental cues: the display of status windows or feedbackdialogs may be coupled with certain items and can be triggeredif these appear in the field of view.

to be of any use a memory prosthetic device will have to have asimple and easy to use user interface. Therefore, allocating someof the built in intelligence for the interface is an obvious idea.

The hardware components of our current demonstrator systemcomprise a Pentium IV 1.8GHz laptop with 512MB RAM, an iVi-sor 3D display, two unibrain fire-i fire-wire cameras, a Sennheiserwireless microphone system 1083-U connected to the laptop’ssound card for digitization and an Xsens MT9 inertial 3D motiontracker. Except for the laptop, the hardware is mounted on a helmetand thus becomes wearable (see Fig. 1).

Next, we will roughly sketch the interplay of hand gestures andaugmented reality in operating our memory spectacles. Then, theuse of head gestures and its integration with visual object recogni-tion shall be discussed in detail.

2.1 Hand Gestures for InteractionFigures 2 and 3 show impressions from experimenting with our

prototype. Working in a natural, unconstrained office environment,the user can chose different system functionalities. Speech recog-nition based on the ESMERALDA software package for speaker

Figure 4: The Xsens MT9 inertial sensor registers 3D rate-of-turn, acceleration as input data for head gesture recognition.

independent speech recognition [7] enables verbal operation of thesystem. Visual skin color detection allows activating or deactivat-ing system functions by means of hand gestures (see Fig. 2). Usingthe augmented reality (AR) display, a command menu is cast intothe scene observed by the cameras. Up and down movements of afinger on the right hand side of the field of view cause the systemto highlight the menu buttons. Highlighted buttons can be selectedby slightly moving the finger sidewards.

If required, information about recognized objects and results ofuser queries are visualized in the AR display. Applying computervision algorithms and contextual reasoning, the system is able toidentify different objects and activities (see Fig. 3). It copes withvarying lighting conditions as well as cluttered video signals andhas capabilities in online object learning [11, 12].

2.2 Head Gestures for InteractionThe digital motion tracker that is part of the prototype registers

3D rate-of-turn, acceleration and earth-magnetic field (see Fig. 4).It provides real-time 3D orientation data at frequencies up to 512Hz and with a noise level of less than 1◦. It therefore provides aconvenient means for accurately measuring head motions in realtime and thus enables determining a wide range of head gesturesuseful for human-machine interaction. Currently, we are consider-ing three different types of head gestures to operate the system:

Spatial head gesturesare head motions of moderate speedwhere the head steadily moves in one of the general directionsleft,right, up anddown. Spatial gestures allow for spatial referencesand indication of directions. As Fig. 5 exemplifies, they may ac-complish the selection of items from the menu displayed on theright hand sight of the user’s field of view. Performing slow butconstant up or down movements of the head, the user can togglebetween menu items. If the highlighting cursor is positioned overthe desired item, up and down movements of higher frequency, i.e.nodding, selects the item. This example naturally leads to the sec-ond category of gestures.

Semantic head gestureslike noddingandshakingthe head areused in many cultures to express agreement or denial. Of coursethis also applies to our scenario. In the example presented in Fig. 5,nodding the head signals the wish to select the currently highlightedmenu item. A substantial extension of this straightforward applica-tion of nodding and shaking the head results from combining headmovement analysis with computer vision. Then, the simple seman-tic gestures of nodding and shaking provide a far richer vocabu-lary for interaction with our system in particular or with intelligentrooms in general. Figure 3 shows two examples. In Fig. 3(a) alight swith was detected in the user’s field of view. Consequently,a corresponding object label (’light’) is cast into in the augmenteddisplay of the scene. Exploiting the metaphorical meaning of aswitch, the object label is accompanied by a virtual toggle button(’on’). Depending on the current dialog state of the system, nod-

Page 4: Combining Environmental Cues & Head Gestures to Interact ...aiweb.techfak.uni-bielefeld.de/files/papers/Hanheide2005-CEC.pdf · with head gesture analysis based on inertial sensor

Figure 5: Screenshots showing the navigation of the menu bar of the memory spectacles by using head gestures. By means of thegestures ’up’ or ’down’ the user switches between menu items. Selection is done by performing a faster ’nod’ gesture.

ding the head if a swith has been detected may trigger activities as,for example, increasing the brightness of the display.

Figure 3(b) shows another example where visual context is in-corporated into interaction. Here, windows containing additionalinformation or conformation dialogs are being displayed becausecorresponding objects have been detected in the scene. Detectingthe cup, for instance, triggered the system to display the admittedlycontrived question if the user desires another cup of coffee. Obvi-ously, if the current scene is cluttered with objects for whom thereis additional information in the system’s database, a mechanism isneeded that prevents messages or dialogs from popping up in anout of control fashion.

Pseudo head gesturesprovide a means to control the interac-tion between visual object recognition and head gesture analysis.Apart from a rejection class of gestures whose role will be dis-cussed in section 4, our system currently considers the pseudo ges-ture of steadyingones head movements after focusing on an in-tended object. If an environmental object has been recognized inthe center of the field of view and remains there because the userdoes not move her head for a short while, additional information isbeing displayed if available for the object in question. Once addi-tional information or dialog windows appear in the display, the usermay perform spatial or semantic head gestures to communicate anappropriate reaction to the system. Furthermore, recognizing ges-tures such assteadyingallows to stabilizes and judge results fromobject recognition. These results can only be considered reliable inreasonable stable images and should be discarded when the head ismoving too fast.

Given these examples of interactive experiments with the currentprototype of memory spectacles, it is tempting to anticipate numer-ous applications of integrated visual and inertial signal processingin the rapidly emerging field of intelligent assistive technologies.As they are literally user centered, wearable, head mounted cam-eras and motion trackers offer tremendous benefits especially forthe paralyzed and people with impaired speech production. Thesystem we introduce here is easy to handle; its operation requiresonly minimal head motion. Combined with visual analysis of theenvironment at medium range from the user, slight movements ofthe head provide a robust means to express a variety of intentions.Figure 6 illustrates how at a distance even slight motions result inlarge shifts of visual focus1. Large shifts of focus facilitate the task

1This fact is exploited in research on computer vision based gazetracking with applications in driver assistance/alertness systems (cf.e.g. [14, 15, 23]). Research in this area also produced evidence that

Figure 6: At a distance, even slight head motions lead to largeshifts of focus; far from the head the distancedf between fo-cused points is much larger than the distancedn near the head.

of visual object recognition. Objects typically found at mediumdistances (e.g. light switches) are less likely to occlude each other.Thus, assigning communication semantics to objects that naturallyoccur in the user’s environment but are conveniently far apart fromeach other provides a robust means for human-machine interactioneven for users whose agility is considerably limited. Furthermore,in automated, intelligent rooms the interface mechanism introducedin the examples above can of course be applied to operate the envi-ronment. An idea for a typical application would be switching TVchannels; for the time being, however, the latter remains a topic offurther work.

3. ALGORITHMIC BASISAs interactions among people rely on the human sensory-motor

system, seamless machine interfaces should accommodate humanhabits and multimodal sensing. Although it seems a triviality, itis important to note that multimodal interfaces which mimic hu-man sensing require techniques for pattern recognition and ma-chine learning. Digital signals recorded by cameras, microphonesor inertial sensors have to be analyzed and interpreted in order toenable meaningful interaction. Therefore and in accordance withpanel discussions at earlier ICMI conferences stressing the needfor a better understanding and embedding of pattern recognition ininterface research, this section will briefly outline the machine in-telligence methods applied in our system. First, we shall discussrobust visual object recognition; afterwards, we will present thetechniques we use for head gesture recognition from inertial sensordata.

for focus recognition user centered, head mounted devices outper-form methods based on cameras observing the user [24].

Page 5: Combining Environmental Cues & Head Gestures to Interact ...aiweb.techfak.uni-bielefeld.de/files/papers/Hanheide2005-CEC.pdf · with head gesture analysis based on inertial sensor

3.1 Visual Object RecognitionOur current demonstrator appliesboosted cascaded weak classi-

fiersfor visual object recognition. Originating from machine learn-ing, this technique was introduced to computer vision by Viola andJones [29].

The error rate of a weak classifier is only slightly better thanguessing. Boosting algorithms such as AdaBoost [9] sequentiallytrain weak classifiers with repeatedly re-weighted versions of thetraining data and thereby produce an ensembleCi, i = 1, . . . , n ofclassifiers. If in iterationi the data that were misclassified by theclassifierCi−1 have their weights increased, each classifier in thesequence concentrates on the misclassified data from the previoussteps. It can be shown that combining the predictions of such anensemble yields very accurate results. Viola and Jones proposedthe use of boosting to identify the most discriminative features foran object recognition task. Given many views of the intended (classof) object(s) and even more counterexamples, AdaBoost learns anensemble of classifiers each of which considers a different feature.Combining these classifiers in a cascaded scheme results in the fol-lowing procedure: a window is slid over the input image and ateach location the most discriminant feature is computed from thepixels within the window. If the corresponding classifier predictsthe intended object, the second most discriminant feature is com-puted and analyzed by the second classifier. Each location wherethis process reaches the bottom of the cascade is said to depict theobject. Usually, however, the top most level of the cascade alreadyrejects the vast majority of the tested image locations. Due to theresulting efficiency, this algorithm yields a rapid yet reliable ap-proach to object recognition.

In order to enable interaction based on environmental cues, wetrained cascaded weak classifiers for different views of objects likeswitches, keyboards, cups or books. Using several hundred positivetraining examples and several thousand randomly chosen negativeones for each considered view of an object, resulted in fast objectdetectors each with a recognition accuracy of more than 90%. Thesmall number of small positive detections does not pose a problem.Inconsistent recognition results are filtered out using temporal con-text. However due to lack of space this mechanism is not describedhere. The interested reader may refer to [31].

3.2 Head Gesture RecognitionAs outlined before, our experimental setup applies head gesture

recognition based on inertial data to enable interaction in an addi-tional natural way besides pointing and speech. Our approach isbased on semi-continuous linear Hidden Markov Models (HMM)with Gaussian mixture to parameterize observations since these areknown to be well-suited as probabilistic classifiers for sequentialdata. For this purpose, we utilized the ESMERALDA softwarepackage [7] already applied in our demonstrator for speech recog-nition and adapted it for the classification of head gestures.

The raw data available from the employed inertial sensor sup-plies nine dimensional samples with three attributes (each threedimensional) at a rate of 100 Hz: “rate of turn”[ deg

s] , “acceler-

ation” [ ms2 ] and “magnetic field”[mGauss]. Since the magnetic

field provides no useful information for head gesture recognitionwe selected the remaining two attributes as input datas for ourclassification approach. Although these samples could be applieddirectly to the HMM-based classification algorithms, previous ex-periments have shown that features describing the characteristicsof the dynamics allow for much better recognition performance. In

our approach we useFourier descriptorsFn(t)

Fn(t) =1

K

˛˛w−1Xk=0

s(t + k)e−2πink

˛˛

as features computed from sliding, overlapping windows of theoriginal six dimensional inertial datas(t). In the above equationthe window size is denoted asw. Best results could be achievedwith a window size ofw = 10 (100ms) overlapping byδ = 6(60ms) to characterize the dynamics of head movements. This al-lows to compute a maximum number ofw Fourier descriptors, butit is expected that descriptors with highern are bound to jitter, sincethey reflect high frequencies in head movement. Accordingly, wechoose a smaller number of onlyn = {1 . . . 4} descriptors. Thefeature extraction with the above values results in a 24 dimensionalfeature vector. To establish Gaussian distributions for the neededGaussian mixture models,k-means vector quantization [20] is car-ried out on a set of training features. Afterwards HMM parametersare trained from manually annotated data using the Baum-Welch-Algorithm. Results obtained with this approach will be reported inthe next section.

4. EXPERIMENTSA quantitative performance evaluation of our prototype of the

head movement classifier was carried out by means of a series ofuser experiments. All together 26 subjects (14 female, 12 male)were asked to perform the gesturesnod, shake, up, down, quiet,noise, and jump while wearing the memory spectacles with at-tached inertial sensor.

4.1 Data AcquisitionIn order to acquire data that would represent gestures as they

would appear naturally, the users were not directly instructed toperform a certain movement. Rather, different images were dis-played to them and they were asked questions or given tasks theyshould answer or accomplish via head gestures (e.g. “Is there acar in the image?”). While the rationale for considering the ges-turesnod, shake, up, anddown was motivated in the discussionof the interaction examples in section 2.2, the reason for consider-ing the gesturesquiet, noiseandjump is not immediately apparent.The terms we use to name these gestures were chosen accordingto frequently reappearing patterns in the inertial sensor data which,however, do not carry a meaning. The pattern we termedjump, forinstance occurs whenever the user performs actions such as “stand-ing up” or “sitting down” which involve the whole body and there-fore yield large oscillations in the motion tracker data. Recognizingthese motion pattern reduces the false positive rate in head gestureclassification. These pseudo gestures thus assume the role of rejec-tion classes in our demonstrator. They are not associated with anysemantic and thus do not trigger any specific action.

A static camera was used to monitor our subjects during theexperiments. This was necessary because interpreting and distin-guishing raw inertial sensor data proved to be difficult. Recordingvideos of the experiments, however, allows for a manual labeling ofthe data acquired by the motion tracker. For example, if the videofootage shows a subject shaking the head, it is easy to decide thatthe corresponding motion tracker data represents ashakegesture.

For the experimental results reported in the following, we eval-uated head gesture classification separately from the integrateddemonstration system.

Page 6: Combining Environmental Cues & Head Gestures to Interact ...aiweb.techfak.uni-bielefeld.de/files/papers/Hanheide2005-CEC.pdf · with head gesture analysis based on inertial sensor

7 13 17 190

5

10

15

20

25

users in training set

erro

r rat

e (%

)

Figure 7: Best recognition results (measured in terms of errorrate) obtained from considering various sized training sets. The(95%) confidence intervals are drawn at each bar.

4.2 EvaluationFor the evaluation of our HMM based head gesture recognition

technique, we split the users into two groups and used the datafrom one group for training the classifier; the data acquired formthe other group was used for testing.

The system was trained with the chosen training data as outlinedin section 3.2. Figure 7 displays our results in terms of error ratesin single gesture recognition for different partitions of our data intotraining- and test groups. Obviously, even training with only fewdata recorded from just 13 different subjects still results in a reli-able recognition performance (error rate: 6.2%). If more data wasused for training, even better better results could be achieved: fortraining based on the data gathered from 19 users, the error ratedrops to 2.8%. These results are encouraging and hint that evenbetter performance might be achieved if the gesture classifier wastrained with yet more examples.

An interesting observation becomes apparent from consideringFig. 8. The figure shows that for all experiments the lowest errorrates are achieved after 15 HMM training iterations. Training be-yond 15 iterations resulted in over-fitted classifiers which exactlyreproduced the training data but showed limited generalization ca-pabilities and thus are of little use in practice.

From these results we can conclude that HMM based classifi-cation inertial motion tracker data using Fourier descriptors pro-vides a convenient and reliable method for head gesture recogni-tion. Though training is straightforward and inexpensive, the re-sulting classifiers yield low error rates. Together with the reliableperformance of visual object recognition reported in section 3.1,integrated head gesture and visual context analysis thus provide arobust rich and robust means for human-machine interaction.

5. SUMMARYAlthough inertial sensing is an important part of the human sen-

sorium, so far, only little attention was paid to inertial based mo-tion tracking as a possible addition to human-machine interfaces.However, recent advances in sensor miniaturization may changethis for technology has become available that can easily be incor-porated into wearable devices for advanced interaction. This paperpresented the prototype of a mobile, head-mounted interface that

0 5 10 15 20 250

5

10

15

20

25

30

35

40

45

iteration

erro

r rat

e (%

)

training size 19training size 16training size 13training size 7

Figure 8: Error rates with respect to the number of trainingiterations on four different training sets.

integrates inertial sensing along with other perceptual modalities.In contrast to purely vision based methods for head motion track-ing, inertial data based head gesture analysis is less involved andless error prone. Computing Fourier descriptors from the recordeddata and applying Hidden Markov Models for classification yieldsvery reliable real time head gesture recognition. We also illustratedthe potential of integrating visual and inertial signal processing.Combining head gesture analysis with visual context informationprovides a rich vocabulary for interaction. As our device is headmounted it is quite literally user centered. It therefore offers in-teresting perspective for research in assistive technologies. Evensmall movements of the head can be reliably distinguished. More-over, at a distance, they lead to larger shifts of focus so that visualobject recognition is unlikely to be hampered by occlusion and clut-ter. Integrated head gesture recognition and visual context analysistherefore are of considerable interest for advanced human-machineinteraction for the bodily impaired or paralyzed.

A promising direction for future work is extending integrated vi-sual and inertial data processing towards smart room technologyfor the disabled. Applications of integrated visual and inertial anal-ysis are of course not limited to the operation of the head mounteddevice itself. Appliances in fully automated rooms will be remotelycontrollable. If for remote controlled devices operation informationand dialogs are stored in the database of the memory spectacles,they will be operable by just looking at them and performing therespective sequence of nods.

6. ACKNOWLEDGMENTSThe work reported here results from the projectVisual Active

Memory Processes and Interactive REtrieval(VAMPIRE) which isbeing funded within the European Union IST Program (IST-2001-34401). We want to express our gratitude to Jan Schafer for de-signing and implementing the graphical user interface for the aug-mented reality display of our system.

7. REFERENCES[1] C. Bauckhage, J. Fritsch, K.J. Rohlfing, S. Wachsmuth, and

G. Sagerer. Evaluating integrated speech- and imageunderstanding. InProc. Int. Conf. on Multimodal Interfaces,pages 9–14, 2002.

Page 7: Combining Environmental Cues & Head Gestures to Interact ...aiweb.techfak.uni-bielefeld.de/files/papers/Hanheide2005-CEC.pdf · with head gesture analysis based on inertial sensor

[2] C. Bauckhage, M. Hanheide, S. Wrede, and G. Sagerer. ACognitive Vision System for Action Recognition in OfficeEnvironments. InProc. Conf. on Computer Vision andPattern Recognition, volume II, pages 827–833, 2004.

[3] P.R. Cohen and D.R. McGee. Tangible multimodal interfacesfor safety-critical applications.Communications of the ACM,47(1):41–46, 2004.

[4] P.R. Cohen, D.R. McGee, S.L. Oviatt, L. Wu, J. Clow,R. King, S. Julier, and L. Rosenblum. MultimodalInteractions for 2D and 3D Environments.IEEE ComputerGraphics and Applications, 19(4):10–13, 1999.

[5] J.L. Crowley, J. Cutaz, and F. Berard. Things that see.Communications of the ACM, 43(3):54–64, 2000.

[6] J.W. Davis and S. Vaks. A Perceptual User Interface forRecognizing Head Gesture Acknowledgements. InProc.Workshop on Perceptive User Interfaces, 2001.

[7] G.A. Fink. Developing HMM-based Recognizers withESMERALDA. In Text, Speech and Dialogue, volume 1692of LNAI, pages 229–234. Springer, 1999.

[8] G.A. Fink, J. Fritsch, S. Hohenner, M. Kleinehagenbrock,S. Lang, and G. Sagerer. Towards multi-modal interactionwith a mobile robot.Pattern Recognition and ImageAnalysis, 14(2):173–184, 2004.

[9] Y. Freund and R.E. Schapire. Experiments with a newboosting algorithm. InProc. Int. Conf on Machine Learning,pages 148–156, 1996.

[10] J. Fritsch, M. Kleinehagenbrock, S. Lang, T. Plotz, G.A.Fink, and G. Sagerer. Multi-Modal Anchoring forHuman-Robot-Interaction.Robotics and AutonomousSystem, 43(2–3):133–147, 2003.

[11] G. Heidemann, I. Bax, H. Bekel, C. Bauckhage,S. Wachsmuth, G. Fink, A. Pinz, H. Ritter, and G. Sagerer.Multimodal Interaction in an Augmented Reality Scenario.In Proc. Int. Conf. on Multimodal Interfaces, pages 52–60,2004.

[12] G. Heidemann, H. Bekel, I. Bax, and H. Ritter. Interactiveonline learning.Pattern Recogn. Image Anal., 15(1):55–58,2005.

[13] N. Hofemann, J. Fritsch, and G. Sagerer. Recognition ofDeictic Gestures with Context. InPattern Recognition,volume 3175 ofLNCS, pages 334–341. Springer, 2004.

[14] T. Ishikawa, S. Baker, I. Matthews, and T. Kanade. PassiveDriver Gaze Tracking with Active Appearance Models. InProc. World Congress on Intelligent Transportation Systems,2004.

[15] Q. Ji and Z. Jang. Real-Time Eye, Gaze, and Face PoseTracking for Monitoring Driver Vigilance.Real TimeImaging, 8(5):357–377, 2002.

[16] A. Kapoor and R.W. Picard. A real-time head nod and shakedetector. InProc. Workshop on Perceptive User Interfaces,2001.

[17] T. Kaster, M. Pfeiffer, C. Bauckhage, and G. Sagerer.Combining Speech and Haptics for Intuitive and EfficientNavigation through Image Databases. InProc. Int. Conf. onMultimodal Interfaces, pages 180–187, 2003.

[18] R. Kjeldsen. Head Gestures for Computer Control. InIEEEICCV Workshop on Recognition, Analysis, and Tracking ofFaces and Gestures in Real-Time Systems, pages 61–68,2001.

[19] S. Lang, M. Kleinehagenbrock, S. Hohenner, J. Fritsch, G.A.Fink, and G. Sagerer. Providing the Basis for

Human-Robot-Interaction: A Multi-Modal Attention Systemfor a Mobile Robot. InProc. Int. Conf. on MultimodalInterfaces, pages 28–35, 2003.

[20] J. MacQueen. Some methods for classification and analysisof multivariate observations. In Lucien M. Le Cam and JerzyNeyman, editors,Proc. Fifth Berkeley Symposium onMathematical Statistics and Probability, volume 1, pages281–296, 1967.

[21] A. Pentland. Perceptual intelligence.Communications of theACM, 43(3):35–44, 2000.

[22] M. Ribo, P. Lang, H. Ganster, M. Brandner, C. Stock, andA. Pinz. Hybrid Tracking for Outdoor Augmented RealityApplications.IEEE Computer Graphics and Applications,22(6):54–63, 2002.

[23] R. Smith and M. Shah N. da Vitoria Lobo. MonitoringHead/eye Motion for Driver Alertness with One Camera. InProc. Int. Conf. on Pattern Recognition, volume IV, pages636–642, 2000.

[24] M. Sodhi, B. Reimer, J.L. Cohen, E. Vastenburg, R. Kaars,and S. Kirchenbaum. On-Road Driver Eye MovementTracking Using Head-Mounted Devices. InProc. EyeTracking Research and Applications Symposium, pages61–68, 2002.

[25] T.E. Starner. Wearable computers: No longer science fiction.IEEE pervasive computing, 1(1):86–88, 2002.

[26] R. Stiefelhagen, C. Fuegen, P. Giesemann, H. Holzapfel,K. Nickel, and A. Waibel. Natural Human-Robot Interactionusing Speech, Gaze and Gestures. InEEE/RSJ Int. Conf. onIntelligent Robots and Systems, pages 2422–2427, 2004.

[27] S. Stillman and I. Essa. Towards reliable multimodal sensingin aware environments. InProc. Workshop on PerceptiveUser Interfaces, 2001.

[28] B. Ullmer and H. Ishii. Emerging Frameworks for TangibleUser Interfaces. In J.M. Carroll, editor,Human-ComputerInteraction in the New Millenium. Addison-Wesley, 2001.

[29] P. Viola and M. Jones. Rapid Object Detection using aBoosted Cascade of Simple Features. InProc. CVPR,volume I, pages 511–518, 2001.

[30] C. Wisneski, H. Ishii, A. Dahley, M. Gobert, S. Brave,B. Ullmer, and P. Yarin. Ambient displays: Turningarchitectural space into an interface between people anddigital information. InProc. Int. Workshop on CooperativeBuildings, pages 22–32, 1998.

[31] S. Wrede, M. Hanheide, C. Bauckhage, and G. Sagerer. Anactive memory as a model for information fusion. InInt.Conf. on Information Fusion, number 1, pages 198–205,2004.