combining voice and gesture for human computer interaction · interaction of users with computers...

Combining Voice and Gesturefor Human Computer

Interaction

Haleh Chizari1

Supervisors:

Dr. Denis Lalanne

Matthias Schwaller

November 2013

Department of Informatics - Master Project

Report

Departement d’Informatique - Departement fur Informatik • Universite de Fribourg -Universitat Freiburg • Boulevard de Perolles 90 • 1700 Fribourg • Switzerland

phone +41 (26) 300 84 65 fax +41 (26) 300 97 31 [email protected]://diuf.unifr.ch

[email protected], DIVA group, DIUF, University of Fribourg

Abstract

Recently, there has been a great deal of interest in multimodal interfaces thanksto their potential in providing more natural user-machine interactions, particularlyin applications where the use of mouse or keyboards is tedious and inappropriate.Among other, two types of inputs are increasingly integrated in multimodal inter-faces: voice and gesture. There are many ways for temporal fusing of speech andgesture inputs. However, user-friendly ones are more interesting for use in real-world applications. In this context, this research project focuses on the user effortand perceived quality as two important criteria of user-friendliness in the interactionbetween the user and the computer.

The project starts with study and design of multimodal commands for severaltypical operations such as selection, dragging and dropping, rotation and resizing.Depending on the use of gesture or voice for initiating or parameterizing the oper-ation, different sets of combined voice-gesture commands are proposed. Then, onemultimodal set which seems more suitable for the majority of users is selected to beimplemented in C# using Microsoft Kinect SDK.

In the next step, the functionality of the implemented multimodal design is ex-amined compared to that of the existing set in the laboratory which uses gestureunimodal commands. To do this, we asked ten people to evaluate both unimodaland multimodal sets in terms of qualitative and quantitative criteria, such as accu-racy, speed, efficiency, perceived quality, effort, and error susceptibility. Statisticalanalyses show that the proposed multimodal set has quantitatively equivalent per-formance in selection, rotation, and mixed activities as the existing unimodal set,but underperforms in move and resizing tasks. Furthermore, the t-test analysis onthe number of the user errors demonstrates that using the multimodal technologysignificantly reduces the user errors in all considered activities. On the other hand,the results obtained from the qualitative questionnaire show that the multimodalset is slightly favored by the users with respect to its unimodal counterpart due tobetter overall performance as well as less cognitive load and effort.

Keywords: Kinect, Multimodal, Speech, Gesture

Acknowledgment

I would like to express my sincerest thanks and appreciation to Prof. Denis Lalanneas my thesis supervisor who guided me through the project with patience and helpedme to cope with the problems I encountered during the thesis. I did learn a lot fromhim in the teaching courses as well.

On the other hand, I would like to extend my gratitude to Matthias Schwallerwho helped me accomplish this study. Thank you for the feedback, direction, andassistance when I needed it.

I am also delighted to thank my Iranian friends who supported me during theevaluation phase of this project. Finally, special recognition goes out to my spousefor his support during my pursuit of the Master’s program.

2

Contents

1 Introduction 51.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Report Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Design Space 92.1 Multimodal Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 CARE Properties . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 CASE Properties . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Command Components . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 System Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Selection Design . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3.2 Rotation Design . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.3 Zoom Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.4 Design Space Analysis . . . . . . . . . . . . . . . . . . . . . . 26

3 Implementation 283.1 Kinect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 PC Driver Support . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Development Kit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.1 Using Candescent NUI for Gesture . . . . . . . . . . . . . . . 313.2.2 Using Microsoft SDK for Speech . . . . . . . . . . . . . . . . . 31

3.3 Speech Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 333.3.1 Speech Recognition Guidelines . . . . . . . . . . . . . . . . . 333.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Gesture Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 363.4.1 Extra Gesture Implementations . . . . . . . . . . . . . . . . . 37

3.5 Integration Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.5.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.5.2 Move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.5.3 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.5.4 Resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3

4 Evaluation 434.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2 Test Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.3 Range of testers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.4 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.5 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.5.2 Move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.5.3 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.5.4 Resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.5.5 Final . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.5.6 Summary of Speech Recognition Performance . . . . . . . . . 694.5.7 Quantitative Result Summary . . . . . . . . . . . . . . . . . . 70

4.6 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.7.1 General Points . . . . . . . . . . . . . . . . . . . . . . . . . . 754.7.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.7.3 Move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.7.4 Rotation and Resizing . . . . . . . . . . . . . . . . . . . . . . 764.7.5 Final . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5 Conclusions and Future Work 795.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Appendix 82

Bibliography 84

4

Chapter 1

Introduction

Contents1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Report Structure . . . . . . . . . . . . . . . . . . . . . . . 7

In this chapter, we give a brief literature review on multimodal human-machineinteraction. Next, we present the thesis motivation and objectives. Finally, thereport structure is provided.

5

1.1 Background

Nowadays multimodal interaction has received a great deal of attention since it canprovide the user with more than one mode for interfacing with a system in such away that the weakness of one modality therein can be covered by the strengths of theother. In other words, multimodal interfaces are designed to exploit the capabilitiesof multiple modalities. In this way, recognition errors may be corrected either byexploiting mutual disambiguation or by taking advantage of real-time error handlingprovided [1]. As a result, multimodal interfaces offer better usability, flexibilityand reliability compared to their unimodal counterparts. When using multimodalinteraction is desired for input of data, various user input modes - such as speech,touch, gesture, gaze, and pen are combined in a unified manner rather than typicalinput unimodal interfaces like keyboard and mouse.

The concept of multimodal interaction has been first introduced in 1980 in MIT[2]. The researchers there presented a system so-called put that there with thecapability of processing speech and pointing gesture during object manipulation.For instance, the system was able to create or move some geometrical shapes havingdifferent color through voice commands, and the location was chosen by the pointinggesture. Another example is the system proposed in [3] where deictic gestures,speech and eye gaze were combined for manipulating spatial objects on a map.A more recent application of multimodal interaction has been introduced in [4]where hierarchical menus for implementing controls in the car were replaced by acombination of speech and gesture in order to lower the visual demand. Speechand gesture were used for identification of functions and manipulation, respectively.In [5], multimodal interfaces based on co-speech spatial gestures have been usedon smart devices in an assisted living environment to help people with physical orcognitive impairments.

Among other, two types of inputs are increasingly integrated in multimodalinterfaces: voice and gesture. Both gestures and natural human conversation areconsidered as rich interaction tools. There exist different definitions of gesture inliterature. On the one hand, Kendon has defined gesture as voluntary and expressivemovements of the body used together with speech and perceived by the observeras a meaning part of the speech [6]. This definition only includes natural userinterfaces rather than elements of gesture languages. On the other hand, gesturecan be defined as body movements used for transmitting data from one person toanother [7]. Based on this definition, gestures in HCI must have explicit meanings,even they are implicit in nature. In this project, we stick to the second definition ofgesture.

A widely-used device for human-machine interaction is Microsoft Kinect whichprovides the user interface designer with two input modalities of gestures and speech[8]. The Kinect sensor is composed of four components, namely, depth camera, colorcamera, microphone array, and tilting mechanism. Kinect has recently found manyapplications including game consoles, virtual MIDI controller [9], health care ap-plications [10][11], physical therapy [12], education [13], and training [14]. Amongother recent interesting applications of Kinect, we can name Kinoogle, a Kinect in-terface for natural interaction with Google Earth [15] where gesture is proposed to

6

perform span, zoom, rotation and tilt operation, while the advanced speech recogni-tion will be used as an alternative option for the people who don’t have the necessaryagility and dexterity. In [16], an approach for multimodal multi-application naturalinteraction of users with computers using Kinect device as a two modal has beendescribed.

In spite of human communications where the use of speech and gesture is co-ordinated, in human-computer interaction there is a large difference between timeresponses of the recognizers to gesture and speech. Typically, speech recognition ofa word requires more time than computing the coordinates of a gesture. This showsthat the combination of gesture and speech inputs constitutes a major challenge.

Although, there are many ways for temporal fusing of speech and gesture inputs,user-friendly ones are more interesting for use in real-world applications. Differ-ent parameters should be taken into account, including information redundancy ofmodalities, time required for integrating speech and gesture, properties and techno-logical constraints of each modality, user state and operating environment.

1.2 Objectives

In this context, the goal of this research is to study, design, and evaluate multimodalinterfaces as a means of interaction between the user and the computer. The projectaims at performing common tasks on objects such as selection, dragging and drop-ping, rotation and resizing. Depending on the use of gesture or voice for initiatingor parameterizing the operation, different sets of combined voice-gesture commandsare proposed and developed. Then, one multimodal set which seems more suitablefor the majority of users is selected to be implemented using C#. To detect andrecognize speech and gesture inputs, Microsoft Kinect sensor device is employed. Inthe next step, the performance of the implemented multimodal design is evaluatedcompared to that of the existing set [17] in the laboratory which uses iconic gestureunimodal commands, see Fig. 1.1.

Figure 1.1: Unimodal set using iconic gestures [17].

1.3 Report Structure

The rest of the report is organized as follows. In Chapter 2, we discuss the gestureand voice fusion in human-machine interaction and propose a design space for main

7

object manipulation commands. In this context, different solutions for each com-mand are explored. Then, a set of solutions is chosen to be implemented. In chapter3, different features of Microsoft Kinect are explained and the use of Kinect SDKand Candescent library is described. Then, the implementation of the multimodalsolutions and its challenges are presented. Chapter 4 presents the evaluation of theimplemented multimodal and the existing unimodal solutions and discusses the ob-tained results. Finally, Chapter 5 concludes the report and points out some futureworks on the topic.

8

Chapter 2

Design Space

Contents2.1 Multimodal Fusion . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 CARE Properties . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.2 CASE Properties . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Command Components . . . . . . . . . . . . . . . . . . . . 11

2.3 System Design Space . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Selection Design . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.2 Rotation Design . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.3 Zoom Design . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.4 Design Space Analysis . . . . . . . . . . . . . . . . . . . 26

In this chapter, we present an analysis of the integration of voice and gesturein human-machine interaction. Existing different ways of multimodal data fusion,we propose a design space to classify multimodal systems within a framework andto compare them in terms of the user performance. To this aim, we consider fourmain commands for manipulating objects, namely, selection, dragging and dropping,rotation and zoom.

9

2.1 Multimodal Fusion

2.1.1 CARE Properties

Within a multimodal system, the availability of several modalities demands thattheir combined usage is characterized. Several frameworks have been proposed so farto address the issue of the combined usage of multiple modalities. In the frameworkcalled TYCOON [18] six types of cooperation between the modalities are considered:

• Equivalence

• Specialization

• Redundancy

• Complementarity

• Transfer

• Concurrency

Typically, equivalence describes the choice between several modalities havingalmost equal capability in performing the task. On the other hand, specializationshows that a given modality is always used for one particular task. Redundancyimplies that more than one modality are used in performing the same task whilecomplementarity describes the case that several modalities must be used to reach agiven state, namely none of them is not individually sufficient. When two modalitiescooperate by transfer, this implies that an interaction produced by a modality isused by another modality. Finally, concurrency describes the situation where eachmodality is used independently but in a parallel manner.

Another framework for addressing the issue of relationships between modalitieswas proposed by Coutaz [19]. This framework so-called CARE includes four prop-erties than can be occurred between several modalities within a multimodal userinterface:

• Complementarity

• Assignment which indicates that only one modality is used to reach a givenstate.

• Redundancy

• Equivalence

While equivalence and assignment describe the option available at a given case,redundancy and complementarity consider the combined usage of several modalitiesaccording to temporal constraints. In this work, we consider both redundancy andcomplementarity aspects of CARE properties.

10

2.1.2 CASE Properties

Fusion strategy for multiple modalities can be different from one system to another.The CASE framework describes different aspects of the modality fusion from themachine point of view, as shown in Fig. 2.1 [20]. The term CASE is the abbreviationof concurrent, alternate, synergistic and exclusive. More specifically, this frameworkaddresses the temporal availability of multiple modalities in terms of absence orpresence of parallelism. In a system that allows the user simultaneous use of multiplemodalities, parallel use of modalities is supported. On the contrary, in a system thatallows only sequential use of modalities, the users have to employ the modalities oneby one. In this context, the alternate feature is chosen during this work, i.e. onlythe sequential use of speech and gesture modalities is taken into account.

Figure 2.1: The multi-feature system design

2.2 Command Components

In general, a command in human-machine interaction consists of two components:the command type such as selection, rotation and zoom, and the command parame-ter such as direction, orientation and angle. Similarly, two different input types arerequired to perform each command: one input used to determine the command type(invoking the command) called function input, and the other employed to charac-terize the invoked command (specifying the parameters) known as parameter input.For instance, in order to execute the command of rotation, one input should con-vey the command type (rotation) and the other one is required to determine theparameters (the rotation orientation and/or the angle of rotation).

Table 2.1 summarizes the function and parameter components for each commandof our interest. It is worth noting that in the case of some commands such as rotationand zoom, a part of the parameter may be directly specified using the function input.For instance, for the command of rotation, the rotation orientation can be eitherdetermined by the parameter input or defined directly using the function input.

11

Table 2.1: Command Components

Selection• Functions: selection• Parameters: x,y position, object

Move• Functions: drag, drop• Parameters: x,y position, object

Rotation

• Function: rotate, rotate right androtate left

• Parameters: rotate amount, ro-tate right and rotate left

Zoom

• Functions: zoom, zoom in andzoom out

• Parameters: zoom amount, zoomin and zoom out

2.3 System Design Space

As mentioned earlier, we aim at using gesture and speech modalities for manipulatingobjects using four main command of selection, dragging and dropping, rotation andzoom. Moreover, we stated that these modalities can be either independently usedor fused regarding the CARE properties. Furthermore, we noted that in the caseof combined usage of speech and gesture, we are interested in the sequential use ofthese modalities.

Within a multimodal user interface, each command can be performed either usingone modality or using combined modalities. Since each command itself generallyrequires two input types for the execution, it can be concluded that each of thespeech and gesture modalities can be employed either to initiate a given commandas the function input or to characterize an ongoing command as the parameterinput. Regarding the gesture modality, although this project focuses on only handgestures, we can consider two possibilities: the use of single-hand gestures and theuse of two-hand gestures. Different kinds of information can be extracted from handgesture modality during the system design such as the hand(s) position, movementand posture.

On the other side, speech modality can be employed at least in two different waysdepending on which feature of the speech is used. In the first approach, linguisticfeatures of the speech are extracted using word recognition algorithms. In the secondapproach, acoustic features of the input speech are used. Among different acousticfeatures, we can name prosody, energy, voicing probabilities, spectrum and duration[21].

To sum up the discussion, Table 2.2 shows all possible ways to use gesture andspeech modalities to perform a command. Table includes both unimodal and mul-timodal cases. In unimodal scenarios, either hand gestures or speech is employedto perform a command, namely only one modality is used for both function andparameter inputs. On the contrary, in multimodal cases the modality used for giv-

12

ing the function input differs from the modality providing the parameters. In thefollowing, we propose solutions for the multimodal cases, indicated in Table by X.

Table 2.2: Design Space

ParameterGesture Speech

One Hand Two Hands Word Recog. Voice Prop.

FunctionGesture

One Hand × ×Two Hands × ×

SpeechWord Recog. × ×Voice Prop. × ×

13

2.3.1 Selection Design

In this section, we propose different multimodal solutions for performing the com-mand of selection. As mentioned before, there are two general ways to integratethe speech and the gesture modalities. In Table 2.3, speech modality is used toconvey the command type (i.e. to define the function) while gesture modality deter-mines the parameters of the command. Depending on which feature of the speechis used (word recognition/linguistic features or voice properties/acoustic features)and the hand gesture type (one-hand or two-hand), there exist four solutions. Thefirst column describes the gesture proposed for determining the command parame-ter where a drawing per solution is also added. The second column determines thespeech input (above the microphone icon) and its type (below the microphone icon)for activating the command. Finally, the last column summarizes positive pointsand drawbacks of the proposed solutions. It is worth noting that we are always inpointing mode, therefore there is not required to say “pointing” to activate selectioncommand.

Table 2.3: Selection Design (Speech command - Gesture parameter)

Gestureparameters

Speech commands Cons and Pros

S1

Point to the Object

One Hand

Tonguing

Property of Voice

+ Easy to detect gesture+ Natural− Hard to recognize sound− Not reliable

S2

close Lhand &ponit Rhand while

tonguing

Two Hands

Tonguing

Property of Voice

+ Easy to detect gesture+ More reliable than previ-

ous one− Hard to recognize sound− Can be tiring (use of both

hands & speech)

S3

close Lhand &point Rhand whilesaying “Selection”

Two Hands

“Selection”

Word Recognition

+ Easy to detect gesture &speech

+ Reliable+ Accurate− Can be tiring(use of both

hands & speech)

S4

Point to the object

One Hand

“Selection”

Word Recognition

+ Easy to detect gesture &speech

+ Easy to use for users/lesstiring

+ Natural+ Reliable+ Accurate

14

On the other hand, in Table 2.4 gesture modality is employed to activate thecommand by the function input whereas speech modality specifies the parametersof the command. Similar to the previous table, four different types of solutions canbe considered depending on speech and gesture implementation. The first columnshows the gesture input for activating the command. The second column describesthe speech input proposed for conveying the command parameter.

Table 2.4: Selection Design (Gesture command - Speech parameter)

Gesture commands Speech parameters Cons and Pros

S5

close Lhand &point Rhand

Two Hands

Call the name ofthe object

Word Recognition

+ Easy to detect speech+ Accurate+ Reliable+ Natural− can be confusing for users

due to call object− Can be tiring (use of both

hands & speech)

S6

Hand pinch

One Hand

Call the name ofthe object

Word Recognition

+ Easy to detect speech− Hand should be in front of

detector− can be confusing for users

due to call object

S7

Hand grab

One Hand

Say numberassignment for

object

Word Recognition

+ Easy to recognize gesture& speech

+ Fast & easy to use+ Accurate

2.3.2 Rotation Design

In this section, we propose several multimodal solutions for executing the commandof rotation. It is obvious that two general ways for integrating speech and gesturemodalities exist. In Table 2.10, speech modality is used to determine the commandtype (i.e. to define the function) while gesture modality specifies the parametersof the command. Depending on the implementation of the speech and gesturemodalities, four different kinds of solutions are proposed. The first and secondcolumns of table determine the speech and gesture inputs, respectively. As with theproposed solutions for the command of selection, the command of rotation is alsodeactivated using speech modality, i.e. the same modality used for the activation ofthe command.

Table 2.6 shows the alternate multimodal solutions where the gesture and speechmodalities are used to convey the function and parameter inputs, respectively. It isworth noting that the command is released using gesture modality.

15

Table 2.5: Rotation Design (Speech command - Gesture parameter)

Speech CommandsGesture

ParametersRelease Cons and Pros

R1

“Rotation”

Word Recognition

Turn Rhand to theleft or right more

than 30 degreerepeatly(10 degreesper each gesture).

One Hand

“Release”

Word Recognition

+ Easy to detect ges-ture & speech

+ Wide range+ Capability of cor-

rectness− Can be tiring− Not Continuous

R2

“Rotation”

Word Recognition

Turn Rhand to theR or Lhand to the

L repeatly(10degrees per each

gesture).

Two Hands

“Release”

Word Recognition



rectness+ Less tiring compare

to previous one− Not Continuous

R3

“Rotation”

Word Recognition

Turn Rhand to theR or L & keep.

duration specifiesrotation amount

One Hand

“Release”

Word Recognition


+ Wide range+ Continuous+ Capability of cor-

rectness− Can be tiring for

user

R4

“Rotation”

Word Recognition

Turn Rhand to theR or Lhnad to theL & keep. durationspecifies rotation

amount

Two Hands

“Release”

Word Recognition




to previous one

R5

“Rotation”

Word Recognition

Move hand with 1finger

clockwise(RR),counterclock-

wise(RL)

One Hand

“Release”

Word Recognition



rectness

16

Speech commandsGesture

parametersRelease Cons and Pros

R6

Tonguing Twice

Voice Properties

Turn Rhand to theleft or right

repeatly(10 degreesper each gesture).

One Hand

“ShSh...”

Voice Properties

+ Easy to detect ges-ture

+ Wide range+ Not Continuous+ Capability of cor-

rectness− Can be tiring− Hard to detect

sounds

R7

Tonguing Twice

Voice Properties

Turn Rhand to theR or Lhand to the

L repeatly(10degrees per each

gesture).

Two Hands

“ShSh...”

Voice Properties


+ Wide range+ Not Continuous+ Capability of cor-


to previous one− Hard to detect

sounds

R8

Tounging Twice

Voice Properties

Turn Rhand to theR or L & keep.

duration specifiesrotation amount

One Hand

“ShSh...”

Voice Properties




user− Hard to detect

sounds

R9

Tounging Twice

Voice Properties

Turn Rhand to theR or Lhnad to theL & keep. durationspecifies rotation

amount

Two Hands

“ShSh...”

Voice Properties




to previous one− Hard to detect

sounds

R10

Tounging Twice

Voice Properties

Move hand with 1finger

clockwise(RR),counterclock-

wise(RL)

one Hand

“ShSh...”

Voice Properties



rectness− Hard to detect

sounds

17

Table 2.6: Rotation Design (Gesture command - Speech parameter)

GestureCommands

Speech Parameters Release Cons and Pros

R11

Turn 2 indexessmoothly

Two Hands

Say amount ofrotation

Word Recognition

Open Hands

Two Hands

+ Accurate+ Natural+ Reliable− Hard to detect ges-

ture− Need more time to

execute− Can be tiring for

users

R12

Clench your handsand rotate morethan 30 degree

Two Hands


Word Recognition

Open Hands

Two Hands


+ Accurate+ Reliable+ Natural− Need more time to


users

R13

Swipe finger to therightward or

leftward

One Hand


Word Recognition

Open Hand

One Hand

+ Easy to use for userand developers

+ Accurate+ Reliable+ Less tiring compare

to previous one− Need more time to

execute− Not capable of cor-

rectness (differentcommands for RRand RL)

R14

Make a circle withone finger

one hand


Word Recognition

Open Hand

One Hand


+ Accurate+ Natural+ Less effort

R15

Turn 2 indexessmoothly

Two Hands

Change the tone ofvoice increasing“rotate − left”,

decreasing“rotate − right”

Voice Properties

Open Hands

Two Hands

+ Accurate+ Natural+ Reliable− Hard to detect ges-

ture & sound− Need more time to


users

18

GestureCommands


R16

Clench your handsand rotate morethan 30 degree

Two Hands

Change the tone ofvoice increasing“rotate − right”,

decreasing“rotate − left”

Voice Properties

Open Hands

Two Hands


+ Accurate+ Reliable+ Natural− Hard to detect

sound− Need more time to


users

R17

Swipe finger to therightward or

leftward

One Hand

Say “Mmmm...”

Voice Properties

Open Hand

One Hand


+ Accurate+ Reliable+ Less tiring compare

to previous one− Need more time to



− Hard to detectsound

R18

Make a circle withone finger

One Hand

“Mmmm...”

Voice Properties

Open Hand

One Hand


+ Accurate+ Natural+ Less effort− Hard to detect

sound− Need more time to



19

2.3.3 Zoom Design

In this section, multimodal solutions for performing the command of ‘zoom’ are pro-posed. In a first approach for integrating speech and gesture modalities, as shownin Table 2.7, speech modality is used to define the function while gesture modalitydetermines the parameters of the command. Since the speech and gesture modalitiescan be implemented in several ways, i.e. using word recognition or acoustic prop-erties algorithms as well as one-hand or two-hand gestures, different solutions areincluded in the table. As always, the command is released using the same modalitywhich is used for the function input, i.e. the speech modality.

On the other hand, it is also possible to use gesture for conveying the functioninput and then to use speech for determining the parameter input. Table 2.8 in-cludes such solutions where the first and second columns describe the function andparameter inputs, respectively. The command of ‘zoom’ is deactivated using thegesture modality in this combination case.

20

Table 2.7: Zoom Design (Speech command - Gesture parameter)



Z1

“zooming”

Word Recognition

Take a distancetwo handsvertically

“zoom − out”, visea verse

“Zoom − in”

Two Hand

“Release”

Word Recognition

+ Easy to detectspeech

+ Capability of cor-rectness

− Hard to detect ges-ture

− Limited range ifhands are touching

− Can be tiring (us-ing two hands &speech)

Z2

“zooming”

Word Recognition

Spread two hands,get close

“zoom − in”, getfar “zoom − out”

Two Hands

“Release”

Word Recognition



rectness− Can be tiring (us-

ing two hands &speech)

− Limited range ifhands are touching

Z3

“zooming”

Word Recognition

Move handbackward

“zoom − out”,forward

“zoom − in” &keep. duration

specifies amount ofzoom

One Hand

“Release”

Word Recognition



rectness+ Natural− Can be tiring for

user

21



Z4

“zooming”

Word Recognition

Move hand withone finger arround

of initial pointclockwise

“zoom − in” orcounterclockwise

“zoom − out”

One hand

“Release”

Word Recognition



rectness− Not natural

Z5

“Zooming”

Word Recognition

Turn Rhand to theR “zoom − in”,Lhand to the L“zoom − out”

Two Hands

“Release”

Word Recognition



rectness+ Easy to use+ Continuous− Not natural

Z6

Tonguing threetimes

Voice Properties

Take a distancetwo handsvertically

“zoom − out”, visea verse

“Zoom − in”

Two Hand

“ShSh...”

Voice Properties


− Hard to detect ges-ture & sounds

− Limited range− Can be tiring (us-


Z7

Tonguing threetimes

Voice Properties

Spread two hands,get close

“zoom − in”, getfar “zoom − out”

Two Hands

“ShSh...”

Voice Properties



− Limited range− Hard to detect

sound− Can be tiring (us-


Z8

Tonguing threetimes

Voice Properties

Move handbackward

“zoom − out”,forward

“zoom − in” &keep. duration

specifies amount ofzoom

One Hand

“ShSh...”

Voice Properties




user− Hard to detect

sound

22



Z9

Tonguing threetimes

Voice Properties




“zoom − out”

One Hand

“ShSh...”

Voice Properties



rectness− Not natural− Hard to detect

sound

Z10

Tonguing threetimes

Voice Properties

Turn Rhand to theR “zoom − in”,Lhand to the L“zoom − out”

Two Hands

“ShSh...”

Voice Properties



rectness+ Easy to use− Not natural− Hard to detect

sound

23

Table 2.8: Zoom Design (Gesture command - Speech parameter)

GestureCommands


Z11

Take distancebetween index &

thumb“zoom − in”, get

close “zoom − out”

0ne Hand

Say amount ofzoom

Word Recognition

Open Hand

Two Hands


+ Natural+ Reliable− Posture detection

is difficult− Not enough space

for accuracy− Not capable of cor-

rectness(differentcommand for zoomin & zoom out)

− Need more time toexecute

Z12

Clench your handsand Take

distance“zoom − in”,get close

“zoom − out”

Two Hands

Say amount ofzoom

Word Recognition

Open Hands

Two Hands




users− Not capable of cor-


Z13

Double-tap in theair with two fingers

one hand

Say amount ofzoom

Word Recognition

Open Hand

one hand

+ Natural+ Easy to detect

speech+ Less tiring− Not accurate due to

gesture recognition

Z14

Take distancebetween index &

thumb“zoom − in”, get

close “zoom − out”

One Hand

“Mmmm...”

Voice Properties

open hand

One Hand


+ Natural+ Reliable− Posture detection

is difficult− Not enough space

for accuracy− Not capable of cor-


− Need more time toexecute

24

GestureCommands


Z15

Clench your handsand Take

distance“zoom − in”,get close

“zoom − out”

Two Hands

“Mmmm...”

Voice Recognition

Open Hands

Two Hands




users− Not capable of cor-


Z16

Double-tap in theair with two fingers

One Hand

Change the tone ofvoice increasing“zoom − out”,

decreasing“zoom − in”

Voice Properties

Open Hand

One Hand

+ Natural+ Easy to detect

sound+ Less tiring− Not accurate due to

gesture recognition

Table 2.9: Classification of the proposed solutions

Command Complementarity Redundancy

Selection S1,S2,S4,S6,S7 S3,S5

RotationR6,R7,R8,R9,R10,R11,R12,R13

R14,R15,R16,R17,R18R1,R2,R3,R4,R5

ResizingZ1,Z2,Z3,Z4,Z5,Z6,Z7,Z8,Z9

Z10,Z11,Z12,Z13,Z14,Z15,Z16

25

2.3.4 Design Space Analysis

So far in this section we have proposed different types of multimodal solutions usinggesture and speech modalities for three main command of selection, rotation andzoom. These solutions can be divided into two main categories:

• function input by speech modality and parameter input using gesture modality.

• function input by gesture modality and parameter input using speech .

Furthermore, we considered two different ways of the implementation for eachmodality. Gestures can be defined by means of one hand or two hands. Similarly,speech modality can be used based on word recognition or acoustic features. There-fore, altogether eight different sets of multimodal solutions have been suggested. Inprevious sections, we compared these solutions in terms of different criteria such asuser effort, ease of recognition, continuity and dynamic range of operation. As statedin the introduction, the best multimodal solution in terms of the user effort will beselected to be implemented. In the following, we discuss guidelines for choosing theproper solution.

Regarding how the speech modality is applied in the human-machine interaction,we refer to work reported in [22], where the authors compared three categories ofusing speech modality, namely one-word recognition, multiple-word recognition andnon-speech inputs (equivalent to acoustic properties in this project). The resultsobtained from the user evaluation of these three options showed that speech modalityby one-word recognition is significantly high ranked for acceptability than two other.Accordingly, we decided to drop the solutions which are based on voice properties.

On the other hand, since the project is founded on the approaches with reduceduser effort, it is intuitively logical to give the preference to one-hand gestures ratherthan its two-hand counterparts. Consequently, the solutions based on two-handgestures were eliminated from the candidate solution list for the implementation.

After eliminating two-hand gesture options as well as voice-properties-basedspeech solutions, only two possible ways for modality fusion remain valid:

• function input by word-recognition-based speech modality and parameter in-put using one-hand gesture.

• function input by one-hand gesture modality and parameter input using word-recognition-based speech .

Typically, gesture recognition requires less time than word recognition. On theother hand, gesture can provide the user with a better range of variations. Therefore,gesture modality can be in general a more flexible tool than speech in definingand adjusting the parameters of a given command. For these reasons, we decidedto implement the multimodal solution which uses word-recognition-based speechand one-hand gesture modalities for conveying the function and parameter inputs,respectively. In the next section, we discuss the implementation procedure, and thenthe implemented multimodal set will be compared to an existing unimodal set interms of efficiency and user effort.

26

Table 2.10: Selected proposed Design


ParametersRelease

“Selection”

Word Recognition

Point right fingeron the object.

One Hand

No Release Needed

“Dragging”

Word Recognition

Point right fingeron the object andmove right hand

One Hand

“Dropping”

Word Recognition

“Rotation”

Word Recognition



“Rotate − right” orcounterclockwise“rotate − left”

One hand

“Release”

Word Recognition

“Zooming”

Word Recognition




“zoom − out”

One hand

“Release”

Word Recognition

27

Chapter 3

Implementation

Contents3.1 Kinect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 PC Driver Support . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Development Kit . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.1 Using Candescent NUI for Gesture . . . . . . . . . . . . . 31

3.2.2 Using Microsoft SDK for Speech . . . . . . . . . . . . . . 31

3.3 Speech Implementation . . . . . . . . . . . . . . . . . . . . 33

3.3.1 Speech Recognition Guidelines . . . . . . . . . . . . . . . 33

3.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4 Gesture Implementation . . . . . . . . . . . . . . . . . . . 36

3.4.1 Extra Gesture Implementations . . . . . . . . . . . . . . . 37

3.5 Integration Summary . . . . . . . . . . . . . . . . . . . . . 38

3.5.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5.2 Move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5.3 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5.4 Resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

This chapter begins with a brief introduction on Microsoft Kinect hardwareand features as well as the available development kits. Then, gesture and speechimplementation is presented and the challenges faced during the development arediscussed.

28

3.1 Kinect

Some of Kinect features have been briefly described in the introduction. In thissection, Kinect hardware and its features are given in more detail.

The Kinect sensor generally includes an RGB camera, a depth sensor and amulti-array microphone, and is capable of providing the user with full-body 3Dmotion capture, facial recognition and voice recognition. Accordingly, the users donot need to wear or hold something special. Since the depth sensor uses the infraredtechnology, the recognition can be performed even in the darkness. On the otherhand, microphone array can be used for acoustic source localization and ambientnoise suppression. It is worth noting that the accuracy and usability of Kinectdepend completely on the associated software running on the host machine as wellas its drivers[15].

Figure 3.1: Anatomy of Microsoft Kinect [10]

The sensing range of the depth camera (Shown in figure 3.2) can be adjusted bythe software. The physical limits of the sensing range are typically between 80 cmto 400 cm, while the sweet spot is often defined in the range 120 cm to 350 cm. Infarther distances, the device is still capable of retrieving the depth data, but withhigher noise levels, so the reliability is decreased [23].

In addition to default modes (Shown figure 3.3 ), Kinect also benefits from anear mode depth range to provide depth data in the distances closer to the device,namely in the range of 40 cm to 300 cm. The sweet spot of the near mode lies in thedistances between 80 cm and 250 cm. The horizontal and vertical angles of vision forcolor and depth sensors are 57.5 degrees and 43.5 degrees, respectively. Moreover,the devices can be tilted in the vertical plane by -27 degrees to +27 degrees [23].

3.1.1 PC Driver Support

First, two companies of OpenNI and PrimeSense released their Kinect open sourcedrivers. Then, Microsoft released PC drivers for the Kinect along with its non-commercial Software Development Kit (SDK). Using Microsofts SDK, developers are

29

Figure 3.2: Vision’s angle of Mirosoft Kinect [21]

Figure 3.3: Near mode and default mode of Mirosoft Kinect [21]

able to build Kinect-enabled applications in Microsoft Visual Studio using differentlanguages such as C++ and C# [15].

Another set of drivers for Kinect is provided by the OpenKinect (libFreeNect)open source project. Affordances of these three sets of drivers are different. AlthoughMicrosoft SDK for Windows does not require a calibration pose and provides speechrecognition, it is restricted to skeletal tracking and finger and hand gesture recogni-tion or hands-only tracking is not supported. On the other hand, although OpenNIneeds a calibration pose and lacks advanced audio processing, its drivers are moreflexible in hand/finger gesture recognition [15].

30

3.2 Development Kit

In this project, we use Microsoft Kinect SDK 1.7 for both voice and gesture recogni-tion since it is faster and smoother compared to its earlier versions. Since a goal ofthe project is the use of simple hand gestures, data of the hand position and its ges-ture suffices and recognition of all fingers is not necessary. Accordingly, CandescentNUI library is used which allows easy access to such information.

Candescent NUI library has been developed by Stefan Stegmueller in C#, work-ing with both OpenNI and Microsoft Kinect SDK. This library provides developerswith useful information about the hand and fingers, and also supports two-handrecognition. This library is based on the recognition of the hand. Therefore, whenan object other than hand is placed in front of the device, the library fails in recog-nizing its features and slows down the program.

3.2.1 Using Candescent NUI for Gesture

In order to access the features of Candescent NUI library in gesture recognition, thefollowing dynamic link libraries (DLLs) are required to be added in the referencesof the project [24]:

• CT.NUI.Core

• CCT.NUI.HandTracking

• CCT.NUI.KinectSDK

• CCT.NUI.Visual

To use the SDK, the code snippet is:IDataSourceFactory dataSourceFactory = new SDKDataSourceFactory();Then the thread of the hand data source should be created and started. The

minimal code to create a HandDataSource is:var handDataSource = new HandDataSource(dataSourceFactory.CreateShapeDataSource());

handDataSource.Start();Own parameter objects are passed:var handDataSource = new HandDataSource(dataSourceFactory.CreateShapeDataSource(new

ClusterDataSourceSettings(), new ShapeDataSourceSettings()), new HandDataSource-Settings());

3.2.2 Using Microsoft SDK for Speech

To add speech recognition capability to any application, it is preferred that thefollowing packages are installed in this order :

• Microsoft Speech Platform - Software Development Kit (SDK) (Version 10.2)

• Microsoft Speech Platform - Server Runtime (Version 10.2)

• Kinect for Windows Runtime Language Pack

31

After installing the Kinect SDK for Windows and plugging in the Kinect to aUSB port on the PC, it can be ensured from the installation by checking DeviceManager in Control Panel:

SpeechRecognitionEngine Class Provides the means to access and manage aspeech recognition engine.

SpeechRecognitionEngine SpeechEngine;To get information about recognizer which are instlled on the current system,

should use InstalledRecognizer method:SpeechRecognitionEngine.InstalledRecognizers();using LoadGrammar(g) method to load grammer to the application:SpeechEngine.LoadGrammar(g);

32

3.3 Speech Implementation

3.3.1 Speech Recognition Guidelines

There are two listening models for speech recognition with Kinect:

• Active listening: the sensor is always listening for all the defined words orphrases. It works well in the case of very small numbers of distinct words orphrases. However if this condition is not satisfied, the probability of havingfalse activation becomes high. In this mode, the recognizer functionality alsodepends on how frequent the user is speaking during the application running.If the application needs to listen to the environment continuously, its RAMrequirement grows. Therefore, it is advised that for long recognition sessionsthe sensor should be recreated every 2 minutes. In this project, we use thislistening model. For improving the recognition performance, speech enginewill be recreated at the beginning of each level [23].

• Keyword/trigger: the sensor only listens for a single word. When it hears thatword, it listens for additional given words or phrases. This model is consideredas the best way for reducing false activations. The keyword should be verydistinct so that it is not easily misinterpreted.

On the other hand, generally the developer should take into account severalfactors in defining words to be used in the application [23].

• Distinct sounds: Avoid alliteration, same rhythm, common syllable lengths,common vowel sounds, and using the same word in different phrase.

• Brevity: Try to keep phrase short, less than 6 words.

• Word length: Use of one syllable word cause error recognition, because theyare likely to overlap with others.

• Simple vocabulary: Use common words where possible for a more naturalfeeling experience and for easier memorization.

• Reduce false activation: if a specific word always is falsely recognized and needto repeat several times, try to find a new way to describe it.

Regarding the ambient noise, it is worth noting that the sensor focuses on theloudest sound source and attempts to eliminate other ambient noise. Therefore,other conversation around can reduce the accuracy of the speech recognition. More-over, Kinect just cancel out the monophonic sounds, such as a system beep, but notstereophonic. So, if there is a stereophonic sound like music, the speech recognizermay be unsuccessful in proper recognition [23].

Finally, it should be mentioned that there are four microphone arrays withhardware-based audio processing. It includes multichannel echo cancellation (MEC),sounds position tracking and other signal processing (noise suppression and reduc-tion). The audio level depends exponentially on the distance of the user to Kinect

33

(show in figure 3.4). Therefore, in farther distances the audio level fall below theacceptable threshold, resulting in unreliable recognition and demanding significantlyloader speech.

level.png

Figure 3.4: Distance of user agains audio level[21]

Kinect sensor can detect audio input from -50 degree to +50 degree from right toleft in the horizontal plane (shown in figure 3.5).The microphone array can point at10 degree incrementing steps within 100 degree range of detection. This informationis useful to find the angle of the source. It can help to focuse on speific user butcannot reduce the effect of ambient noise. Any voice out of this range cannot bedetected [23].

Figure 3.5: Audio input [21]

3.3.2 Implementation

As mentioned before, after initializing the speech recognition, it is required to createand load grammar. The syntax of the grammar format can be presented in twoforms: an Augmented BNF (ABNF) Form and an XML Form [ref]. These twoforms are specified in such a way that semantic performances of the grammars areidentical, therefore, the automatic conversion between the two forms is allowed. Inthis project, the XML form syntax (version 1.0) is used which uses XML elementsfor the representation of the grammar. It constructs and adapts designs from thePipeBeach grammar, TalkML and a research XML variant of the JSpeech GrammarFormat [25].

In general, in order to make a raw text transcription of the detected input, aspeech recognizer matches the audio input against the grammar. Speech recogniz-

34

ers can also perform subsequent processing of the raw text to produce a semanticinterpretation of the input. In this grammar semantic interpretation is used as well.

Once the recognizer detects an input speech, the detection event is raised up.The processing of the input takes usually about 1-2 seconds where the recognizercompares the input with each defined phrase in the grammar. When the recog-nizer receives input that matches a grammar, the speech recognizer can raise itsSpeechRecognized event. However, if the recognizer determines that the input doesnot match with sufficient confidence any of its loaded and enabled Grammar objects,it raises the SpeechRecognitionRejected event.

One of the speech Recognitions properties is confidence which do not indicate theabsolute likelihood that a phrase was recognized correctly. Instead, confidence scoresprovide a mechanism for comparing the relative accuracy of multiple recognitionalternates for a given input. This facilitates returning the most accurate recognitionresult. Its value is between 0 to 1.

The processing result is a confidence value associated with each defined phrase.Generally, the recognizer also returns the recognized phrase with the highest confi-dence as the recognition result. A confidence threshold level can be determined bythe developer within the range of 0 and 1 to limit recognition of the input with lowconfidence values. The choice of this threshold is in practice a trade-off between therate of false positive results and the user convenience. In the present application, aconfidence value of 0.3 is selected as the minimum required value for the approvedrecognition. Fig. 3.6 schematically depicts the speech recognition process.

Figure 3.6: Speech Recognition

3.3.3 Challenges

One of the challenges when using the Kinect speech recognition engine is relatedto the difficulty in proper recognition of some commands, especially the ones withone or two syllables such as zoom. To tackle this problem, two possible ways havebeen proposed: using the alternate phrases with more syllables for instance adding aprefix or word, and adding some words with almost the same pronunciation (pseudo-homophones) to the grammar. For instance, in the case of zoom, one can useperform zoom or zooming to increase the syllables of the speech function and thereby

35

enhance the recognition confidence. On the other hand, one may add some pseudo-homophones for the function in the grammar such as zum,zoum,zoon,zoo,etc.

Moreover, it seems that the default gain setting of the Kinect microphone array(with a level of 100 on a scale of 100) is not optimal. This causes that the proba-bility of false positives is increased while resulting in clipping input audio signals.Therefore speech recognition quality is severely affected when users are speakingclose to the Kinect sensor. Setting the microphone gain level to a much lower valuecan improve the speech recognition performance to some extent:

• In the Windows Control Panel, select the Recording devices tab of Sound.

• Select Kinect Microphone Array as the default recorder, then open its Prop-erties

• In the Levels tab, set the Microphone Array gain level to the preferred value(about 20-30 in the current application)

There are some other recommendations to improve the recognition quality. Asan example, it is suggested to disable the automatic gain control property of therecognizer, i.e. AutomaticGainControlEnabled = false (in the last SDK version, thisproperty is by default disabled). Other properties affecting the speech recognitionquality include noise suppression and Echo Cancellation; both are recommended tobe disabled[26].

By default, the AdaptationOn flag in the speech engine is active, which meansthe speech engine is actively adapting its speech models to the current speaker.This can cause problems over time in noisy environments or where there are a greatnumber of speakers (in a kiosk environment, for example). Therefore we recommendthat the AdaptationOn flag be set to OFF in such applications[26].

Moreover, if the speech engine is run continuously, its RAM requirement grows.It is better , for long recognition sessions, that the SpeechRecognitionEngine berecreated every 2 minutes. So, to improve code quality, speech engine recreated atthe beginning of each level[26].

3.4 Gesture Implementation

As stated earlier, the selected multimodal set uses gestures for conveying the com-mand parameter inputs. These gestures are chosen to be very close to the iconicgestures in the previous work [17]. The first reason is that the iconic gestures aremore natural, so are generally more understandable and easier to remember for typ-ical users. On the other side, the technological gestures look closer to the machineside. Since in the previous work it has been shown that the iconic gestures aremore intuitive for the users, we decided to select this type of gestures. Moreover,since the multimodal implementation is compared to the existing unimodal set inthe laboratory (i.e. the work in [17]), it is reasonable to focus on replacing thegesture activating commands with the speech ones. Therefore, we used the sameiconic gestures for parameterizing as the previous work, but adapted them to one-hand gesture implementation. For the selection command, the index finger tip is

36

used for hovering the object which the user intends to select/deselect. For the movecommand, the index finger tip is used for hovering the object as well as for changingthe object position. For the rotation and resizing commands, the index finger tipis used for hovering the object. In addition, we use step by step rope technique forthe rotation and the resizing, discussed completely in [17], using the index fingertip which is simple to understand and to operate. In Section 3.5, we show how thespeech and gesture input are combined to perform commands.

3.4.1 Extra Gesture Implementations

3.4.1.1 Finger Type Detection

Candescent library provides basic information of the hands including the fingertippoints, but the finger type is missing. In this project, we developed a techniquewhich reliably assigns the finger type to each finger. When the system recognizes allfive fingers, the positions of the fingertips and the palm are provided. By connectingthe fingertips to the palm, one can calculate for each finger the relative angle of otherfour fingers. Therefore, four angles αij are assigned to each finger i. The polarity ofsin(αij) determines the finger type. The thumb is the finger i such that sin(αij) isnegative for all j. the type of other fingers is also determined in the same manner:

• Thumb: sin(αij) is negative for all j

• Index: sin(αij) is negative for 3 values of j

• Middle: sin(αij) is negative for 2 values of j

• Ring: sin(αij) is positive for 3 values of j

• small: sin(αij) is positive for all j

3.4.1.2 Selection

A gesture for the command of selection is to hover the hand on the object for 1second. This gesture is easy to implement and easy to recognize as it needs thatthe recognizer detects all five fingers and calculates a hovering time for this gesture.The hand gesture should be upward as the angled hand gesture cannot activate thecommand. Closing the hand release the command.

3.4.1.3 Rotation

Similar to the command of selection, an intuitive gesture for the command of rotationis turning the hand to the desired direction. Turning the hand to the right orleft direction by more than 30 degree activates the rotation to the right or left,respectively. This gesture is also easy to implement.

37

3.4.1.4 Zoom

Similar to the hand gesture for zooming in touchscreen devices, we propose to use thegesture of two fingers (thumb and pointing finger) for the command of zoom. If twofingers approach each other or get farther away, zoom out or zoom in are performed,respectively. The lack of enough range for spreading or closing the fingers is one ofthe drawbacks.

3.4.1.5 Summary

Table 3.1: Extra Gesture Summary

Function Number of finger Command Release

Selection Five fingers

Hold Hand 1 Sec Close Hand

Rotation Five fingers

Turn hand to theright/left Close Hand

Rotation Two fingers

Take distancebetween thumb

and index finger orget close

Close Hand

During the implementation step, we realized that the above-mentioned gestures(summarized in Table 3.1 ) may be tiring for the users. Especially, in the case ofperforming rotation to the left since turning the right hand to the left for a whileneeds high effort of the user. Furthermore, the gesture for zoom is not flexibleand accurate due to lack of sufficient space between two fingers. Moreover, anotherissue is raised since the gesture for zoom in and zoom out are different. In thissituation, generally switching between these two functions will be time-consuming.Consider the case where a user is interested in performing zoom out up to a desiredamount. If the user passes the desired value, he has to release the current function,and use the function of zoom in to reach the goal, resulting in a tiring and time-consuming procedure. Therefore, it is often wiser to convey the direction of zoomin the parameter input rather than in the function one.

3.5 Integration Summary

Fig. 3.7 illustrates the implementation architecture for the fusion of speech andgesture inputs. In general, the fusion of different modalities can be performed attwo levels: feature level or early fusion and decision level or late fusion where multiple

38

modalities are fused in the semantic space. In the late fusion approach, the analysisunit of each modality first provides the local decision based on individual features,then the local decisions are combined using a decision feature unit. Generally, thelate fusion strategy is easier and more scalable and flexible. Therefore, we decidedto use this strategy in the implementation of the multimodal set. As mentionedbefore, speech and gesture inputs are received using the Kinect sensors. Kinectprovides raw image frames with 30 frames per second and audio stream to the KinectSDK. The SDK processes the raw data. In the gesture path, the hand and fingerdata is obtained from Candescent NUI. The index finger tip information is used forpointing as well as for conveying the parameter inputs. To provide smoothness ofthe gesture, the polling model every 20 milliseconds is used. In the speech path theevents provided by the recognizer are used for handling speech recognition.

Figure 3.7: Implementation architecture for integrating speech and gesture.

In the following, we discuss the integration of speech and gesture inputs in theimplementation of different commands.

3.5.1 Selection

Fig. 3.8 shows the state diagram for integrating the gesture and the speech inputsin the selection command. The gesture and speech transitions are shown using blueand red arrows, respectively. The diagram is simplified by removing the states forerror counting.

Assuming that there is no hand detection yet, the system starts in state A. Assoon as the hand is recognized by Candescent NUI, the system goes to state B andwaits until the user hovers the object to proceed to state C. In this state, if the speechinput is detected, the system goes to the state D, waiting for the results of the speech

39

recognition. If the speech ‘selection’ command is successfully recognized and a highconfidence level is obtained, the selection procedure is completed by reaching to stateF. However, in the case of failed, false and low confidence recognition, the programcomes back to state C and waits for the next speech input. As shown, during thespeech recognition, if the user points outside the object area, the program goes backto state B regardless of the speech recognition result. To release the object, theprocedure is the same, but instead of the term ‘selection’, ‘release’ must be used.

Figure 3.8: Fusion of speech and gesture: State diagram of selection

3.5.2 Move

The state diagram for the move command is shown in Fig. 3.9. The procedureup to the state D is the same as the selection. By successful speech recognition of‘dragging’, the program goes to state E and the move command is activated. In thisstate, the pointing gesture i.e. the movement of the fingertip causes the movementof the object. By successful recognition of the word ‘dropping’, the move commandis disabled and the object is dropped at its last position.

Figure 3.9: Fusion of speech and gesture: State diagram of Move

3.5.3 Rotation

The state diagram shown in Fig. 3.10 shows the fusion of two modalities for therotation command. After successful word recognition of ‘rotation’, the rotation com-

40

mand is activated in state E and a screen for parameterizing appears. As mentionedbefore, the rope technique is used for conveying the angle data where a connectingline between the fingertip and the fixed point is drawn to have a better visual under-standing. As soon as the rope is created, the system goes to state F, therein turningthe index finger tip around the fix central point determines the rotation amount. Ateach polling step, depending on the turning direction (clockwise or anticlockwise)and the turning amount the object remains unchanged or rotates ±10 degrees. Theoperation is terminated by using speech command of ‘Release’.

Figure 3.10: Fusion of speech and gesture: State diagram of rotation

3.5.4 Resizing

Fig. 3.11 shows the state diagram for the fusion of two modalities in the resizingcommand. It is very similar to the rotation state diagram, but there are two dif-ferences. First, successful word recognition of ‘zooming’ activates the command.Second, at each polling step, depending on the turning direction and amount, theobject remains unchanged or resized by ±6.6%. The operation is terminated byusing speech command of ‘Release’.

41

Figure 3.11: Fusion of speech and gesture: State diagram of resizing

42

Chapter 4

Evaluation

Contents4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Test Application . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Range of testers . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.5 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . 48

4.5.1 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5.2 Move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5.3 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.5.4 Resizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5.5 Final . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.5.6 Summary of Speech Recognition Performance . . . . . . . 69

4.5.7 Quantitative Result Summary . . . . . . . . . . . . . . . . 70

4.6 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . 72

4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.7.1 General Points . . . . . . . . . . . . . . . . . . . . . . . . 75

4.7.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.7.3 Move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.7.4 Rotation and Resizing . . . . . . . . . . . . . . . . . . . . 76

4.7.5 Final . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

This chapter first provides necessary information about the test application,testers and the questionnaire for qualitative assessment. Then, the data obtainedfrom the unimodal and multimodal experiments are presented, analyzed and dis-cussed.

43

4.1 Introduction

In this chapter, we evaluate the performance of the developed multimodal set com-pared to the existing unimodal one. The user evaluation in this work includes bothqualitative and quantitative assessments. Paired t-test is used for quantitative eval-uation due to its greater power compared to other kinds of t-tests. This impliesthat the test is performed twice by each user, once using the unimodal set, i.e.using only hand gestures, and other time using the multimodal set, i.e. using thecombination of speech and gesture. Therefore, to prevent the learning effect on theresults and to control user bias impact, the users participating in the evaluation aredivided into two groups. The first group starts the test with the multimodal set andthen performs the test on the unimodal set. The second group does the opposite,namely, performs the test first on the unimodal set and then on the multimodalset, see Table 4.1. After carrying out the test on each of the two sets, the usersqualitatively assess the set through the provided questionnaire. The questionnaireincludes several items about the overall performance, fatigue, and cognitive load.

The time required for completing the whole test is 45-60 minutes per user, de-pending on how much the user is familiar with Kinect.

Table 4.1: Testers grouping

Groups Number of user First experimentSecond

experiment

Group 1 5 Multimodal Unimodal

Group 2 5 Unimodal Multimodal

4.2 Test Application

The test application has been fully described in Chapter 7 of [17]. For the sake ofcompleteness, we here summarize its main features. The test application includes6 levels of activities: training, selection, move, rotation, resizing, and final. At theend of each level, a log file is created to store the test data. The structure of thelog differs between the two sets. For the unimodal set, the recording data includesthe history of commands and the time spent on each task. On the other hand, inthe log of the multimodal set for each task at each level the following informationare additionally stored:

• Number of speech detection

• Number of speech recognition

• Number of low confidence recognition

• Number of failed recognition

• Number of user error in using the speech command before pointing the object.

44

The test application begins with a set-up form where the users enter their names.By clicking on the Start button, the test will start.

Figure 4.1: Progression of level path

TrainingIn the training level,users are allowed to employ all commands on multiple objectsprovided in the test screen in order to get familiar with the environment. Theminimum duration of this level is ninety seconds, while users can spend more timefor practicing the commands.

SelectionThere are twelve consecutive tasks in this level. In the unimodal set the user mustunselect the selected object, whereas in the multimodal set selecting the objectsuffices.

MoveThe move level provides twelve tasks. The user should drag the object, move anddrop it to blue areas which its position is varied at each task.

Figure 4.2: move level demonstration

RotationThere are eight tasks where only the rotation command is evaluated. The user mustrotate the object and match it into the predefined target. At each task, the objectand the target rotation angle is changed. Users are able to rotate the objects inclockwise or counterclockwise directions.

ResizingThe resizing level works exactly like the rotation level. There are eight tasks, rangingfrom some little zooms in and out for testing the accuracy and several large ones forevaluating the speed.

45

Figure 4.3: rotation level demonstration

Figure 4.4: resizing level demonstration

FinalAll commands are employed together in this level. There are three tasks. At eachtask, to match the object into its target,it must be moved, rotated and resized. Thethree commands can be used by the user in any arbitrary order.

Figure 4.5: Final level demonstration

FeedbackThe feedback structure of the test application has been described in Section 7.8 of[17]. In this work, the feedback in the unimodal set remains unchanged compared tothe original set, whereas in the multimodal set we add some additional feedback toinform the user of the speech recognition state. When the speech input is recognizedwith an appropriate confidence level, the recognized function is shown on the object.When the confidence level is less than the given threshold, the expression of “Repeatagain...” is displayed on the object.

Figure 4.6: Feedback demonstration

46

Fusion of ModalitiesIn the move, rotation, resizing, and final tasks, we have to modify the state diagramsshown in the previous section in order to insert the concept of reaching goal at eachtask. Fig. 4.7 shows how the speech and gesture inputs are combined in the testapplication tasks. It can be divided into two parts. The first part (i.e. states A,B, C, D, E) is the command activating part which is very similar to what alreadyshown in the previous section. The second part (i.e. states E, G, K, J, F) relatesto parameterizing and releasing the command, and shows how speech and gestureare combined to satisfy the goal at each task. The main point, here, is that onlywhen the goal is reached, the high confidence recognition of ‘release’ can completethe task. Otherwise, if the goal is not reached, the system goes back to state Cwhere the command is not activated.

Figure 4.7: Fusion of speech and gesture: State diagram for move, rotation, resizing and finaltasks.

4.3 Range of testers

The people who are familiar with Kinect and Xbox are selected for the evaluation.All the users are male, working in scientific areas, and between 27 and 32 years old.To avoid the effect of different accents on the results, all users are selected from thesame country, in this work, from Iran.

4.4 Questionnaire

At the end of each experiment, users fill the questionnaire for assessing the qualita-tive performance of the set. The questionnaire is composed of three parts. The firstpart is related to the overall performance of the commands. The second part focuseson the fatigue felt. The last part discusses the cognitive load of the commands toclarify the user perception in interacting with computer using each of the two sets.In other words, this parameter shows how much the user has to mentally concentrate

47

in order to perform a given task. Each item in the questionnaire is rated from 1 to7, see Appendix for more detail.

4.5 Quantitative Analysis

The quantitative analysis is performed using the data recorded in the log file of eachset. As stated earlier, the log file contains the necessary information for interpretingthe performance such as the time spent, number and type of recognized functionsfor performing each task. For the multimodal set, the user errors and number ofspeech detection and recognition is also stored in the log file. MATLAB has beenused for extracting the data from the log files and exporting to Microsoft Excel.

In the following, we statistically analyze the data obtained from performing testson unimodal and multimodal sets. The Student’s t-test is used to see if the differencebetween two technologies is significant or not.

4.5.1 Selection

4.5.1.1 Task Timing

Fig. 4.8 and Fig. 4.9 show the average time spent by the users for each task in twotechnologies. In the case of the multimodal selection, the variations of the averagetime are limited, i.e. in the average sense the time taken for the completion of eachselection task of the multimodal set is approximately constant. On the other hand,in the case of the unimodal set, the average time follows a decreasing trend in thefirst four tasks and then remains approximately constant over the next tasks. Thisimplies that the learning of the users has a significant effect on the execution speedof the unimodal selection command, while the multimodal selection command is notaffected by user experience.

Figure 4.8: Unimodal selection: Average time vs. task.

Figure 4.9: Multimodal selection: Average time vs. task.

48

Figure 4.10: Comparison of unimodal and multimodal selection: Average time vs task.

4.5.1.2 Tester Timing

It is also informative to compare the total average time spent by each tester forcompleting the unimodal and multimodal selection activities. No specific correlationis seen between the two curves. More specifically, the difference between two averagetimes is small for all users except for the second and forth users. The testers 1 to5 were of the group 1 and accordingly started the experiment with the multimodal,while the testers 6 to 10 were of group 2 and first performed the unimodal selectionactivity. As seen, the multimodal selection is slightly faster than the unimodal onein both groups, perhaps due to the fact that it does not require object unselection,see Fig. 4.11.

Figure 4.11: Comparison of unimodal and multimodal selection: Average time vs tester.

49

4.5.1.3 Statistical Comparison

The statistical parameters related to the average time of the multimodal selectionare as follows:

Mean=4.309 Low=3.015 High=7.989First Quartile=3.461 Median=3.952

Third Quartile=4.594727065 STD=1.402Variance=1.966 95% confidence interval=1.003

In the following, the statistical parameters related to the average time of theunimodal selection are given:



The mean and the median of the average time for the multimodal set are slightlyless than those of the unimodal set, whereas the standard deviation of the unimodalaverage time data is lower.

Fig. 4.12 compares the box plot of the average time for the unimodal and multi-modal selection commands. It is observed that the multimodal technology statisti-cally performed better than its unimodal counterpart in terms of the median. Bothunimodal and multimodal data sets have the same degree of dispersion (spread).There is one outlier in both data sets.

Figure 4.12: Comparison of unimodal and multimodal selection: Average time box plot.

Fig. 4.13 depicts the box plots of the average time for each group of the users.It seems that the users in group 2 generally performed faster than the other usersin group 1. This difference is more pronounced in the multimodal case where thegroup 2 had the experience of the unimodal selection.

50

Figure 4.13: Comparison of unimodal (left) and multimodal (right) selection: Average time boxplot per group.

The t-test analysis is applied to the average time data of the users in order todetermine if there is a significant difference between the two technologies. Applyinga two-tail paired t-test, a p-value of 0.49 is obtained. Therefore,the null hypothesisis rejected. The p-value is too big for claiming that there is a significant differencebetween the unimodal and multimodal selection commands, see Table 4.2.

Table 4.2: Selection: Times t-test table

Type/Testers 1 2 3 4 5 6 7 8 9 10 t-test

Multimodal 4.54 3.90 3.24 7.98 4.29 3.99 3.01 3.53 4.75 3.81 0.49Unimodal 4.98 7.36 4.43 4.58 4.20 4.27 3.38 5.43 3.38 5.21

4.5.1.4 Errors

In the multimodal experiment, the errors can be divided into two groups. First, theuser errors which include pointing out of the object during speech recognition, usinginappropriate or incorrect speech command, and additional tries for completingthe task. Second, the system errors which include false and low confidence speechrecognition. Of course, the user also can result in low confidence recognition (forinstance by improper pronunciation), but is is less likely. Similarly, in the unimodalexperiment there are user and system errors. Pointing out of the object during theleft-hand gesture, the use of inappropriate gesture command, and additional triesfor completing the task are considered as the user errors, while the false gesturerecognition is a system error. Since separating the system and user errors of theunimodal experiment is not possible from the log (particularly distinguishing thefalse recognition and inappropriate command is not feasible using the log data), wedecided to compare two experiments in terms of the total error.

Fig. 4.14 compares the number of the errors for each user in the two experiments.It is obvious from the graph that the performance of the multimodal technology is

51

better in terms of the errors. The t-test analysis on the errors results in a p-value of0.061, very close to the threshold of 0.05. Thus it can be somehow concluded thatthe multimodal selection causes less errors compared to the unimodal selection, seeTable 4.3.

Figure 4.14: Comparison of unimodal and multimodal selection: Total error vs. user.

Table 4.3: Selection: Errors t-test table

Type/Testers 1 2 3 4 5 6 7 8 9 10 mean std t-test

Multimodal 2 0 13 0 0 0 0 1 1 2 1.9 3.99 0.061Unimodal 11 5 4 4 2 3 4 5 4 15 5.7 4.05

4.5.1.5 Effective Index of Difficulty

For evaluating the performance of computer pointing devices, uniform guidelines andtesting procedures have been established by ISO 9241-9. Specifically, this standardintroduces a useful metric referred to as Fitt’s Index of performance (or through-put) for comparing both the speed and the accuracy of the user performance. Thethroughput in bits per second is defined as

Throughput =IDe

MT=

log2( DWe

+ 1)

MT=

log2( D4.133SD

+ 1)

MT

where MT is the mean movement time (in seconds) for all trials within the samecondition, IDe is the effective index of difficulty (in bits), D is the distance to thetarget, We is the effective width of the target, and SD is the standard deviationof the selection distances to the target center. In the test application, the D hasbeen chosen equal to 400 pixels. Using the data in the log file, we can calculatethe standard deviation of the selection distances SD as well as the effective index ofdifficulty IDe. Table 4.4 and Fig. 4.15 show IDe values obtained for each task of theunimodal and multimodal experiments. The variations of the multimodal IDe is lesscompared to that of the unimodal experiment. Accordingly, Fig. 4.16 depicts thethroughput of the multimodal and unimodal experiments as a function of the tasknumber. It is clearly observed that in the case of the unimodal set the throughput

52

shows an increasing trend. Table 4.5 gives the average throughput values of bothexperiments. There is a small difference between the two values.

Table 4.4: Effective index of difficulty (bits)

Type/Task 1 2 3 4 5 6 7 8 9 10 11 12

Multimodal 3.5557 3.1019 3.5982 2.9897 3.9480 3.0641 3.8913 3.7310 3.6476 3.2326 3.8272 3.3467Unimodal 3.2715 3.2561 3.6132 4.2315 3.7186 3.8268 2.9283 3.7712 4.4026 3.3622 4.4422 4.1783

Table 4.5: Average Throughput (bits/s)

Type Throuthput Avg.

Multimodal 0.8201Unimodal 0.8700

Figure 4.15: Comparison of unimodal and multimodal selection: IDe vs. task.

Figure 4.16: Comparison of unimodal and multimodal selection: Throughput vs. task.

53

4.5.2 Move

4.5.2.1 Task Timing

The average time spent for each task of the move level using the unimodal andmultimodal technologies are shown in Fig. 4.17 and Fig. 4.18, respectively. Sim-ilar to the multimodal selection, the multimodal move has small variations of theaverage time for different tasks. On the other hand, in the case of the unimodalset, the average time follows a decreasing behavior in the first four tasks and thenits variations are nearly limited over the next tasks. This somehow demonstratesthe learning effect during the consecutive tasks on the speed of the unimodal movecommand, the effect which is not seen for the multimodal move command.

Figure 4.17: Unimodal move: Average time vs. task.

Figure 4.18: Multimodal move: Average time vs. task.

Fig. 4.19 compares the unimodal and multimodal move commands in termsof the average time for each task. It is obvious that after the first two tasks theunimodal move command outperforms the other one over the following tasks.


Fig. 4.20 shows the average time of each tester for completing the move activitylevel. It is observed that except one tester all other testers were faster using theunimodal technology than using the multimodal set, probably because of the timerequired for speech recognizer to analyze the speech input. Therefore, it seems thatthe unimodal move is the winner in terms of the execution speed.

54

Figure 4.19: Comparison of unimodal and multimodal move: Average time vs task.

Figure 4.20: Comparison of unimodal and multimodal move: Average time vs user.


In the following, the statistical parameters related to the average time of the multi-modal move are given:



55

The statistical parameters related to the average time of the unimodal move areas follows:



Obviously, the unimodal data has lower mean and median values but higherstandard deviation. In the case of multimodal set, it can be estimated that about68% of the testers completed each task of the move level in the average sense between5.89 and 7.41 seconds. For the unimodal set, this range is from 4.22 seconds to 6.84seconds. Fig. 4.21 depicts the box plot of the average time values for the twomove commands. The differences between the standard deviation and the medianvalues are obvious: the unimodal median is lower while the multimodal data is moreconcentrated.

Figure 4.21: Comparison of unimodal and multimodal move: Average time box plot.

Fig. 4.22 compares the performance of the users in two groups. In the case of theunimodal set, the second group significantly performed better than the first groupdue to lower median and standard deviation values. In the case of the multimodalset, the differences between two groups are less pronounced.

The t-test analysis on the average time values results in a small p-value of 0.011,which is small enough to claim that there is a significant difference between the twomove commands in terms of the operation speed, see Table 4.6. In other words,the t-test shows that the unimodal move command is performed faster than itsmultimodal counterpart.

56

Figure 4.22: Comparison of unimodal (left) and multimodal (right) selection: Average time boxplot per group.

Table 4.6: Move: Times t-test table


Multimodal 7.17 6.65 5.56 7.20 7.20 7.06 5.15 6.70 7.5 6.32 0.01unimodal 5.17 7.90 4.10 6.71 6.80 5.35 4.11 3.89 5.70 5.61

4.5.2.4 Errors

Fig. 4.23 compares the number of the user errors in both experiments. As shownin Fig. 4.23, less errors occurred when the multimodal move command rather thanthe unimodal move command was used.

Figure 4.23: Comparison of unimodal and multimodal move: Total error vs. user.

The t-test analysis on the errors returns a small p-value of 0.016, see Table 4.7.Thus we can conclude the multimodal technology provides better results in termsof the error.

57

Table 4.7: Move: Errors t-test table

Type/Testers 1 2 3 4 5 6 7 8 9 10 mean STD t-test


4.5.3 Rotation

4.5.3.1 Task Timing

Fig. 4.24 and Fig. 4.25 show the average time spent for performing each task of therotation activity in the unimodal and multimodal technologies, respectively. Sincethe desired rotation angle differs for different tasks, the analysis of the average timevalues cannot demonstrate the existence or the lack of the learning effect. Fig. 4.26compares the average time values of the two rotation commands. For the first tasks,the multimodal rotation is faster, but at the end of the activity, i.e. the seventh andeighth tasks the unimodal rotation shows better performance probably owing to thelearning effect in the unimodal approach.

Figure 4.24: Unimodal rotation: Average time vs. task.

Figure 4.25: Multimodal rotation: Average time vs. task.

58

Figure 4.26: Comparison of unimodal and multimodal rotation: Average time vs task.


Fig. 4.27 compares the total average time spent by each tester for completing theunimodal and multimodal rotation activities. As seen, the difference between twocurves is not significant. The variations of the average time values for the multimodalrotation are less compared to the unimodal one. Moreover, while 70% of the testerscould finish each multimodal task in less than 16 seconds (in the average sense),60% of them completed the unimodal tasks in more than 16 seconds.

Figure 4.27: Comparison of unimodal and multimodal rotation: Average time vs user.


The statistical parameters related to the average time of the multimodal rotationare as follows:

59



In the following, the statistical parameters related to the average time of theunimodal rotation are given:

Mean=15.598 Low=9.528 High=20.710First Quartile = 11.805 Median= 16.389Third Quartile= 20.074 STD=4.283

Variance=18.349 95% confidence interval = 3.064

Comparing the statistical data shows that the data associated with the mul-timodal technology has lower mean, median and standard deviation values. The95% confidence interval of the multimodal rotation command is 1.575 seconds, i.e.approximately the half of the confidence interval of the unimodal rotation.

Fig. 4.28 shows the box plot of the average time values of the rotation activity.It is observed that the data of the multimodal set is quite concentrated aroundthe mean value, has the lower median. On the contrary, the data associated withthe unimodal set are spread about the median. There is no outlier in the rotationactivity. The performance of the first and second groups are compared in Fig. 4.29.In the unimodal experiment, the data of the first group has lower median but is quitespread around its median, while the data of the second group is more concentrated.In the multimodal experiment, both groups show the same degree of dispersion, butthe second group resulted in a lower median.

Figure 4.28: Comparison of unimodal (left) and multimodal (right) rotation: Average time boxplot.

60

Figure 4.29: Comparison of unimodal (left) and multimodal (right) rotation: Average time boxplot per group.

The t-test analysis on the average time values returns a p-value of 0.41, see Table4.8. Therefore, the null hypothesis is rejected, i.e. there is no significant differencebetween the mean average time values of the unimodal and multimodal experiments.

Table 4.8: Rotation: Times t-test table



4.5.3.4 Errors

Fig. 4.30 compares the number of the errors occurred in both experiments. As shownin Fig. 4.30, the multimodal technology has a better error handling capability as thenumber of the errors associated with this technology is significantly smaller than thatof the unimodal experiment. The t-test analysis verifies this observation, returninga small p-value of 0.004, see Table 4.9. Thus the null hypothesis is accepted.

Table 4.9: Rotation:Errors t-test table



61

Figure 4.30: Comparison of unimodal and multimodal rotation: Total error vs. user.

4.5.4 Resizing

4.5.4.1 Task Timing

Fig. 4.31 compares the average time values of the two technologies for each resizingtask. Like the rotation activity, there is no link between the time values associatedwith different tasks, because the desired zoom amount of each task differs from theothers. The figure shows that the unimodal resizing is faster specially after firstthree tasks.

Figure 4.31: Comparison of unimodal and multimodal resizing: Average time vs task.


As shown in Fig. 4.32, the unimodal technology is a clear winner when it comesto the comparison of the tester’s average time values. More specifically, all thetesters performed the unimodal resizing tasks in less time than they carried out themultimodal resizing tasks.

62

Figure 4.32: Comparison of unimodal and multimodal resizing: Average time vs tester.


In the following, the statistical parameters related to the average time of the multi-modal resizing are given:



The statistical parameters related to the average time of the unimodal resizingare as follows:

Mean=13.025 Low=9.556 High=16.328First Quartile = 11.816 Median= 13.148Third Quartile= 14.319 STD=2.025

Variance=4.102 95% confidence interval = 1.448

Obviously, the unimodal experiment shows better performance: the mean andthe median of the average time are smaller by about 3.7 seconds and 3.8 seconds,respectively. These observations are confirmed by the box plot shown in Fig. 4.33.As seen, in the unimodal experiment the median is lower and the data are moreconcentrated around the median.

Fig. 4.34 shows that in both experiments the data associated with the secondgroup is more spread and has higher median. However, the superiority of the uni-modal resizing command in terms of the speed is obvious for the both first andsecond groups.

63

Figure 4.33: Comparison of unimodal and multimodal resizing: Average time box plot.

Figure 4.34: Comparison of unimodal (left) and multimodal (right) resizing: Average time boxplot per group.

The t-test analysis on the average time values returns a very small p-value of0.001, see Table 4.10. This indicates that the null hypothesis is true. Therefore, itis confirmed that the unimodal technology performs faster in the resizing activitycompared to the multimodal technology.

Table 4.10: Resizing: Times t-test tables



64

4.5.4.4 Errors

The error definition in the resizing activity is the same as the definition alreadydiscussed in the rotation activity. Fig. 4.35 shows that like previous commands thenumber of the errors in the resizing activity reduces when the multimodal technologyrather than the unimodal one is used. The t-test analysis here returns a p-value of0.007, see Table 4.11. Thus, the superiority of the multimodal resizing in errorreduction is confirmed.

Figure 4.35: Comparison of unimodal and multimodal resizing: Errors vs. tester.

Table 4.11: Resizing: Errors t-test table



4.5.5 Final

4.5.5.1 Timing

Fig. 4.36 shows the time spent by each tester for the three final tasks. The top andbottom figures are related to the unimodal and multimodal experiments, respec-tively. It seems that the time variations of the unimodal technology with respect tothe testers is higher for all three tasks. On the other hand, for each task of the finalactivity, the minimum time is obtained when the unimodal interaction is used.

65

Figure 4.36: Comparison of multimodal (top) and unimodal (bottom) final activity: Time vs taskand tester.

Fig. 4.37 depicts the average time spent by each tester for completing the uni-modal and multimodal final activities. It is observed that two curves are very closeto each other for eight of ten testers. No specific trend is seen between the results.

Figure 4.37: Comparison of unimodal (top) and multimodal (bottom) final activity: Average timevs tester.


In the following, the statistical parameters related to the average time in the multi-modal experiment are given:



66

The statistical parameters related to the average time in the unimodal experi-ment are as follows:



It is clear that the two experiments resulted in very close mean and medianvalues, however the unimodal values are slightly lower. On the other hand, thestandard deviation of the data obtained form the multimodal experiment is signifi-cantly lower, showing that the average time in the final activity is less dependent onthe tester when the multimodal technology is used. These results are in agreementwith the box plots depicted in Fig. 4.38.

Figure 4.38: Comparison of unimodal and multimodal final activity: Average time box plot.

Fig. 4.39 shows the box plots of the average time for each group. Both groups fol-low the same approach when comparing the uinmodal and multimodal experiments:the unimodal median is lower while the multimodal data are more concentratedaround the median. Comparing the box plots of the two groups, it is observed thatthe data of the first group in both experiments has lower median and is less spread.

67

Figure 4.39: Comparison of unimodal (left) and multimodal (right) final activity: Average timebox plot per group.

The t-test analysis on the average time data shows that there is no significantdifference between the two technologies in performing the final activity since a p-value of 0.68 is obtained, see Table 4.12. the returned p-value is big and indicatesperformance time are equal.

Table 4.12: Final: Times t-test table



4.5.5.3 Errors

Fig. 4.40 compares the number of the errors occurred in both experiments. As shownin Fig. 4.40, the multimodal technology has a better error handling capability asthe number of the errors associated with this technology is smaller than that of theunimodal experiment for most users. The t-test analysis rejects this observation,returning a p-value of 0.084, see Table 4.13. The obtained p-value is not smallenough to claim there is a significant difference between the two technologies interms of the number of errors.

68

Figure 4.40: Comparison of unimodal and multimodal final activity: Total error vs. tester.

Table 4.13: Final: Errors t-test table



4.5.6 Summary of Speech Recognition Performance

Tables 4.14, 4.15 and 4.16 summarize the speech recognition performance of themultimodal experiment. Table 4.14 gives the number of failed speech recognitionper tester at each level of the experiment. As seen, failed recognition occurred onlyonce (tester 4, rotation level) where the recognizer could not recognize the speechcommand. It should be noted here that all evaluation experiments were carriedout in nearly silent and noise-free environment. The number of low confidencerecognition cases (i.e. the false negative cases) are summarized in Table 4.15. Weused a threshold of 0.3 for the recognition confidence level. Therefore, the commandsrecognized with a confidence level less than this threshold are considered as the lowconfidence recognition cases. Finally, Table 4.16 shows the false positive cases,namely, the number of times the recognizer has recognized an unrelated word fromthe grammar.

Table 4.14: Failed speech recognition

Type/Tester 1 2 3 4 5 6 7 8 9 10 Average

Selection 0 0 0 0 0 0 0 0 0 0 0Move 0 0 0 0 0 0 0 0 0 0 0Rotation 0 0 0 1 0 0 0 0 0 0 0.1Resizing 0 0 0 0 0 0 0 0 0 0 0Final 0 0 0 0 0 0 0 0 0 0 0

69

Table 4.15: Low confidence speech recognition - false negative


Selection 2 0 0 13 0 0 0 0 0 1 1.6Move 2 0 0 1 4 0 0 0 1 1 0.9Rotation 6 1 5 0 1 1 0 0 0 1 1.7Resizing 1 0 6 4 0 0 0 1 2 2 1.6Final 7 2 5 1 2 2 3 0 1 5 2.8

Table 4.16: false positive


Selection 0 0 0 0 0 0 0 0 0 0 0Move 0 0 0 0 0 0 0 0 0 0 0Rotation 0 0 0 0 0 0 0 0 0 0 0Resizing 0 0 0 0 0 0 1 0 2 0 0.3Final 0 0 0 0 0 2 0 0 0 0 0.2

4.5.7 Quantitative Result Summary

To help the reader in analyzing different results presented so far, we summarize thedata in Tables 4.17 and 4.18. Table 4.17 gives the statistical data (such as mean,median, standard deviation, and p-value) on the average time per activity for bothunimodal and multimodal experiments. Table 4.18 summarizes the error perfor-mance of the unimodal and multimodal sets and provides the relevant statisticaldata. Moreover, for comparison the summary tables of the previous work [17] areshown in Table 4.19.

Table 4.17: Activities Summaries Table

Mean Median Low High Std dev t-value df Paired t-test

SelectionUnimodal 4.72 4.50 3.38 7.36 1.15 -0.712 9 0.49 ...Multimodal 4.30 3.95 3.01 7.98 1.40

MoveUnimodal 5.53 5.43 3.89 7.90 1.31 3.158 9 0.01 ∗Multimodal 6.65 6.88 5.15 7.5 0.76

RotationUnimodal 15.59 16.38 9.52 20.71 4.28 -0.856 9 0.41 ...Multimodal 14.61 14.36 11.24 18.31 2.20

ResizingUnimodal 13.02 13.14 9.55 16.32 2.02 4.779 9 0.001 ∗∗Multimodal 16.69 16.92 12.33 20.79 2.73

FinalUnimodal 48.84 44.37 26.67 92.12 21.09 0.424 9 0.68 ...Multimodal 50.87 44.37 26.67 92.12 21.09

null hypothesis significant level: ∗ =< 0.05; ∗∗ =< 0.01; ∗ ∗ ∗ =< 0.001; ... => 0.05df (degrees of freedom): N (number of samples) −1

t-value is related to the size of the difference between the means of two samples to be compared

70

Table 4.18: Errors Summaries Table

1 2 3 4 5 6 7 8 9 10 total % of errors sd t-test

SelectionUnimodal 11 5 4 4 2 3 4 5 4 15 57 0.475 4.05 0.061 ...Multimodal 2 0 13 0 0 0 0 1 1 1 18 0.15 3.99

MoveUnimodal 6 4 1 14 5 4 7 0 3 14 58 0.483 4.80 0.016 ∗Multimodal 2 0 1 2 3 2 0 1 2 2 15 0.125 0.98

RotationUnimodal 11 11 23 6 23 40 3 7 20 30 174 2.175 11.86 0.004 ∗∗Multimodal 8 5 8 1 2 3 1 0 1 5 34 0.425 2.98

ResizingUnimodal 8 5 12 5 8 15 6 2 1 19 81 1.012 5.70 0.007 ∗∗Multimodal 1 0 8 4 0 3 1 2 3 6 28 0.35 2.61

FinalUnimodal 2 3 22 19 4 51 7 5 20 31 164 1.82 15.70 0.084 ...Multimodal 10 3 13 4 5 8 4 7 3 18 75 0.83 4.92

Selection:12 tasks of 1 command, Move:12 tasks of 1 command, Rotation:8 tasks of 1 command,Resizing:8 tasks of 1 command, Final: 3 tasks of 3 command

null hypothesis significant level: ∗ =< 0.05; ∗∗ =< 0.01; ∗ ∗ ∗ =< 0.001; ... => 0.05

Table 4.19: Summary tables of the previous work [17]

Iconic: to be closer to natural human gestures (Selection: grab hand, Rotation: one finger, Zoom:two fingers)- the technological gestures is closer to the machine side(Selection: simple move in a

predefined area, Rotation:Using a horizontal slide, Zoom: using a vertical slide)

71

4.6 Qualitative Results

Table 4.20 summarizes the results obtained from the questionnaires distributedamong the testers. The scores assigned to each response in a scaling from 1 to7 have been used in survey response analysis. Table 4.20 shows three columns foreach experiment: first and second columns show the average of the scores for thefirst and second groups respectively, while the third column determines the totalaverage of the scores for each technology.

Table 4.20: Questionnaire results

Unimodal Multimodal

average G1 average G2 average average G1 average G2 averageoverall Commands:Smoothness during operations 4.4 4.8 4.6 4.8 6 5.4Effort Required during operations 4 5.2 4.6 5.4 6.4 5.9Accuracy 4.8 4.8 4.8 5.4 4.8 5.1Rotation ease of use 4.8 5 4.9 4.8 6 5.4Resizing ease of use 4.8 5.4 5.1 4.6 6.2 5.4Selection (move) ease of use 6.8 6.6 6.7 5.8 6.8 6.3Operation speed 6 5.4 5.7 5.4 5.8 5.6General comfort 4.4 5 4.7 5 6.2 5.6Feedback quality 5.6 6 5.8 6 6 6Overall quality 6 5.8 5.9 5.6 6 5.8Fatigue:Fingers fatigue 2.8 1.2 2 2.2 1.2 1.7Wrist fatigue 2 1.8 1.9 1.4 1.4 1.4Arms fatigue 4 4.2 4.1 3.4 3.8 3.6Shoulders fatigue 4.2 4 4.1 3.2 2.8 3Back fatigue 1.2 1.2 1.2 1.2 1.2 1.2Overall fatigue 4.2 2.8 3.5 3 2.2 2.6Cognitive load:Effort 4.4 4.6 4.5 4.4 5.4 4.9Naturalness 6 5.2 5.6 5.8 5.8 5.8

Scores for Overall Commands: 1 (worst) - 7 (best); Scores for Fatigue: 1 (none) - 7 (highest);Scores for Cognitive Load: 1 (worst) - 7 (best)

Regarding the overall performance, the comparison between the total averagesshows that the multimodal technology has obtained a marginally better scores thanthe unimodal set. More specifically, for each question we have:

• Smoothness: The multimodal technology has received a higher score than theunimodal one. Therefore, the testers believe that the multimodal commandsshow smoother behavior during the manipulating operations. For both groupsof the testers, the same trend is observed, but more pronounced for the secondgroup.

• Required effort: Obviously, the multimodal technology is a clear winnerwhere the testers has selected this technology as the one with less effort re-quired during the operations. Both groups of the testers follow the same trend.

• Accuracy: The testers believe that the multimodal technology has better(group 1) or at least similar (group 2) accuracy compared to the unimodal set.

72

• Rotation ease of use: The results show that the multimodal rotation iseasier (group 2) or at least has the same easiness (group 1) compared to theunimodal rotation.

• Resizing ease of use: The total multimodal and unimodal scores are close toeach other, but the second group has found the multimodal resizing commandmore easier.

• Selection ease of use: While the first group clearly choose the unimodalselection (or move) command, the scores of the second group are very close.

• Operation speed: While the first group believes that the unimodal technol-ogy provides faster interaction, the second group has an opposition opinion.This can be justified due to the fact that for both groups the learning phe-nomenon affects the opinion on the operation speed. Please note that the firstgroup performed the unimodal experiment after the multimodal experimentwhile the second group did the opposite.

• General comfort: The multimodal technology is the winner in terms of thegeneral comfort, particularly for the testers in the second group.

• Feedback quality: The unimodal and multimodal scores are very close.

• Overall quality: The unimodal and multimodal scores are very close, how-ever the learning effect on the scores is somehow seen.

On the other hand, the comparison of the fatigue scores shows that the multi-modal technology in general leads to less fatigue on the testers.

• Fingers fatigue: The testers believes that the multimodal commands causeless fatigue (group 1) or at least the same level of fatigue (group 2).

• Wrist fatigue: The testers in both groups have the same opinion that themultimodal technology leads to less wrist fatigue.

• Arms fatigue: Both groups believe that the multimodal commands causeless arm fatigue.

• Shoulders fatigue: Obviously, the multimodal technology is the winner asits shoulders fatigue scores is significantly lower than those of the unimodaltechnology.

• Back fatigue: The scores of both technologies are equal and very low.

• Overall fatigue: It is clear that the multimodal commands have better per-formance in terms of the overall fatigue. The score difference is more pro-nounced for the first group.

Regarding the cognitive load, although the multimodal technology has receivedbetter scores, the difference is not significant.

73

• Effort: The testers in the second group believe that the multimodal technol-ogy has better performance in terms of the cognitive effort, while the first groupsees both technologies at the same level. The difference can be attributed tothe learning effect.

• Naturalness: The scores are close to each other. The first group marginallyprefer the unimodal naturalness, but the second group believes that the mul-timodal commands are more natural.

The Wilcoxon signed-rank test is a non-parametric analogue of dependent t-testfor paired samples for testing hypothesis on median. This test does not require anyassumptions about the shape of the distribution. The null hypothesis is that themedian difference between the pairs is zero. To perform the Wilcoxon signed-ranktest, first the absolute value of the differences between the pairs are ranked fromthe smallest (rank 1) to the largest, noting that ties are given average ranks. Then,the ranks of all differences in each direction are separately summed. The smallerof these two sums is the test statistic W , used for p-value calculation. Table 4.21gives the results obtained from the Wilcoxon signed-rank test on the questionnairedata. It is observed that the null hypothesis is rejected for five cases of smoothness,required effort, wrist fatigue, overall fatigue, and general comfort. Therefore, thesuperiority of the multimodal technology is confirmed by the Wilcoxon signed-ranktest in the above-mentioned five features.

Table 4.21: Wilcoxon signed-rank test on questionnaire results

Wilcoxon’s W n (reduced sample size) p-value

overall Commands:Smoothness during operations 0 6 p < 0.001Effort Required during operations 0 9 p < 0.001Accuracy 2 4 0.20 < pRotation ease of use 2 5 0.10 < p < 0.20Resizing ease of use 3 5 0.20 < pSelection (move) ease of use 1.5 4 0.20 < pOperation speed 9 6 0.20 < pGeneral comfort 0 8 p < 0.001Feedback quality 5 5 0.20 < pOverall quality 2 3 0.20 < pFatigue:Fingers fatigue 0 3 0.20 < pWrist fatigue 0 4 p < 0.001Arms fatigue 5 6 0.20 < pShoulders fatigue 3 7 0.05 < p < 0.10Back fatigue 0 0 0.20 < pOverall fatigue 0 8 p < 0.001Cognitive load:Effort 9 8 0.20 < pNaturalness 7 6 0.20 < p

74

4.7 Discussion

In this section, first we describe the interaction of the users with the test applicationduring the evaluation. Then, for each command we analyze the quantitative andqualitative results, and compare the two technologies. Finally, we conclude thesection with a discussion on the overall performance.

4.7.1 General Points

Since the users were not familiar with the environment of the test application, atraining level at the beginning of the evaluation was put, allowing the users practic-ing the commands. However, the users sometimes had difficulties in rememberingand correctly using the gestures and the words. More specifically, in the case ofunimodal technology some users had difficulties in remembering which hand mustbe used for conveying the command parameters. For instance, it happened that inthe move level some of the users used their left hands for changing the location of theobjects. It was very common for both technologies that the users held their handsout of the safe recognition range, thus degrading the smoothness and accuracy ofthe application. However, since in the unimodal set two-hand gestures are simulta-neously employed, this issue was more pronounced. Several users commented thatthe use of the right-hand gestures for specifying the parameters as in the multimodaltechnology is more comfortable than using the left hand. Moreover, the unimodaltechnology particularly in the final level suffered from unwanted switching betweenresizing and rotation commands as the recognizer was prone to fail in detection ofthe number of fingers. Several users complained about the necessity of keeping theright hand on the object during the rotation and resizing levels of the unimodaltechnology. One of the users commented that he prefers the unimodal set becausewaiting for the result of the speech recognition made him stressed. Some of the usersproposed to display the detection area of resizing and rotation commands aroundthe selected object rather than a pre-defined fix position.

4.7.2 Selection

The mechanism of the selection is somehow different in both technologies. In uni-modal technology, the users had to select the object (using the right-hand gestureas the pointer and the left-hand gesture for performing the selection command) andthen release it (using the left-hand gesture while the pointer of the right hand wasstill on the object). For this reason, it happened that some of the users changed thepointer before unselecting the object and waited for the next selection task. On theother hand, in multimodal technology the users had the obligation to only selectthe object (using the right-hand gesture as the pointer and the speech for perform-ing the selection command). Once the pointer was on the object and the user saidthe correct speech command, the task was accomplished and there was no need tothe object unselection. However, the pointer had to be held on the object all theduration of speech recognition process. On the quantitative side, the multimodaltechnology has 10% lower mean and median than the unimodal technology. Both

75

technologies include one outlier, while the standard deviation of the unimodal tech-nology is lower by 20%. However, the t-test on the average time shows that there isno difference between the mean values of two methods.

Obviously, the multimodal technology resulted in lower errors compared to theunimodal counterpart due to its higher accuracy, but not verified by the t-test onthe user error data. Using the unimodal technology 70% of the users finished theactivity quicker, while 58% of tasks were performed faster.

4.7.3 Move

For the move activity, both the unimodal and multimodal technologies used thesame procedure. The user had to drag the object and drop it in a given area. Wedid not notice anything special in the case of multimodal technology. However, asmentioned before, in the unimodal technology it happened that the user used thewrong hand gesture for moving the object. From a statistical point of view, the uni-modal technology has lower mean and median values compared to the multimodalmethod: the mean and median value has been decreased by 17% and 20%, respec-tively. However, the standard deviation of the unimodal set is almost twice thatof the multimodal set. The t-test also confirms that the unimodal technology hasbetter performance compared to the multimodal technology in terms of the speed.Using the unimodal technology 90% of the users finished the activity quicker, while87% of tasks were accomplished faster.

On the contrary, the multimodal technology has obtained much better perfor-mance in terms of the accuracy according to its very low number of errors. Onthe qualitative side, the testers in the first group preferred the unimodal methodto the multimodal technique in terms of ease of use. However, the scores given bythe testers in the second group caused that the Wilcoxon test does not confirm thesignificant qualitative difference between two technologies.

4.7.4 Rotation and Resizing

For the rotation and resizing activities, the procedure in two technologies differs.In the first step of the multimodal technology, right-hand gesture is used as thepointer. Being the pointer on the object, the speech input enables the rotation orresizing command. In the third step, the right-hand gesture is used for determiningthe rotation/resizing direction and amount. Finally, in the last step the speechinput releases the command. On the other hand, in the unimodal case when thepointer associated with the right-hand gesture is on the object, the left-hand gestureinitiates and then parameterizes the rotation or resizing command. For the rotationactivity, in both technologies the users had the right to reach the goal either byrotating the object in clockwise or counter-clockwise directions. The quantitativeresults show that the multimodal rotation performs better. Its average time and themedian time are about 1 sec and 1.7 sec less than those of the unimodal technology,respectively. More interestingly, the standard deviation of the multimodal rotationis 2.2 sec, while this parameter for the unimodal approach is 4.3 sec. 60% of the userswere able to finish the activity quicker using the multimodal technique, while 75% of

76

the 8 rotation tasks were performed faster with the same technology. However, thet-test analysis shows that there is no evidence on the superiority of the multimodalmethod in terms of the operation speed.

On the other hand, the t-test result demonstrates that the multimodal technologyhas better performance in terms of the error handling. The low number of the errorsis pronounced when the multimodal set is compared to the unimodal counterpart.On the qualitative side, the users preferred the multimodal approach by 0.5 scorein terms of the ease of use.

Despite the rotation activity, the statistical results show the superiority of theunimodal resizing according to the operation speed. The average time is better by3.7 sec, the median wins by 3.8 sec, the standard deviation is less by 27%, all theusers performed quicker with the unimodal, and finally at least in 75% of tasks theunimodal technology is faster. The t-test also confirms the null hypothesis. Asalways, the multimodal resulted in much fewer errors, demonstrated by the t-testanalysis. According to the qualitative results, the users marginally believed thatusing the multimodal technology is easier for resizing activities, perhaps becauseof inaccuracy and instability of the unimodal approach. However, the Wilcoxonsigned-rank test does not confirm a significant difference between two methods.

As stated above, the user evaluation showed that the unimodal resizing unlikethe unimodal rotation has a better timing performance compared to its multimodalcounterpart. Actually, it is reasonable that multimodal commands take more timethan unimodal ones due to the time required for the speech recognition. However,in the case of the rotation tasks we observed that there is no significant differencebetween the two technologies. Two reasons can be noted to justify this observation.First, we observed that keeping the pointer (with right-hand gesture) on the objectcould be challenging during the rotation tasks. It frequently happened that theusers pointed off the object center while most objects had asymmetrical geometries(L or T shape). Therefore, while they were focusing on the left-hand gesture forrotating the object, the pointer went outside of the object, causing time wasting.Second, we noticed that number of tries in the unimodal rotation activity is abouttwice that of the unimodal resizing tasks. Obviously, increasing the number of triesdecreases the command speed.

4.7.5 Final

As stated before, in the final activity the user had to use all three commands ofmove, rotation, and resizing in order to fit the object to the target. To do so, thebest way is first to move the object to the target position, then to rotate the objectby the proper angle, and finally to resize the object. However, the users were freeto use their own strategies and choose the order of the commands. In the unimodaltechnology, it frequently happened that the commands were wrongly recognized dueto the sensitivity of the recognizer or the user errors. Based on the quantitativeanalysis, the unimodal set resulted in lower mean and median values, while itsstandard deviation is higher. The t-test on the average times shows that none oftwo approaches had the better performance compared to the other one. Moreover,the multimodal technology decreased the number of errors, but not verified by the

77

t-test analysis.

4.7.6 Summary

To summarize the discussion, the t-test on the number of errors showed that themultimodal technology indeed had better accuracy and results in fewer user errors.In terms of the operation speed, the unimodal technology obviously had betterperformance in the move and resizing activities and nearly equal performance inthe selection, rotation, and final activities. However, according to the qualitativeresults, the multimodal approach had marginally better scores in general, perhapsowing to its higher accuracy.

78

Chapter 5

Conclusions and Future Work

Contents5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 80

79

5.1 Conclusions

Multimodal interaction has found its way into modern human-computer interfacesthanks to its potential in increasing the usability (making use of more senses), flex-ibility, robustness and error handling. Among all modalities, gesture and voicefusion has received more attention in recent years. In this context, it is of greatimportance to find appropriate modality combinations which are well adapted tothe system purposes. To this aim, in this project we have carried out a study on theuse of speech and gesture combined data for emulating common manipulating com-mands on objects. Different sets of speech-gesture combined commands have beenproposed depending on the use of modalities, level of abstraction, and fusion type.Among all proposed solutions, we selected the set which can be result in lower usereffort, i.e. the one uses one-hand gesture and word recognition for defining functionand parameter components of the commands, respectively. This multimodal set wasimplemented in C# and compared to the existing unimodal set.

The quantitative analysis resulted in two main conclusions. First, the multimodaltechnology definitely has better error handling capability compared to the unimodalcounterpart, as in the all activity level it caused lower errors. Second, the unimodalmove and resizing commands are faster than their multimodal rivals verified by thet-test.

On the other hand, the qualitative analysis on the questionnaire data demon-strates that the multimodal technology is the winner of quality at least in fivefeatures, namely, in smoothness, required effort, general comfort, wrist fatigue, andoverall fatigue.

5.2 Future Work

The results of this thesis point to several interesting directions for future work:

• For user evaluation of the multimodal and unimodal sets, all the testers weremale. The reason is as follows. The implemented speech recognizer had lowersensitivity to the input speech commands of female testers, thus resultingin lower recognition confidence levels. Therefore, it frequently happened inpreliminary tests that the female testers had to repeat the speech commands.A possible research can be directed towards addressing such a problem.

• In the current project, the input speech is recognized through the definedgrammar. Therefore, the user accent may affect the recognition performance,for instance on the confidence level of the recognized input. One possiblesolution consists in pre-recording speech commands for each user before thetest. Speech recognition is then performed during the test by comparing theinputs to the user samples. In this way, the speech recognition will be adaptedto the user accent.

• Instead of speech and gesture fusion, a research path may focus on fusion ofspeech and touch. A touch screen is provided to the user to determine theparameters, while speech is used for activating and releasing the commands.

80

• A unimodal one-hand gesture system can be designed. For instance, for themove command the index finger tip points to the object while the thumb isopen. Closing the thumb, the object is selected and the command is activated.While the thumb is still closed, the index finger tip can be used for movingthe object. As soon as the thumb is opened, the command is released.

• Using gaze modality for performing move and selection commands can be alsoof interest.

81

Appendix

The questionnaire discussed in Section 4.4 is as follows:

USER’S SATICTIFICATION SURVEY

The goal of this survey is to extend previous evaluations. The questionnaire tests the user

perception of the commands as well as fatigue.

Modality: combination of speech and gesture / two hands gesture

Name of user: _____________________ Age of user: __________

Mother tongue: ______________________ Female ☐ Male ☐

1st walkthrough ☐ or 2nd walkthrough ☐ Group Number: ________

Overall Commands Worst Best

1. Smoothness during operations: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

2. Effort Required during operations: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

3. Accuracy 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

4. Rotation ease of use: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

5. Resizing ease of use: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

6. Selection (move) ease of use: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

7. Operation speed: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

8. General comfort: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

9. Feedback quality: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

10. Overall quality: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

Comments:

82

Fatigue None High

1. Fingers fatigue: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

2. Wrist fatigue: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

3. Arms fatigue: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

4. Shoulders fatigue: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

5. Back fatigue: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

6. Overall fatigue: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

Comments:

Cognitive load Worst Best

1. Effort: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

2. Naturalness: 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐

Comments:

83

Bibliography

[1] Sharon Oviatt and Philip Cohen. Perceptual user interfaces: Multimodal inter-faces that process what comes naturally. Commun. ACM, 43(3):45–53, March2000.

[2] Richard A. Bolt. “put-that-there”: Voice and gesture at thegraphics interface. SIGGRAPH Comput. Graph., 14(3):262–270, July 1980.

[3] David B. Koons, Carlton J. Sparrell, and Kristinn R. Thorisson. Intelligentmultimedia interfaces. chapter Integrating Simultaneous Input from Speech,Gaze, and Hand Gestures, pages 257–276. American Association for ArtificialIntelligence, Menlo Park, CA, USA, 1993.

[4] Bastian Pfleging, Stefan Schneegas, and Albrecht Schmidt. Multimodal inter-action in the car - combining speech and gestures on the steering wheel. InProceedings of the 4th International Conference on Automotive User Interfacesand Interactive Vehicular Applications, pages 155–162, New York, NY, USA,2012. ACM.

[5] Dimitra Anastasiou, Cui Jian, and Desislava Zhekova. Speech and gesture in-teraction in an ambient assisted living lab. In Proceedings of the 1st Workshopon Speech and Multimodal Interaction in Assistive Environments, SMIAE ’12,pages 18–27, Stroudsburg, PA, USA, 2012. Association for Computational Lin-guistics.

[6] Adam Kendon. Gesture. Annual Review of Anthropology, 1997.

[7] Virtual Reality Systems, chapter Gesture Driven Interaction as a Human Factorin Virtual Environments. Academic Press Ltd., 1993.

[8] Kinect. http://en.wikipedia.org/wiki/Kinect.

[9] Kinectar. http://ethnotekh.com/portfolio/kinectar/.

[10] Clean’move. http://apps.after-mouse.com/fiche-produit/clean-move.

html.

[11] L. Gallo, A.P. Placitelli, and M. Ciampi. Controller-free exploration of medicalimage data: Experiencing the kinect. In Computer-Based Medical Systems(CBMS), 2011 24th International Symposium on, pages 1–6, 2011.

84

[12] Kinect physical therapy. http://kinectpt.net/.

[13] Kinect education. http://www.kinecteducation.com/.

[14] Nike+ kinect training. http://www.nike.com/us/en_us/c/training/

nike-plus-kinect-training.

[15] Maged Kamel Boulos, Bryan Blanchard, Cory Walker, Julio Montero, AalapTripathy, and Ricardo Gutierrez-Osuna. Web gis in practice x: a microsoftkinect natural user interface for google earth navigation. International Journalof Health Geographics, 10(1):45, 2011.

[16] N. Vidakis, M. Syntychakis, G. Triantafyllidis, and D. Akoumianakis. Mul-timodal natural user interaction for multiple applications: The gesture-voiceexample. In Telecommunications and Multimedia (TEMU), 2012 InternationalConference on, pages 208–213, 2012.

[17] Simon Brunner. Using microsoft kinect to perform commands on virtual ob-jects. Master’s thesis, University of Fribourg, 2012.

[18] Jean claude Martin. Tycoon: Theoretical framework and software tools formultimodal interfaces. In In John Lee (Ed.), Intelligence and Multimodality inMultimedia Interfaces. AAAI Press, 1998.

[19] Jolle Coutaz, Laurence Nigay, Daniel Salber, Ann Blandford, Jon May, andRichard M. Young. Four easy pieces for assessing the usability of multimodalinteraction: the care properties. In Knut Nordby, Per H. Helmersen, David J.Gilmore, and Svein A. Arnesen, editors, INTERACT, IFIP Conference Pro-ceedings, pages 115–120. Chapman and Hall, 1995.

[20] Bruno Dumas, Denis Lalanne, and Sharon Oviatt. Multimodal interfaces: Asurvey of principles, models and frameworks. In Denis Lalanne and Jrg Kohlas,editors, Human Machine Interaction, volume 5440 of Lecture Notes in Com-puter Science, pages 3–26. Springer Berlin Heidelberg, 2009.

[21] V. Perez-Rosas, R. Mihalcea, and L.-P. Morency. Utterance-level multimodalsentiment analysis. In The 51st Annual Meeting of the Association for Compu-tational Linguistics (ACL 2013).

[22] Julie Rico and Stephen Brewster. Gesture and voice prototyping for earlyevaluations of social acceptability in multimodal interfaces. In InternationalConference on Multimodal Interfaces and the Workshop on Machine Learningfor Multimodal Interaction, ICMI-MLMI ’10, pages 16:1–16:9, New York, NY,USA, 2010. ACM.

[23] Kinect for windows human interface guidelines v1.7.0. http://www.microsoft.com/.

[24] Candescent nui. http://candescentnui.codeplex.com.

85

[25] Speech recognition grammar specification version 1.0. http://www.w3.org/

TR/speech-grammar/.

[26] Microsoft developer network. http://msdn.microsoft.com/.

86

combining voice and gesture for human computer interaction · interaction of users with computers...

Documents