multimodal interaction with internet of things and...

41
Multimodal Interaction with Internet of Things and Augmented Reality Foundations, Systems and Challenges Joo Chan Kim

Upload: others

Post on 19-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

Multimodal Interaction with Internet of Things and Augmented Reality

Foundations, Systems and Challenges

Joo Chan Kim

Page 2: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an
Page 3: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

Multimodal Interaction with Internet of Things and Augmented Reality

Foundations, Systems and Challenges

AuthorJoo Chan Kim

[email protected]

Supervisors Teemu H. Laine

[email protected]

Christer Å[email protected]

Luleå University of TechnologyDepartment of Computer Science, Electrical and Space Engineering

Division of Computer Science

Page 4: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

ISSN 1402-1536ISBN 978-91-7790-562-2 (pdf)

Luleå 2020

www.ltu.se

Page 5: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

AbstractThe development of technology has enabled diverse modalities that can be used by humans or machines to interactwith computer systems. In particular, the Internet of Things (IoT) and Augmented Reality (AR) are explored inthis report due to the new modalities offered by these two innovations which could be used to build multimodalinteraction systems. Researchers have utilized multiple modalities in interaction systems for providing better usabil-ity. However, the employment of multiple modalities introduces some challenges that need to be considered in thedevelopment of multimodal interaction systems to achieve high usability. In order to identify a number of remainingchallenges in the research area of multimodal interaction systems with IoT and AR, we analyzed a body of literatureon multimodal interaction systems from the perspectives of system architecture, input and output modalities, dataprocessing methodology and use cases. The identified challenges are regarding of (i) multidisciplinary knowledge, (ii)reusability, scalability and security of multimodal interaction system architecture, (iii) usability of multimodal inter-action interface, (iv) adaptivity of multimodal interface design, (v) limitation of current technology, and (vi) adventof new modalities. We are expecting that the findings of this report and future research can be used to nurture themultimodal interaction system research area, which is still in its infancy.

i

Page 6: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

Table of Contents1 Human-computer Interaction 1

2 Foundations 22.1 Multimodal Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Augmented Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Multimodal Interaction - Modality 53.1 Input (Human → Computer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.1 Visual signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.1.2 Sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.1.3 Biosignals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1.4 Inertia & Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.5 Tangible objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Output (Computer → Human) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.1 Visual representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.2 Sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2.3 Haptics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2.4 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Multimodal Interaction - System Modeling 114.1 Integration (Fusion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1.1 Data level integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.1.2 Feature level integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.1.3 Decision level integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2 Presentation (Fission) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Multimodal Interaction using Internet of Things & Augmented Reality 135.1 Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1.1 Visual signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.1.2 Sound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.1.3 Biosignal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.1.4 Inertia & Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.2 Augmented Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.3 IoT with AR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.3.1 AR for user interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.3.2 AR for interactive data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Discussion 19

7 Conclusion 22

ii

Page 7: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

List of Figures1 Multimodal interaction framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Internal framework of interaction system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Visual input signal types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Sound categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Biosignals and corresponding body positions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 The architecture of three integration types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Interaction types in ARIoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

List of Tables1 Interaction type classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Challenges and Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

iii

Page 8: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

Abbreviations

AR Augmented RealityBCI Brain-Computer InterfaceECG ElectrocardiogramEDA Electrodermal ActivityEEG ElectroencephalographyEMG ElectromyographyEOG ElectrooculographyFPS First-person ShooterHCI Human-Computer InteractionHMD Head-Mounted DisplayIMU Inertial Measurement UnitIoT Internet of ThingsISO International Organization for StandardizationITU International Telecommunication UnionMI Multimodal InteractionMR Mixed RealityPPG Pulse Pattern GeneratorRFID Radio-frequency IdentificationSCR Skin Conductance Response

iv

Page 9: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

1 Human-computer InteractionHuman-Computer Interaction (HCI) is a research field that mainly focuses on design methods for a human to interactwith a computer. The discipline started to grow from the 1980s [1], and the term HCI was popularized by Stuart K.Card [2]. From that time, apart from the ordinary interaction system (i.e., mouse and keyboard), researchers havestarted to design new interaction systems based on multimodal interaction by combining more than one interactionmethods, such as speech and hand gestures [3].

Nowadays, the development of technology enables ubiquitous computing in real life, and this development gives apossibility to utilize many new interaction technologies, such as head-mounted displays [4], gesture recognition sen-sors [5], brain-computer interfaces [6], augmented reality [7], and smart objects [8]. By considering these developmentenvironments, the increase in complexity, as well as potential, of multimodal interaction is inevitable. However, animprovement in the usability of multimodal interaction for the user is a remaining challenge in HCI research [9], [10].Another challenge related to this is that the designer of multimodal interaction systems must have multidisciplinaryknowledge in diverse fields to understand the user, the system and the interaction in order to achieve high usabil-ity [9], [11]. The term ‘usability’ has been used when the study aims to evaluate the interaction system from the user’sperspective. According to the ISO 9241-11:2018 standard [12], usability is consisting of effectiveness, efficiency, andsatisfaction by the user, and these aspects are typically covered by usability measurement instruments.

In this report, we give an overview of multimodal interaction and focus on two aspects of multimodal interaction;the system and the interaction. In particular, this report provides an overview of state-of-the-art research on twoinnovations, Internet of Things (IoT) [8] and Augmented Reality (AR) [13], due to the new modalities offered by theseinnovations, which could be used to build multimodal interaction systems. In this report, multimodal interaction refersto multiple inputs and/or outputs from the system’s perspective. We review the state-of-art research on multimodalinteraction systems with AR and/or IoT technologies that were published from 2014 to 2019. With comprehensiveresearch, this report gives a general knowledge of multimodal interaction to parties interested in this subject andthereby helps to identify challenges for future research and development.

1

Page 10: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

2 FoundationsIn this section, we explain the definition of the terms, Multimodal Interaction (MI), Internet of Things (IoT) andAugmented Reality (AR), and elaborate upon in relation to previous research. The goals of this section are to: (i)show the ambiguity of the terms due to multiple existing definitions, and (ii) formulate the definitions to be used inthis study.

2.1 Multimodal InteractionOne of the key terms in multimodal interaction is modality. The sender (as well as the receiver) of the data can eitherbe a human user or a machine. Consequently, the delivery method can incorporate both analogue (e.g., human bodyparts) and digital (e.g., a digital image) components. We distinguish the state from intent because of the responsecreated based on the received data, which depends on whether the data represents the state or the intent of the sender.For example, a system can understand the user’s intent when the user uses a finger as a tool of modality to point out(i.e., gesture) or select an object (i.e., touch), whereas another modality, the heartbeat, can be used to interpret theuser’s state. In this report, modality is defined as follows:

Definition 2.1. Modality is a delivery method to transfer data which can be used for interpretation of the sender’sstate [9] or intent [11].

Consequently, modalities in a multimodal interaction system are often connected to voluntary actions (e.g., touch,spoken words), but they can also represent signals of involuntary or autonomous nature (e.g., heartbeat, sweating,external monitoring by a third party).

The modality for the interaction system can vary depending on the available devices or technology for the systemand the purpose of the system. Not only a mouse and a keyboard, but also a computer vision, sound, biosignal, andhuman behaviour can be utilized for using modalities. In this sense, when an interaction system employs more thanone modality for input or output, it is referred to as a multimodal interaction system. Figure 1 picks our frameworkof multimodal interaction that represents the relationship between the agent and the system by using modalities. Weuse ‘agent’ in this report referred to both the human and the machine. While Figure 1 illustrates the fundamental casewhere one agent interacts with a system, it can also be extended to cover multiagent interaction systems by addingmore agents. In Section 3, various modalities that have been used for transmitting input or output data in betweenan agent and the interaction system will be described in detail.

Figure 1: Multimodal interaction framework

The term ‘multimodal interaction’ was used and defined in other studies [6], [9]. However, we will make our owndefinition based on our framework:

Definition 2.2. Multimodal interaction is the process between the agent and the interaction system that allows theagent to send data by combining and organizing more than one modality (see "INPUT" in Figure 1), while outputdata can also be provided by the interaction system in multiple modalities (see "OUTPUT" in Figure 1).

The primary purpose of multimodal interaction is to provide better usability and user experience while usingmultiple modalities rather than one modality. As an example, Kefi et al. [14] compared the user satisfaction with twodifferent input modality sets to control a three-dimensional (3D) user interface: (i) mouse only, and (ii) mouse withvoice. The result of this study showed that the use of two input modalities (mouse with voice) could provide betteruser satisfaction than one (mouse) input modality.

To provide a detailed analysis of the process of multimodal interaction and how input and output data are handled,we divide it into two steps: integration and presentation. In the integration step, the data from different modalitiesare merged and interpreted for the presentation step. Based on interpretation, a presentation will be made to theagent through one or more output modalities. Figure 2 visualizes the internal framework of the interaction system.Each step has a unique process for data handling, and the details of the three steps are explained in Section 4.

2

Page 11: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

Figure 2: Internal framework of interaction system

2.2 Internet of ThingsThe Internet of Things (IoT) is a technology powered by the ubiquitous sensing which was realized with the advance-ment of network and computing power [15]. The advancement of technology enables to create small connected sensordevices which contain enough power to gather real-time data from users or environment and transfer them through anetwork. With these sensors, the IoT becomes a notable technology in multimodal interaction due to the ability toprovide valuable data for understanding the context of users or the environment.

There is a definition of IoT by International Telecommunication Union (ITU) [16].

Definition 2.3. Internet of Things (IoT) is a global infrastructure for the information society, enabling advancedservices by interconnecting (physical and virtual) things based on existing and evolving interoperable information andcommunication technologies.

However, the type of data that IoT devices can collect is our interest in this report. So it is not applicable to ourreport due to the lack of description on ‘interoperable information’, thus we define the term IoT with our words.

Definition 2.4. Internet of Things (IoT) is the collection of interconnected objects equipped with sensors which areable to receive data from a real-world entity, such as a human body, environment, and other physical entities, andtransfer them to a destination through a network.

As an example, Jo et al. [10] showed that by utilizing IoT, a system could provide the current state of merchandisein a shop to the user through a mobile device to improve the shopping experience. More IoT use cases in multimodalinteraction are presented in Section 5.

2.3 Augmented RealityAugmented Reality (AR) is another notable technology that helps the user to interact with the system or the environ-ment. AR is a technology that visualizes graphical objects generated by computer onto a view of the real world, wherethe view is typically represented by a real-time camera feed on a device or a head-mounted display (HMD). Accordingto our definition of modality, AR is a user interface rather than a modality. In this report, we use the well-establisheddefinition of AR by Paul Milgram and Fumio Kishino [17]:

Definition 2.5. As an operational definition of Augmented Reality, we take the term to refer to any case in whichan otherwise real environment is ‘augmented’ by means of virtual (computer graphic) objects.

The input modality for interacting with AR can be a hand gesture, voice, face, and body gesture [13], which areoften combined in multimodal interaction AR systems. For example, Lee et al. [7] built multimodal interaction with

3

Page 12: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

speech recognition and hand gesture recognition techniques to control an AR object. The result showed that themultimodal interaction method produced better usability than a one-modality interaction method.

AR is implementable on devices which have a camera to capture a view of the real world and has enough computingpower to present virtual objects on the screen. Additionally, there is a similar technology which is called Mixed Reality.Mixed Reality (MR) is an advanced technology of AR which diminishes the gap between the real-world and the virtualworld by creating virtual objects that are aware of the properties of real world objects, and can thereby act as if theywere part of the real world. Microsoft’s HoloLens is a well-known HMD that supports MR. Section 5 has more usecases of AR in multimodal interaction which are also applicable to MR technology.

4

Page 13: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

3 Multimodal Interaction - ModalityIn this section, we present the feasible input and output modalities which were used in other studies. Each modal-ity is categorized based on the type of modality, and each category has information of (i) instruments which canrecord/capture data for the modality, (ii) type of modality, (iii) technology for processing the captured data, and (vi)identified limitations and open challenges.

3.1 Input (Human → Computer)3.1.1 Visual signal

Vision is one of the key functions for humans to obtain information, and interaction systems can also accept visualinformation. In order to get input from an agent by visual signal, the interaction system commonly uses a camera forreceiving the input as a form of a single image, a sequence of images, or a video clip. There are two types of visualsignals that can be used as a modality by an agent. One of the types is a feasible input modality when the agent iscapable of expressing humanlike gestures or has a humanlike form, whereas the other type is available when the agenthas a non-humanlike form. Figure 3 illustrates the categories of visual input signal types.

Figure 3: Visual input signal types

When the agent has a humanlike form, the agent’s body can be used as a visual signal by posing or moving it. Inthis case, the interaction system recognizes body gestures by interpreting them for their purpose. For example, Kinectfrom Microsoft is a motion-sensing device for reading human body movement as input data. ‘Just Dance’ is one ofthe well-known game franchises developed by Ubisoft that utilizes Kinect to obtain the player’s body gestures as aninput modality [18]. Players have to follow given dancing gestures by moving their bodies to beat the level. Kinectwas widely used not only in games, but also in research, such as emotion recognition [19], physical training [20]–[22],and smart environment [23] in order to receive body gestures as an input modality.

While the interaction system could recognize the entire body of an agent, a specific part of the body can beconsidered as a tool for input modality depending on the purpose. We divide the body into two parts based on thewaist position. Above the waist is the ‘upper body ’ part (e.g. arms, hands [4], [5], [24]), and beneath the waist is the‘lower body ’ (e.g. legs, feet) part. In the upper body part, there are three parts that drew attention from researchers assources of a visual input signal; ‘head ’, ‘arms’, and ‘hands’. Head is the part that shows facial expressions composedof eyes, eyebrows, lips, and nose. Facial expressions were used for detecting emotion [25], whereas facial features wereused for identifying a person [26], [27]. Facial recognition systems use several facial features for comparing againstmodels in a database. For example, iris recognition systems, such as the one proposed by Elrafaei et al. [28], workwith a built-in camera in a smartphone for reading the iris to identify the user. Additionally, these facial featureswere not only used for identifying the user, but also for accessing the controls of the system or for other purposes. Forexample, eyes have features, such as gaze movement and eye-blink, which can be used to manipulate a system [29] orimply intentions of a user [30].

Arms are one of the upper body parts that are used to display most humanlike behaviours. Like when Luo etal. [31] made a robot with Kinect to imitate human arm movements by its mechanical arms. A Kinectic camera isused as the interface between human arms and mechanical arms, such as being human-robot interaction systems [32].

Hands are the last upper body part that is used as a tool in order to achieve natural and intuitive communication inHCI [33]. With fingers and hands, an agent is capable of providing various forms of visual signals to a camera from staticgestures to dynamic gestures [34]. For example, sign language is an example of static gestures that can be capturedthrough a camera [35], while hand tracking for manipulating a system is a use case of dynamic gestures [36], [37].For more information regarding vision-based hand gesture recognition, see Rautaray and Agrawal [5]’s comprehensivesurvey.

Whereas the upper body is composed of three distinct sources of visual input signals, the lower body has two distinctparts for visual input signals; feet and legs. According to our study, feet as the lowermost part of the human body are

5

Page 14: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

less preferred to be detected by vision than compared to other body parts. Although many studies proposed systemsthat use motion-detecting sensors attached to feet or ankles [38], [39] (which are described in detail in Section 3.1.4),some studies tried to detect feet from visual information. For example, Hashem and Ghali [40] developed a systemthat extracts foot features to identify the user, and Lv et al. [41] used foot’s shape which was being tracked from anorthogonal view through a built-in camera of a smartphone to control a soccer ball in a game. Legs are another lowerbody part for multimodal interaction systems, and they have mainly been used to understand the human activity, suchas gait for identification of the user [42], leg movement for controlling a character in virtual reality [43] and operatinga humanoid robot by imitating the user’s dancing movement [44].

When the interaction system is set to use other than humanlike form, an image captured through an optical devicecan be used to trigger interactions between the agent and the interaction system. According to our findings, thereare two types of visual input signals to be used when the agent has a non-humanlike form. One type is marker,which is a plain 2D image that is pre-stored in a system. When the system recognizes the printed image scanned byan optical device, an interaction happens. There are various use cases for this type of visual input signals, such asbarcodes [45], [46], QR code [47], [48], AR markers [49], [50], and plain printed images which need to be scanned inorder to activate an interaction [49], [51]. Another type of visual signal that belongs to the non-humanlike category ismarkerless signal whereby the interaction system can identify the target by using object detection or image recognitiontechniques. However, unlike the marker type, the markerless type does not need a plain 2D image in order to triggeran interaction. In this type, the interaction system uses physical object (e.g. buildings [52], environment [53] or a car’sregistration plate [54]) instead of plain printed images. There are several factors that can influence the performanceof interaction between an interaction system and an agent when the system uses a visual signal as the input modality.These factors include but are not limited to overlaying objects on markers, different light conditions (especially in thedark), target placement outside of the line of sight [24], [50].

3.1.2 Sound

Sound is another input modality that is used for interaction between a system and an agent [55]. It also one of thenatural and intuitive communication methods to send/receive information. Sound is a wave that passes through atransmission medium such as air, liquid, or a solid object, and devices that can capture this wave and turn it into anelectrical signal are used to record sound. Sound is divided into two categories depending on whether it is speech ornot. Figure 4 describes these two categories for sound.

Figure 4: Sound categories (sneezing in ‘vocal sound’ by Luis Prado from the Noun Project)

According to our research, many studies used a speech to communicate with a system by giving verbal commands.In this case, the system requires to understand what the agent says; thus, the system employs speech recognitionin order to form a response based on the agent’s spoken request. Modern mobile devices such as HoloLens andsmartphones have speech recognition systems installed, and they are used in various forms. For example, Microsoft’sCognitive Speech Services enables the use of voice commands in Microsoft’s products [4], [56], and Apple’s SpeechAPI is used in their products to add the capability of providing a natural-language user interface [57]. Moreover,when speech recognition is combined with machine learning technologies, it realizes humanlike virtual assistants suchas Cortana by Microsoft [58], Siri by Apple [59] and Google Assistant by Google [60]. Additionally, just as facialexpressions are used to recognize emotions, voices are also used, too [61]. For example, Amandine et al. [62] validatedover 2,000 emotional voice stimuli by an evaluation with 1,739 participants. The participants scored the presentedemotional voice stimuli based on the voice range, valence, and arousal. Their validation results provide 20 emotionsand a neutral state in three different languages (English, Swedish, and Hebrew), which can be used in other voiceemotion recognition studies.

6

Page 15: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

Another category for sound is nonspeech, which is a sound that is not a speech. Nonspeech sounds are any typesof sounds, including vocal sounds that do not represent anything in a language (see sneezing in the ‘vocal sound’ areain Figure 4). Researchers tried to use nonspeech sounds in order to understand events [63] or environments [64], oreven to unveil a hidden problem. As an example of the latter, Joyanta et al. [65] developed a system that listens tocardiac sound to identify abnormal conditions which imply a possibility of diseases.

3.1.3 Biosignals

In this report, we categorize biosignal as an input modality when it is collected through body-attached sensors. Thereare lots of different biosignals of the human body that could be used for interaction; however, we present only biosignalsthat have been used in interaction systems. As a result of our analysis, we could find brainwave, heart rate, muscleactivity (Electromyography: EMG), skin conductance (Electrodermal activity: EDA), skin temperature, and bloodpressure. Each biosignal and the corresponding body position for capturing the signal is illustrated in Figure 5.

Figure 5: Biosignals and corresponding body positions

Firstly, brainwave is an electrical signal of brain activity that can be captured by Electroencephalography (EEG).In brain-computer interface (BCI) studies, researchers have used brainwaves to infer the user’s state, intentions, or evenemotions in order to understand and predict the user’s need [6]. There are six frequency bands (i.e., delta, theta, alpha,beta, gamma, mu) in EEG that represent different types of brainwaves. An EEG sensor can collect these brainwavesfrom different positions on the head. These frequency bands should be collected and analyzed carefully based on thepurpose of research because each frequency band implies different information about the brain activity of users [66],[67]. Due to the necessity of stable contacts of the EEG sensor on a specific part of the head for better quality of data,movements of the user wearing the EEG sensor are usually restricted during the time of data collection [68]; however,the advantage of the deep level of understanding the user is an attractive property of EEG for researchers. Zao etal. [69] developed an HMD to collect six frequency bands and an additional two channels for the user’s eye activityby combining EEG and Electrooculography (EOG) sensors. To reduce the interference of sensors’ cables on the user’sbehaviour, the collected data were sent to the system through a wireless network. The authors verified that an HMDwith EEG/EOG sensors is capable of monitoring users’ brain activity when they are exposed to visual stimulation.More use cases of EEG are documented in another research by Gang et al. [70], who focused on people with functionaldisabilities.

Secondly, a heart is one of the internal organs that directly relate to the lives of humans. Therefore, it is not asurprise that a heart has become one of the biosignals that have attracted researchers’ academic curiosity to understandhumans’ bodily states. A commonly used method to observe heart activity is the use of dedicated sensor hardware,such as an Electrocardiogram (ECG), which is attached to a human body, and which can capture the electrical activityand heart rate. There are, however, several studies that detected heart activity without attached sensors in order toovercome the intrusiveness of the attached sensors. Examples include a smartphone camera to read blood pressurefrom a fingertip by analyzing the captured image of the finger [71] and a system that uses a web camera to recognizeblood circulation from facial skin color [72]. Due to the strong relationship between the health condition and heart’sstate, heart rate has been used for monitoring people’s health state [73], for detecting critical health issues [74], andeven for measuring mental stress level [75]. Moreover, studies have demonstrated the effectiveness of heart rate as amethod for engaging users in interaction with games [76], [77].

Thirdly, since most parts of the human body are composed of muscles, skin, and bones, muscle activity is anotherbiosignal source. The most common method to acquire data on muscular activity is the use of EMG [78]. Unlike otherbiosignals that are collected through attached sensors, such as EEG, EMG is more free from contamination by noise oncollected data. This is because of how the attached sensors work to detect the biosignal. For example, an EEG sensorreads brainwaves that are propagated from the brain through pericranial muscles to the skin where the sensor can read

7

Page 16: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

them. EEG sensors use an amplifier on the signal due to the low amplitude of the original signal in order to successfullycapture the data. During the propagation and amplification processes of brainwave signals, other muscular activitymay cause unwanted noise to EEG signals. In contrast, the EMG signal has a relatively high amplitude comparedto the EEG signal. This indicates that the EMG signal has less probability of contamination from noise [79], [80].With this characteristic and capability of understanding human physical activity, the EMG has been used for varioussubjects, such as controlling of robotic finger by muscle activity on the lower arm [81], controlling a game characterby muscle activity on specific body parts (e.g., hand gestures [82], mouth movement [83]), and controlling a seriousgame as rehabilitation for people who have disabilities [84], [85].

Fourthly, skin, like muscles, is another part of the body that exists throughout the entire human body. Skinconductance response (SCR) is an electrical activity on the skin which is caused by internal or external stimuli, whichcan be interpreted as a clue of cognitive state to identify stress level [86] and emotions [87]. By using this characteristic,SCR has been used to interact with systems. For example, Shigeru et al. [88] developed a 2D game where the numberof obstacles, which the player’s game character should avoid, could be controlled by the player’s SCR. Yi et al. [89]also made a game but in virtual reality (VR) with different compositions of biosignals, such as SCR and heart ratevariability, to detect emotions in order to adjust the environment in the game depending on the player’s emotionalstate.

Fifthly, not only the electrical activity but also the temperature of the skin can be a biosignal for the interactionsystem. The variability of skin temperature during physical activity is a useful biosignal to understand which body partwas used [90]. In that sense, skin temperature was proposed as a cue to achieve better efficiency of physical exercisefor rehabilitation [91]. Additionally, skin temperature was also used for identifying people’s health state [92] andmental state (e.g., stress level [93], emotions [94], [95]) due to a strong relation to bodily condition. In an interactivesystem, this identified health or mental state can be used as an input. For example, Chang et al. [96] developed agame that was monitoring the player’s facial skin temperature in order to detect the player’s stress status. The gameautomatically adjusted the difficulty level based on the player’s stress status for improving the gaming experience. Tocapture skin temperature, an attachable sensor [97], [98] or an infrared camera [99] is used.

Lastly, blood pressure, which is another basis of life activity along with heart rate, is a biosignal that can bemeasured by a sphygmomanometer, which is an attachable device/sensor on the body. It reads the pressure of blood,which is caused by the pumping of a heart to push the blood into the body through vessels. Since the pressureof blood is done by a heart, blood pressure is another source of biosignals to monitor people’s health/mental statefor identifying critical issues, such as cardiovascular diseases [100], hypertension [101] and stress [102]. Furthermore,systolic and diastolic blood pressures have been used for emotion detection, whereby participants’ reactions werecategorized into positive and negative [103], [104]. Additionally, a game is a well-used subject with measurement ofblood pressure to find a correlation between players and games, such as effects of games on health [105] and stress [106],cardiovascular reactivity depending on gender [107] and type of game (e.g., M-rated versus E-rated) [108], and efficiencyfor controlling the blood pressure [109]. A study by Gbenga and Thomas [110] provides a well-documented account onother measurement methods for blood pressure, whereas a study by William et al. [111] gives a literature survey abouthome blood pressure measurement in order to verify the applicability of it from the perspective of clinical practice.

3.1.4 Inertia & Location

Regarding other types of sensors for input modality, an Inertial Measurement Unit (IMU) is an electronic devicecomposed of accelerometers, gyroscopes, and magnetometers in order to measure force, angular velocity, and evenorientation [112], [113]. By using an IMU on the human body, various human motions become recognizable by systemsfor producing useful information to people. For example, a fall detector for preventing injuries [114], swimming motionanalyzer for achieving better performance for swimmers [115], and gait recognizer for identifying people in order to usein security-related applications [113]. Since an IMU can be installed on any target object, not only human motion butalso other physical objects’ motion can be measured. As an example, Wu et al. [116] attached an IMU on a broomstickin order to control the direction of a virtual character in their game. In general, smartphones are commonly useddevices that have integrated IMUs for providing inertial information to applications, and IMUs are used as a part ofa controller for interaction in gaming gears, such as Wii, HTC Vive, and Oculus Rift [117], [118]. Moreover, IMU canbe used to provide positioning as an alternative for satellite-based positioning devices in places where a satellite signalis unavailable [119], [120].

Geolocation is another type of input modality that refers to the information about the geographical position of auser or a system entity. Global Positioning System (GPS) is one of the satellite-based positioning systems developed bythe United States that provides geolocation data from a network of satellites. There is also a number of other satellite-based navigation systems running by other countries throughout the world, such as GLObalnaya NAvigatsionnayaSputnikovaya Sistema (GLONASS) by Russia, Indian Regional Navigational Satellite System (IRNSS) by India, Galileoby the European Union, Quasi-Zenith Satellite System (QZSS) by Japan, and BeiDou Navigation Satellite System(BDS) by China. Geolocation data is used in diverse ways from pathfinding services (i.e., navigation) to physicalactivity analysis. For example, Ehrmann et al. [121] measured various data from soccer players regarding the distancethey covered during matches. By using the measured data, Ehrmann et al. analyzed the relationship between softtissue injuries and intensity of players’ physical activity and identified some variables that can be used to predictplayers’ injuries. Geolocation data is used in systems for interaction as well, such as in Pokémon GO for accessing

8

Page 17: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

game contents at specific locations [122], and in monitoring systems for tracking a specific target in order to alertpotential threats to safety [123].

3.1.5 Tangible objects

In the aspect of input modality, tangible object refers to any object that can be touched and used by agents to provideinput data in order to manipulate digital information [124]. Mouse and keyboard are commonly known tangible objectsfor providing input data. When any object is combined with other sensors or devices to be capable of creating inputsignals, the object becomes a potential tool for input modality. For example, Jacques et al. [125] developed a glovecombined with an IMU to read the user’s hand gestures for controlling a virtual object. Andrea et al. [126] createda small wheel-shaped controller that can be used to insert a password by rotating the controller. Additionally, whilesome tangible objects create input data by using sensors attached to objects, there are some tangible objects that relyon an external device for providing input data. For example, Varsha et al. [127] created a block programming platformby using tangible blocks that each of which has an image that refers to specific programming construct. When usersalign the blocks to form a certain sentence, the system reads the blocks’ images by using a camera and publishes theresult in the form of narration.

3.2 Output (Computer → Human)3.2.1 Visual representation

Since vision is one of the key methods to receive information for either humans or machine agents, interaction systemscan utilise visual representation as an output modality in order to react to inputs captured from agents. Like Filseckeret al. [128] and Hwang et al. [129] used in their applications for presenting information, still image, animation, text,and 2D/3D objects are commonly used form of visual representation on a screen. Video is another form that canbe displayed on a screen as a visual representation [130]. AR and MR, a combination of video and 2D/3D objects,are included in the visual representation category [131], [132]. Additionally, visual representation is not limited tothe information presented on a screen. For example, Horeman et al. [133] developed a training device that informsthe result of a suture in real-time by using red or green light for improving suture skill. The strength of visualrepresentation is the intuitiveness when delivering information, whereas the weakness is an increase in time cost whencreating high-quality content.

3.2.2 Sound

Similar to the input modality described in Section 3.1.2, there are two fundamental choices that a system can utilisewhen the sound is used as an output modality: speech and non-speech.

An interaction system can use speech output that is created by two different techniques, either a human voice or aspeech synthesizer [134]. The human voice technique can imply the emotional state when the user modifies the way ofspeaking according to the experienced emotion [135]. This is one of the reasons why human voice is commonly usedfor virtual characters [136], [137], thus making them seem alive. However, the cost of developing human voice outputis expensive when the system requires a large number of recorded voice lines. Therefore, the speech synthesizer aroseas an alternative.

Speech synthesizer is an artificial voice creator that can produce speech at a relatively low price in a short time [134].However, it is still infeasible to convey emotional cues in speech as a human voice does. Thus a mixture of these twotechniques is sometimes used in order to complement this shortcoming [138]. As examples of the use of speechsynthesizers, Elouali et al. [55] and Bhargava et al. [139] developed applications that can read out text on a device’sscreen by using speech synthesis.

In contrast to speech, any type of sounds that do not represent anything in a human language is regarded asnon-speech. Similar to speech, non-speech can be composed by using two different techniques, either by a digitalsound synthesizer or by recording real-world sounds with a microphone. Many devices, applications, and gameswidely use sound effects that were created by digital sound synthesizers. For example, Nguyen et al. [140] developed adrowsiness detection system that plays an alarm sound when the driver of a car looks sleepy. Koons and Haungs [141]implemented every sound effect in their game by using a digital sound synthesizer. Additionally, there are varioustypes of synthesizers for musical instruments to amplify and modify the style of output sound.

In some cases, recorded real-world sounds were used for providing more immersive experiences. EA Digital IllusionsCE AB (DICE) used real gunshot sounds in a first-person shooter (FPS) game, Battlefield 3, in order to provide arealistic game environment to players [142]–[144]. Foley sound effects are another example of recorded real-worldsounds that are usually used in films. However, a Foley sound effect is a re-creation of a sound for modelling realisticimpressions on events by using various objects and skills rather than using the same objects and skills to reproducesimilar sounds of events [145]. Additionally, some vocal sounds are examples of non-speech sounds, such as groans andscreams of virtual characters in a game [146].

9

Page 18: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

3.2.3 Haptics

A sense of touch is one of the methods for the human body to perceive stimulation. In the aspect of output modality,the term haptics refers to any method that can provide an experience of touch by exerting force [147]. Foottit etal. [148] stated two categories for the types of feedback that mainly use haptics: (i) to provide an experience of touchof real-world objects, (ii) to produce a stimulus for conveying information which is not related to real-world forces.For example, representations of material texture [149], [150], weight [151], and shape of objects [152] are examples ofthe first category, whereas a vibrotactile sensation for notification of specific events [153]–[155] is an example of thesecond category. More use cases of tactile interaction involving fingers and hands are documented in another researchby Pacchierotti et al. [156].

3.2.4 Others

We presented some examples related to the three sensors that humans have for detecting diverse stimuli. Visualrepresentation through eyes, a sound that can be heard by ears, and haptics for tactile sensations. There are severalstudies that aimed at the remaining two senses that a human uses to obtain information: gustation and olfaction.The tongue is an organ through which humans can taste (gustation), and the nose is another organ that humans canuse to smell (olfaction). Therefore, when an interaction system is able to provide gustatory and olfactory outputs toagents, there would be more chances of developing novel interaction methods [157]. For example, Risso et al. [158]developed a portable device that can deliver odours by combining up to eight fragrances, whereas Ranasinghe andDo [159] created a device, the ‘Digital Lollipop’, that can produce four different tastes (e.g., sweet, sour, bitter, salty)by evoking electrical stimulation on an agent’s tongue.

Additionally, thermoception is another sense that the human uses to recognise the temperature of an object oran environment. From the perspective of output modality, the sense of temperature for an agent can be achieved byutilising either a heating system or a thermoelectric cooler. There are two ways to perceive temperature. One wayis from a device that is attached to the human body [160], and another way is from an ambient temperature that iscontrolled by a heating system or a device [161].

Like the last type of output modality, there are several cases where the system controls physical objects or theenvironment as a reaction to the agent’s input. For example, Jiang et al. [162] implemented a robotic arm controlsystem by using hand gestures and voice commands while the agent is sitting in a wheel-chair. Khan et al. [163] builta smart home management system that adjusts the light level of a room depending on the intensity of ambient lightin order to reduce energy consumption.

10

Page 19: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

4 Multimodal Interaction - System Modeling

4.1 Integration (Fusion)Since the interaction system receives input data through various modalities, it requires a step to process the input data.This process is called ‘integration’ or ‘fusion’ [6], [9], [55], [164]. During this process, the input data is synchronizedand/or combined with other data based on models or algorithms that the system uses in order to produce an output.In this report, we use three different integration types that were mentioned in several studies [6], [165]–[167]. Figure 6depicts the architecture of the three integration types as proposed by Sharma, Pavlovic and Huang [167], who basedtheir work on the original design of Hall and Llinas [165]. The three integration types are explored further in thefollowing sections.

(a) Data level integration

(b) Feature level integration

(c) Decision level integration

Figure 6: The architecture of three integration types redesigned by Sharma, Pavlovic and Huang [167]; the originalarchitecture was designed by Hall and Llinas [165]

4.1.1 Data level integration

Data level integration is a type of fusion that happens when raw data from multiple modalities are merged. Theraw data must be based on observations of the same object that are delivered by the same modality, such as audiostreams (i.e. sound) from a camera and a microphone [6] or images (i.e., visual signal) from multiple cameras [167].As depicted in Figure 6a, data level integration is accomplished right after obtaining data from input modalities.Thus, the collected data contains abundant information due to the absence of preprocessing, and this absence leads topotential data reliability issues such as vulnerability to noise and data loss on sensor failure [6], [166], [167]. After theintegration, the data is processed to extract features that are used for making a final decision to form a result. Datalevel integration has been used in some cases such as pattern recognition [165] and biosignal processing [6]. However,it is not commonly used in multimodal interaction systems since several different modalities are used. Thus, othermultimodal interaction studies employed the other two integration types which are presented next [9], [11], [13], [168],[169].

4.1.2 Feature level integration

When the input data from each sensor provides extracted features, the feature level integration can happen (Figure 6b).Feature level integration is also called ‘early fusion’ [9], [11]. In this integration process, features are obtained fromthe input data when the data go through data processing that is designed for each sensor [165], [167]. Featuresdo not need to come out from the same type of modality. Features that are extracted from closely coupled andsynchronized modalities (e.g., a sound of speech and images of lip movement) can be integrated in order to produceanother feature [167], [168]. Due to the data processing that occurs before integration, feature level integration has

11

Page 20: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

a relatively high data loss rate compared to the data level integration, while it is less vulnerable to noise [6], [167].However, features from sensor data can result in a large quantity of data, which requires a high cost for computation inorder to get results [6], [166], [167], [169]. Feature level integration has been used in the case of feature extraction frombiosignals for affective computing [6]. Additionally, Dasarathy categorized two additional types of integration in therefined version of the data integration categorization, and one of them is the ‘data in-feature out’ type [166]. In thisintegration type, raw data from input sensors are preprocessed before they are merged into one data set for featureextraction. Depending on the viewpoint, this integration type can be labelled as either ‘data level integration’ or‘feature level integration’. In other studies, this integration type is called ‘intermediate fusion (mid-level integration)’,and some examples of probabilistic graphical models for pattern recognition are categorized into this type [9], [11].

4.1.3 Decision level integration

Decision level integration is the last type that takes a number of alternative decisions as inputs and produces a finaldecision as an output (Figure 6b). This integration type is also called ‘late fusion’ [9], [11] or ‘semantic fusion’ [168],[169]. In decision level integration, data derived from different modalities are processed independently until respectivedecisions have been made. A final decision or interpretation regarding the agent’s input will be made when all thedecisions are ready to be merged [9], [55]. Decision level integration is used in multimodal systems in order to integratemultiple modalities that are not tightly coupled but have complementary information [6], [169]. Due to an individualdecision-making process for each modality, computational cost at this stage is relatively lower than that of featurelevel integration [11], [167]. Dasarathy’s categorization has the ‘feature in-decision out’ type of integration, which canbe labelled as either ‘feature level integration’ or ‘decision level integration’ [166]. In this integration type, the systemclassifies input features based on trained knowledge to get an output decision. Some pattern recognition systemsutilized this type of integration process [6], [9], [11].

4.2 Presentation (Fission)After integration of input data is completed, a decision or interpretation regarding the input data needs to be deliveredto the agent through one or more modalities. This stage is called ‘presentation’ or ‘fission’. The output from themultimodal interaction system will become another stimulus that can affect the agent’s behaviour [6]. Consequently,the outcome of this interaction depends on how the multimodal interaction system chooses to present the decision tothe client. Thus, the way in which output is presented to agents through modalities should be carefully consideredwhen designing a multimodal interaction system. To mitigate this challenge, Foster [170] proposed three importanttasks in the presentation process (content selection and structuring, modality selection, and output coordination),whereas Rousseau et al. [171] presented a model consisting of four questions (What-Which-How-Then) that must betaken into account in design processes for multimodal presentation: (i) What is the information to present? (ii) Whichmodalities should we use to present this information? (iii) How to present the information using these modalities?(iv) and Then, how to handle the evolution of the resulting presentation?. Some studies employed these models toimprove the efficiency of the presentation. For example, Grifoni [172] used Foster’s three tasks for analyzing thefeatures of visual, auditory, and haptic modalities when these are used as output channels. Cost and Duarte [173] usedthe WWHT model to design an adaptive user interface based on user profiles, with a focus on older adults. Anotherexample of adaptive systems was provided by Honold et al. [174], who designed an adaptive presentation system thatprovides visual information to agents based on their contexts.

12

Page 21: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

5 Multimodal Interaction using Internet of Things & Augmented RealityBefore we dive into the use case of multimodal interaction with AR and IoT, we present several studies that builtmultimodal interaction with AR or IoT, respectively. In order to identify the modalities that have been used inAR/IoT multimodal interaction system, we analysed studies based on modalities that each multimodal interactionsystem used to provide the main functionality to the agent, which could be a human or another system.

5.1 Internet of ThingsThe data collected through IoT devices have been used by multimodal interaction systems to understand the agent’sstate, intent or context. As technology advances and more types of data are collected, systems become more capableof understanding and helping the agent. Furthermore, when a multimodal interaction system is combined with othertechnologies rather than using only one sensor, it allows the collection and analysis of large quantities of data fromdiverse sources for better understanding of the agent. In this section, we analyze studies that developed multimodalinteraction using IoT. The analyzed studies are categorized by the primary modality used.

Some studies were excluded from in-depth analysis due to the lack of information regarding technical aspects,although they exemplify use cases of IoT. As an example, Farhan et al. [130] developed a Learning ManagementSystem (LMS) that uses attention scoring assessment. The students start the learning process through a video lecturewhich is shown on the computer, and students are recorded by a webcam. The students’ attention level is evaluatedaccording to the location of the face and whether the eyes are closed or not among the images taken in real time.This assessment can be used by instructors to understand the status of their students and to improve the quality oftheir learning experiences. Their system gathers some information (e.g. student location) from the students via anIoT infrastructure to improve the learning experience. However, the information about the IoT infrastructure was notdescribed in detail. So we exclude such studies with major gaps in technical details from our analysis.

Additionally, it is important to acknowledge the existence of important challenges and knowledge on differentresearch topics within IoT regardless of their use of multimodality. Therefore, we identify several survey papersthat focused on specific issues and use-cases, such as the IoT system in mobile phone computing [175], IoT formedical applications [176], [177], occupancy monitoring for smart buildings [178], data fusion methods for smartenvironments [179] and the brain-computer interface [6].

5.1.1 Visual signal

The visual signal is defined here to be anything that can be captured as a set of images by an optical instrument. Thecaptured images can be processed with the computer vision technology for recognizing movements of a target (e.g.hand gesture, body gesture, eye gaze, facial expression) or physical objects. This is a powerful modality that allows asystem to perceive the visual richness of the surrounding world. As an example, Kubler et al. [180] used this approachwhen they developed and tested a smart parking manager using IoT. The user can reserve a parking spot through anonline booking system, and when the user tries to enter the parking lot, the car plate number will be optically detectedat a gateway. The interaction system will manipulate the actuator to open the gate when it finds the matched numberfrom a booking list. After the user enters the parking lot, the user’s location will be saved in the interaction system.In the case of an accident, the interaction system gives the right to the emergency vehicle to access the parking lotbased on the latest updated user location.

5.1.2 Sound

Sound is a wave signal that can be captured by a sensor device. Kranz et al. [181] used the sound modality when theydeveloped a context-aware kitchen that consists of a microphone, a camera and several IoT sensors on a knife and acutting board. The interaction system can recognize the type of ingredients being cut by analyzing the sound of thecut. While the microphone detects the types of ingredients, the three-axis-of-force and three-axis-of-torque sensors onthe knife also collect data that are used to determine the type of food. Another use case of sound modality is Sahaet al. [182]’s air and sound pollution monitoring system that uses a sound and gas sensor to detect air quality andsound intensity in a specific area. When the system detects abnormal levels of noise or air pollution, an alarm wouldbe alerted in order to make agents take actions. As an extension of the monitoring system, a cardiac sound is usedto monitor an agent’s health condition [183], [184]. When the monitoring system acknowledged the abnormality in acardiac sound stream, favourable services could be presented to an agent [185].

5.1.3 Biosignal

The human body is a source of various biosignals which can be used for understanding the physical and psychologicalstate of the person. This can offer diverse ways of interacting with the system by using body signals as a medium,thereby promoting natural interaction. The biosignal sensors are attached to the human body for recording and sendingdata to an interaction system. For instance, Wai et al. [186] used EEG, Pulse pattern generator (PPG) and eye trackerto understand the user’s neural, physical and physiological states. The EEG and PPG sensors were attached on aheadband, and the sensors send the recorded data to a cloud server for real-time processing and access. Whereas

13

Page 22: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

the EEG sensors were used for recording the user’s brainwave signals, the PPG sensors were used for recording theheart rate and heart rate variability. With these data, including also gaze data, authors try to detect the fatigue statein a driving scenario, and their result shows that a combination of EEG and PPG modality increases the detectionaccuracy compare than only using one modality.

5.1.4 Inertia & Location

An accelerometer is a sensor which can be used for determining the user’s physical activity by the acceleration of theirbody parts. Most recent smartphones have an accelerometer, and many smart accident detectors were implemented byusing built-in accelerometers of smartphones. [187]–[189] However, in the case of smartphone-based accident detectors,the smartphone’s built-in accelerometer may not sense the user’s movement in a desired manner. To overcome thisissue, Chandran et al. [190] developed a helmet to detect and notify an accident for user safety. The three-axisaccelerometer was attached on a helmet, and it collected data of the user’s head acceleration. The data were sentto the cloud server to detect an accident by analysing rapid changes in acceleration. When the system detects anaccident, it would give a phone call and send a message that was supposed to be answered by the user within acertain amount of time. If the system does not receive an answer from the user, it will place an emergency calland a message to another number which is registered by the user beforehand. In another example, He et al. [191]attached a gyroscope and an accelerometer on the vest to get more accurate results of the user’s physical activity forfall detection. The sensors send a data to the fall detection app in a smartphone, and when the app detects a fall,the app will provide notification to registered users through different modalities, such as message, phone call, alarm,and vibration. The location is another information that can be given as an input to an interaction system. NazariShirehjini and Semsar [192] created an app that shows information about controllable objects around the user on theirsmartphone depending on the room that the user entered. The objects which are identified through the IoT devicesare presented on the screen as 3D objects. The app presents the status of the objects, such as power and location,and the user can control them to turn on and off by tapping the screen with their fingers.

5.2 Augmented RealityAccording to our literature survey, all the studies that utilize AR in a multimodal interaction system uses visualsignal as the main input modality due to the visualization of virtual content. One of the known advantages of AR isthat it attracts the user’s attention and motivates the user to focus on the content, which have been demonstratedparticularly in a study on education [193]. Due to these advantages, AR has been used in different studies as a wayof information visualization or part of a user interface that gives an interactable object to control the system.

In order to represent the AR content on the screen, the AR device requires the knowledge of the position wherethe AR content will be placed onto the real-world view. Visual markers is a commonly used method that provides apositional anchoring to the AR device for displaying the AR content. However, this method has the limitation thatthere needs to be a marker on the target object or place. Some researchers tried to combine AR with different computervision techniques, such as object recognition, image recognition and object detection, to overcome the limitation ofmarker-based AR visualization. Bhargava et al. [139] used AR for text recognition and translation between Englishand Hindi, vice-versa. The mobile device displays the translated text on the screen and also uses sound to speak outthe text. In 2015 when Bhargave’s team developed their application, Google Translate was able to translate the textonly from English to Hindi [60].

Multiple input and output modalities were applied in the study by Zhu et al. [194], who developed an AR mentoringsystem on an HMD that helps with inspection and repair tasks. When the user asks a question with voice, the mentorsystem responds with certain answers in voice depending on the stored data. The mentor system also uses a camera,an IMU, and object recognition to recognize the target object and the direction of the user’s gaze to express the taskson the user’s screen in AR. Al-Jabi and Sammaneh [195] used AR to present information about what the user needs toknow when they are parking their car. Their system uses deep learning to understand the camera view when the usernavigates the space. When the system recognizes the user’s location from the camera view, the AR arrow will appearon the screen for navigating the user to the closest available parking slot. If the system cannot find any availableparking slot, it will visualize the remaining parking time of other cars on the roof of each car.

5.3 IoT with ARResearchers have focused on establishing efficient interaction methods using both AR and IoT to increase usability.This mixture of two techniques, AR and IoT, is sometimes called “ARIoT" [196]. According to our literature study,there are four types of interaction in ARIoT systems. Figure 7 describes each type of interaction. The first two typesare for interaction through AR interfaces (Figure 7 (a) and (b)). In these types, the agent can control a real-worldobject by manipulating an AR interface on the mobile device such as a smartphone, tablet PC and HMD. Thus, ARis not only for presenting information but also for providing inputs for interaction. The difference between Figure 7(a) and Figure 7 (b) is whether an IoT sensor is attached on a marker or an object, or not. Conversely, the other twotypes, depicted in Figure 7 (c) and (d), use AR only for displaying data on the screen. The agent can interact withaugmented objects. However, it only requests the data to be presented. As in Figure 7 (a) and (b), the only differencebetween the two types is the presence or absence of IoT sensors on a marker or an object. However, in the case of

14

Page 23: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

Figure 7 (c) and (d), the agent can manipulate the real-world object, and AR is only used to display the data. Basedon the four interaction types illustrated in Figure 7, we categorize them into two sections: AR for interface and ARfor data presentation.

Marker on IoT Marker on object

AR forinterface

(a) (b)

AR for datapresentation

(c) (d)

Figure 7: Interaction types in ARIoT

Table 1 presents a summary of ARIoT studies with information of input and output modalities, features of theimplemented systems, and interaction types based on Table 7. ‘Visual graphics’ in output modality indicate any formof computer-generated graphical object that can be displayed on the screen in the form of image, text, video, animationand 2D/3D objects. Unless the system uses a fixed point camera, such as [197], [198], the ‘camera’ in Table 1 indicatesa tool for visual signal modality due to the degree of freedom for searching the objects.

Table 1: Interaction type classification

Modality (tool)

Input Output Feature Interactiontype

Touch (finger), vision (camera),sensor value (IoT sensor: powerstate)

Visual graphics (screen), sensorcommand (IoT sensor: powercontrol)

Manipulate the merchandise ina shop [10] (a)

Touch (finger), vision (camera),sensor value (IoT sensor: micro-controller)

Visual graphics (screen), sensorcommand (IoT sensor: powercontrol)

Present the status and manipu-late real world objects [196] (a)

Touch (finger), vision (cam-era), sensor value (IoT sensor:RFID)

Visual graphics (screen) Check and order the books fromsmart shelf [199] (a)

Touch (finger), vision (camera),sensor value (IoT sensor: powerstate)

Visual graphics (screen), sensorcommand (IoT sensor: powercontrol)

In-house energy managementapplication [200] (b)

Touch (finger), vision (camera),sensor value (IoT sensor: tem-perature)

Visual graphics (screen) Sensor data visualization [201] (c)

Vision (camera), sensor value(IoT sensors: microcontroller,battery level, electrical connec-tion, power consumption)

Visual graphics (screen) Present the current status of anelectrical panel [202] (d)

15

Page 24: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

Vision (camera), sensor value(IoT sensors: acceleration, O2,CO2, humidity, pressure, tem-perature)

Visual graphics (screen), sound(audio)

Task procedure visualiza-tion [203] (c)

Click (mouse), sensor value(IoT sensors: soil moisture,temperature, water level)

Visual graphics (screen)Farm manager that presentsinformation about plantedcrops [197]

(d)

Vision (camera), sensor value(IoT sensor: location) Visual graphics (screen)

Locate IoT devices by usingazimuth and elevation anglesbetween a wireless transmitterand a mobile device [204]

(c)

Vision (camera), sensor value(IoT sensor: air pollution) Visual graphics (screen)

Markerless AR game for inform-ing the agent about air pollu-tion issues [205]

(b)

Click (mouse), vision (camera),sensor value (IoT sensors: tem-perature, humidity)

Visual graphics (screen) Present information aboutmonuments in AR [206] (d)

Vision (camera), sensor value(IoT sensors: temperature, hu-midity, soil moisture)

Visual graphics (screen)Measure the response time forgetting data from a cloud serverto an AR application [207]

(d)

Touch (finger), vision (camera),sensor value (IoT sensors: tem-perature, power state, location)

Visual graphics (screen), sen-sor command (IoT sensor: com-mand)

Application that runs withina smart city with smart ob-jects [208]

(b)

Vision (camera), sensor value(IoT sensors: temperature,dust, gas concentration, carbonmonoxide)

Visual graphics (screen) Visualize IoT sensor data in ARfor miner safety [209] (d)

Vision (camera), sensor value(IoT sensors: temperature, bat-tery level)

Visual graphics (screen)Visualize IoT sensor data re-lated to Quality of Service inAR [210]

(c)

Vision (camera), sensor value(IoT sensor: power state) Visual graphics (screen)

Web-based AR application formonitoring the power state ofthe IoT device [211]

(d)

Touch (finger), vision (camera),sensor value (IoT sensor: loca-tion)

Visual graphics (screen)Public transport applicationthat provides information aboutthe state of a bus [212]

(d)

Touch (finger), vision (camera),sensor value (IoT sensors: tem-perature, CO2)

Visual graphics (screen) Serious game to raise awarenesson air pollution issues [213] (b)

Sensor value (IoT sensors: lightlevel, orientation, position),physical object (cube)

Visual graphics (screen)Evaluation system for upperlimb function by using a physi-cal cube on an IoT board [198]

(c)

Touch (finger), vision (camera),sound (voice), sensor value (IoTsensor: power state)

Visual graphics (screen), sensorcommand (IoT sensor: powercontrol)

Programming a smart environ-ment by using AR bricks [214] (b)

Touch (finger), vision (camera),sensor value (IoT sensors: tem-perature, power consumption)

Visual graphics (screen), sensorcommand (IoT sensor: powercontrol)

Application that helps changingpeople’s behaviour for betterenergy efficiency at schools [215]

(b)

Vision (camera), sensor value(IoT sensor: heartbeat rate)

Visual graphics (screen), sound(audio)

Visualize a virtual heart thatpulsates with an actual heart-beat [216]

(d)

Vision (camera), gesture(hand), sensor value (IoT sen-sors: power state, air pressure,temperature, humidity)

Visual graphics (screen), sen-sor command (IoT sensor: com-mand)

Application that can build reac-tion rules between IoT devicesthrough an AR interface on anHMD [217]

(a)

16

Page 25: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

Touch (finger), vision (cam-era), sensor value (IoT sensors:power voltage, gas, PM2.5,PM10, temperature, pressure,humidity)

Visual graphics (screen), sensorcommand (IoT sensor: powercontrol)

Visualize energy consumptiondata of a smart plug [218] (a)

Vision (camera), sensor value(IoT sensors: strain, voltage) Visual graphics (screen)

Monitor the stress value ofmetal shelving based on straingauges [219]

(c)

Vision (camera), gesture (hand)Visual graphics (screen), sensorcommand (IoT sensors: sound,light)

Control IoT devices by handgestures [220] (a)

Touch (finger), vision (camera),sensor value (IoT sensors: bat-tery, moisture)

Visual graphics (screen), sensorcommand (IoT sensor: robotmovement)

Design robot actions throughan AR interface [221] (b)

Vision (camera), sensor value(IoT sensors: acceleration, dis-tance)

Visual graphics (screen)Guide the agent to a destinationby using virtual text and virtualarrows [222]

(d)

Touch (finger), vision (camera),sensor value (IoT sensor: motorvalue)

Visual graphics (screen), sensorcommand (IoT sensor: motorcontrol)

Control the LEGO cen-trifuge [223] (a)

Vision (camera), sensor value(IoT sensors: temperature, hu-midity, light, noise, presence ofmovement data)

Visual graphics (screen) Present real-time data from IoTsensors by using AR [224] (d)

Vision (camera), sensor value(IoT sensors: distance, micro-controller)

Visual graphics (screen), sen-sor command (IoT sensor: com-mand)

Spatial mapping of a smart ob-ject’s location and display ontothe screen based on the distancebetween the AR device and thesmart object [225]

(a)

Touch (haptic device), vision(camera), sensor value (IoT sen-sor: city data), gesture (hand),sound (voice)

Visual graphics (screen)Visualize a city model by ARonto HoloLens to navigate inthe city data [226]

(b)

5.3.1 AR for user interaction

AR interface visualizing IoT data can be used to be informed real world object. Pokrić et al. [205] developed an ARgame which presents the air quality level collected from IoT sensors in real-time. In order to raise awareness on airpollution, the game provides a feature to the user where they could guess and check the actual air quality level of theircity. The authors of [199] developed an app that can assist people who have motor disabilities in selecting desiredbooks from shelves. Through AR interface, the user could check the status of the listed book on shelves which had IoTsensors attached to each section. When the user selected a book by touching the screen, there would be a real-worldassistant who brings the selected books from shelves to the user.

In addition, using AR, actuation affecting real world object also be carried out. Jo et al. [10], [196] used AR topresent a virtual object which referred a real-world object (e.g., a lamp) on a mobile device. IoT was used to enablethe manipulation of a real-world object by the user’s finger touch on the AR object, such as turn on and off thelamp. However, the AR interface can be used not only for direct control of a real world object but also for indirectcontrol by programming the behaviours of each device depending on the situation and the relation between devices.Stefanidi et al. [214] developed a platform for programming the in-house IoT products’ behaviours. The user can setup the interaction between the products and human activity by combining virtual bricks on an AR interface. Forexample, when the user programmed the alarm to be rung when the door opens, it will happen in the real world,and the behaviours in the real world (e.g., turn off the alarm) will conversely be applied on the AR interface. Asan example of utilizing an AR interface as a remote controller of IoT objects in a room, Cho et al. [200] tested theirenergy management system in bigger space, a building, that enables the control of real-world entities through an ARinterface. Their system collected the overall status of energy consumption in a building and represented it on an ARmap. In order to manage the energy consumption, the user can turn on and off real-world entities by pressing thecorresponding buttons on the AR map with their finger.

17

Page 26: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

5.3.2 AR for interactive data representation

Unlike the aforementioned cases, when AR is only used as a tool to visualize data, with interactive data representation,the agent can select data to visualize. Seitz et al. [201] combined AR with IoT devices which are used for machinesin industrial factories. The main purpose of AR was the visualization of machines’ states by using interactive datarepresentation for data from IoT devices. Chaves-Diéguez et al. [202] used interactive data representation on anelectrical panel. The authors of [203] developed an HMD system that uses AR to guide staffs to carry out amaintenance task. In order to identify the object, detection techniques used instead of marker. Finally, in order toachieve precision farming, Phupattanasilp and Tong [197] developed a farm manager which presents the informationof crops through AR. The information related to crops, such as soil moisture, temperature, water level, and so forth,were gathered by IoT sensors.

18

Page 27: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

6 DiscussionFrom the literature survey, we found that 36 studies have been conducted on multimodal interaction with AR andIoT within various subjects, such as smart environments, energy management, assistants to people with special needs,tourism, mobile gaming, agriculture, shopping, and more. Regardless of diversity, most of the studies used a uniqueframework and/or an architecture during the development process that was designed for each research purpose withdifferent requirements on modality. Thus, only a few studies proposed a complete or partial framework and/orarchitecture that others could use to create multimodal interaction systems with AR and IoT. This rarity is becauseof the following challenges that previous research on multimodal interaction systems has identified.

Multidisciplinary

Due to the fact that multimodal interaction systems utilize various modalities in many ways depending on theirpurpose, each modality requires an expert to implement the respective part of the system with intended level ofusability and performance [201], [214]. Otherwise, developers should have multidisciplinary knowledge in diversefields to achieve a certain degree of usability and performance in the multimodal interaction system [9], [11]. Thisrequirement increases the difficulty of system development and becomes an obstacle to the establishment of a generalframework/architecture of a multimodal interaction system.

Reusability

According to our literature survey, most of the proposed multimodal interaction systems were developed for a specificpurpose and only evaluated within dedicated testbeds rather than implemented and evaluated in practical environ-ments. In these systems, it is necessary to configure all devices beforehand in order to provide the intended services.Different multimodal interaction systems may have varying requirements regarding the types of devices and interactionmodalities, and new device types may need to be added later; however, the previous systems are not able to provide areusable framework/architecture to cater for these requirements. Additionally, a multimodal interaction system shouldbe evaluated in a practical environment, with preferably different use cases, in order to verify its user experience [200],[201], [206], [211]. Most of the reviewed systems failed to do this because they were only evaluated in testbeds.

Technology

Some features are not effectively served due to the limits of technology. For example, a voice recognition system canbe effectively used by English speakers, while non-English speakers might have issues when using it. This can besolved by using another voice recognition system designed for a specific language. However, the use of a dedicatedlanguage recognition system may lead to a lack of reusability. Another example is sensors for biosignal modality. Forexample, a BCI delivers brainwaves as input signals through EEG; however, EEG has some issues regarding ease ofuse and data reliability [6], [227]. An EEG device requires many steps for setup, and it has several things to beware ofwhile wearing it for ensuring reliable data collection. As an extension of the third challenge, AR technology requiresfurther improvement on target recognition and tracking features. In marker-based tracking, when the camera cannotread enough information about the target (e.g., a marker) while it is rapidly moving, the AR system fails to trackthe target on which AR content should be placed. As an example, Cao et al. [221] found this challenge and plannedto employ a new protocol in their system for restoring lost-tracking circumstance as future work.In markerless-basedtracking, when the multimodal interaction system analyzes a camera image to estimate the location of the target,there is a high chance of incorrect coordinates due to the inability of reading of the precise z-axis values [197].

Scalability

The establishment of a reusable framework and/or architecture that is capable of achieving high scalability is oneof the most significant challenges for multimodal interaction system developers. The reason for this difficulty isthat the multimodal interaction system needs to cover numerous types of different input/output modalities and IoTdevices [167], [208], as well as cater for increased performance requirements [207], [228]. In other words, this challengecould be solved when the multimodal interaction system can be easily extended to conduct efficient managementof current and future input/output modalities and IoT devices. Many partial solutions were proposed to solve thescalability challenge, such as a distributed computing to reduce the burden on a central server by using metadata [196],a management of the number of connected devices based on agent profiles or distance between agents and devices [208],and a hierarchical cloud service to reduce the response time for receiving data from a server to an IoT sensor in orderto display on AR object [207]. However, most of these solutions were only proposed and partially verified in testbedsrather in real environments.

Security

The assurance on a proper level of security between agents and a multimodal interaction system is another challenge.Due to the diverse connections between various agents and a number of devices through multimodality, any violation

19

Page 28: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

of privacy and security would have a huge impact on the credibility of the multimodal interaction system. Thus, it isimportant to provide reliable security to every connected agent and device of multimodal interaction systems.

New modalities

The development of new technologies and devices, which are feasible to be adopted in a multimodal interaction system,raises the chance of discovering new modalities [229]. For example, a further improvement of the foldable screen on amobile device might bring a new type of modality that could be used in multimodal interaction, such as an attachablescreen on human skin or an interactable clothe made by foldable screens. Not only the emergence of new technologybut also the improvement of existing technology that enables more complicated and precise tasks with sufficientperformance levels for use in a multimodal interaction system could introduce a new modality.

Interface design

In order to provide a sufficient level of usability to agents while being able to manage diverse input/output modalitiesand various types of IoT devices, it is necessary for a multimodal interaction system to have a properly designedinterface. The requirement of usability depends on several reasons, such as the purpose of a multimodal interactionsystem and available modalities [217], [225]. Especially, the design of interface could be affected by not only whether itis implementable but also by the performance level of each modality; developers should thus consider the performancelevel of each available modality before merging them into an interface [11], [167]. Moreover, multiple modalities usedin an interface should not simply be combined without consideration of an agent’s cognitive aspects. Since there is noevidence that an agent, especially when it is a human, accepts incoming information separately via each modality, allmodalities should be integrated in a complementary manner for providing fully contextualized information [9], [149].It is an important issue to consider also when an agent is a machine. The research of human-robot interaction is afield that focused on this cognitive aspect when building a robot’s system in order to understand a human’s actionand to make the robot behave like a human [230]. Another challenge for designing an interface is the flexibility onmodality selection and combination [11]. Since a multimodal interaction system could be served to various types ofagents, an interface would be better to have high flexibility that could provide a different combination of modalitiesdepending on an agent’s preference while the combination of selected modality should be provided in a natural mannerwith proper levels of intuitiveness [167], [208].

Adaptivity of Multimodal Interface Design

As an extension to the aforementioned challenge in Interface design, consideration of cognitive aspects and flexibilityon modality could enable adaptivity of an interface according to an agent’s affective/cognitive state [9], [172]. Thedevelopment of adaptive interfaces is one of the demanding challenges that could realize a personalized experience toagents by providing a customized interaction modality depending on the situation [167], [229], [231]. For example,the situation could comprise an agent’s preferences or the type of information that would be presented to agents.Additionally, regarding the adaptivity of an interface, there is a challenge to produce personalized feedback ratherthan uniform feedback for agents in order to achieve better usability [214].

Usability of Multimodal Interaction Interface

The last challenge is related to the properties of multimodality. One of the key reasons for adding multiple modalitiesin an interaction system is to improve usability compared to a unimodal interaction system. However, multimodalitymight not always provide better usability. Instead, multimodality could provide a flexible interface whereby an agentcan choose different modalities depending on their needs [232]. This lack of certainty on usability improvement couldbe caused by missing knowledge on the relationships between agents’ cognitive processes and each modality [9], [11].Therefore, in order to resolve this challenge, an in-depth study on usability analysis in a number of multimodalinteraction systems with heterogeneous modalities would be needed.

From our findings, we could identify research questions that have the potential to resolve the known challenges.These questions are illustrated in Table 2 together with their respective challenges.

Table 2: Challenges and Research questions

Challenge Research question

Multidisciplinary Why the combination of IoT and AR has been relatively less employed than the individualuse case of each technology in multimodal interaction systems?

Multidisciplinary How to facilitate the use of IoT and AR in multimodal interaction systems?

New modalities What modalities are missing from the four identified modality categories that are feasibleto use but not implemented yet?

20

Page 29: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

ReusabilityWhat design of framework/architecture could be applied to a multimodal interactionsystem that uses any combination of the four modalities (visual signal, audio, inertia, andlocation) with IoT and AR in order to improve reusability?

Scalability

What design could be applied to increase scalability of a multimodal interaction systemthat uses any combination of the four modalities (visual signal, audio, and inertia, andlocation) with AR and IoT in terms of both device management (modalities, agents) andsystem performance?

Reusability How can the reusability of a multimodal interaction system framework/architecture beverified?

Interface design What design of framework/architecture could be used for applying the newly discoveredmodalities with existing modalities?

Adaptivity of MultimodalInterface Design

In design of an adaptive/personalized interface for a multimodal interaction system, whataspects (agent-centric vs information-centric) are important for selecting input/outputmodalities in a given context?

Adaptivity of MultimodalInterface Design

How can the adaptivity of a designed multimodal interaction system frame-work/architecture be verified in terms of usability (satisfaction, efficiency, effectiveness)?

Usability of MultimodalInteraction Interface How to ensure usability of an increasingly complex multimodal interaction environment?

Since it is not easy to make a general architecture/framework that covers every possible modality in multimodalinteraction with AR and IoT, we can propose a reusable architecture/framework that uses combinations of some ofthe four modalities (visual signal, audio, and inertia & location). The reason for choosing these four modalities isbecause they are commonly used in multimodal interaction systems that utilize AR and IoT. AR and IoT are usuallyimplemented by using visual signals (e.g., marker, image/object recognition) and location (e.g., geolocation that isdelivered through a wireless network) as input modalities, and visual representation (e.g., on screen), sound (e.g.,audio), and haptic feedback (e.g., vibration) are delivered to agents as output modalities. The main objective of thisresearch is to establish a reusable framework/architecture that can be used to consider during the development processof multimodal interaction systems with IoT and AR. This framework/architecture would be verified through a numberof practical development projects outside of testbed environments.

21

Page 30: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

7 ConclusionIn this report, we described the state-of-the-art knowledge regarding multimodal interaction systems that utilize IoTand AR. We classified available input and output modalities based on their types. These modalities could conveyinformation between agents and a multimodal interaction system. We listed modalities that are not only already usedin multimodal interaction systems but also used in unimodal interaction systems that have the potential to be usedin multimodal interaction systems.

Since multimodal interaction systems utilize various modalities, there should be a process for handling the incomingor outgoing data between agents and a multimodal interaction system. We summarized the proposed theories and ideasregarding the architecture of data management in multimodal interaction systems into two categories: the integrationprocess that merges the data sent by agents or acquired from other sources, and the presentation process that deliversthe outcome of interaction to an agent through multiple modalities.

We also investigated how IoT and AR are used in multimodal interaction systems, respectively. After that,we searched multimodal interaction system studies that utilized IoT and AR and thereby enabled new forms ofmultimodal interaction. From the results of this literature survey, we found four interaction types that were formedby the combination of IoT and AR:

1. AR for interface and marker on IoT, Figure 7 (a)

2. AR for interface and marker on object, Figure 7 (b)

3. AR for data presentation and marker on IoT, Figure 7 (c)

4. AR for data presentation and marker on object, Figure 7 (d)

Moreover, we listed the input and output modalities that have been used in multimodal interaction systems withIoT and AR in other studies to identify what kinds of modalities are employed for enabling interaction between agentsand a multimodal interaction system.

In conclusion, we have identified a number of remaining challenges in the research area of multimodal interactionsystem with IoT and AR. The employment of multiple modalities requires a variety of knowledge from different fieldsin order to combine selected modalities into an interface that has a sufficient level of usability for a wider range ofagents.

To overcome the obstacles of development of multimodal interaction systems with IoT and AR, some researchersbuilt a framework/architecture of multimodal interaction system. However, the research on this subject is still in itsinfancy due to several challenges that made previously proposed frameworks/architectures case-specific and therebyhard to be reused in other situations. Examples of these development challenges are the diversity of available modalitiesthat raises the scalability issue and the requirement of situation-aware heterogeneous interfaces with appropriate levelsof usability.

Furthermore, some researchers stated the importance of personalized experience for agents while they are usinga multimodal interaction system. The adaptivity of an interface that enables the provision of customized interactionmodalities depending on the agent’s situation is one of the potential solutions for achieving personalized experiences.Making the system aware of the cognitive state of an agent is one of the elements that can realize adaptive multimodalinterfaces thus developers should take this into account in order to build a multimodal interaction system withappropriate levels of personalization and usability.

Based on the finding of state-of-the-art research documented in this report, we foresee that an important con-tribution to the research and development of multimodal interaction with IoT and AR will be (i) to design novelmultimodal interaction which has decent usability, (ii) to create a reusable framework/architecture for multimodalinteraction systems that use IoT and AR. We believe that this will be a good contribution to the field of multimodalinteraction research and development, which both are still in its infancy.

22

Page 31: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

References[1] L. Liu and M. T. Özsu, Eds., Encyclopedia of database systems, ser. Springer reference. New York: Springer,

2009.

[2] S. K. Card, T. P. Moran, and A. Newell, The psychology of human-computer interaction. 2018.

[3] R. A. Bolt, “’’Put-that-there”: Voice and Gesture at the Graphics Interface”, in Proceedings of the 7th AnnualConference on Computer Graphics and Interactive Techniques, ser. SIGGRAPH ’80, New York, NY, USA:ACM, 1980.

[4] C. Zimmer, M. Bertram, F. Büntig, D. Drochtert, and C. Geiger, “Mobile augmented reality illustrations thatentertain and inform: Design and implementation issues with the hololens”, in SIGGRAPH Asia 2017 MobileGraphics & Interactive Applications on - SA ’17, Bangkok, Thailand: ACM Press, 2017.

[5] S. S. Rautaray and A. Agrawal, “Vision based hand gesture recognition for human computer interaction: Asurvey”, Artificial Intelligence Review, vol. 43, no. 1, Jan. 2015.

[6] H. Gürkök and A. Nijholt, “Brain–Computer Interfaces for Multimodal Interaction: A Survey and Principles”,International Journal of Human-Computer Interaction, vol. 28, no. 5, May 2012.

[7] M. Lee, M. Billinghurst, W. Baek, R. Green, and W. Woo, “A usability study of multimodal input in anaugmented reality environment”, Virtual Reality, vol. 17, no. 4, Nov. 2013.

[8] M.-A. Nguyen, “Designing Smart Interactions for Smart Objects”, in Human Computer Interaction in theInternet of Things Era, University of Munich Department of Computer Science Media Informatics Group,2015.

[9] A. Jaimes and N. Sebe, “Multimodal human–computer interaction: A survey”, Computer Vision and ImageUnderstanding, vol. 108, no. 1-2, Oct. 2007.

[10] D. Jo and G. J. Kim, “IoT + AR: Pervasive and augmented environments for “Digi-log” shopping experience”,Human-centric Computing and Information Sciences, vol. 9, no. 1, Dec. 2019.

[11] M. Turk, “Multimodal interaction: A review”, Pattern Recognition Letters, vol. 36, Jan. 2014.

[12] ISO 9241-11:2018(en), Ergonomics of human-system interaction — Part 11: Usability: Definitions andconcepts. [Online]. Available: https://www.iso.org/obp/ui/#iso:std:iso:9241:-11:ed-2:v1:en.

[13] S. S. M. Nizam, R. Z. Abidin, N. C. Hashim, M. Chun, H. Arshad, and N. A. A. Majid, “A Review ofMultimodal Interaction Technique in Augmented Reality Environment”,

[14] M. Kefi, T. N. Hoang, P. Richard, and E. Verhulst, “An evaluation of multimodal interaction techniques for3d layout constraint solver in a desktop-based virtual environment”, Virtual Reality, vol. 22, no. 4, Nov. 2018.

[15] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “Internet of Things (IoT): A vision, architecturalelements, and future directions”, Future Generation Computer Systems, vol. 29, no. 7, Sep. 2013.

[16] ITU, ITU-T Y.4000/Y.2060, Jun. 2012. [Online]. Available: http://handle.itu.int/11.1002/1000/11559.

[17] M. Paul and F. Kishino, “A taxonomy of mixed reality visual displays”, vol. 77, no. 12, 1994.

[18] Just Dance 2019. [Online]. Available: https://www.ubisoft.com/en-us/game/just-dance-2019.

[19] S. Piana, A. Staglianò, F. Odone, and A. Camurri, “Adaptive Body Gesture Representation for AutomaticEmotion Recognition”, ACM Transactions on Interactive Intelligent Systems, vol. 6, no. 1, Mar. 2016.

[20] J. E. Pompeu, C. Torriani-Pasin, F. Doná, F. F. Ganança, K. G. da Silva, and H. B. Ferraz, “Effect of Kinectgames on postural control of patients with Parkinson’s disease”, in Proceedings of the 3rd 2015 Workshop onICTs for improving Patients Rehabilitation Research Techniques - REHAB ’15, Lisbon, Portugal: ACM Press,2015.

[21] N. Vernadakis, V. Derri, E. Tsitskari, and P. Antoniou, “The effect of Xbox Kinect intervention on balanceability for previously injured young competitive male athletes: A preliminary study”, Physical Therapy inSport, vol. 15, no. 3, Aug. 2014.

[22] H. Kayama, K. Okamoto, S. Nishiguchi, M. Yamada, T. Kuroda, and T. Aoyama, “Effect of a Kinect-BasedExercise Game on Improving Executive Cognitive Performance in Community-Dwelling Elderly: Case ControlStudy”, Journal of Medical Internet Research, vol. 16, no. 2, Feb. 2014.

[23] T. Mizumoto, A. Fornaser, H. Suwa, K. Yasumoto, and M. De Cecco, “Kinect-Based Micro-Behavior SensingSystem for Learning the Smart Assistance with Human Subjects Inside Their Homes”, in 2018 Workshop onMetrology for Industry 4.0 and IoT, Brescia: IEEE, Apr. 2018.

[24] S. Deng, N. Jiang, J. Chang, S. Guo, and J. J. Zhang, “Understanding the impact of multimodal interactionusing gaze informed mid-air gesture control in 3d virtual objects manipulation”, International Journal ofHuman-Computer Studies, vol. 105, Sep. 2017.

[25] N. Jain, S. Kumar, A. Kumar, P. Shamsolmoali, and M. Zareapoor, “Hybrid deep neural networks for faceemotion recognition”, Pattern Recognition Letters, vol. 115, Nov. 2018.

23

Page 32: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

[26] M. Galterio, S. Shavit, and T. Hayajneh, “A Review of Facial Biometrics Security for Smart Devices”,Computers, vol. 7, no. 3, Jun. 2018.

[27] P. B. Balla and K. T. Jadhao, “IoT Based Facial Recognition Security System”, in 2018 InternationalConference on Smart City and Emerging Technology (ICSCET), Mumbai: IEEE, Jan. 2018.

[28] L. A. Elrefaei, D. H. Hamid, A. A. Bayazed, S. S. Bushnak, and S. Y. Maasher, “Developing Iris RecognitionSystem for Smartphone Security”, Multimedia Tools and Applications, vol. 77, no. 12, Jun. 2018.

[29] Universitas Brawijaya, G. Pangestu, F. Utaminingrum, Universitas Brawijaya, F. Bachtiar, and UniversitasBrawijaya, “Eye State Recognition Using Multiple Methods for Applied to Control Smart Wheelchair”,International Journal of Intelligent Engineering and Systems, vol. 12, no. 1, Feb. 2019.

[30] F. Koochaki and L. Najafizadeh, “Predicting Intention Through Eye Gaze Patterns”, in 2018 IEEEBiomedical Circuits and Systems Conference (BioCAS), Cleveland, OH: IEEE, Oct. 2018.

[31] R. C. Luo, B.-H. Shih, and T.-W. Lin, “Real time human motion imitation of anthropomorphic dual armrobot based on Cartesian impedance control”, in 2013 IEEE International Symposium on Robotic and SensorsEnvironments (ROSE), Washington, DC, USA: IEEE, Oct. 2013.

[32] K. Qian, J. Niu, and H. Yang, “Developing a Gesture Based Remote Human-Robot Interaction System UsingKinect”, International Journal of Smart Home, vol. 7, no. 4, 2013.

[33] J. P. Wachs, M. Kölsch, H. Stern, and Y. Edan, “Vision-based hand-gesture applications”, Communications ofthe ACM, vol. 54, no. 2, Feb. 2011.

[34] M. Kaâniche, “Gesture recognition from video sequences”,

[35] J. R. Pansare, S. H. Gawande, and M. Ingle, “Real-Time Static Hand Gesture Recognition for American SignLanguage (ASL) in Complex Background”, Journal of Signal and Information Processing, vol. 03, no. 03, 2012.

[36] J. Wachs, H. Stern, Y. Edan, M. Gillam, C. Feied, M. Smithd, and J. Handler, “Real-Time Hand GestureInterface for Browsing Medical Images”, International Journal of Intelligent Computing in Medical Sciences &Image Processing, vol. 2, no. 1, Jan. 2008.

[37] Y. A. Yusoff, A. H. Basori, and F. Mohamed, “Interactive Hand and Arm Gesture Control for 2d MedicalImage and 3d Volumetric Medical Visualization”, Procedia - Social and Behavioral Sciences, vol. 97, Nov. 2013.

[38] Zelun Zhang and S. Poslad, “Improved Use of Foot Force Sensors and Mobile Phone GPS for MobilityActivity Recognition”, IEEE Sensors Journal, vol. 14, no. 12, Dec. 2014.

[39] J. Scott, D. Dearman, K. Yatani, and K. N. Truong, “Sensing foot gestures from the pocket”, in Proceedings ofthe 23nd annual ACM symposium on User interface software and technology - UIST ’10, New York, NewYork, USA: ACM Press, 2010.

[40] Department of computer science University of Thi-Qar, Thi-Qar, Iraq, K. M.Hashem, and F. Ghali, “HumanIdentification Using Foot Features”, International Journal of Engineering and Manufacturing, vol. 6, no. 4,Jul. 2016.

[41] Z. Lv, A. Halawani, S. Feng, H. Li, and S. U. Réhman, “Multimodal Hand and Foot Gesture Interaction forHandheld Devices”, ACM Transactions on Multimedia Computing, Communications, and Applications,vol. 11, no. 1s, Oct. 2014.

[42] F. Tafazzoli and R. Safabakhsh, “Model-based human gait recognition using leg and arm movements”,Engineering Applications of Artificial Intelligence, vol. 23, no. 8, Dec. 2010.

[43] Chin-Chun Chang and Wen-Hsiang Tsai, “Vision-based tracking and interpretation of human leg movementfor virtual reality applications”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 11,no. 1, Jan. 2001.

[44] S. Nakaoka, A. Nakazawa, K. Yokoi, and K. Ikeuchi, “Leg motion primitives for a dancing humanoid robot”,in IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA ’04. 2004, NewOrleans, LA, USA: IEEE, 2004.

[45] R. Puri and V. Jain, “Barcode Detection using OpenCV-Python”, vol. 4, no. 1,

[46] J. Gao, V. Kulkarni, H. Ranavat, L. Chang, and H. Mei, “A 2d Barcode-Based Mobile Payment System”, in2009 Third International Conference on Multimedia and Ubiquitous Engineering, Qingdao, China: IEEE, Jun.2009.

[47] S. L. Fong, D. C. W. Yung, F. Y. H. Ahmed, and A. Jamal, “Smart City Bus Application with QuickResponse (QR) Code Payment”, in Proceedings of the 2019 8th International Conference on Software andComputer Applications - ICSCA ’19, Penang, Malaysia: ACM Press, 2019.

[48] Z. Ayop, C. Yee, S. Anawar, E. Hamid, and M. Syahrul, “Location-aware Event Attendance System using QRCode and GPS Technology”, International Journal of Advanced Computer Science and Applications, vol. 9,no. 9, 2018.

24

Page 33: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

[49] S. Ćuković, M. Gattullo, F. Pankratz, G. Devedžić, E. Carrabba, and K. Baizid, “Marker Based vs. NaturalFeature Tracking Augmented Reality Visualization of the 3d Foot Phantom”,

[50] P. Q. Brito and J. Stoyanova, “Marker versus Markerless Augmented Reality. Which Has More Impact onUsers?”, International Journal of Human–Computer Interaction, vol. 34, no. 9, Sep. 2018.

[51] T. Frantz, B. Jansen, J. Duerinck, and J. Vandemeulebroucke, “Augmenting Microsoft’s HoloLens withvuforia tracking for neuronavigation”, Healthcare Technology Letters, vol. 5, no. 5, Oct. 2018.

[52] S. Blanco-Pons, B. Carrión-Ruiz, M. Duong, J. Chartrand, S. Fai, and J. L. Lerma, “Augmented RealityMarkerless Multi-Image Outdoor Tracking System for the Historical Buildings on Parliament Hill”,Sustainability, vol. 11, no. 16, Aug. 2019.

[53] Z. Balint, B. Kiss, B. Magyari, and K. Simon, “Augmented reality and image recognition based framework fortreasure hunt games”, in 2012 IEEE 10th Jubilee International Symposium on Intelligent Systems andInformatics, Subotica, Serbia: IEEE, Sep. 2012.

[54] S. H. Kasaei, S. M. Kasaei, and S. A. Kasaei, “New Morphology-Based Method for RobustIranian Car PlateDetection and Recognition”, International Journal of Computer Theory and Engineering, 2010.

[55] N. Elouali, J. Rouillard, X. Le Pallec, and J.-C. Tarby, “Multimodal interaction: A survey from model drivenengineering and mobile perspectives”, Journal on Multimodal User Interfaces, vol. 7, no. 4, Dec. 2013.

[56] Cognitive Speech Services | Microsoft Azure. [Online]. Available:https://azure.microsoft.com/en-us/services/cognitive-services/speech-services/.

[57] Speech | Apple Developer Documentation. [Online]. Available:https://developer.apple.com/documentation/speech.

[58] Cortana - Your personal productivity assistant. [Online]. Available:https://www.microsoft.com/en-us/cortana.

[59] Siri - Apple. [Online]. Available: https://www.apple.com/siri/.[60] Google Assistant is now available on Android and iPhone mobiles. [Online]. Available:

https://assistant.google.com/platforms/phones/.[61] M. F. Alghifari, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi, and Z. Janin, “On the use of voice activity

detection in speech emotion recognition”, vol. 8, no. 4, 2019.[62] A. Lassalle, D. Pigat, H. O’Reilly, S. Berggen, S. Fridenson-Hayo, S. Tal, S. Elfström, A. Råde, O. Golan,

S. Bölte, S. Baron-Cohen, and D. Lundqvist, “The EU-Emotion Voice Database”, Behavior Research Methods,vol. 51, no. 2, Apr. 2019.

[63] A. Mesaros, A. Diment, B. Elizalde, T. Heittola, E. Vincent, B. Raj, and T. Virtanen, “Sound EventDetection in the DCASE 2017 Challenge”, IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, vol. 27, no. 6, Jun. 2019.

[64] S. Boutamine, D. Istrate, J. Boudy, and H. Tannous, “Smart Sound Sensor to Detect the Number of People ina Room”, in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and BiologySociety (EMBC), Berlin, Germany: IEEE, Jul. 2019.

[65] J. K. Roy, T. S. Roy, and S. C. Mukhopadhyay, “Heart Sound: Detection and Analytical Approach TowardsDiseases”, in Modern Sensing Technologies, S. C. Mukhopadhyay, K. P. Jayasundera, and O. A. Postolache,Eds., vol. 29, Cham: Springer International Publishing, 2019.

[66] J. N. Demos, Getting started with neurofeedback, 1st ed. New York: W.W. Norton, 2005.[67] D. P. Subha, P. K. Joseph, R. Acharya U, and C. M. Lim, “EEG Signal Analysis: A Survey”, Journal of

Medical Systems, vol. 34, no. 2, Apr. 2010.[68] M. R. Lakshmi, V. P. T, and C. P. V, “Survey on EEG signal processing methods”, International Journal of

Advanced Research in Computer Science and Software Engineering, vol. 4, no. 1, 2014.[69] J. K. Zao, T.-P. Jung, H.-M. Chang, T.-T. Gan, Y.-T. Wang, Y.-P. Lin, W.-H. Liu, G.-Y. Zheng, C.-K. Lin,

C.-H. Lin, Y.-Y. Chien, F.-C. Lin, Y.-P. Huang, S. J. Rodríguez Méndez, and F. A. Medeiros, “AugmentingVR/AR Applications with EEG/EOG Monitoring and Oculo-Vestibular Recoupling”, in Foundations ofAugmented Cognition: Neuroergonomics and Operational Neuroscience, D. D. Schmorrow andC. M. Fidopiastis, Eds., vol. 9743, Cham: Springer International Publishing, 2016.

[70] P. Gang, J. Hui, S. Stirenko, Y. Gordienko, T. Shemsedinov, O. Alienin, Y. Kochura, N. Gordienko, A. Rojbi,J. R. López Benito, and E. Artetxe González, “User-Driven Intelligent Interface on the Basis of MultimodalAugmented Reality and Brain-Computer Interaction for People with Functional Disabilities”, in Advances inInformation and Communication Networks, K. Arai, S. Kapoor, and R. Bhatia, Eds., vol. 886, Cham:Springer International Publishing, 2019.

[71] P. Pelegris, K. Banitsas, T. Orbach, and K. Marias, “A novel method to detect Heart Beat Rate using amobile phone”, in 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology,Buenos Aires: IEEE, Aug. 2010.

25

Page 34: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

[72] S. L. Fernandes, V. P. Gurupur, N. R. Sunder, N. Arunkumar, and S. Kadry, “A novel nonintrusive decisionsupport approach for heart rate measurement”, Pattern Recognition Letters, Jul. 2017.

[73] Z. Yang, Q. Zhou, L. Lei, K. Zheng, and W. Xiang, “An IoT-cloud Based Wearable ECG Monitoring Systemfor Smart Healthcare”, Journal of Medical Systems, vol. 40, no. 12, Dec. 2016.

[74] P. Leijdekkers and V. Gay, “Personal Heart Monitoring System Using Smart Phones To Detect LifeThreatening Arrhythmias”, in 19th IEEE Symposium on Computer-Based Medical Systems (CBMS’06), SaltLake City, UT: IEEE, 2006.

[75] J. Choi and R. Gutierrez-Osuna, “Using Heart Rate Monitors to Detect Mental Stress”, in 2009 SixthInternational Workshop on Wearable and Implantable Body Sensor Networks, Berkeley, CA: IEEE, Jun. 2009.

[76] V. Nenonen, A. Lindblad, V. Häkkinen, T. Laitinen, M. Jouhtio, and P. Hämäläinen, “Using heart rate tocontrol an interactive game”, in Proceedings of the SIGCHI Conference on Human Factors in ComputingSystems - CHI ’07, San Jose, California, USA: ACM Press, 2007.

[77] R. Magielse and P. Markopoulos, “HeartBeat: An outdoor pervasive game for children”, in Proceedings of the27th international conference on Human factors in computing systems - CHI 09, Boston, MA, USA: ACMPress, 2009.

[78] A. Tanaka, “Embodied Musical Interaction: Body Physiology, Cross Modality, and Sonic Experience”, in NewDirections in Music and Human-Computer Interaction, S. Holland, T. Mudd, K. Wilkie-McKenna,A. McPherson, and M. M. Wanderley, Eds., Cham: Springer International Publishing, 2019.

[79] R. Ahsan, M. I. Ibrahimy, and O. O. Khalifa, “EMG Signal Classification for Human Computer Interaction:A Review”, 2009.

[80] C. Vidal, A. Philominraj, and C. del, “A DSP Practical Application: Working on ECG Signal”, inApplications of Digital Signal Processing, C. Cuadrado-Laborde, Ed., InTech, Nov. 2011.

[81] S. Bitzer and P. van der Smagt, “Learning EMG control of a robotic hand: Towards active prostheses”, inProceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., Orlando,FL, USA: IEEE, 2006.

[82] S. Gorzkowski and G. Sarwas, “Exploitation of EMG Signals for Video Game Control”, in 2019 20thInternational Carpathian Control Conference (ICCC), Krakow-Wieliczka, Poland: IEEE, May 2019.

[83] S.-C. Liao, F.-G. Wu, and S.-H. Feng, “Playing games with your mouth: Improving gaming experience withEMG supportive input device”, in International Association of Societies of Design Research Conference 2019,2019.

[84] M. Ghassemi, K. Triandafilou, A. Barry, M. E. Stoykov, E. Roth, F. A. Mussa-Ivaldi, D. G. Kamper, andR. Ranganathan, “Development of an EMG-Controlled Serious Game for Rehabilitation”, IEEE Transactionson Neural Systems and Rehabilitation Engineering, vol. 27, no. 2, Feb. 2019.

[85] M. Lyu, W.-H. Chen, X. Ding, J. Wang, Z. Pei, and B. Zhang, “Development of an EMG-Controlled KneeExoskeleton to Assist Home Rehabilitation in a Game Context”, Frontiers in Neurorobotics, vol. 13, Aug. 2019.

[86] R. Hinrichs, S. J. H. van Rooij, V. Michopoulos, K. Schultebraucks, S. Winters, J. Maples-Keller,A. O. Rothbaum, J. S. Stevens, I. Galatzer-Levy, B. O. Rothbaum, K. J. Ressler, and T. Jovanovic,“Increased Skin Conductance Response in the Immediate Aftermath of Trauma Predicts PTSD Risk”, ChronicStress, vol. 3, Jan. 2019.

[87] G. I. Christopoulos, M. A. Uy, and W. J. Yap, “The Body and the Brain: Measuring Skin ConductanceResponses to Understand the Emotional Experience”, Organizational Research Methods, vol. 22, no. 1, Jan.2019.

[88] S. Sakurazawa, N. Yoshida, and N. Munekata, “Entertainment feature of a game using skin conductanceresponse”, in Proceedings of the 2004 ACM SIGCHI International Conference on Advances in computerentertainment technology - ACE ’04, Singapore: ACM Press, 2004.

[89] Y. Li, A. S. Elmaghraby, A. El-Baz, and E. M. Sokhadze, “Using physiological signal analysis to designaffective VR games”, in 2015 IEEE International Symposium on Signal Processing and InformationTechnology (ISSPIT), Abu Dhabi, United Arab Emirates: IEEE, Dec. 2015.

[90] G. Tanda, “The use of infrared thermography to detect the skin temperature response to physical activity”,Journal of Physics: Conference Series, vol. 655, Nov. 2015.

[91] O. Postolache, F. Lourenco, J. M. Dias Pereira, and P. S. Girao, “Serious game for physical rehabilitation:Measuring the effectiveness of virtual and real training environments”, in 2017 IEEE InternationalInstrumentation and Measurement Technology Conference (I2MTC), Torino, Italy: IEEE, May 2017.

[92] H. Jin, Y. S. Abu-Raya, and H. Haick, “Advanced Materials for Health Monitoring with Skin-Based WearableDevices”, Advanced Healthcare Materials, vol. 6, no. 11, Jun. 2017.

[93] K. A. Herborn, J. L. Graves, P. Jerem, N. P. Evans, R. Nager, D. J. McCafferty, and D. E. McKeegan, “Skintemperature reveals the intensity of acute stress”, Physiology & Behavior, vol. 152, Dec. 2015.

26

Page 35: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

[94] V. C. R. Appel, V. L. Belini, D. H. Jong, D. V. Magalhaes, and G. A. P. Caurin, “Classifying emotions inrehabilitation robotics based on facial skin temperature”, in 5th IEEE RAS/EMBS International Conferenceon Biomedical Robotics and Biomechatronics, Sao Paulo, Brazil: IEEE, Aug. 2014.

[95] E. Salazar-López, E. Domínguez, V. Juárez Ramos, J. de la Fuente, A. Meins, O. Iborra, G. Gálvez,M. Rodríguez-Artacho, and E. Gómez-Milán, “The mental and subjective skin: Emotion, empathy, feelingsand thermography”, Consciousness and Cognition, vol. 34, Jul. 2015.

[96] C. Yun, D. Shastri, I. Pavlidis, and Z. Deng, “O’ game, can you feel my frustration?: Improving user’s gamingexperience via stresscam”, in Proceedings of the 27th international conference on Human factors in computingsystems - CHI 09, Boston, MA, USA: ACM Press, 2009.

[97] W. Vanmarkenlichtenbelt, H. Daanen, L. Wouters, R. Fronczek, R. Raymann, N. Severens, andE. Vansomeren, “Evaluation of wireless determination of skin temperature using iButtons”, Physiology &Behavior, vol. 88, no. 4-5, Jul. 2006.

[98] Y. Yamamoto, D. Yamamoto, M. Takada, H. Naito, T. Arie, S. Akita, and K. Takei, “Efficient SkinTemperature Sensor and Stable Gel-Less Sticky ECG Sensor for a Wearable Flexible Healthcare Patch”,Advanced Healthcare Materials, vol. 6, no. 17, Sep. 2017.

[99] V. Bernard, E. Staffa, V. Mornstein, and A. Bourek, “Infrared camera assessment of skin surface temperature– Effect of emissivity”, Physica Medica, vol. 29, no. 6, Nov. 2013.

[100] D. Ettehad, C. A. Emdin, A. Kiran, S. G. Anderson, T. Callender, J. Emberson, J. Chalmers, A. Rodgers,and K. Rahimi, “Blood pressure lowering for prevention of cardiovascular disease and death: A systematicreview and meta-analysis”, The Lancet, vol. 387, no. 10022, Mar. 2016.

[101] on behalf of ESH Working Group on Blood Pressure Monitoring, G. Parati, G. S. Stergiou, R. Asmar,G. Bilo, P. de Leeuw, Y. Imai, K. Kario, E. Lurbe, A. Manolis, T. Mengden, E. O’Brien, T. Ohkubo,P. Padfield, P. Palatini, T. G. Pickering, J. Redon, M. Revera, L. M. Ruilope, A. Shennan, J. A. Staessen,A. Tisler, B. Waeber, A. Zanchetti, and G. Mancia, “European Society of Hypertension Practice Guidelinesfor home blood pressure monitoring”, Journal of Human Hypertension, vol. 24, no. 12, Dec. 2010.

[102] D. Gasperin, G. Netuveli, J. S. Dias-da-Costa, and M. P. Pattussi, “Effect of psychological stress on bloodpressure increase: A meta-analysis of cohort studies”, Cadernos de Saúde Pública, vol. 25, no. 4, Apr. 2009.

[103] E. A. Butler, T. L. Lee, and J. J. Gross, “Does Expressing Your Emotions Raise or Lower Your BloodPressure?: The Answer Depends on Cultural Context”, Journal of Cross-Cultural Psychology, vol. 40, no. 3,May 2009.

[104] J. A. McCubbin, M. M. Merritt, J. J. Sollers, M. K. Evans, A. B. Zonderman, R. D. Lane, and J. F. Thayer,“Cardiovascular-Emotional Dampening: The Relationship Between Blood Pressure and Recognition ofEmotion”, Psychosomatic Medicine, vol. 73, no. 9, Nov. 2011.

[105] D. E. Warburton, S. S. Bredin, L. T. Horita, D. Zbogar, J. M. Scott, B. T. Esch, and R. E. Rhodes, “Thehealth benefits of interactive video game exercise”, Applied Physiology, Nutrition, and Metabolism, vol. 32,no. 4, Aug. 2007.

[106] A. M. Porter and P. Goolkasian, “Video Games and Stress: How Stress Appraisals and Game Content AffectCardiovascular and Emotion Outcomes”, Frontiers in Psychology, vol. 10, May 2019.

[107] R. J. Tafalla, “Gender Differences in Cardiovascular Reactivity and Game Performance Related to SensoryModality in Violent Video Game Play”, Journal of Applied Social Psychology, vol. 37, no. 9, Sep. 2007.

[108] R. Jagadheeswari, R. G. Devi, and A. J. Priya, “Evaluating the effects of video games on blood pressure andheart rate”, Drug Invention Today, vol. 10, no. 1, 2018.

[109] B. P. Kerfoot, A. Turchin, E. Breydo, D. Gagnon, and P. R. Conlin, “An Online Spaced-Education GameAmong Clinicians Improves Their Patients’ Time to Blood Pressure Control: A Randomized ControlledTrial”, Circulation: Cardiovascular Quality and Outcomes, vol. 7, no. 3, May 2014.

[110] G. Ogedegbe and T. Pickering, “Principles and Techniques of Blood Pressure Measurement”, CardiologyClinics, vol. 28, no. 4, Nov. 2010.

[111] W. J. Verberk, A. A. Kroon, H. A. Jongen-Vancraybex, and P. W. de Leeuw, “The applicability of homeblood pressure measurement in clinical practice: A review of literature”, Vascular Health and RiskManagement, vol. 3, no. 6, 2007.

[112] V. van Acht, E. Bongers, N. Lambert, and R. Verberne, “Miniature Wireless Inertial Sensor for MeasuringHuman Motions”, in 2007 29th Annual International Conference of the IEEE Engineering in Medicine andBiology Society, Lyon, France: IEEE, Aug. 2007.

[113] S. Sprager and M. Juric, “Inertial Sensor-Based Gait Recognition: A Review”, Sensors, vol. 15, no. 9, Sep.2015.

27

Page 36: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

[114] D. R., M. Naheem, S. Khandelwal, P. S.P., J. Joseph, and M. Sivaprakasam, “Fall Detection Using KinematicFeatures from a Wrist-Worn Inertial Sensor”, in 2019 IEEE International Symposium on MedicalMeasurements and Applications (MeMeA), Istanbul, Turkey: IEEE, Jun. 2019.

[115] R. Mooney, G. Corley, A. Godfrey, L. Quinlan, and G. ÓLaighin, “Inertial Sensor Technology for EliteSwimming Performance Analysis: A Systematic Review”, Sensors, vol. 16, no. 1, Dec. 2015.

[116] Chin-Hao Wu, Yuan-Tse Chang, and Y.-C. Tseng, “Multi-screen cyber-physical video game: An integrationwith body-area inertial sensor networks”, in 2010 8th IEEE International Conference on Pervasive Computingand Communications Workshops (PERCOM Workshops), Mannheim, Germany: IEEE, Mar. 2010.

[117] S. Kim, J. Kim, and D. Suh, “Game Controller Position Tracking using A2c Machine Learning on InertialSensors”, in 2019 IEEE Games, Entertainment, Media Conference (GEM), New Haven, CT, USA: IEEE, Jun.2019.

[118] J. Collin, P. Davidson, M. Kirkko-Jaakkola, and H. Leppakoski, “Inertial Sensors and Their Applications”,

[119] S. B. Libby, D. H. Chambers, J. Chang, and J. Zumstein, “GPS-Free Navigation in Buildings -NA-241_2018_final_report_sblibby_12_31_18”, Tech. Rep. LLNL-TR–766218, 1498455, Jan. 2019.

[120] A. S. Balloch, M. Meghji, R. U. Newton, N. H. Hart, J. A. Weber, I. Ahmad, and D. Habibi, “Assessment of aNovel Algorithm to Determine Change-of-Direction Angles While Running Using Inertial Sensors:” Journal ofStrength and Conditioning Research, Jan. 2019.

[121] F. E. Ehrmann, C. S. Duncan, D. Sindhusake, W. N. Franzsen, and D. A. Greene, “GPS and InjuryPrevention in Professional Soccer:” Journal of Strength and Conditioning Research, vol. 30, no. 2, Feb. 2016.

[122] J. Paavilainen, H. Korhonen, K. Alha, J. Stenros, E. Koskinen, and F. Mayra, “The Pokémon GO Experience:A Location-Based Augmented Reality Mobile Game Goes Mainstream”, in Proceedings of the 2017 CHIConference on Human Factors in Computing Systems - CHI ’17, Denver, Colorado, USA: ACM Press, 2017.

[123] M. Gor, J. Vora, S. Tanwar, S. Tyagi, N. Kumar, M. S. Obaidat, and B. Sadoun, “GATA: GPS-Arduino basedTracking and Alarm system for protection of wildlife animals”, in 2017 International Conference onComputer, Information and Telecommunication Systems (CITS), Dalian, China: IEEE, Jul. 2017.

[124] C. S. González-González, M. D. Guzmán-Franco, and A. Infante-Moro, “Tangible Technologies for ChildhoodEducation: A Systematic Review”, Sustainability, vol. 11, no. 10, May 2019.

[125] J. Foottit, D. Brown, S. Marks, and A. M. Connor, “A Wearable Haptic Game Controller”, InternationalJournal of Game Theory and Technology, vol. 2, no. 1, Mar. 2016.

[126] A. Bianchi, I. Oakley, J. K. Lee, D. S. Kwon, and V. Kostakos, “Haptics for Tangible Interaction: AVibro-Tactile Prototype”,

[127] V. Koushik, D. Guinness, and S. K. Kane, “StoryBlocks: A Tangible Programming Game To CreateAccessible Audio Stories”, in Proceedings of the 2019 CHI Conference on Human Factors in ComputingSystems - CHI ’19, Glasgow, Scotland Uk: ACM Press, 2019.

[128] M. Filsecker and D. T. Hickey, “A multilevel analysis of the effects of external rewards on elementary students’motivation, engagement and learning in an educational game”, Computers & Education, vol. 75, Jun. 2014.

[129] G.-J. Hwang, P.-H. Wu, C.-C. Chen, and N.-T. Tu, “Effects of an augmented reality-based educational gameon students’ learning achievements and attitudes in real-world observations”, Interactive LearningEnvironments, vol. 24, no. 8, Nov. 2016.

[130] M. Farhan, S. Jabbar, M. Aslam, M. Hammoudeh, M. Ahmad, S. Khalid, M. Khan, and K. Han, “IoT-basedstudents interaction framework using attention-scoring assessment in eLearning”, Future GenerationComputer Systems, 2018.

[131] N.-S. Pai, P.-Y. Chen, S.-A. Chen, and S.-X. Chen, “Realization of Internet of vehicles technology integratedinto an augmented reality system”, Journal of Low Frequency Noise, Vibration and Active Control, Mar. 2019.

[132] O. M. Tepper, H. L. Rudy, A. Lefkowitz, K. A. Weimer, S. M. Marks, C. S. Stern, and E. S. Garfein, “MixedReality with HoloLens: Where Virtual Reality Meets Augmented Reality in the Operating Room”, Plastic andReconstructive Surgery, vol. 140, no. 5, Nov. 2017.

[133] T. Horeman, M. D. Blikkendaal, D. Feng, A. van Dijke, F. Jansen, J. Dankelman, andJ. J. van den Dobbelsteen, “Visual Force Feedback Improves Knot-Tying Security”, Journal of SurgicalEducation, vol. 71, no. 1, Jan. 2014.

[134] S. R. Mache, M. R. Baheti, and C. N. Mahender, “Review on Text-To-Speech Synthesizer”, vol. 4, no. 8,

[135] J. Crumpton and C. L. Bethel, “A Survey of Using Vocal Prosody to Convey Emotion in Robot Speech”,International Journal of Social Robotics, vol. 8, no. 2, Apr. 2016.

[136] E. Sakai, T. Itoh, and A. Ito, “A Study on Voice Actor Recommendation for Game Characters Based onAcoustic Feature Estimation and Document Co-occurrence”, in 2017 Nicograph International (NicoInt),Kyoto, Japan: IEEE, Jun. 2017.

28

Page 37: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

[137] J. Byun and C. S. Loh, “Audial engagement: Effects of game sound on learner engagement in digitalgame-based learning environments”, Computers in Human Behavior, vol. 46, May 2015.

[138] S. Rabin, Ed., Game AI Pro 2: Collected Wisdom of Game AI Professionals. A K Peters/CRC Press, Apr.2015.

[139] M. Bhargava, P. Dhote, A. Srivastava, and A. Kumar, “Speech enabled integrated AR-based multimodallanguage translation”, in 2016 Conference on Advances in Signal Processing (CASP), Pune, India: IEEE, Jun.2016.

[140] T. P. Nguyen, M. T. Chew, and S. Demidenko, “Eye tracking system to detect driver drowsiness”, in 2015 6thInternational Conference on Automation, Robotics and Applications (ICARA), Queenstown, New Zealand:IEEE, Feb. 2015.

[141] N. Koons and M. Haungs, “Intrinsically musical game worlds: Abstract music generation as a result ofgameplay”, in Proceedings of the 14th International Conference on the Foundations of Digital Games - FDG’19, San Luis Obispo, California: ACM Press, 2019.

[142] G. Informer, The Secrets Behind Battlefield 3’s Sound Design, Jun. 2011. [Online]. Available:https://www.youtube.com/watch?v=Vc8WQsIxhro.

[143] B. Minto, EA DICE PKM Gun Recording Microphone Type, Position & Pre Amp Comparison -ben(dot)minto(at)dice(dot)se, Jun. 2011. [Online]. Available: https://vimeo.com/20869893.

[144] P. Ng and K. Nesbitt, “Informative sound design in video games”, in Proceedings of The 9th AustralasianConference on Interactive Entertainment Matters of Life and Death - IE ’13, Melbourne, Australia: ACMPress, 2013.

[145] V. Dakic, Sound Design for Film and Television. 2009.

[146] T. Hartmann, K. M. Krakowiak, and M. Tsay-Vogel, “How Violent Video Games Communicate Violence: ALiterature Review and Content Analysis of Moral Disengagement Factors”, Communication Monographs,vol. 81, no. 3, Jul. 2014.

[147] T. M.-W. Dictionary, Haptics, Jan. 2020. [Online]. Available:https://www.merriam-webster.com/dictionary/haptics.

[148] J. Foottit, D. Brown, S. Marks, and A. Connor, “Development of a wearable haptic game interface”, EAIEndorsed Transactions on Creative Technologies, vol. 3, no. 6, Apr. 2016.

[149] Wai Yu and S. Brewster, “Comparing two haptic interfaces for multimodal graph rendering”, in Proceedings10th Symposium on Haptic Interfaces for Virtual Environment and Teleoperator Systems. HAPTICS 2002,Orlando, FL, USA: IEEE Comput. Soc, 2002.

[150] T. H. Massie and J. K. Salisbury, “The PHANToM Haptic Interface: A Device for Probing Virtual Objects”,

[151] A. Zenner and A. Kruger, “Shifty: A Weight-Shifting Dynamic Passive Haptic Proxy to Enhance ObjectPerception in Virtual Reality”, IEEE Transactions on Visualization and Computer Graphics, vol. 23, no. 4,Apr. 2017.

[152] K. Zareinia, Y. Maddahi, C. Ng, N. Sepehri, and G. R. Sutherland, “Performance evaluation of haptichand-controllers in a robot-assisted surgical system: Evaluation of haptic devices in a robot-assisted surgicalsystem”, The International Journal of Medical Robotics and Computer Assisted Surgery, vol. 11, no. 4, Dec.2015.

[153] O. S. Schneider and K. E. MacLean, “Improvising design with a Haptic Instrument”, in 2014 IEEE HapticsSymposium (HAPTICS), Houston, TX, USA: IEEE, Feb. 2014.

[154] J. Kangas, D. Akkil, J. Rantala, P. Isokoski, P. Majaranta, and R. Raisamo, “Gaze gestures and hapticfeedback in mobile devices”, in Proceedings of the 32nd annual ACM conference on Human factors incomputing systems - CHI ’14, Toronto, Ontario, Canada: ACM Press, 2014.

[155] I. Hussain, L. Meli, C. Pacchierotti, G. Salvietti, and D. Prattichizzo, “Vibrotactile haptic feedback forintuitive control of robotic extra fingers”, in 2015 IEEE World Haptics Conference (WHC), Evanston, IL:IEEE, Jun. 2015.

[156] C. Pacchierotti, S. Sinclair, M. Solazzi, A. Frisoli, V. Hayward, and D. Prattichizzo, “Wearable HapticSystems for the Fingertip and the Hand: Taxonomy, Review, and Perspectives”, IEEE Transactions onHaptics, vol. 10, no. 4, Oct. 2017.

[157] M. Obrist, C. Velasco, C. Vi, N. Ranasinghe, A. Israr, A. Cheok, C. Spence, and P. Gopalakrishnakone,“Sensing the future of HCI: Touch, taste, and smell user interfaces”, interactions, vol. 23, no. 5, Aug. 2016.

[158] P. Risso, M. Covarrubias Rodriguez, M. Bordegoni, and A. Gallace, “Development and Testing of a Small-SizeOlfactometer for the Perception of Food and Beverages in Humans”, Frontiers in Digital Humanities, vol. 5,Apr. 2018.

29

Page 38: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

[159] N. Ranasinghe and E. Y.-L. Do, “Digital Lollipop: Studying Electrical Stimulation on the Human Tongue toSimulate Taste Sensations”, ACM Transactions on Multimedia Computing, Communications, andApplications, vol. 13, no. 1, Oct. 2016.

[160] D. Jain, M. Sra, J. Guo, R. Marques, R. Wu, J. Chiu, and C. Schmandt, “Immersive Scuba Diving SimulatorUsing Virtual Reality”, in Proceedings of the 29th Annual Symposium on User Interface Software andTechnology - UIST ’16, Tokyo, Japan: ACM Press, 2016.

[161] J. Kim, E.-s. Jung, Y.-t. Lee, and W. Ryu, “Home appliance control framework based on smart TV set-topbox”, IEEE Transactions on Consumer Electronics, vol. 61, no. 3, Aug. 2015.

[162] H. Jiang, T. Zhang, J. P. Wachs, and B. S. Duerstock, “Enhanced control of a wheelchair-mounted roboticmanipulator using 3-D vision and multimodal interaction”, Computer Vision and Image Understanding,vol. 149, Aug. 2016.

[163] M. Khan, B. N. Silva, and K. Han, “Internet of Things Based Energy Aware Smart Home Control System”,IEEE Access, vol. 4, 2016.

[164] M. Turk, “Multimodal human-computer interaction”, in Real-time vision for human-computer interaction,Springer, 2005.

[165] D. Hall and J. Llinas, “An introduction to multisensor data fusion”, Proceedings of the IEEE, vol. 85, no. 1,Jan. 1997.

[166] B. Dasarathy, “Sensor fusion potential exploitation-innovative architectures and illustrative applications”,Proceedings of the IEEE, vol. 85, no. 1, Jan. 1997.

[167] R. Sharma, V. Pavlovic, and T. Huang, “Toward multimodal human-computer interface”, Proceedings of theIEEE, vol. 86, no. 5, May 1998.

[168] A. Corradini, M. Mehta, N. O. Bernsen, J. Martin, and S. Abrilian, “Multimodal input fusion inhuman-computer interaction”, NATO Science Series Sub Series III Computer and Systems Sciences, vol. 198,2005.

[169] A. W. Ismail and M. S. Sunar, “Multimodal Fusion: Gesture and Speech Input in Augmented RealityEnvironment”, in Computational Intelligence in Information Systems, S. Phon-Amnuaisuk and T. W. Au,Eds., vol. 331, Cham: Springer International Publishing, 2015.

[170] M. E. Foster, “State of the art review: Multimodal fission”, COMIC project Deliverable, vol. 6, Sep. 2002.

[171] C. Rousseau, Y. Bellik, F. Vernier, and D. Bazalgette, “A framework for the intelligent multimodalpresentation of information”, Signal Processing, vol. 86, no. 12, Dec. 2006.

[172] “Multimodal fission”, in Multimodal human computer interaction and pervasive services, P. Grifoni, Ed., IGIGlobal, 2009.

[173] D. Costa and C. Duarte, “Adapting Multimodal Fission to User’s Abilities”, in Universal Access inHuman-Computer Interaction. Design for All and eInclusion, C. Stephanidis, Ed., vol. 6765, Berlin,Heidelberg: Springer Berlin Heidelberg, 2011.

[174] F. Honold, F. Schüssel, and M. Weber, “Adaptive probabilistic fission for multimodal systems”, in Proceedingsof the 24th Australian Computer-Human Interaction Conference on - OzCHI ’12, Melbourne, Australia: ACMPress, 2012.

[175] A. Kamilaris and A. Pitsillides, “Mobile Phone Computing and the Internet of Things: A Survey”,

[176] J. Qi, P. Yang, G. Min, O. Amft, F. Dong, and L. Xu, “Advanced Internet of Things for PersonalisedHealthcare System: A Survey”,

[177] N. Scarpato, A. Pieroni, L. D. Nunzio, and F. Fallucchi, “E-health-IoT Universe: A Review”, InternationalJournal on Advanced Science, Engineering and Information Technology, vol. 7, no. 6, Dec. 2017.

[178] K. Akkaya, I. Guvenc, R. Aygun, N. Pala, and A. Kadri, “IoT-based occupancy monitoring techniques forenergy-efficient smart buildings”, in 2015 IEEE Wireless Communications and Networking ConferenceWorkshops (WCNCW), New Orleans, LA, USA: IEEE, Mar. 2015.

[179] F. Alam, R. Mehmood, I. Katib, N. N. Albogami, and A. Albeshri, “Data Fusion and IoT for SmartUbiquitous Environments: A Survey”, IEEE Access, vol. 5, 2017.

[180] S. Kubler, J. Robert, A. Hefnawy, K. Framling, C. Cherifi, and A. Bouras, “Open IoT Ecosystem for SportingEvent Management”, IEEE Access, vol. 5, 2017.

[181] M. Kranz, P. Holleis, and A. Schmidt, “Embedded Interaction: Interacting with the Internet of Things”, IEEEInternet Computing, vol. 14, no. 2, Mar. 2010.

[182] A. K. Saha, S. Sircar, P. Chatterjee, S. Dutta, A. Mitra, A. Chatterjee, S. P. Chattopadhyay, and H. N. Saha,“A raspberry Pi controlled cloud based air and sound pollution monitoring system with temperature andhumidity sensing”, in 2018 IEEE 8th Annual Computing and Communication Workshop and Conference(CCWC), Las Vegas, NV: IEEE, Jan. 2018.

30

Page 39: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

[183] J. Jusak, H. Pratikno, and V. H. Putra, “Internet of Medical Things for cardiac monitoring: Paving the wayto 5g mobile networks”, in 2016 IEEE International Conference on Communication, Networks and Satellite(COMNETSAT), Surabaya, Indonesia: IEEE, 2016.

[184] H. Ren, H. Jin, C. Chen, H. Ghayvat, and W. Chen, “A Novel Cardiac Auscultation Monitoring SystemBased on Wireless Sensing for Healthcare”, IEEE Journal of Translational Engineering in Health andMedicine, vol. 6, 2018.

[185] A. Ukil and U. K. Roy, “Smart cardiac health management in IoT through heart sound signal analytics androbust noise filtering”, in 2017 IEEE 28th Annual International Symposium on Personal, Indoor, and MobileRadio Communications (PIMRC), Montreal, QC: IEEE, Oct. 2017.

[186] A. A. P. Wai, H. Dajiang, and N. S. Huat, “IoT-enabled multimodal sensing headwear system”, in 2018 IEEE4th World Forum on Internet of Things (WF-IoT), Singapore: IEEE, Feb. 2018.

[187] L.-J. Kau and C.-S. Chen, “A Smart Phone-Based Pocket Fall Accident Detection, Positioning, and RescueSystem”, IEEE Journal of Biomedical and Health Informatics, vol. 19, no. 1, Jan. 2015.

[188] F. Aloul, I. Zualkernan, R. Abu-Salma, H. Al-Ali, and M. Al-Merri, “iBump: Smartphone application todetect car accidents”, Computers & Electrical Engineering, vol. 43, Apr. 2015.

[189] A. B. Faiz, A. Imteaj, and M. Chowdhury, “Smart vehicle accident detection and alarming system using asmartphone”, in 2015 International Conference on Computer and Information Engineering (ICCIE),Rajshahi, Bangladesh: IEEE, Nov. 2015.

[190] S. Chandran, S. Chandrasekar, and N. E. Elizabeth, “Konnect: An Internet of Things(IoT) based smarthelmet for accident detection and notification”, in 2016 IEEE Annual India Conference (INDICON),Bangalore, India: IEEE, Dec. 2016.

[191] J. He, C. Hu, and X. Wang, “A Smart Device Enabled System for Autonomous Fall Detection and Alert”,International Journal of Distributed Sensor Networks, vol. 12, no. 2, Feb. 2016.

[192] A. A. Nazari Shirehjini and A. Semsar, “Human interaction with IoT-based smart environments”, MultimediaTools and Applications, vol. 76, no. 11, Jun. 2017.

[193] J. Bacca, S. Baldiris, R. Fabregat, S. Graf, and Kunshuk, “Augmented Reality Trends in Education : ASystematic Review of Research and Applications”, Educational Technology & Society, vol. 17, no. 4, 2014.

[194] Z. Zhu, V. Branzoi, M. Wolverton, G. Murray, N. Vitovitch, L. Yarnall, G. Acharya, S. Samarasekera, andR. Kumar, “AR-mentor: Augmented reality based mentoring system”, in 2014 IEEE International Symposiumon Mixed and Augmented Reality (ISMAR), Munich, Germany: IEEE, Sep. 2014.

[195] M. Al-Jabi and H. Sammaneh, “Toward Mobile AR-based Interactive Smart Parking System”, in 2018 IEEE20th International Conference on High Performance Computing and Communications; IEEE 16thInternational Conference on Smart City; IEEE 4th International Conference on Data Science and Systems(HPCC/SmartCity/DSS), Exeter, United Kingdom: IEEE, Jun. 2018.

[196] D. Jo and G. J. Kim, “ARIoT: Scalable augmented reality framework for interacting with Internet of Thingsappliances everywhere”, IEEE Transactions on Consumer Electronics, vol. 62, no. 3, Aug. 2016.

[197] P. Phupattanasilp and S.-R. Tong, “Augmented Reality in the Integrative Internet of Things (AR-IoT):Application for Precision Farming”, Sustainability, vol. 11, no. 9, May 2019.

[198] Y. He, I. Sawada, O. Fukuda, R. Shima, N. Yamaguchi, and H. Okumura, “Development of an evaluationsystem for upper limb function using AR technology”, in Proceedings of the Genetic and EvolutionaryComputation Conference Companion on - GECCO ’18, Kyoto, Japan: ACM Press, 2018.

[199] Z. Rashid, J. Melià-Seguí, R. Pous, and E. Peig, “Using Augmented Reality and Internet of Things to improveaccessibility of people with motor disabilities in the context of Smart Cities”, Future Generation ComputerSystems, vol. 76, Nov. 2017.

[200] K. Cho, H. Jang, L. W. Park, S. Kim, and S. Park, “Energy Management System Based on AugmentedReality for Human-Computer Interaction in a Smart City”, in 2019 IEEE International Conference onConsumer Electronics (ICCE), Las Vegas, NV, USA: IEEE, Jan. 2019.

[201] A. Seitz, D. Henze, J. Nickles, M. Sauer, and B. Bruegge, “Augmenting the industrial Internet of Things withEmojis”, in 2018 Third International Conference on Fog and Mobile Edge Computing (FMEC), Barcelona:IEEE, Apr. 2018.

[202] D. Chaves-Diéguez, A. Pellitero-Rivero, D. García-Coego, F. González-Castaño, P. Rodríguez-Hernández,Ó. Piñeiro-Gómez, F. Gil-Castiñeira, and E. Costa-Montenegro, “Providing IoT Services in Smart Citiesthrough Dynamic Augmented Reality Markers”, Sensors, vol. 15, no. 7, Jul. 2015.

[203] M. F. Alam, S. Katsikas, O. Beltramello, and S. Hadjiefthymiades, “Augmented and virtual reality basedmonitoring and safety system: A prototype IoT platform”, Journal of Network and Computer Applications,vol. 89, Jul. 2017.

31

Page 40: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

[204] Y. Park, S. Yun, and K.-H. Kim, “When IoT met Augmented Reality: Visualizing the Source of the WirelessSignal in AR View”, in Proceedings of the 17th Annual International Conference on Mobile Systems,Applications, and Services - MobiSys ’19, Seoul, Republic of Korea: ACM Press, 2019.

[205] B. Pokric, S. Krco, M. Pokric, P. Knezevic, and D. Jovanovic, “Engaging Citizen Communities in SmartCities Using loT, Serious Gaming and Fast Marl<erless Augmented Reality”,

[206] A. Muthanna, A. A. Ateya, A. Amelyanovich, M. Shpakov, P. Darya, and M. Makolkina, “AR EnabledSystem for Cultural Heritage Monitoring and Preservation”, in Internet of Things, Smart Spaces, and NextGeneration Networks and Systems, O. Galinina, S. Andreev, S. Balandin, and Y. Koucheryavy, Eds.,vol. 11118, Cham: Springer International Publishing, 2018.

[207] M. Makolkina, V. D. Pham, R. Kirichek, A. Gogol, and A. Koucheryavy, “Interaction of AR and IoTApplications on the Basis of Hierarchical Cloud Services”, in Internet of Things, Smart Spaces, and NextGeneration Networks and Systems, O. Galinina, S. Andreev, S. Balandin, and Y. Koucheryavy, Eds.,vol. 11118, Cham: Springer International Publishing, 2018.

[208] K. Michalakis, J. Aliprantis, and G. Caridakis, “Visualizing the Internet of Things: NaturalizingHuman-Computer Interaction by Incorporating AR Features”, IEEE Consumer Electronics Magazine, vol. 7,no. 3, May 2018.

[209] C. Shasha, W. Mei, and Q. Xuebin, “Object Recognition of Environmental Information in the Internet ofThings Based on Augmented Reality”, MATEC Web of Conferences, vol. 173, J. Heled and A. Yuan, Eds.,2018.

[210] G. White, C. Cabrera, A. Palade, and S. Clarke, “Augmented Reality in IoT”, in Service-Oriented Computing– ICSOC 2018 Workshops, X. Liu, M. Mrissa, L. Zhang, D. Benslimane, A. Ghose, Z. Wang, A. Bucchiarone,W. Zhang, Y. Zou, and Q. Yu, Eds., vol. 11434, Cham: Springer International Publishing, 2019.

[211] T. Leppanen, A. Heikkinen, A. Karhu, E. Harjula, J. Riekki, and T. Koskela, “Augmented Reality WebApplications with Mobile Agents in the Internet of Things”, in 2014 Eighth International Conference on NextGeneration Mobile Apps, Services and Technologies, Oxford, United Kingdom: IEEE, Sep. 2014.

[212] B. Pokric, S. Krco, and M. Pokric, “Augmented Reality Based Smart City Services Using Secure IoTInfrastructure”, in 2014 28th International Conference on Advanced Information Networking and ApplicationsWorkshops, BC, Canada: IEEE, May 2014.

[213] B. Pokric, V. Rajs, Ž. Mihajlovic, P. Kneževic, and D. Jovanovic, “Augmented Reality Enabled IoT Servicesfor Environmental Monitoring Utilising Serious Gaming Concept”,

[214] E. Stefanidi, D. Arampatzis, A. Leonidis, and G. Papagiannakis, “BricklAyeR: A Platform for Building Rulesfor AmI Environments in AR”, in Advances in Computer Graphics, M. Gavrilova, J. Chang, N. M. Thalmann,E. Hitzer, and H. Ishikawa, Eds., vol. 11542, Cham: Springer International Publishing, 2019.

[215] G. Mylonas, C. Triantafyllis, and D. Amaxilatis, “An Augmented Reality Prototype for supporting IoT-basedEducational Activities for Energy-efficient School Buildings”, Electronic Notes in Theoretical ComputerScience, vol. 343, May 2019.

[216] D. Agrawal, S. B. Mane, A. Pacharne, and S. Tiwari, “IoT Based Augmented Reality System of HumanHeart: An Android Application”, in 2018 2nd International Conference on Trends in Electronics andInformatics (ICOEI), Tirunelveli: IEEE, May 2018.

[217] R. Seiger, M. Gohlke, and U. Aßmann, “Augmented Reality-Based Process Modelling for the Internet ofThings with HoloFlows”, in Enterprise, Business-Process and Information Systems Modeling,I. Reinhartz-Berger, J. Zdravkovic, J. Gulden, and R. Schmidt, Eds., vol. 352, Cham: Springer InternationalPublishing, 2019.

[218] J. A. Purmaissur, P. Towakel, S. P. Guness, A. Seeam, and X. A. Bellekens, “Augmented-RealityComputer-Vision Assisted Disaggregated Energy Monitoring and IoT Control Platform”, in 2018International Conference on Intelligent and Innovative Computing Applications (ICONIC), Plaine Magnien:IEEE, Dec. 2018.

[219] R. Revetria, F. Tonelli, L. Damiani, M. Demartini, F. Bisio, and N. Peruzzo, “A REAL-TIMEMECHANICAL STRUCTURES MONITORING SYSTEM BASED ON DIGITAL TWIN, IOT ANDAUGMENTED REALITY”,

[220] Y. Sun, A. Armengol-Urpi, S. N. Reddy Kantareddy, J. Siegel, and S. Sarma, “MagicHand: Interact with IoTDevices in Augmented Reality Environment”, in 2019 IEEE Conference on Virtual Reality and 3D UserInterfaces (VR), Osaka, Japan: IEEE, Mar. 2019.

[221] Y. Cao, Z. Xu, F. Li, W. Zhong, K. Huo, and K. Ramani, “V.Ra: An In-Situ Visual Authoring System forRobot-IoT Task Planning with Augmented Reality”, in Proceedings of the 2019 on Designing InteractiveSystems Conference - DIS ’19, San Diego, CA, USA: ACM Press, 2019.

32

Page 41: Multimodal Interaction with Internet of Things and ...ltu.diva-portal.org/smash/get/diva2:1415900/FULLTEXT01.pdf · the advantage of the deep level of understanding the user is an

[222] J. A. Karasinski, R. Joyce, C. Carroll, J. Gale, and S. Hillenius, “An Augmented Reality/Internet of ThingsPrototype for Just-in-time Astronaut Training”, in Virtual, Augmented and Mixed Reality, S. Lackey andJ. Chen, Eds., vol. 10280, Cham: Springer International Publishing, 2017.

[223] Z. A. Dodevska, V. Kvrgić, and V. Štavljanin, “Augmented Reality and Internet of Things – Implementationin Projects by Using Simplified Robotic Models”, European Project Management Journal, vol. 8, no. 2, 2018.

[224] D. A. Smuseva, A. Y. Rolich, L. S. Voskov, and I. Y. Malakhov, “Big Data, Internet of Things, AugmentedReality: Technology convergence in visualization issues”, Data Science,

[225] K. Huo, Y. Cao, S. H. Yoon, Z. Xu, G. Chen, and K. Ramani, “Scenariot: Spatially Mapping Smart ThingsWithin Augmented Reality Scenes”, in Proceedings of the 2018 CHI Conference on Human Factors inComputing Systems - CHI ’18, Montreal QC, Canada: ACM Press, 2018.

[226] L. Zhang, S. Chen, H. Dong, and A. El Saddik, “Visualizing Toronto City Data with HoloLens: UsingAugmented Reality for a City Model”, IEEE Consumer Electronics Magazine, vol. 7, no. 3, May 2018.

[227] J. Kim, J. Seo, and T. H. Laine, “Detecting boredom from eye gaze and EEG”, Biomedical Signal Processingand Control, vol. 46, Sep. 2018.

[228] N. Norouzi, G. Bruder, B. Belna, S. Mutter, D. Turgut, and G. Welch, “A Systematic Review of theConvergence of Augmented Reality, Intelligent Virtual Agents, and the Internet of Things”, in ArtificialIntelligence in IoT, F. Al-Turjman, Ed., Cham: Springer International Publishing, 2019.

[229] B. Dumas, D. Lalanne, and S. Oviatt, “Multimodal Interfaces: A Survey of Principles, Models andFrameworks”, in Human Machine Interaction, D. Lalanne and J. Kohlas, Eds., vol. 5440, Berlin, Heidelberg:Springer Berlin Heidelberg, 2009.

[230] I. Brinck and C. Balkenius, “Mutual Recognition in Human-Robot Interaction: A Deflationary Account”,Philosophy & Technology, Dec. 2018.

[231] E. Stefanidi, M. Foukarakis, D. Arampatzis, M. Korozi, A. Leonidis, and M. Antona, “ParlAmI: AMultimodal Approach for Programming Intelligent Environments”, Technologies, vol. 7, no. 1, Jan. 2019.

[232] S. Oviatt, “Ten myths of multimodal interaction”, Communications of the ACM, vol. 42, no. 11, Nov. 1999.

33