multimodal interaction design and application in augmented...

• Article •

Multimodal interaction design and application in

augmented reality for chemical experiment

Mengting XIAO1,2, Zhiquan FENG1,2*, Xiaohui YANG1,2, Tao XU1,2, Qingbei GUO1,2

1. Department of Information Science and Engineering, University of Jinan, Jinan 250022, China

2. Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan 250022, China

* Corresponding author， [email protected]

Supported by the National Key R&D Program of China (No. 2018YFB1004901) and the Independent In-novation Team

Project of Jinan City (No. 2019GXRC013).

Abstract Background The augmented reality classroom is becoming a research hotspot in the field of

education, but there are some limitations: First, most researchers use cards to operate experiments, and a

large number of cards cause difficulty and inconvenience for users. Second, most users only operate

experiments in visual modal, and single-modal interaction greatly reduces the user’s real sense of interaction.

In order to solve these problems, this paper proposes the multimodal interaction method (ARGEV) base on

visual and tactile in Augmented Reality, and designs a Virtual and Real Fusion Interactive Tool Suite

(VRFITS) with gesture recognition and intelligent equipment. Methods The ARGVE method can fusion

gesture, intelligent equipment and virtual model. We use the gesture recognition model trained by the

convolutional neural network to recognize the gestures in AR, and to trigger the vibration feedback after the

five-finger grasp gesture recognition. we establish a coordinate mapping relationship between real hand and

virtual model to achieve the fusion of gesture and virtual model. Results The gesture recognition average

accuracy rate is 99.04%. We verify and apply the VRFITS in the Augmented Reality Chemistry Lab (ARCL),

and the overall operation load of ARCL is reduced by 27.42% compared with the traditional simulation

virtual experiment. Conclusions We achieve real-time fusion and interaction of gesture, virtual model and

intelligent equipment in ARCL. Compared with the simulation virtual experiment, the ARCL improves user’s

real sense of operation and interaction efficiency.

Keywords Augmented reality; Gesture recognition; Intelligent equipment; Multimodal Interaction;

Augmented Reality Chemistry Lab

1 Introduction

Virtual experiment teaching is an important field of information intelligence [1], it is also an important

research area of Human-computer Interaction (HCI). The virtual teaching method gradually adopts

Augmented Reality (AR) technology on the basis of simulation technology, which realizes the transformation

from two-dimensional space to three-dimensional (3D) space, and conducive to improving students’ interest

and enriching the sense of experience [2]. AR technology include Simultaneous Localization and Mapping

(SLAM) [3], card mark recognition [4] or gesture interaction technology [5] et al. In the field of education, most

AR research are based on mobile phones, iPads or computers, and do not need to wear any Head Mounted

Display (HMD), which is conducive to user-friendly operating. Sun Chenxi [6] used the Vuforia SDK to

identify marked cards, and constructed an AR learning environment with sound animation system, gesture

interaction, particle effects, real-time mapping of color and game interaction functions, they have shown that

learning in the form of AR helps increase the imagination of users. Fidan et al. [7] developed FenAR software

based on marked AR technology to support learning activities in the classroom. Studies have shown that

incorporating AR into learning activities can improve students’ academic performance. However, users

operate multiple cards in an experiment, on the one hand, the operation is complicated, on the other hand,

they can only interact with the virtual model by means of external objects and ignore the interactivity,

especially in the study and understanding of middle school students’ chemical experiments, they cannot

without the actual hands-on operation.

Next, in order to enhance the user’s ability of hands-on operation, so it produced research has emerged

some augmented reality technologies based on gesture recognition. Dave et al. [8] combined AR and gestures,

and used adaptive hand segmentation and pose estimation methods to recognize gestures in virtual

experiments, the users could simulate experiments with real hands. İbili et al. [9] developed an AR geometry

teaching system with gesture operations, verified the individual differences in 3D thinking skills of middle

school students and a personalized dynamic intelligent learning environment are particularly important. Cao

et al. [10] proposed an interactive AR system that enabled students to naturally manipulate 3D objects directly

through gesture interaction. Studies have shown that systems based on gesture and AR interaction are easy

to use and attractive. However, these studies are only single modal gesture interaction, which cause the user’s

operation load too heavy, and the interaction efficiency is too low. For example, in the chemical experiment,

if the user only uses gesture interaction, the gestures’ type complex and similar gestures are easy to be

confused to inaccurate recognition, causing the user’s operation load to be heavy, and easily leading to a

reduction in interaction efficiency and operation.

Gesture recognition is an essential part of achieving augmented reality effects. Gesture recognition method

include Deep Learning method [11,12], Hidden Markov Models (HMM) [13,14], and geometric features [15,16],

which are used to implement gesture and the process of information transmission between visions.

Researchers used methods to realize the process of information transfer between gesture and vision. Wu D

et al. [17] used Deep Belief Network (DBN) and 3D Convolutional Neural Network (3DCNN) fusion, and

input the gesture classification probability into the HMM model to complete the gesture recognition.

Elmezain et al. [18] used HMM model recognized the dynamic trajectory of the gesture. Priyal et al. [19] used

matrix feature normalization to identify the geometry of the gesture. Liang H et al. [20] used random forest

method to classify gestures, and operated the gesture on 3D virtual objects to achieve a seamless fusion of

real scenes and virtual objects. According to the research, it is shown that deep learning methods perform

best in gesture recognition problems due to their strong fit and other advantages compared with other method,

but the efficiency of gesture recognition is relatively low in practical applications.

In the virtual experiment, on the one hand, the virtual effect presented by the AR technology has a strong

sense of immersion. On the other hand, the fusion of physical and virtual objects and the fusion interaction

of multiple modalities can improve user interactivity. However, the existing research has the problems that

the experiment is not easy to operate, the gesture recognition is easy to confuse, and the operation load is too

heavy. To address these challenges all together, our work innovations the following:

1. This paper use gesture recognition instead of card mark recognition in AR, we propose a multimodal

interaction method (ARGEV) based on gesture and sensor, which through Kinect turns complex augmented

reality technology into a simple coordinate transformation problem, achieving the fusion of real hands, virtual

scenes and sensor. It solves the real-time interaction of gestures and virtual objects, and improves the user’s

interaction efficiency;

2. We combine gestures and intelligent equipment to design a virtual and real fusion interaction tool suite

(VRFITS), which solves the interaction between physical objects and virtual model, and then triggers

perceptual feedback when gestures interact with virtual model, enhancing the user’s sense of real operation;

3. The Augmented Reality Chemistry Lab (ARCL) is designed, which is operated through the interaction

of real hands, intelligent equipment and virtual model.

2 Multimodal interaction method based on AR

The VRFITS includes an intelligent equipment and a multimodal interaction method. In visual and tactile,

the intelligent equipment can enhance realism of user operations, the gesture interaction method is used to

identify user’s gesture behavior, and then trigger vibration feedback on the intelligent equipment.

2.1 The overall framework

The overall framework structure includes the model establishment stage, the gesture recognition and

interaction stage, the system application phase stage (Figure 1). In the model building stage, we firstly process

the gesture depth map, then the deep learning method is used to train the gesture recognition model through

CNN. In the gesture recognition and interaction phase, we use gesture recognition model to achieve gesture

recognition, and input the gesture depth map to model, then we get gesture recognition’s result, and achieve

the consistency through coordinate binding in virtual scenes and real scenes. The fusion of gesture and virtual

model trigger intelligent equipment vibration. In the system display stage, we implement the ARCL.

Gesture depth

map

Preprocessing

Gesture model

Recognition result

TrainingTesting

Gesture recognition

Gesture depth

map

Recognition result

the frame

n-1 the frame n

Gesture model

Coordinate calibration of

virtual and real fusion

Camera

calibration

Augmented Reality

Chemistry Lab

Virtual Modeling

Intelligent

euipment

Vibration

feedback

Visual

effects

multimodal interaction

Figure 1 The overall framework of multimodal interaction method.

2.2 Design and interaction method for intelligent equipment

At present, intelligent sensing equipment are becoming more and more abundant, such as somatosensory

devices, Google glasses, Microsoft HoloLens and Kinect etc., users can use the sensing devices to learn or

home entertainment. Researchers can also develop new research based on sensing devices, but the price is

very high, which is unrealistic for a large number of teaching classrooms in the Midwest of China. In addition,

in real experiments, many experimental equipment cannot be reused, which wastes a lot of experimental

equipment. Therefore, this work designs an intelligent equipment with inexpensive sensors.

Touch sensor

1 2

Vibrator

Intelligent

ring

Intelligent equipment

Figure 2 Design structure of intelligent equipment.

About the intelligent equipment (Figure 2), the system uses sensors to sense changes in external signals. It

connects to the IO port of the STM32103 main control chip through the signal output port, and further uses

the serial transmission mode to transmit the signal to the computer through the EIA-RS232 module. Finally,

the serial data is read in the Unity 3D platform. In the processing of the vibrator, the serial port send the data

under the Unity 3D platform, and the information is transmitted to the STM32103 main control chip through

the serial port to control the vibration of the vibrator.

The intelligent equipment includes an intelligent ring and two touch sensors, and the approximate cost is

30 yuan. We set up a vibrator on the intelligent ring. During the experiment, we put the ring on the little

thumb finger of the right hand. When the user grabs the virtual object with his hand, the sensor has the effect

of vibration. The specific perceptual process (no order) are as follows:

1. If the five fingers garb gesture is recognized, the system sent the “00” to the serial port and the

information is transmitted to the STM32103 main control chip through the serial port. We set the vibrator

sensor shock for 5 seconds. if not the five fingers grab gesture there is no vibration state;

2. When the hand touches the tactile sensor, the touch intensity of the perceived is calculated, and the

average touch intensity is obtained through repeated experiments. If the touch intensity is greater than the

touch average intensity, the system receive signal is the Noe “1” and “2” indicate the buttons of start

experiment and end experiment. If there no signal received, it means that semantics are not expressed.

The intelligent equipment is suitable for virtual experimental scenes (AR and VR) with gesture recognition

function, which combines gesture behavior and vibrator to trigger the effect of the vibration feedback.

2.3 Gesture recognition method

2.3.1 Gesture data preprocessing

For the virtual experiment system, taking the operation chemistry experiment as an example, we count the

types of natural and common gestures used by students or teachers in operation experiments. So, we

investigate and design six gestures commonly used in the experiment. First, we use Kinect to collect the

depth map of the human body, and collect 10,000 pieces of each type. Then, we obtain the depth information

and the coordinates of the centroid point of the human hand by Kinect to segment the gesture depth map from

the collected the body image. The 3cm distance from the centroid point is used as the threshold value. If the

distance is larger than the threshold value, the human hand area will be exceeded. Otherwise, the human hand

area will be cut to obtain a gesture depth map of 200 × 200 pixels. For simple operation, we set similar

gestures represent the same semantics, such as two fingers spread and three fingers stretch gestures. The pre-

processed six gesture depth maps and their definitions are shown in Table 1.

Table 1 Definition of six static gestures

2.3.2 Gesture recognition method based on CNN

First, we build the AlexNet network structure model. The AlexNet structure of the convolutional neural

network has the advantage of learning richer and higher-dimensional image features. AlexNet uses random

Number Name Depth map Gesture state Presentation semantics

1 Five fingers grab

The five fingers are fists Grab

2 One finger stretch

The forefinger extended Click

3 Two fingers spread

Open thumb and index finger Amplify

4 Two fingers stretch

Forefinger and middle finger are opened

with scissors Whirl

5 Three fingers stretch

Open thumb, index and middle fingers Amplify

6 Five fingers spread

Five fingers are open Put down

inactivation dropout and data enhancement methods to effectively suppress overfitting and use the Relu

function to replace the previous Sigmoid as the activation function. Therefore, for the six kinds of static

gesture depth maps, this paper trains a gesture recognition model based on the AlexNet structure. The

network structure includes five the convolutional layers, three the pooling layers, three the fully connected

layers and a softmax classification function. We choose each gesture 10,000 depth maps, and each gesture

depth map is initially set to 200 × 200 pixels. Then, through the AlexNet network we obtain the gesture

recognition model. The network structure is shown in Figure 3.

3*3S=1

3*3S=2

200*200*3

200*200*64200*200*64 100*100*64 100*100*128 50*50*128

50*50*256 50*50*256 50*50*128 25*25*128

3*3S=1

3*3S=2

3*3S=1

3*3S=1

3*3S=1

3*3S=2

FC FC

1*1*1024 1*1*6

MAX-POOL

MAX-POOL

MAX-POOL

softmax

Figure 3 AlexNet structure.

In the AlexNet structure, the optimal number of iterations epoch set 20000, the number of each training or

test batch size 20, the type of padding set SAME, the value dropout set 0.8, and the generalization

performance is evaluated every 20 depth maps. The step size and convolution kernel size of each layer are

shown in Figure 3. The process of training gesture recognition model is as follows:

1. Using Kinect to obtain the depth information and depth map of human bone nodes, and judge the human

hand area according to the threshold value to generate a gesture depth map;

2. Dividing the data set into the training set and test set which ratio is 7:3;

3. The training set is input into the AlexNet, and the gesture depth features are extracted by continuously

updating the weights. Its weight update form is:

𝑥𝑖𝑚 = 𝑓(∑ 𝑥𝑗

𝑚−1 ∙ 𝑤𝑖𝑗𝑚 + 𝑏𝑖

𝑚𝑛𝑗=1 ) (1)

The 𝑚 is the current number of layers, the n is the number of neurons in the previous layer of m, 𝑤𝑖𝑗𝑚 is the

connection weight of neurons 𝑗 in layer 𝑚 and neuron 𝑖 in the previous layer, and 𝑏𝑖𝑚 is the 𝑖 feature bias

after 𝑚 convolution layers;

4. After calculating the Softmax layer, the v vector is obtained. Each dimension of the v vector represents

the probability of the prediction type. For each type of prediction probability, the calculation is as follows:

𝑃𝑙 = 𝑒𝑣𝑙

∑ 𝑒𝑣𝑘𝑘 (2)

𝑣𝑙 represents the l element of the vector v, and 𝑃𝑙 is the predicted probability value of the l element in the

vector v. 𝑖 ∈ (0,6];

Finally, we get a trained gesture recognition model (AlexNet_gesture), and then encapsulate and import

into ARCL.

2.4 Multimodal interaction method

Frist, we use the Kinect RGB sensor to build the real environment, the Kinect depth camera captures the

depth map of the hand. This paper realizes the AR effect by establishing coordinate calibration of real space

and virtual space in Unity 3D. The ARG method achieves tactile and visual fusion interaction. In the

operation of the experiment, the user directly observes the experimental phenomena and scenes in the AR

through the computer screen, without wearing any head-mounted display (HMD). The user sits in front of

the computer and puts the intelligent equipment beside the computer. If user makes a five-finger grab gesture,

trigger vibration feedback to them. The process is shown in Figure 4.

Depth feature

Animation

Depth feature data

Multimodal interaction

Y

X

Z

RGB Sensor

Depth Sensor

Vibration Feedback

Gesture operation

ARCL

Depth Information

Recognition

RGB Information

Figure 4 Multimodal interaction process in AR.

First, we encapsulate the AlexNet_gesture model and define right gesture as set 𝐺𝑒𝑠_𝑅 and left

hand gesture set 𝐺𝑒𝑠_𝐿,

𝐺𝑒𝑠_𝑅 ∈ {𝑅1, 𝑅2, 𝑅3, 𝑅4, 𝑅5, 𝑅6} (2)

𝐺𝑒𝑠_𝐿 ∈ {𝐿1, 𝐿2, 𝐿3, 𝐿4, 𝐿5, 𝐿6} (3)

𝑅1 − 𝑅6 and 𝐿1 − 𝐿6 represent the number in Table 1. Then, we call AlexNet_gesture in VTARICL to

establishing the consistency of hand coordinates and virtual coordinates. Set

𝜃 = (𝑘𝑥, 𝑘𝑦, 𝑘𝑧) (4)

the 𝜃 is depth 3D coordinate under Kinect. According to the mapping of hand joint point coordinates in real

space and depth 3D coordinates, the mapping relationship between the joint point coordinates and the virtual

scene is determined as:

[

𝑘𝑥

𝑘𝑦

𝑘𝑧

] = 𝑡 [

𝑈𝑥

𝑈𝑦

𝑈𝑧

] + [

𝑑𝑥

𝑑𝑦

𝑑𝑧

] (5)

Where (𝑈𝑥 , 𝑈𝑦, 𝑈𝑧)is the virtual scene coordinate in Unity, and 𝑡 is the ratio corresponding to the 3D

coordinates of the world scene and the virtual scene, (𝑑𝑥 , 𝑑𝑦 , 𝑑𝑧) is the intercept value at the virtual scene

coordinates.

In ARGEV algorithm, we input gesture depth maps and sensor signals, perform multimodal interaction,

and output vibration feedback and visual effects. Visual effects include designed animations, particle effects,

and dumping effects of virtual beakers in Unity 3D. The specific gesture interaction algorithm is as follows:

Algorithm 1: Multimodal Interaction Algorithm based on AR (ARGEV algorithm)

Input: gesture depth map, sensor signal Noe;

Output: vibration feedback, visual effects;

1. Using the Kinect depth camera to obtain the (n-1) frame gesture depth map and input it into the

AlexNet_gesture model for gesture recognition;

2. The n frame gesture depth map is obtained, the joint point coordinates 𝑆𝑛−1(𝜃𝑛−1) and 𝑆𝑛(𝜃𝑛) of

frame (n-1) and n are recorded;

3. if (Noe = 1) then the experiment begins, and virtual equipment appears in the scene.

4. if (Noe = 2) then the experiment ends.

5. if (𝑆𝑛−1(𝜃𝑛−1) = 𝑆𝑛(𝜃𝑛)) then

while (𝐺𝑒𝑠_𝑅 ∪ 𝐺𝑒𝑠_𝐿 ≠ ∅) do

if (R1) then set the 𝑃𝑖𝑠_𝑣 as three-dimensional coordinate of the virtual model

𝑆𝑛(𝜃𝑛) = 𝑃𝑖𝑠_𝑣, send the “00” data to the microcontroller to trigger the intelligent ring

vibration end if

if (L1) then the effect of the prompt box of the selected experimental equipment appears on

the system interface end if

if (R4) then the current virtual equipment is dumping end if

if (R6) then drop the current virtual equipment end if

end while

end if

6. if (𝑆𝑛−1(𝜃𝑛−1) ≠ 𝑆𝑛(𝜃𝑛)) then return 1 end if

3 Experimental results and analysis

3.1 The AlexNet_gesture model training results

During the training process, this article sets 20 iterations to detect the accuracy and loss value changes, and

visualize it through the Tensorboard method. The AlexNet_gesture model accuracy and loss curve are shown

in Figure 5.

(a) the accuracy rate change (b) the Loss change

Figure 5 Changes in accuracy and Loss values during training.

In figure 5, the accuracy rate in the training process gradually tends to 1, and the Loss value changes from

large to stable and tends to 0, which proves that the trained model is continuously optimized effectively.

3.2 Comparative experiment

We design two sets of comparative experiments. The first set is the comparison of the accuracy of each

gesture before and after the model optimization, and the second set is the comparison of the training results

of the AlexNet、GoogleNet 和 VGG16Net models. We use the pre-processed gesture depth map, and 3000

test pictures. The experimental results are shown in the figure 6.

(a) The first set of experimental results (b) The second set of experimental results

Figure 6 Compare experimental results before and after model optimization.

It can be seen that the average accuracy rate of gestures after optimization is 99.04%, which is about 2%

higher than before optimization. The recognition effect for similar gestures 2 and 3 is also better than that

before optimization. The accuracy of the gesture recognition model obtained by the AlexNet model training

is better than the other two network models before or after optimization, and the optimized model is improved

by about 1% -3%.

3.3 Effectiveness of the ARGEV algorithm

We verify the effectiveness of the ARGEV algorithm by judging whether the coordinates of the gesture and

the coordinates of the virtual model are consistent. In the same time period, when the user makes a five-finger

grab gesture, we record the three-dimensional coordinates of the hand and the three-dimensional coordinates

of the virtual model.

(a) coordinates of the real hand (b) coordinates of the virtual model

Figure 7 Comparison of 3D coordinates of real hand and virtual model.

In Figure 7, we label the 3D coordinates of the gesture trajectories with different colors. The 3D

coordinates of Figures 5(a) and 5(b) at the same time period are equal, which proves that the gesture

interaction algorithm is effective.

3.4 Application of VRFITS in augmented reality chemistry lab

We build an interactive virtual and real fusion environment in intelligent equipment, real hands and virtual

models. About hardware, we adopt the Intel® Core™ i7-8550 CPU, Kinect 3.0 and intelligent equipment.

About software, we set up experimental scenarios in Unity 3D platform, and use C# as the programming

language. Finally, we apply the ARGEV algorithm and intelligent equipment in ARCL.

The VRFITS can be repeatedly used, and it also can help student avoiding the dangers when user operate

experiment alone. Of course, we want to help students focus more on the operation of chemical experiments

and the realization of observation and learning, and to solve problems such as difficulties in experiments,

fear of experiments, and lack of experimental reagents. So, we design ARCL and apply the VRFITS. We

validate ARGEV through an example natrium and water reaction chemistry experiment in ARCL. In the

natrium and water experiments, it is considered that the user only involves the actions of grab, putting down

and dumping during the operation. Therefore, in order to facilitate the operation of students in ARCL, we

choose 3 type gestures as experimental verification. The main flowchart of ARCL is Figure 8.

Start

Get joint point type

R4 R6

Rotate the

virtual model

Drop virtual

model

L1

Virtual prompt

box

Tactile

initialization

Kinect

initialization

Left

End

Right

Vibration

feedback

Y

Touch

1

N

R1

Grab, select, move

virtual models

Touch 2

N

N

N

NN

Y

Y Y Y Y

Figure 8 The flowchart of VTARICL.

In the middle school chemistry experiment class, natrium and water experiments are the main chemistry

experiments. However, the appropriate amount of natrium and water reaction to generate gas during the user

experiment, and a large amount of natrium and water reaction to explode, making the experiment difficult to

observe and operate. In order to allow students to better experience the experimental process, this paper takes

the natrium and water reaction as an example to present the experimental mechanism in VRCL. In virtual

scene, we add prompt windows, virtual experimental equipment, particle effects and animation effects to

make users more immersive effects during experiments. The main effect of the experimental operation is

shown below.

cba

Figure 9 The effect of the ARCL operation one

In the Figure 9, the red box indicates operation prompts and scene effects, and the yellow box indicates

user operation behavior. After the user touches the start key to operate the experiment, the system presents

the AR scene. In the prompt box, the user selects the experimental equipment with the five fingers grab

gesture, and the intelligent ring vibrates (a). User dumps the virtual breaker (b), then take out the virtual

equipment needed for the experiment.

a b

a

c

ba c

d e f

Figure 10 The effect of the ARCL operation two

In the Figure 10, the user uses a virtual knife to cut the natrium block (a), and tweezers takes a small piece

of natrium and puts it into a beaker filled with water (b). The user can observe five phenomena of natrium

and water reaction and add real video verify authenticity (c). The user selects the beaker again (d), puts it on

the table, and takes a large piece of natrium into the beaker with tweezers, we can see the explosion scenes

(e). Finally, the user ends the experiment by touching the end key (f).

3.5 User evaluation comparison

We choose natrium and water experiments of the NOBOOK virtual experiment platform [21] and the ARCL

to compare performance according to user evaluation (Figure 11). The natrium and water experiments of the

NOBOOK is a simulation experiment of virtual reality, the system uses a mouse as an input source.

a b

Figure 11 The NOBOOK virtual experiment platform (a) and the ARCL (b).

We invited 10 teachers and 30 students to complete the evaluation in the order of NOBOOK experiment

and the ARCL. In order to verify that the experimental system conforms to the teaching application, we set

seven comparative aspects of “teaching evaluation”, “experimental interest”, “experimental interaction”,

“learning effect”, “system stability”, “experimental hints” and “ease of operation” as VET_P, and set the

VET_P to VET_P1-VET_P7 in order. After the operation, 10 teachers respectively compared the two systems

from VET_P, and each score was divided into 5 levels by the contrast score was increased in order. The

NOBOOK virtual experiment platform as A and the ARCL as B. We use the ANOVA to assess the

significance of each factor (Table 2). The significance 𝛼 is 0.05.

Table 2 ANOVA statistical result

VET_P Experiment

platform

average

value variance SS MS F P-value F crit

VET_P1 A 2.2 0.844

18.05 18.05 25.992 7.51 4.419 B 4.1 0.544

VET_P2 A 2.8 0.622

11.25 11.25 26.299 7.04 4.419 B 4.3 0.233

VET_P3 A 2.7 0.456

8.45 8.45 10.787 0.004 4.419 B 4 1.111

VET_P4 A 1.9 0.767

20 20 26.087 7.36 4.419 B 3.9 0.767

VET_P5 A 4.2 0.711

0.2 0.2 0.409 0.530 4.419 B 4.6 0.267

VET_P6 A 1.2 0.177

28.8 28.8 51.84 1.06 4.419 B 3.6 0.933

VET_P7 A 2.7 0.455

5 5 14.516 0.012 4.419 B 3.7 0.233

Figure 12 The VET_P comparison score.

Teachers generally think that the results of both operation experiments are very satisfactory, and they can

correctly operate the experiments and observe the experimental phenomena. However, the teacher’s

evaluation of the two experimental patterns are very different, the average value of the ARCL is 29.42%

higher than NOBOOK experiment. The teacher ’s evaluation on VET_P5 have a little different, we can see

that the F is less than the F crit, this shows that the two system is relatively stable. In the three aspects of

VET_P1, VET_P3 and VET_P7, the overall evaluation of the ARCL is 28% higher than the NOBOOK

experiment, it can be seen that the ARCL operation is simpler and more convenient for teachers to teach than

the NOBOOK. In addition, in the evaluation of VET_P4 and VET_P6, it can be seen that the ARCL is 40%

and 50% higher than the NOBOOK, respectively. It further illustrates that in ARCL with multimodal

interactions, they are more immersed in the experiments and have better learning effects. From VET_P6, the

NOBOOK experiment needs to be explored, and the ARCL can be operated according to the prompt box,

even if the operator is not familiar with the virtual experimental environment, the operator does not need to

waste much time during the experiment, which improves the efficiency of experimental interaction.

In addition, we also evaluate the operating load of this system based on the National Aeronautics and Space

Administration Task Load Index (NASA-TLX) [22] cognitive load assessment. We use the NASA-TLX

evaluation criteria for the “mental demand (MD)”, “physical demand (PD)”, “temporary demand (TD)”,

“effort (E)”, “performance (P)”, and “frustration level (FD)” to evaluate the scores of the two systems.

According to the NASA-TLX evaluation, the 40 operators experienced NOBOOK and ARCL separately and

statistically evaluated the average score. The comparative score is the same as the VET_P assessment using

a five-point scale. The NASA-TLX model evaluation of the two systems is shown in Figure 13.

Figure 13 NASA-TLX model evaluation results.

We can see that the two experiments on the evaluation and of MD have a little different, it means that the

user consumes less mental in the operation experiment. However, in the other five indicators, the ARCL is

significantly lower than the evaluation score of the NOBOOK, the overall cognitive operation load of the

ARCL is reduced by 27.42%, it proves that the VRFITS interaction efficiency is improved. The virtual and

real fusion interaction improves the user’s interaction with the virtual model and the immersion of the

operation experiment.

4 Conclusion and discussion

This paper designs and proposes the VRFITS, which contains an intelligent equipment and gesture interaction

method, the suite is suitable for any AR experiment with gesture operation behavior. We achieve the

combination of gestures, sensors and virtual model in AR. An intelligent equipment and a gesture interaction

method assist each other, the gesture can trigger vibration feedback. Finally, we design and implement a

prototype system ARCL. According to user evaluation, the ARCL compared with NOBOOK increases the

interactivity and real sense of operation in virtual experiments, reduces the user’s operation load, and

improves the user’s interaction efficiency. In addition, compared with AR card recognition of Vuforia SDK,

ARCL gets rid of multiple card operations and only uses different gesture commands to trigger different

virtual models, it makes the operation more convenient and effective.

However, this paper has certain limitations. On the one hand, there are relatively few types of gesture

recognition, so there is a lack of gesture types in the interactive process of users in virtual experiments. On

the other hand, in the virtual chemistry experiment system, the particle effect, animation effect and virtual

model rendering effect in the experimental scene are not prominent, and the interface effect of the system

should be improved in the future.

Acknowledgments

This research was supported in part by the National Key R&D Program of China under Grant

(No.2018YFB1004901) and the Independent In-novation Team Project of Jinan City (No. 2019GXRC013).

we express ours thanks to the people who helped for the work and acknowledge valuable suggestions from

the reviewers.

References

1 Collazos C A, Merchan L. Human-computer interaction in Colombia: bridging the gap between education and industry.

IT Professional, 2015, 17(1): 5-9.

DOI: 10.1109/MITP.2015.8.

2 Desai K, Belmonte U H H, Jin R, et alExperiences with multi-modal collaborative virtual laboratory (mmcvl)//2017 IEEE

Third International Conference on Multimedia Big Data (BigMM). IEEE, 2017: 376-383.

DOI: 10.1109/BigMM.2017.62

3 Chen L, Tang W, John N W, et al. SLAM-based dense surface reconstruction in monocular Minimally Invasive Surgery

and its application to Augmented Reality. Computer methods and programs in biomedicine, 2018, 158: 135-146.

DOI: 10.1016/j.cmpb.2018.02.006

4 Huynh B, Orlosky J, Höllerer T. In-Situ Labeling for Augmented Reality Language Learning//2019 IEEE Conference on

Virtual Reality and 3D User Interfaces (VR). IEEE, 2019: 1606-1611.

DOI: 10.1109/vr.2019.8798358

5 Karambakhsh A, Kamel A, Sheng B, et al. Deep gesture interaction for augmented anatomy learning. International Journal

of Information Management, 2019, 45: 328-336.

DOI: 10.1016/j.ijinfomgt.2018.03.004

6 Sun Chenxi. The design and implementation of children’s education system based on augmented reality. The Shandong

University, 2017

7 Fidan M, Tuncel M. Integrating augmented reality into problem based learning: The effects on learning achievement and

attitude in physics education[J]. Computers & Education, 2019, 142: 103635.

8 Dave I R, Chaudhary V, Upla K P. Simulation of Analytical Chemistry Experiments on Augmented Reality

Platform//Progress in Advanced Computing and Intelligent Engineering. Springer, Singapore, 2019: 393-403.

DOI: 10.1007/978-981-13-0224-4_35

9 İbili E, Çat M, Resnyansky D, et al. An assessment of geometry teaching supported with augmented reality teaching

materials to enhance students’ 3D geometry thinking skills[J]. International Journal of Mathematical Education in Science

and Technology, 2020, 51(2): 224-246

10 Rani S S, Dhrisya K J, Ahalyadas M. Hand gesture control of virtual object in augmented reality[C]. 2017 International

Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, 2017: 1500-1505.

11 Skaria S, Al-Hourani A, Lech M, et al. Hand-gesture recognition using two-antenna Doppler radar with deep convolutional

neural networks. IEEE Sensors Journal, 2019, 19(8): 3041-3048.

DOI: 10.1109/jsen.2019.2892073

12 Côté-Allard U, Fall C L, Drouin A, et al. Deep learning for electromyographic hand gesture signal classification using

transfer learning. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2019, 27(4): 760-771.

DOI: 10.1109/tnsre.2019.2896269

13 Sinha K, Kumari R, Priya A, et al. A computer vision-based gesture recognition using hidden markov model //Innovations

in Soft Computing and Information Technology. Springer, Singapore, 2019: 55-67.

DOI: 10.1007/978-981-13-3185-5_6

14 Zhang L, Zhang Y, Niu L, et al. HMM Static Hand Gesture Recognition Based on Combination of Shape Features and

Wavelet Texture Features//International Conference on Wireless and Satellite Systems. Springer, Cham, 2019: 187-197.

15 S.U. Ahmad, S. Akhter. Real Time Rotation Invariant Static Hand Gesture Recognition using An Orientation Based Hash

Code. Informatics, Electronics & Vision (ICIEV). Dhaka, Bangladesh, 2013: 1-6.

DOI: 10.1109/iciev.2013.6572620

16 Rehman A, Harouni M, Saba T. Cursive Multilingual Characters Recognition Based on Hard Geometric Features. arXiv

preprint arXiv:1904.08760, 2019.

17 Wu D, Pigou L, Kindermans P J, et al. Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and

Recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2016, 38(8):1-1.

DOI: 10.1109/TPAMI.2016.2537340

18 M. Elmezain, A. Al-Hamadi, B. Michaelis. Hand Trajectory-based Gesture Spotting and Recognition using HMM. Image

Processing (ICIP). Cairo, Egypt, 2009: 3577-3580.

DOI: 10.1109/ICIP.2009.5414322

19 S.P. Priyal, P.K. Bora. A Robust Static Hand Gesture Recognition System using Geometry Based Normalizations and

Krawtchouk Moments. Pattern Recognition, 2013, 46(8): 2202-2219.

DOI: 10.1016/j.patcog.2013.01.033

20 Liang H, Yuan J, Thalmann D, et al. Ar in hand: Egocentric palm pose tracking and gesture recognition for augmented

reality applications. Proceedings of the 23rd ACM international conference on Multimedia. Brisbane, Australia, 2015:

743-744.

DOI: 10.1145/2733373.2807972

21 J. Wang, Research on the application of virtual simulation experiment in physics experiment teaching of senior high school,

2018.

22 Law K E, Lowndes B R, Kelley S R, et al. NASA-task load index differentiates surgical approach: opportunities for

improvement in colon and rectal surgery[J]. Annals of surgery, 2020, 271(5): 906-912.

multimodal interaction design and application in augmented...

Documents