multimodal interaction design and application in augmented...
TRANSCRIPT
• Article •
Multimodal interaction design and application in
augmented reality for chemical experiment
Mengting XIAO1,2, Zhiquan FENG1,2*, Xiaohui YANG1,2, Tao XU1,2, Qingbei GUO1,2
1. Department of Information Science and Engineering, University of Jinan, Jinan 250022, China
2. Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan 250022, China
* Corresponding author, [email protected]
Supported by the National Key R&D Program of China (No. 2018YFB1004901) and the Independent In-novation Team
Project of Jinan City (No. 2019GXRC013).
Abstract Background The augmented reality classroom is becoming a research hotspot in the field of
education, but there are some limitations: First, most researchers use cards to operate experiments, and a
large number of cards cause difficulty and inconvenience for users. Second, most users only operate
experiments in visual modal, and single-modal interaction greatly reduces the user’s real sense of interaction.
In order to solve these problems, this paper proposes the multimodal interaction method (ARGEV) base on
visual and tactile in Augmented Reality, and designs a Virtual and Real Fusion Interactive Tool Suite
(VRFITS) with gesture recognition and intelligent equipment. Methods The ARGVE method can fusion
gesture, intelligent equipment and virtual model. We use the gesture recognition model trained by the
convolutional neural network to recognize the gestures in AR, and to trigger the vibration feedback after the
five-finger grasp gesture recognition. we establish a coordinate mapping relationship between real hand and
virtual model to achieve the fusion of gesture and virtual model. Results The gesture recognition average
accuracy rate is 99.04%. We verify and apply the VRFITS in the Augmented Reality Chemistry Lab (ARCL),
and the overall operation load of ARCL is reduced by 27.42% compared with the traditional simulation
virtual experiment. Conclusions We achieve real-time fusion and interaction of gesture, virtual model and
intelligent equipment in ARCL. Compared with the simulation virtual experiment, the ARCL improves user’s
real sense of operation and interaction efficiency.
Keywords Augmented reality; Gesture recognition; Intelligent equipment; Multimodal Interaction;
Augmented Reality Chemistry Lab
1 Introduction
Virtual experiment teaching is an important field of information intelligence [1], it is also an important
research area of Human-computer Interaction (HCI). The virtual teaching method gradually adopts
Augmented Reality (AR) technology on the basis of simulation technology, which realizes the transformation
from two-dimensional space to three-dimensional (3D) space, and conducive to improving students’ interest
and enriching the sense of experience [2]. AR technology include Simultaneous Localization and Mapping
(SLAM) [3], card mark recognition [4] or gesture interaction technology [5] et al. In the field of education, most
AR research are based on mobile phones, iPads or computers, and do not need to wear any Head Mounted
Display (HMD), which is conducive to user-friendly operating. Sun Chenxi [6] used the Vuforia SDK to
identify marked cards, and constructed an AR learning environment with sound animation system, gesture
interaction, particle effects, real-time mapping of color and game interaction functions, they have shown that
learning in the form of AR helps increase the imagination of users. Fidan et al. [7] developed FenAR software
based on marked AR technology to support learning activities in the classroom. Studies have shown that
incorporating AR into learning activities can improve students’ academic performance. However, users
operate multiple cards in an experiment, on the one hand, the operation is complicated, on the other hand,
they can only interact with the virtual model by means of external objects and ignore the interactivity,
especially in the study and understanding of middle school students’ chemical experiments, they cannot
without the actual hands-on operation.
Next, in order to enhance the user’s ability of hands-on operation, so it produced research has emerged
some augmented reality technologies based on gesture recognition. Dave et al. [8] combined AR and gestures,
and used adaptive hand segmentation and pose estimation methods to recognize gestures in virtual
experiments, the users could simulate experiments with real hands. İbili et al. [9] developed an AR geometry
teaching system with gesture operations, verified the individual differences in 3D thinking skills of middle
school students and a personalized dynamic intelligent learning environment are particularly important. Cao
et al. [10] proposed an interactive AR system that enabled students to naturally manipulate 3D objects directly
through gesture interaction. Studies have shown that systems based on gesture and AR interaction are easy
to use and attractive. However, these studies are only single modal gesture interaction, which cause the user’s
operation load too heavy, and the interaction efficiency is too low. For example, in the chemical experiment,
if the user only uses gesture interaction, the gestures’ type complex and similar gestures are easy to be
confused to inaccurate recognition, causing the user’s operation load to be heavy, and easily leading to a
reduction in interaction efficiency and operation.
Gesture recognition is an essential part of achieving augmented reality effects. Gesture recognition method
include Deep Learning method [11,12], Hidden Markov Models (HMM) [13,14], and geometric features [15,16],
which are used to implement gesture and the process of information transmission between visions.
Researchers used methods to realize the process of information transfer between gesture and vision. Wu D
et al. [17] used Deep Belief Network (DBN) and 3D Convolutional Neural Network (3DCNN) fusion, and
input the gesture classification probability into the HMM model to complete the gesture recognition.
Elmezain et al. [18] used HMM model recognized the dynamic trajectory of the gesture. Priyal et al. [19] used
matrix feature normalization to identify the geometry of the gesture. Liang H et al. [20] used random forest
method to classify gestures, and operated the gesture on 3D virtual objects to achieve a seamless fusion of
real scenes and virtual objects. According to the research, it is shown that deep learning methods perform
best in gesture recognition problems due to their strong fit and other advantages compared with other method,
but the efficiency of gesture recognition is relatively low in practical applications.
In the virtual experiment, on the one hand, the virtual effect presented by the AR technology has a strong
sense of immersion. On the other hand, the fusion of physical and virtual objects and the fusion interaction
of multiple modalities can improve user interactivity. However, the existing research has the problems that
the experiment is not easy to operate, the gesture recognition is easy to confuse, and the operation load is too
heavy. To address these challenges all together, our work innovations the following:
1. This paper use gesture recognition instead of card mark recognition in AR, we propose a multimodal
interaction method (ARGEV) based on gesture and sensor, which through Kinect turns complex augmented
reality technology into a simple coordinate transformation problem, achieving the fusion of real hands, virtual
scenes and sensor. It solves the real-time interaction of gestures and virtual objects, and improves the user’s
interaction efficiency;
2. We combine gestures and intelligent equipment to design a virtual and real fusion interaction tool suite
(VRFITS), which solves the interaction between physical objects and virtual model, and then triggers
perceptual feedback when gestures interact with virtual model, enhancing the user’s sense of real operation;
3. The Augmented Reality Chemistry Lab (ARCL) is designed, which is operated through the interaction
of real hands, intelligent equipment and virtual model.
2 Multimodal interaction method based on AR
The VRFITS includes an intelligent equipment and a multimodal interaction method. In visual and tactile,
the intelligent equipment can enhance realism of user operations, the gesture interaction method is used to
identify user’s gesture behavior, and then trigger vibration feedback on the intelligent equipment.
2.1 The overall framework
The overall framework structure includes the model establishment stage, the gesture recognition and
interaction stage, the system application phase stage (Figure 1). In the model building stage, we firstly process
the gesture depth map, then the deep learning method is used to train the gesture recognition model through
CNN. In the gesture recognition and interaction phase, we use gesture recognition model to achieve gesture
recognition, and input the gesture depth map to model, then we get gesture recognition’s result, and achieve
the consistency through coordinate binding in virtual scenes and real scenes. The fusion of gesture and virtual
model trigger intelligent equipment vibration. In the system display stage, we implement the ARCL.
Gesture depth
map
Preprocessing
Gesture model
Recognition result
TrainingTesting
Gesture recognition
Gesture depth
map
Recognition result
the frame
n-1 the frame n
Gesture model
Coordinate calibration of
virtual and real fusion
Camera
calibration
Augmented Reality
Chemistry Lab
Virtual Modeling
Intelligent
euipment
Vibration
feedback
Visual
effects
multimodal interaction
Figure 1 The overall framework of multimodal interaction method.
2.2 Design and interaction method for intelligent equipment
At present, intelligent sensing equipment are becoming more and more abundant, such as somatosensory
devices, Google glasses, Microsoft HoloLens and Kinect etc., users can use the sensing devices to learn or
home entertainment. Researchers can also develop new research based on sensing devices, but the price is
very high, which is unrealistic for a large number of teaching classrooms in the Midwest of China. In addition,
in real experiments, many experimental equipment cannot be reused, which wastes a lot of experimental
equipment. Therefore, this work designs an intelligent equipment with inexpensive sensors.
Touch sensor
1 2
Vibrator
Intelligent
ring
Intelligent equipment
Figure 2 Design structure of intelligent equipment.
About the intelligent equipment (Figure 2), the system uses sensors to sense changes in external signals. It
connects to the IO port of the STM32103 main control chip through the signal output port, and further uses
the serial transmission mode to transmit the signal to the computer through the EIA-RS232 module. Finally,
the serial data is read in the Unity 3D platform. In the processing of the vibrator, the serial port send the data
under the Unity 3D platform, and the information is transmitted to the STM32103 main control chip through
the serial port to control the vibration of the vibrator.
The intelligent equipment includes an intelligent ring and two touch sensors, and the approximate cost is
30 yuan. We set up a vibrator on the intelligent ring. During the experiment, we put the ring on the little
thumb finger of the right hand. When the user grabs the virtual object with his hand, the sensor has the effect
of vibration. The specific perceptual process (no order) are as follows:
1. If the five fingers garb gesture is recognized, the system sent the “00” to the serial port and the
information is transmitted to the STM32103 main control chip through the serial port. We set the vibrator
sensor shock for 5 seconds. if not the five fingers grab gesture there is no vibration state;
2. When the hand touches the tactile sensor, the touch intensity of the perceived is calculated, and the
average touch intensity is obtained through repeated experiments. If the touch intensity is greater than the
touch average intensity, the system receive signal is the Noe “1” and “2” indicate the buttons of start
experiment and end experiment. If there no signal received, it means that semantics are not expressed.
The intelligent equipment is suitable for virtual experimental scenes (AR and VR) with gesture recognition
function, which combines gesture behavior and vibrator to trigger the effect of the vibration feedback.
2.3 Gesture recognition method
2.3.1 Gesture data preprocessing
For the virtual experiment system, taking the operation chemistry experiment as an example, we count the
types of natural and common gestures used by students or teachers in operation experiments. So, we
investigate and design six gestures commonly used in the experiment. First, we use Kinect to collect the
depth map of the human body, and collect 10,000 pieces of each type. Then, we obtain the depth information
and the coordinates of the centroid point of the human hand by Kinect to segment the gesture depth map from
the collected the body image. The 3cm distance from the centroid point is used as the threshold value. If the
distance is larger than the threshold value, the human hand area will be exceeded. Otherwise, the human hand
area will be cut to obtain a gesture depth map of 200 × 200 pixels. For simple operation, we set similar
gestures represent the same semantics, such as two fingers spread and three fingers stretch gestures. The pre-
processed six gesture depth maps and their definitions are shown in Table 1.
Table 1 Definition of six static gestures
2.3.2 Gesture recognition method based on CNN
First, we build the AlexNet network structure model. The AlexNet structure of the convolutional neural
network has the advantage of learning richer and higher-dimensional image features. AlexNet uses random
Number Name Depth map Gesture state Presentation semantics
1 Five fingers grab
The five fingers are fists Grab
2 One finger stretch
The forefinger extended Click
3 Two fingers spread
Open thumb and index finger Amplify
4 Two fingers stretch
Forefinger and middle finger are opened
with scissors Whirl
5 Three fingers stretch
Open thumb, index and middle fingers Amplify
6 Five fingers spread
Five fingers are open Put down
inactivation dropout and data enhancement methods to effectively suppress overfitting and use the Relu
function to replace the previous Sigmoid as the activation function. Therefore, for the six kinds of static
gesture depth maps, this paper trains a gesture recognition model based on the AlexNet structure. The
network structure includes five the convolutional layers, three the pooling layers, three the fully connected
layers and a softmax classification function. We choose each gesture 10,000 depth maps, and each gesture
depth map is initially set to 200 × 200 pixels. Then, through the AlexNet network we obtain the gesture
recognition model. The network structure is shown in Figure 3.
3*3S=1
3*3S=2
200*200*3
200*200*64200*200*64 100*100*64 100*100*128 50*50*128
50*50*256 50*50*256 50*50*128 25*25*128
3*3S=1
3*3S=2
3*3S=1
3*3S=1
3*3S=1
3*3S=2
FC FC
1*1*1024 1*1*6
MAX-POOL
MAX-POOL
MAX-POOL
softmax
Figure 3 AlexNet structure.
In the AlexNet structure, the optimal number of iterations epoch set 20000, the number of each training or
test batch size 20, the type of padding set SAME, the value dropout set 0.8, and the generalization
performance is evaluated every 20 depth maps. The step size and convolution kernel size of each layer are
shown in Figure 3. The process of training gesture recognition model is as follows:
1. Using Kinect to obtain the depth information and depth map of human bone nodes, and judge the human
hand area according to the threshold value to generate a gesture depth map;
2. Dividing the data set into the training set and test set which ratio is 7:3;
3. The training set is input into the AlexNet, and the gesture depth features are extracted by continuously
updating the weights. Its weight update form is:
𝑥𝑖𝑚 = 𝑓(∑ 𝑥𝑗
𝑚−1 ∙ 𝑤𝑖𝑗𝑚 + 𝑏𝑖
𝑚𝑛𝑗=1 ) (1)
The 𝑚 is the current number of layers, the n is the number of neurons in the previous layer of m, 𝑤𝑖𝑗𝑚 is the
connection weight of neurons 𝑗 in layer 𝑚 and neuron 𝑖 in the previous layer, and 𝑏𝑖𝑚 is the 𝑖 feature bias
after 𝑚 convolution layers;
4. After calculating the Softmax layer, the v vector is obtained. Each dimension of the v vector represents
the probability of the prediction type. For each type of prediction probability, the calculation is as follows:
𝑃𝑙 = 𝑒𝑣𝑙
∑ 𝑒𝑣𝑘𝑘 (2)
𝑣𝑙 represents the l element of the vector v, and 𝑃𝑙 is the predicted probability value of the l element in the
vector v. 𝑖 ∈ (0,6];
Finally, we get a trained gesture recognition model (AlexNet_gesture), and then encapsulate and import
into ARCL.
2.4 Multimodal interaction method
Frist, we use the Kinect RGB sensor to build the real environment, the Kinect depth camera captures the
depth map of the hand. This paper realizes the AR effect by establishing coordinate calibration of real space
and virtual space in Unity 3D. The ARG method achieves tactile and visual fusion interaction. In the
operation of the experiment, the user directly observes the experimental phenomena and scenes in the AR
through the computer screen, without wearing any head-mounted display (HMD). The user sits in front of
the computer and puts the intelligent equipment beside the computer. If user makes a five-finger grab gesture,
trigger vibration feedback to them. The process is shown in Figure 4.
Depth feature
Animation
Depth feature data
Multimodal interaction
Y
X
Z
RGB Sensor
Depth Sensor
Vibration Feedback
Gesture operation
ARCL
Depth Information
Recognition
RGB Information
Figure 4 Multimodal interaction process in AR.
First, we encapsulate the AlexNet_gesture model and define right gesture as set 𝐺𝑒𝑠_𝑅 and left
hand gesture set 𝐺𝑒𝑠_𝐿,
𝐺𝑒𝑠_𝑅 ∈ {𝑅1, 𝑅2, 𝑅3, 𝑅4, 𝑅5, 𝑅6} (2)
𝐺𝑒𝑠_𝐿 ∈ {𝐿1, 𝐿2, 𝐿3, 𝐿4, 𝐿5, 𝐿6} (3)
𝑅1 − 𝑅6 and 𝐿1 − 𝐿6 represent the number in Table 1. Then, we call AlexNet_gesture in VTARICL to
establishing the consistency of hand coordinates and virtual coordinates. Set
𝜃 = (𝑘𝑥, 𝑘𝑦, 𝑘𝑧) (4)
the 𝜃 is depth 3D coordinate under Kinect. According to the mapping of hand joint point coordinates in real
space and depth 3D coordinates, the mapping relationship between the joint point coordinates and the virtual
scene is determined as:
[
𝑘𝑥
𝑘𝑦
𝑘𝑧
] = 𝑡 [
𝑈𝑥
𝑈𝑦
𝑈𝑧
] + [
𝑑𝑥
𝑑𝑦
𝑑𝑧
] (5)
Where (𝑈𝑥 , 𝑈𝑦, 𝑈𝑧)is the virtual scene coordinate in Unity, and 𝑡 is the ratio corresponding to the 3D
coordinates of the world scene and the virtual scene, (𝑑𝑥 , 𝑑𝑦 , 𝑑𝑧) is the intercept value at the virtual scene
coordinates.
In ARGEV algorithm, we input gesture depth maps and sensor signals, perform multimodal interaction,
and output vibration feedback and visual effects. Visual effects include designed animations, particle effects,
and dumping effects of virtual beakers in Unity 3D. The specific gesture interaction algorithm is as follows:
Algorithm 1: Multimodal Interaction Algorithm based on AR (ARGEV algorithm)
Input: gesture depth map, sensor signal Noe;
Output: vibration feedback, visual effects;
1. Using the Kinect depth camera to obtain the (n-1) frame gesture depth map and input it into the
AlexNet_gesture model for gesture recognition;
2. The n frame gesture depth map is obtained, the joint point coordinates 𝑆𝑛−1(𝜃𝑛−1) and 𝑆𝑛(𝜃𝑛) of
frame (n-1) and n are recorded;
3. if (Noe = 1) then the experiment begins, and virtual equipment appears in the scene.
4. if (Noe = 2) then the experiment ends.
5. if (𝑆𝑛−1(𝜃𝑛−1) = 𝑆𝑛(𝜃𝑛)) then
while (𝐺𝑒𝑠_𝑅 ∪ 𝐺𝑒𝑠_𝐿 ≠ ∅) do
if (R1) then set the 𝑃𝑖𝑠_𝑣 as three-dimensional coordinate of the virtual model
𝑆𝑛(𝜃𝑛) = 𝑃𝑖𝑠_𝑣, send the “00” data to the microcontroller to trigger the intelligent ring
vibration end if
if (L1) then the effect of the prompt box of the selected experimental equipment appears on
the system interface end if
if (R4) then the current virtual equipment is dumping end if
if (R6) then drop the current virtual equipment end if
end while
end if
6. if (𝑆𝑛−1(𝜃𝑛−1) ≠ 𝑆𝑛(𝜃𝑛)) then return 1 end if
3 Experimental results and analysis
3.1 The AlexNet_gesture model training results
During the training process, this article sets 20 iterations to detect the accuracy and loss value changes, and
visualize it through the Tensorboard method. The AlexNet_gesture model accuracy and loss curve are shown
in Figure 5.
(a) the accuracy rate change (b) the Loss change
Figure 5 Changes in accuracy and Loss values during training.
In figure 5, the accuracy rate in the training process gradually tends to 1, and the Loss value changes from
large to stable and tends to 0, which proves that the trained model is continuously optimized effectively.
3.2 Comparative experiment
We design two sets of comparative experiments. The first set is the comparison of the accuracy of each
gesture before and after the model optimization, and the second set is the comparison of the training results
of the AlexNet、GoogleNet 和 VGG16Net models. We use the pre-processed gesture depth map, and 3000
test pictures. The experimental results are shown in the figure 6.
(a) The first set of experimental results (b) The second set of experimental results
Figure 6 Compare experimental results before and after model optimization.
It can be seen that the average accuracy rate of gestures after optimization is 99.04%, which is about 2%
higher than before optimization. The recognition effect for similar gestures 2 and 3 is also better than that
before optimization. The accuracy of the gesture recognition model obtained by the AlexNet model training
is better than the other two network models before or after optimization, and the optimized model is improved
by about 1% -3%.
3.3 Effectiveness of the ARGEV algorithm
We verify the effectiveness of the ARGEV algorithm by judging whether the coordinates of the gesture and
the coordinates of the virtual model are consistent. In the same time period, when the user makes a five-finger
grab gesture, we record the three-dimensional coordinates of the hand and the three-dimensional coordinates
of the virtual model.
(a) coordinates of the real hand (b) coordinates of the virtual model
Figure 7 Comparison of 3D coordinates of real hand and virtual model.
In Figure 7, we label the 3D coordinates of the gesture trajectories with different colors. The 3D
coordinates of Figures 5(a) and 5(b) at the same time period are equal, which proves that the gesture
interaction algorithm is effective.
3.4 Application of VRFITS in augmented reality chemistry lab
We build an interactive virtual and real fusion environment in intelligent equipment, real hands and virtual
models. About hardware, we adopt the Intel® Core™ i7-8550 CPU, Kinect 3.0 and intelligent equipment.
About software, we set up experimental scenarios in Unity 3D platform, and use C# as the programming
language. Finally, we apply the ARGEV algorithm and intelligent equipment in ARCL.
The VRFITS can be repeatedly used, and it also can help student avoiding the dangers when user operate
experiment alone. Of course, we want to help students focus more on the operation of chemical experiments
and the realization of observation and learning, and to solve problems such as difficulties in experiments,
fear of experiments, and lack of experimental reagents. So, we design ARCL and apply the VRFITS. We
validate ARGEV through an example natrium and water reaction chemistry experiment in ARCL. In the
natrium and water experiments, it is considered that the user only involves the actions of grab, putting down
and dumping during the operation. Therefore, in order to facilitate the operation of students in ARCL, we
choose 3 type gestures as experimental verification. The main flowchart of ARCL is Figure 8.
Start
Get joint point type
R4 R6
Rotate the
virtual model
Drop virtual
model
L1
Virtual prompt
box
Tactile
initialization
Kinect
initialization
Left
End
Right
Vibration
feedback
Y
Touch
1
N
R1
Grab, select, move
virtual models
Touch 2
N
N
N
NN
Y
Y Y Y Y
Figure 8 The flowchart of VTARICL.
In the middle school chemistry experiment class, natrium and water experiments are the main chemistry
experiments. However, the appropriate amount of natrium and water reaction to generate gas during the user
experiment, and a large amount of natrium and water reaction to explode, making the experiment difficult to
observe and operate. In order to allow students to better experience the experimental process, this paper takes
the natrium and water reaction as an example to present the experimental mechanism in VRCL. In virtual
scene, we add prompt windows, virtual experimental equipment, particle effects and animation effects to
make users more immersive effects during experiments. The main effect of the experimental operation is
shown below.
cba
Figure 9 The effect of the ARCL operation one
In the Figure 9, the red box indicates operation prompts and scene effects, and the yellow box indicates
user operation behavior. After the user touches the start key to operate the experiment, the system presents
the AR scene. In the prompt box, the user selects the experimental equipment with the five fingers grab
gesture, and the intelligent ring vibrates (a). User dumps the virtual breaker (b), then take out the virtual
equipment needed for the experiment.
a b
a
c
ba c
d e f
Figure 10 The effect of the ARCL operation two
In the Figure 10, the user uses a virtual knife to cut the natrium block (a), and tweezers takes a small piece
of natrium and puts it into a beaker filled with water (b). The user can observe five phenomena of natrium
and water reaction and add real video verify authenticity (c). The user selects the beaker again (d), puts it on
the table, and takes a large piece of natrium into the beaker with tweezers, we can see the explosion scenes
(e). Finally, the user ends the experiment by touching the end key (f).
3.5 User evaluation comparison
We choose natrium and water experiments of the NOBOOK virtual experiment platform [21] and the ARCL
to compare performance according to user evaluation (Figure 11). The natrium and water experiments of the
NOBOOK is a simulation experiment of virtual reality, the system uses a mouse as an input source.
a b
Figure 11 The NOBOOK virtual experiment platform (a) and the ARCL (b).
We invited 10 teachers and 30 students to complete the evaluation in the order of NOBOOK experiment
and the ARCL. In order to verify that the experimental system conforms to the teaching application, we set
seven comparative aspects of “teaching evaluation”, “experimental interest”, “experimental interaction”,
“learning effect”, “system stability”, “experimental hints” and “ease of operation” as VET_P, and set the
VET_P to VET_P1-VET_P7 in order. After the operation, 10 teachers respectively compared the two systems
from VET_P, and each score was divided into 5 levels by the contrast score was increased in order. The
NOBOOK virtual experiment platform as A and the ARCL as B. We use the ANOVA to assess the
significance of each factor (Table 2). The significance 𝛼 is 0.05.
Table 2 ANOVA statistical result
VET_P Experiment
platform
average
value variance SS MS F P-value F crit
VET_P1 A 2.2 0.844
18.05 18.05 25.992 7.51 4.419 B 4.1 0.544
VET_P2 A 2.8 0.622
11.25 11.25 26.299 7.04 4.419 B 4.3 0.233
VET_P3 A 2.7 0.456
8.45 8.45 10.787 0.004 4.419 B 4 1.111
VET_P4 A 1.9 0.767
20 20 26.087 7.36 4.419 B 3.9 0.767
VET_P5 A 4.2 0.711
0.2 0.2 0.409 0.530 4.419 B 4.6 0.267
VET_P6 A 1.2 0.177
28.8 28.8 51.84 1.06 4.419 B 3.6 0.933
VET_P7 A 2.7 0.455
5 5 14.516 0.012 4.419 B 3.7 0.233
Figure 12 The VET_P comparison score.
Teachers generally think that the results of both operation experiments are very satisfactory, and they can
correctly operate the experiments and observe the experimental phenomena. However, the teacher’s
evaluation of the two experimental patterns are very different, the average value of the ARCL is 29.42%
higher than NOBOOK experiment. The teacher ’s evaluation on VET_P5 have a little different, we can see
that the F is less than the F crit, this shows that the two system is relatively stable. In the three aspects of
VET_P1, VET_P3 and VET_P7, the overall evaluation of the ARCL is 28% higher than the NOBOOK
experiment, it can be seen that the ARCL operation is simpler and more convenient for teachers to teach than
the NOBOOK. In addition, in the evaluation of VET_P4 and VET_P6, it can be seen that the ARCL is 40%
and 50% higher than the NOBOOK, respectively. It further illustrates that in ARCL with multimodal
interactions, they are more immersed in the experiments and have better learning effects. From VET_P6, the
NOBOOK experiment needs to be explored, and the ARCL can be operated according to the prompt box,
even if the operator is not familiar with the virtual experimental environment, the operator does not need to
waste much time during the experiment, which improves the efficiency of experimental interaction.
In addition, we also evaluate the operating load of this system based on the National Aeronautics and Space
Administration Task Load Index (NASA-TLX) [22] cognitive load assessment. We use the NASA-TLX
evaluation criteria for the “mental demand (MD)”, “physical demand (PD)”, “temporary demand (TD)”,
“effort (E)”, “performance (P)”, and “frustration level (FD)” to evaluate the scores of the two systems.
According to the NASA-TLX evaluation, the 40 operators experienced NOBOOK and ARCL separately and
statistically evaluated the average score. The comparative score is the same as the VET_P assessment using
a five-point scale. The NASA-TLX model evaluation of the two systems is shown in Figure 13.
Figure 13 NASA-TLX model evaluation results.
We can see that the two experiments on the evaluation and of MD have a little different, it means that the
user consumes less mental in the operation experiment. However, in the other five indicators, the ARCL is
significantly lower than the evaluation score of the NOBOOK, the overall cognitive operation load of the
ARCL is reduced by 27.42%, it proves that the VRFITS interaction efficiency is improved. The virtual and
real fusion interaction improves the user’s interaction with the virtual model and the immersion of the
operation experiment.
4 Conclusion and discussion
This paper designs and proposes the VRFITS, which contains an intelligent equipment and gesture interaction
method, the suite is suitable for any AR experiment with gesture operation behavior. We achieve the
combination of gestures, sensors and virtual model in AR. An intelligent equipment and a gesture interaction
method assist each other, the gesture can trigger vibration feedback. Finally, we design and implement a
prototype system ARCL. According to user evaluation, the ARCL compared with NOBOOK increases the
interactivity and real sense of operation in virtual experiments, reduces the user’s operation load, and
improves the user’s interaction efficiency. In addition, compared with AR card recognition of Vuforia SDK,
ARCL gets rid of multiple card operations and only uses different gesture commands to trigger different
virtual models, it makes the operation more convenient and effective.
However, this paper has certain limitations. On the one hand, there are relatively few types of gesture
recognition, so there is a lack of gesture types in the interactive process of users in virtual experiments. On
the other hand, in the virtual chemistry experiment system, the particle effect, animation effect and virtual
model rendering effect in the experimental scene are not prominent, and the interface effect of the system
should be improved in the future.
Acknowledgments
This research was supported in part by the National Key R&D Program of China under Grant
(No.2018YFB1004901) and the Independent In-novation Team Project of Jinan City (No. 2019GXRC013).
we express ours thanks to the people who helped for the work and acknowledge valuable suggestions from
the reviewers.
References
1 Collazos C A, Merchan L. Human-computer interaction in Colombia: bridging the gap between education and industry.
IT Professional, 2015, 17(1): 5-9.
DOI: 10.1109/MITP.2015.8.
2 Desai K, Belmonte U H H, Jin R, et alExperiences with multi-modal collaborative virtual laboratory (mmcvl)//2017 IEEE
Third International Conference on Multimedia Big Data (BigMM). IEEE, 2017: 376-383.
DOI: 10.1109/BigMM.2017.62
3 Chen L, Tang W, John N W, et al. SLAM-based dense surface reconstruction in monocular Minimally Invasive Surgery
and its application to Augmented Reality. Computer methods and programs in biomedicine, 2018, 158: 135-146.
DOI: 10.1016/j.cmpb.2018.02.006
4 Huynh B, Orlosky J, Höllerer T. In-Situ Labeling for Augmented Reality Language Learning//2019 IEEE Conference on
Virtual Reality and 3D User Interfaces (VR). IEEE, 2019: 1606-1611.
DOI: 10.1109/vr.2019.8798358
5 Karambakhsh A, Kamel A, Sheng B, et al. Deep gesture interaction for augmented anatomy learning. International Journal
of Information Management, 2019, 45: 328-336.
DOI: 10.1016/j.ijinfomgt.2018.03.004
6 Sun Chenxi. The design and implementation of children’s education system based on augmented reality. The Shandong
University, 2017
7 Fidan M, Tuncel M. Integrating augmented reality into problem based learning: The effects on learning achievement and
attitude in physics education[J]. Computers & Education, 2019, 142: 103635.
8 Dave I R, Chaudhary V, Upla K P. Simulation of Analytical Chemistry Experiments on Augmented Reality
Platform//Progress in Advanced Computing and Intelligent Engineering. Springer, Singapore, 2019: 393-403.
DOI: 10.1007/978-981-13-0224-4_35
9 İbili E, Çat M, Resnyansky D, et al. An assessment of geometry teaching supported with augmented reality teaching
materials to enhance students’ 3D geometry thinking skills[J]. International Journal of Mathematical Education in Science
and Technology, 2020, 51(2): 224-246
10 Rani S S, Dhrisya K J, Ahalyadas M. Hand gesture control of virtual object in augmented reality[C]. 2017 International
Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, 2017: 1500-1505.
11 Skaria S, Al-Hourani A, Lech M, et al. Hand-gesture recognition using two-antenna Doppler radar with deep convolutional
neural networks. IEEE Sensors Journal, 2019, 19(8): 3041-3048.
DOI: 10.1109/jsen.2019.2892073
12 Côté-Allard U, Fall C L, Drouin A, et al. Deep learning for electromyographic hand gesture signal classification using
transfer learning. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2019, 27(4): 760-771.
DOI: 10.1109/tnsre.2019.2896269
13 Sinha K, Kumari R, Priya A, et al. A computer vision-based gesture recognition using hidden markov model //Innovations
in Soft Computing and Information Technology. Springer, Singapore, 2019: 55-67.
DOI: 10.1007/978-981-13-3185-5_6
14 Zhang L, Zhang Y, Niu L, et al. HMM Static Hand Gesture Recognition Based on Combination of Shape Features and
Wavelet Texture Features//International Conference on Wireless and Satellite Systems. Springer, Cham, 2019: 187-197.
15 S.U. Ahmad, S. Akhter. Real Time Rotation Invariant Static Hand Gesture Recognition using An Orientation Based Hash
Code. Informatics, Electronics & Vision (ICIEV). Dhaka, Bangladesh, 2013: 1-6.
DOI: 10.1109/iciev.2013.6572620
16 Rehman A, Harouni M, Saba T. Cursive Multilingual Characters Recognition Based on Hard Geometric Features. arXiv
preprint arXiv:1904.08760, 2019.
17 Wu D, Pigou L, Kindermans P J, et al. Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and
Recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2016, 38(8):1-1.
DOI: 10.1109/TPAMI.2016.2537340
18 M. Elmezain, A. Al-Hamadi, B. Michaelis. Hand Trajectory-based Gesture Spotting and Recognition using HMM. Image
Processing (ICIP). Cairo, Egypt, 2009: 3577-3580.
DOI: 10.1109/ICIP.2009.5414322
19 S.P. Priyal, P.K. Bora. A Robust Static Hand Gesture Recognition System using Geometry Based Normalizations and
Krawtchouk Moments. Pattern Recognition, 2013, 46(8): 2202-2219.
DOI: 10.1016/j.patcog.2013.01.033
20 Liang H, Yuan J, Thalmann D, et al. Ar in hand: Egocentric palm pose tracking and gesture recognition for augmented
reality applications. Proceedings of the 23rd ACM international conference on Multimedia. Brisbane, Australia, 2015:
743-744.
DOI: 10.1145/2733373.2807972
21 J. Wang, Research on the application of virtual simulation experiment in physics experiment teaching of senior high school,
2018.
22 Law K E, Lowndes B R, Kelley S R, et al. NASA-task load index differentiates surgical approach: opportunities for
improvement in colon and rectal surgery[J]. Annals of surgery, 2020, 271(5): 906-912.