using the leap motion controller to .translate sign
TRANSCRIPT
Using the Leap Motion
Controller to .Translate Sign
Language to Speech
Name: Tam Chi Yan Leung Ka Chun Cheung Yat Laam To Wun Yin
ID: ' ·,
School: Engineering Engineering Engineering Engineering
Department: Computer Computer Computer Science Computer
Science Science Science
Year of 4 4 4 4
study:
Email: -
Phone
number:
1 I
2
Abstract
This project aims to develop a sign language translator to improve the
communication quality between deaf people and the general public. It returns speech
and text when the user performs sign language in front of Leap Motion Controller
(LMC). LMC is used to capture hand gesture images and convert them into the
positional and direction information. These data will be used to compare with the
data inside the database to determine the most similar sign using Dynamic Time
Warping (DTW) algorithm. DTW is an algorithm measuring similarity between
sequences or time series which may vary in time [1]. It finds an optimal alignment
between two time series. Once the gesture has been recognized, its corresponding
speech will be played and meaning will be displayed in text.
Our product is able to recognize the gesture with more than 90% accuracy within 50
gestures, where each gesture contains 10 recorded samples, in less than 2 seconds
after the action performed. This product shows the possibility and effectiveness of
recognizing sign language using LMC. It might probably eliminate the language
barrier between deaf and us in future.
3
Table of Contents
Abstract 2Table of Contents 31. Detail Description 4
1.1 Data Structure 41.1.1 Data Structure of Leap Motion 41.1.2 Data Structure of Our System 4
1.2 Gesture Matching Algorithm 61.3 GUI Design 11
1.3.1 Overall Design 111.3.2 Graphic Visualizer Design 111.3.3 Multilingual GUI Output Design 12
2. Discussion 133. References 15
4
1. Detail Description
The goal of this project is to develop a HumanComputer Interaction (HCI)
application to improve the communication quality between deaf people and the
general public in Hong Kong. This application uses the cameras on LMC to capture
the hand gestures and then look for the corresponding sign language, so that the
gestures recognized can be translated into text and speech in realtime.
The system receives the data from the LMC, which captures the hand movement.
Then the data will be compared with the stored gestures in the database. If the
comparison reaches similarity, the meaning in terms of text will be displayed on the
screen and the corresponding speech will be played.
1.1 Data Structure
1.1.1 Data Structure of Leap Motion
The data sent from LMC is a series of instances of class Frame. Each Frame object
provides the information of recognized hands in a frame, including their directions
and coordinates. Only some of these data of a Frame instance are being stored in
order to reduce the size of the database.
1.1.2 Data Structure of Our System
To handle the data from controller, we introduce class Coordinate and enumeration
HandType for managing threedimensional coordinates and representing the
corresponding side of the recognized hands respectively. They are the major
components of classes FingerData and PalmData which organize the information
related to the captured fingers and palms respectively.
FingerData, PalmData and HandType form a customized frame, class OneFrame, to
replace the bulky class Frame from Leap Motion. An array of OneFrame objects is
5
as a simplified record of an input gesture from the controller. It is the fundamental
part of class Sample.
A set of objects of class Sample is taken as the basis for our program to identify a
particular gesture. Each gesture has a unique name. It also contains the information
of the number of fingers, the number of palms and the type of hands to facilitate
faster comparison among signs. These elements define class Sign which is the
representation of a sign in our system.
Figure 1. UML class diagram of database
6
1.2 Gesture Matching Algorithm
Whenever a gesture is captured by LMC, it will be sent to our system to compare
with other gestures stored in the database using the Dynamic Time Warping (DTW)
algorithm. DTW is an algorithm measuring similarity between sequences or time
series which may vary in time. It finds an optimal alignment between two time series.
One of the time series may be “warped” nonlinearly by stretching or shrinking its
time axis. This optimal alignment can be used to determine the similarity between
these two series. The recognition algorithm mainly considers the similarity between
the given data (including normalized coordinates of fingertips, normalized
coordinates of palms) and those data in the database gesture by gesture. There are
already numerous studies and journals about DTW. A journal written by Ralph Niels
can show the basic principle of DTW [1].
The distance calculation for the alignment between two sequences in DTW is the
major concern in this project. An algorithm has been introduced to calculate the
differences between the gesture captured by LMC and those stored in the database.
As errors may be caused by gestures beginning at different coordinates, the
normalized coordinate of fingers and palms should be calculated in each frame. It
helps reduce inconsistency before we calculate the distance between two frames.
The following approach has been suggested.
Now, given a frame, name as “frame n” shown in Figure 2. For each finger, the
normalized finger coordinate is the relative to the coordinates between fingers and
palm. It can preserve the movement of the fingers while omitting the error mentioned
above.
7
Figure 2. Calculation of normalized finger coordinate
The normalized palm coordinate is the relative to the coordinates between the palm
in “frame n” and the first frame in the Sample (i.e. frame 0). It can preserve the
movement of palm.
Figure 3. Calculation of normalized palm coordinate
8
The normalized coordinates of fingertips and palms is used to calculate the distance
between two frames. The following equation has been suggested to calculate the
distance between frames. For example, given a frame from Sample A (i.e. Frame A)
and a frame from Sample B (i.e. Frame B), the distance difference is shown in Figure
4.
Figure 4. Equation of calculating distance between two frames
The normalized coordinate of fingers and palms preserve the movement of fingers
and palms. The difference of distance can be calculated using this equation to
compare the properties between two frames. When the distance between all frames
are calculated, DTW can be implemented to find the optimal alignment and the
average distance between two gestures.
Due to the limitation of LMC, our product can capture those handsigns which only
involve finger movements. Sign languages which consist of limbs and joints are not
considered.
DTW algorithm has been used for matching gestures. It is easy to implement as
there are numerous source code which implement DTW. Nevertheless, some
modification should be done in order to recognize the difference between two
9
gestures. A gesture sample can be described as a series of frames, matching two
gestures is equal to compare two series of frames. Therefore, the distances between
the frames from these two gesture are the major concern during implementation.
The following equation mentioned above has been implemented in DTW to calculate
the distance between two frames. In order to understand the process, we show the
equation again below.
Figure 5. Equation of calculating distance between two frames
When the system tries to recognize a gesture sample (i.e. Sample A), it compares
with the gestures inside the database by DTW to find the most similar one. The
gesture with minimum distance (i.e. Sample B) with Sample A can be said to be
“matched”. However, the user might perform a gesture which does not exist in the
database, the system would return the most similar one. The inappropriate
recognition might lead to incorrect translation which confuses both user and listener.
A boundary should be added to determine whether the gesture exists in database. If
the distance between the gesture (i.e. Sample A) and each gesture stored in the
database are greater than the boundary, Sample A would be considered as an
“unknown gesture”.
10
Theoretically, the above equation can evaluate the distance between two frames
with the same number of hands. If the user performs a gesture with two hands, a
series of frames with two hands will be generated. Nevertheless, LMC would
occasionally fail to capture some data, there might be a few frames which record one
hand only. This equation would fail to be implemented due to the difference in hand
number. The following approach has been done in order to tolerate this condition.
Figure 6. Frames with different hand number
Given the example above (Figure 6), if the two frames consist different hand number
(i.e. 2 hand in frame A and 1 hand in frame B), the frame with less hand number
(frame B) should be considered first. In this case, we first check whether the hand in
frame B is left or right. Assume it is the left hand in this case, therefore we ignore the
right hand in frame A. Instead, we compare the left hand in both frame A and B only.
As half of the data in frame A has been given up, therefore the distance calculated
by this approach should be adjusted. Adding half of the value of boundary to the
distance has been suggested. As this condition occasionally occurs, only a few of
frames would be affected. The distance calculated by this approach would not
dominate the average distance calculated in DTW.
11
1.3 GUI Design
1.3.1 Overall Design
Figure 7. Graphic User Interface (GUI) prototype
The GUI design is developed by the JavaFX and JavaFX Scene Builder. The GUI
contains multitab which provides different functions of the product.
“Record” tab allows user to set up a new gesture to store in the database.
“Recognize” tab is for user to perform his/her gesture to output the preset
meaning stored in the database.
“Logging” tab is for developer to check the performance of the program and
algorithm.
1.3.2 Graphic Visualizer Design
In GUI, a method is required for user to know what they are doing to represent the
progress of the hand gesture to the user. A visualizer is built to solve this problem.
12
In this program, we used JavaFX as the graphic library of the visualizer. The
visualizer is a class with subscene which can be added into other group.
Figure 8. LMC Official Visualizer Figure 9 Visualizer in this project
Inspired by the Leap Motion official visualizer (Figure 8), we decided to build up the
hand to have skeleton only for our inbuilt visualizer (Figure 9). Apart from the official
visualizer, our skeleton hand simplified the unnecessary details but still keeps the
recognizable appearance.
In addition, a replay function is necessary for visualizer to replay the stored gesture
to the user for showing the gesture in an understandable way for the user. Hence,
the visualizer requires to update the screen with the changing input of the LMC and
the retrieved gesture stored in database.
1.3.3 Multilingual GUI Output Design
User can turn on the recognition mode through GUI. Our system support Cantonese
and English translation. In recognition mode, the system will continuously compare
gestures stored in the database with the performed gesture. If the gesture is
recognized, the corresponding text output will be displayed into the text area of the
GUI. Also, the word stored in the database will be sent to Google texttospeech
service, then the speech of the text will be played.
13
2. Discussion
Using LMC to translate sign language to speech is the main purpose of this project.
A database with 50 signs and total 500 more samples has been recorded. According
to the test result, it indicates that our product has more than 90 percent of accuracy
for recognizing a gesture. This shows the possibility for translating sign language
accurately using LMC. Nevertheless, our product is still restricted by limitations of
LMC.
The limitations are confined to the effectiveness of LMC, restricting the choice of
gestures performed by users. As LMC can only capture hand movement from the
infrared cameras in one direction only, there would be a vision block if the user’s
fingers overlap. Although LMC tries to predict the positions of the fingers whenever
the vision is blocked, it is not accurate enough. The inaccurate raw data obtained
from it might cause recognition errors.
Not only is effectiveness of capturing gestures a concern, but the effectiveness of
detection is also a problem. The field of view which LMC can capture data is
bounded by the distance restriction of infrared camera in LMC. It is about 150
degrees and approximately 3 to 60 centimeters above the device. Some signs might
fail to perform due to the narrow detection range of LMC. This would distort the
representation of those gestures involving movements around chest and head,
inducing unpredictable recognition flaws.
In order to solve these problems, we have to redefine new gestures for those signs
which are unsuitable for detection by LMC. First, gestures that has vision block of
fingers must be avoided. Second, gestures must be performed close to LMC device.
Third, the replacing gestures must be similar to the original standard. As a result,
some signs provided by our product are different from the official sign language.
Sign language in Hong Kong includes over 1000 vocabulary [2]. Although a standard
database is provided by our team with 50 signs and more than 500 samples, it might
be inadequate. Therefore, a sign recording function has been implemented to allow
14
users to record the standard signs performed by themselves and selfdefined
gestures. This customization feature turns our product into a personal product. It
would be efficient for users themselves but not others because of the size of hands
and variation in selfdefined gestures. The size of hands and the way of performing
selfdefined gestures of an author must fit the data of his or her own database best.
So, users can train our product in order to further improve accuracy.
Our first intention was to implement the sign language translator on mobile platform.
It would be a portable translator which users can use it conveniently rather than sit in
front of a computer. This would definitely lower the language barrier between deaf
and us. Nonetheless, the computation power of smartphones is far not enough for
LMC. It requires a powerful CPU such as Intel Core i3/ i5 / i7 or AMD PhenomTM II
[3] which are only available in laptop or desktop computer currently. It is difficult to
implement our sign language translator into smart phone right now but probably in
future. We take iPhone 6s Plus [4] as an example, it consists of a CPU with 1.85
GHz dualcore 64bit ARM structure. The hardware specification of iPhone 6S is
close to the requirement of LMC but still not enough. The computation power of
smartphones are anticipated to be increasing. It probably meets the requirement of
LMC several years or even several months later.
This product has great potential in sign language recognition for further
development. Our group has demonstrated the capability and efficiency of DTW
algorithm applied to gesture recognition. DTW is a relative simple and easily
understood algorithm for undergraduate students compared with many sophisticated
algorithms and models. If there is any further development carried by developers
with advanced skills and knowledge, more superb features could be added such as
natural language processing and artificial intelligent. Translating sign language into a
full sentence with correct grammar could be a possible result of their works. Deaf
might speak with us using LMC without any language barrier in future.
15
3. References
[1] N. Ralph, “Dynamic Time Warping An intuitive way of handwriting
recognition?”, M.S. thesis, Dept. A.I. Eng., Radboud University Nijmegen,
Netherlands, 2004
[2] “CUHK Launches the First Hong Kong Sign Language Browser in Hong Kong
Creating an Communication Platform between Deaf and Hearing People”,
Communications and Public Relations Office, CUHK, para. 3, January 23,
2013 [Online].
Available: https://www.cpr.cuhk.edu.hk/en/press_detail.php?id=1471
[3] "Leap Motion Developers", Developer.leapmotion.com, 2016. [Online].
Available: https://developer.leapmotion.com/.
[4] "Apple", Apple, 2016. [Online].
Available: http://www.apple.com/.