using the leap motion controller to .translate sign

Using the Leap Motion

Controller to .Translate Sign

Language to Speech

Name: Tam Chi Yan Leung Ka Chun Cheung Yat Laam To Wun Yin

ID: ' ·,

School: Engineering Engineering Engineering Engineering

Department: Computer Computer Computer Science Computer

Science Science Science

Year of 4 4 4 4

study:

Email: -

Phone

number:

1 I

2

Abstract

This project aims to develop a sign language translator to improve the

communication quality between deaf people and the general public. It returns speech

and text when the user performs sign language in front of Leap Motion Controller

(LMC). LMC is used to capture hand gesture images and convert them into the

positional and direction information. These data will be used to compare with the

data inside the database to determine the most similar sign using Dynamic Time

Warping (DTW) algorithm. DTW is an algorithm measuring similarity between

sequences or time series which may vary in time [1]. It finds an optimal alignment

between two time series. Once the gesture has been recognized, its corresponding

speech will be played and meaning will be displayed in text.

Our product is able to recognize the gesture with more than 90% accuracy within 50

gestures, where each gesture contains 10 recorded samples, in less than 2 seconds

after the action performed. This product shows the possibility and effectiveness of

recognizing sign language using LMC. It might probably eliminate the language

barrier between deaf and us in future.

3

Table of Contents

Abstract 2Table of Contents 31. Detail Description 4

1.1 Data Structure 41.1.1 Data Structure of Leap Motion 41.1.2 Data Structure of Our System 4

1.2 Gesture Matching Algorithm 61.3 GUI Design 11

1.3.1 Overall Design 111.3.2 Graphic Visualizer Design 111.3.3 Multilingual GUI Output Design 12

2. Discussion 133. References 15

4

1. Detail Description

The goal of this project is to develop a HumanComputer Interaction (HCI)

application to improve the communication quality between deaf people and the

general public in Hong Kong. This application uses the cameras on LMC to capture

the hand gestures and then look for the corresponding sign language, so that the

gestures recognized can be translated into text and speech in realtime.

The system receives the data from the LMC, which captures the hand movement.

Then the data will be compared with the stored gestures in the database. If the

comparison reaches similarity, the meaning in terms of text will be displayed on the

screen and the corresponding speech will be played.

1.1 Data Structure

1.1.1 Data Structure of Leap Motion

The data sent from LMC is a series of instances of class Frame. Each Frame object

provides the information of recognized hands in a frame, including their directions

and coordinates. Only some of these data of a Frame instance are being stored in

order to reduce the size of the database.

1.1.2 Data Structure of Our System

To handle the data from controller, we introduce class Coordinate and enumeration

HandType for managing threedimensional coordinates and representing the

corresponding side of the recognized hands respectively. They are the major

components of classes FingerData and PalmData which organize the information

related to the captured fingers and palms respectively.

FingerData, PalmData and HandType form a customized frame, class OneFrame, to

replace the bulky class Frame from Leap Motion. An array of OneFrame objects is

5

as a simplified record of an input gesture from the controller. It is the fundamental

part of class Sample.

A set of objects of class Sample is taken as the basis for our program to identify a

particular gesture. Each gesture has a unique name. It also contains the information

of the number of fingers, the number of palms and the type of hands to facilitate

faster comparison among signs. These elements define class Sign which is the

representation of a sign in our system.

Figure 1. UML class diagram of database

6

1.2 Gesture Matching Algorithm

Whenever a gesture is captured by LMC, it will be sent to our system to compare

with other gestures stored in the database using the Dynamic Time Warping (DTW)

algorithm. DTW is an algorithm measuring similarity between sequences or time

series which may vary in time. It finds an optimal alignment between two time series.

One of the time series may be “warped” nonlinearly by stretching or shrinking its

time axis. This optimal alignment can be used to determine the similarity between

these two series. The recognition algorithm mainly considers the similarity between

the given data (including normalized coordinates of fingertips, normalized

coordinates of palms) and those data in the database gesture by gesture. There are

already numerous studies and journals about DTW. A journal written by Ralph Niels

can show the basic principle of DTW [1].

The distance calculation for the alignment between two sequences in DTW is the

major concern in this project. An algorithm has been introduced to calculate the

differences between the gesture captured by LMC and those stored in the database.

As errors may be caused by gestures beginning at different coordinates, the

normalized coordinate of fingers and palms should be calculated in each frame. It

helps reduce inconsistency before we calculate the distance between two frames.

The following approach has been suggested.

Now, given a frame, name as “frame n” shown in Figure 2. For each finger, the

normalized finger coordinate is the relative to the coordinates between fingers and

palm. It can preserve the movement of the fingers while omitting the error mentioned

above.

7

Figure 2. Calculation of normalized finger coordinate

The normalized palm coordinate is the relative to the coordinates between the palm

in “frame n” and the first frame in the Sample (i.e. frame 0). It can preserve the

movement of palm.

Figure 3. Calculation of normalized palm coordinate

8

The normalized coordinates of fingertips and palms is used to calculate the distance

between two frames. The following equation has been suggested to calculate the

distance between frames. For example, given a frame from Sample A (i.e. Frame A)

and a frame from Sample B (i.e. Frame B), the distance difference is shown in Figure

4.

Figure 4. Equation of calculating distance between two frames

The normalized coordinate of fingers and palms preserve the movement of fingers

and palms. The difference of distance can be calculated using this equation to

compare the properties between two frames. When the distance between all frames

are calculated, DTW can be implemented to find the optimal alignment and the

average distance between two gestures.

Due to the limitation of LMC, our product can capture those handsigns which only

involve finger movements. Sign languages which consist of limbs and joints are not

considered.

DTW algorithm has been used for matching gestures. It is easy to implement as

there are numerous source code which implement DTW. Nevertheless, some

modification should be done in order to recognize the difference between two

9

gestures. A gesture sample can be described as a series of frames, matching two

gestures is equal to compare two series of frames. Therefore, the distances between

the frames from these two gesture are the major concern during implementation.

The following equation mentioned above has been implemented in DTW to calculate

the distance between two frames. In order to understand the process, we show the

equation again below.

Figure 5. Equation of calculating distance between two frames

When the system tries to recognize a gesture sample (i.e. Sample A), it compares

with the gestures inside the database by DTW to find the most similar one. The

gesture with minimum distance (i.e. Sample B) with Sample A can be said to be

“matched”. However, the user might perform a gesture which does not exist in the

database, the system would return the most similar one. The inappropriate

recognition might lead to incorrect translation which confuses both user and listener.

A boundary should be added to determine whether the gesture exists in database. If

the distance between the gesture (i.e. Sample A) and each gesture stored in the

database are greater than the boundary, Sample A would be considered as an

“unknown gesture”.

10

Theoretically, the above equation can evaluate the distance between two frames

with the same number of hands. If the user performs a gesture with two hands, a

series of frames with two hands will be generated. Nevertheless, LMC would

occasionally fail to capture some data, there might be a few frames which record one

hand only. This equation would fail to be implemented due to the difference in hand

number. The following approach has been done in order to tolerate this condition.

Figure 6. Frames with different hand number

Given the example above (Figure 6), if the two frames consist different hand number

(i.e. 2 hand in frame A and 1 hand in frame B), the frame with less hand number

(frame B) should be considered first. In this case, we first check whether the hand in

frame B is left or right. Assume it is the left hand in this case, therefore we ignore the

right hand in frame A. Instead, we compare the left hand in both frame A and B only.

As half of the data in frame A has been given up, therefore the distance calculated

by this approach should be adjusted. Adding half of the value of boundary to the

distance has been suggested. As this condition occasionally occurs, only a few of

frames would be affected. The distance calculated by this approach would not

dominate the average distance calculated in DTW.

11

1.3 GUI Design

1.3.1 Overall Design

Figure 7. Graphic User Interface (GUI) prototype

The GUI design is developed by the JavaFX and JavaFX Scene Builder. The GUI

contains multitab which provides different functions of the product.

“Record” tab allows user to set up a new gesture to store in the database.

“Recognize” tab is for user to perform his/her gesture to output the preset

meaning stored in the database.

“Logging” tab is for developer to check the performance of the program and

algorithm.

1.3.2 Graphic Visualizer Design

In GUI, a method is required for user to know what they are doing to represent the

progress of the hand gesture to the user. A visualizer is built to solve this problem.

12

In this program, we used JavaFX as the graphic library of the visualizer. The

visualizer is a class with subscene which can be added into other group.

Figure 8. LMC Official Visualizer Figure 9 Visualizer in this project

Inspired by the Leap Motion official visualizer (Figure 8), we decided to build up the

hand to have skeleton only for our inbuilt visualizer (Figure 9). Apart from the official

visualizer, our skeleton hand simplified the unnecessary details but still keeps the

recognizable appearance.

In addition, a replay function is necessary for visualizer to replay the stored gesture

to the user for showing the gesture in an understandable way for the user. Hence,

the visualizer requires to update the screen with the changing input of the LMC and

the retrieved gesture stored in database.

1.3.3 Multilingual GUI Output Design

User can turn on the recognition mode through GUI. Our system support Cantonese

and English translation. In recognition mode, the system will continuously compare

gestures stored in the database with the performed gesture. If the gesture is

recognized, the corresponding text output will be displayed into the text area of the

GUI. Also, the word stored in the database will be sent to Google texttospeech

service, then the speech of the text will be played.

13

2. Discussion

Using LMC to translate sign language to speech is the main purpose of this project.

A database with 50 signs and total 500 more samples has been recorded. According

to the test result, it indicates that our product has more than 90 percent of accuracy

for recognizing a gesture. This shows the possibility for translating sign language

accurately using LMC. Nevertheless, our product is still restricted by limitations of

LMC.

The limitations are confined to the effectiveness of LMC, restricting the choice of

gestures performed by users. As LMC can only capture hand movement from the

infrared cameras in one direction only, there would be a vision block if the user’s

fingers overlap. Although LMC tries to predict the positions of the fingers whenever

the vision is blocked, it is not accurate enough. The inaccurate raw data obtained

from it might cause recognition errors.

Not only is effectiveness of capturing gestures a concern, but the effectiveness of

detection is also a problem. The field of view which LMC can capture data is

bounded by the distance restriction of infrared camera in LMC. It is about 150

degrees and approximately 3 to 60 centimeters above the device. Some signs might

fail to perform due to the narrow detection range of LMC. This would distort the

representation of those gestures involving movements around chest and head,

inducing unpredictable recognition flaws.

In order to solve these problems, we have to redefine new gestures for those signs

which are unsuitable for detection by LMC. First, gestures that has vision block of

fingers must be avoided. Second, gestures must be performed close to LMC device.

Third, the replacing gestures must be similar to the original standard. As a result,

some signs provided by our product are different from the official sign language.

Sign language in Hong Kong includes over 1000 vocabulary [2]. Although a standard

database is provided by our team with 50 signs and more than 500 samples, it might

be inadequate. Therefore, a sign recording function has been implemented to allow

14

users to record the standard signs performed by themselves and selfdefined

gestures. This customization feature turns our product into a personal product. It

would be efficient for users themselves but not others because of the size of hands

and variation in selfdefined gestures. The size of hands and the way of performing

selfdefined gestures of an author must fit the data of his or her own database best.

So, users can train our product in order to further improve accuracy.

Our first intention was to implement the sign language translator on mobile platform.

It would be a portable translator which users can use it conveniently rather than sit in

front of a computer. This would definitely lower the language barrier between deaf

and us. Nonetheless, the computation power of smartphones is far not enough for

LMC. It requires a powerful CPU such as Intel Core i3/ i5 / i7 or AMD PhenomTM II

[3] which are only available in laptop or desktop computer currently. It is difficult to

implement our sign language translator into smart phone right now but probably in

future. We take iPhone 6s Plus [4] as an example, it consists of a CPU with 1.85

GHz dualcore 64bit ARM structure. The hardware specification of iPhone 6S is

close to the requirement of LMC but still not enough. The computation power of

smartphones are anticipated to be increasing. It probably meets the requirement of

LMC several years or even several months later.

This product has great potential in sign language recognition for further

development. Our group has demonstrated the capability and efficiency of DTW

algorithm applied to gesture recognition. DTW is a relative simple and easily

understood algorithm for undergraduate students compared with many sophisticated

algorithms and models. If there is any further development carried by developers

with advanced skills and knowledge, more superb features could be added such as

natural language processing and artificial intelligent. Translating sign language into a

full sentence with correct grammar could be a possible result of their works. Deaf

might speak with us using LMC without any language barrier in future.

15

3. References

[1] N. Ralph, “Dynamic Time Warping An intuitive way of handwriting

recognition?”, M.S. thesis, Dept. A.I. Eng., Radboud University Nijmegen,

Netherlands, 2004

[2] “CUHK Launches the First Hong Kong Sign Language Browser in Hong Kong

Creating an Communication Platform between Deaf and Hearing People”,

Communications and Public Relations Office, CUHK, para. 3, January 23,

2013 [Online].

Available: https://www.cpr.cuhk.edu.hk/en/press_detail.php?id=1471

[3] "Leap Motion Developers", Developer.leapmotion.com, 2016. [Online].

Available: https://developer.leapmotion.com/.

[4] "Apple", Apple, 2016. [Online].

Available: http://www.apple.com/.

https://www.cpr.cuhk.edu.hk/en/press_detail.php?id=1471

https://developer.leapmotion.com/

http://www.apple.com/

using the leap motion controller to .translate sign

Documents