a real-time american sign language word recognition...

Turk J Elec Eng & Comp Sci

(2015) 23: 2107 – 2123

c⃝ TUBITAK

doi:10.3906/elk-1303-167

Turkish Journal of Electrical Engineering & Computer Sciences

http :// journa l s . tub i tak .gov . t r/e lektr ik/

Research Article

A real-time American Sign Language word recognition system based on neural

networks and a probabilistic model

Neelesh SARAWATE1, Ming Chan LEU2, Cemil OZ3,∗

1Department of Mechanical Engineering, Ohio State University, Columbus, OH, USA2Department of Mechanical and Aerospace Engineering, Missouri University of Science and Technology,

Rolla, MO, USA3Department of Computer Engineering, Faculty of Computer and Information Sciences, Sakarya University,

Sakarya, Turkey

Received: 22.03.2013 • Accepted/Published Online: 31.07.2013 • Printed: 31.12.2015

Abstract: The development of an American Sign Language (ASL) word recognition system based on neural networks

and a probabilistic model is presented. We use a CyberGlove and a Flock of Birds motion tracker to extract the gesture

data. The finger joint angle data obtained from the sensory glove defines the handshape while the data from the motion

tracker describes the trajectory of the hand movement. The four gesture features, namely handshape, hand position,

hand orientation, and hand movement, are recognized using different functions that include backpropagation neural

networks. The sequence of these features is used to generate a specific sign or word in ASL based on a probabilistic

model. The system can recognize the ASL signs in real time and update its database based interactively. The system

has an accuracy of 95.4% over a vocabulary of 40 ASL words.

Key words: American Sign Language, American Sign Language recognition, virtual reality, artificial neural network,

probabilistic model

1. Introduction

Deafness is a serious disability for humankind. It is defined as a degree of impairment such that a person is

unable to understand speech. Every country has a deaf community; while most of these individuals are born

with hearing loss, others lose their sense of hearing while they are children, or even as adults [1]. Almost all

deaf communities have their own sign language, which is highly different from spoken language. The United

Kingdom and United States both use the English language, but American Sign Language (ASL) was created

from French sign language by French teachers. Due to this, it is different from British Sign Language (BSL).

Turkish Sign Language (TSL) and BSL are more similar to each other than to ASL because TSL was created

by British teachers. It is often very difficult for deaf people to communicate with people outside the deaf

community.

Deaf people in the United States and parts of Canada communicate with ASL [2], but not many people

can understand ASL besides deaf people, other than deaf people’s family members. An automatic machine

translation system between deaf people and hearing people will be helpful for communication. It has to have

two parts, the first part converting ASL sign to text and speech. The second part must convert speech to

text and sign language animation. Most deaf people who know ASL can read English text. There are some

∗Correspondence: [email protected]

2107

SARAWATE et al./Turk J Elec Eng & Comp Sci

successful speech recognition programs like Microsoft Speech SDK, but converting ASL signs to English text is

a very challenging problem due to complex body movements.

Generally, research for hearing devices is supported more than research for sign language recognition.

This is affecting ASL recognition studies; however, there are many ASL recognition and gesture recognition

studies. Some ASL studies are based on computer vision, which specially set up a high-frame camera, where the

user needs to wear a colored glove, a special suit, and a marker to increase the accuracy of the system. Most of

these systems are not portable and some of them are not affordable, like the VICON system or OPTITRACK.

They require complicated data processing and have slower recognition rates. Development of some successful

computer vision-based ASL recognition systems can be found in [3–10]. There are some successful ASL studies

using data gloves and motion trackers. These ASL recognition systems use a special sensory glove and a motion

tracker to detect handshapes and hand movements. Some ASL recognition researchers use sensor devices and

computer vision systems together to get better results.

Between 1980 and 1990, high-accuracy virtual reality devices were developed to interact with the virtual

world, like data gloves, haptics, or Flock of Birds. There have been several ASL studies based on data gloves

and Flock of Birds using artificial neural networks (ANNs), hidden Markov models (HMMs), and statistical

methods. The research based on neural networks includes work done by Fels and Hinton [11,12] and by Waldron

and Kim to recognize ASL signs [13]. Recent attempts in sign language recognition involve the use of HMMs.

Gao et al. did considerable work in Chinese Sign Language (CSL) recognition [14–18] using sensory gloves and

motion tracker. Their work included the use of ANNs and HMMs for continuous recognition of CSL. Liang and

Ouhyoung did work on the continuous recognition of Taiwanese Sign Language recognition using sensory gloves

[19,20]. Oz et al. developed an ASL word recognition system using ANNs [21,22]. Other attempts include the

study by Kim et al. [23] to recognize Korean Sign Language words using neural networks. Some recent neural

network-based ASL studies were done by Qutaishat et al. and by Stergiopolou and Papamarkos [24,25]. Lee

and Xu [26] recognized 14 alphabets in ASL for human/robot interface using HMMs. Wang et al. developed

an ASL alphabet using HMMs [27]. The applications of gesture recognition research are not limited to sign

language recognition applications. Gesture recognition systems can be useful to human/robot interface and

virtual reality [28–30].

This paper discusses the development of a real-time ASL sign recognition system using neural networks

and a probabilistic model. A sign is composed of four features that are recognized by four neural network

classifiers, and then the sequence of network outputs is analyzed using the probabilistic model. The sign output

is then converted to English text and speech. We use a CyberGlove sensory glove and Flock of Birds motion

tracker for collecting the data. Other components in the system include a velocity network, which is developed

to define the start and end of the signing based on the velocity of hand. The system is also capable of online

training of the ASL signs. This allows the user to add new words to the vocabulary without retraining the

whole system.

2. Experimental setup and procedure

Standard computer input and output devices such as mouse, keyboard, and monitor are designed to be used

while sitting in a chair and they are constrained devices. Virtual reality applications need specially designed

devices to provide physical flexibility to the user, like gloves, motion trackers, and head-mounted display. These

interaction-based input devices were designed to overcome constrained peripheral devices [31].

A CyberGlove and Flock of Birds are used to track bending fingers, hand position, and hand orientation.

2108


The CyberGlove has 18 sensors, three strain gauge for the palm, three strain gauge for the thumb, two strain

gauge for each of the four fingers, and four strain gauge between the fingers. Every strain gauge value of the

glove converts the degree of bending to a number between 1 and 255. CyberGlove data, except the three palm

data, are used for classifying handshape and ASL alphabet. The glove is shown in Figure 1.

Flock of Birds has three parts, the system unit, transmitter, and receiver. The receiver is used to get the

hand position and orientation and it is connected to the wrist of the right hand. The transmitter produces a

magnetic field with DC pulse. The magnetic field of the transmitter is about 180 cm. It is used to track the

position and the orientation of the right-hand wrist. Figure 2 shows the Flock of Birds motion tracker system

Figure 1. CyberGlove with 18 sensors. Figure 2. Flock of Birds 3-D motion tracker.

Figure 3 shows the block diagram of our ASL word recognition system. The CyberGlove and the Flock

of Birds tracker have their own system units to get fast and reliable data, and each system unit is connected

to the computer system through an RS-232 serial port. The velocity network listens to the Flock of Birds data

and calculates hand velocity. It is designed to classify the hand velocity at any time instant as “signing” or

“not signing”. Feature vectors are extracted and sensor data are recorded if the output of the velocity network

is “signing”. When the velocity network output changes from “signing” to “steady”, the features and data

recording for the given sign are completed, and the sign recognition phase starts. A probabilistic model based

on the Markov chains and HMM is used to process the outputs of the four networks. This model gives the sign

output based on the trained database. There is a provision to train the system online and to update the present

database in an interactive fashion. A VRML model of the hand is continuously updated, which shows the

position of the fingers and hand on the computer screen along with the sign output. The Microsoft Speech SDK

produces the voice of the sign output text. All these different units are discussed in detail in the subsequent

sections.

3. Handshape recognition using neural networks

Each sign in ASL can be considered to be composed of a combination of 36 basic handshapes. Handshapes

form the most important and interesting features of an ASL sign. During the ASL signing, the dominant hand,

usually the right hand, does the majority of the signing. There are one-handed signs that rely simply on the

dominant hand, symmetric two-handed signs, where the dominant and nondominant hands perform the same

action, and nonsymmetrical two-handed signs, where both hands are often performing different actions. Figure

4 shows the ASL handshapes.

2109


Feature Recognition

CyberGlove® and Flock of Birds® data at 33.33 Hz, 15 angles + (x, y, z) + (α, β, γ)

Velocity Network

(15 angles)* T Handshape Network

(x, y, z)*T Position Network

(α, β, γ)* T Orientation Network

(Δx, Δy, Δz)*T Movement Function

Probability Model

Interactive feedback from the user

Update/train the model online

Recognized sign output

Speech by Microso'® Speech SDK

Display on screen using Open Inventor

VRML hand model

Figure 3. The overall structure of our system.

We have used ANNs for solving the problem of handshape recognition. The 26 handshapes in ASL, which

are used for spelling nouns and words out of the ASL dictionary, are very similar in nature to handshapes and

a similar solution can also be used for alphabet recognition. ANNs [12] are composed of several computing

elements operating in parallel and arranged in a pattern indicative of the biological nervous system. The input

and output nodes are interconnected by links called weights. In the training phase, the network learns to

perform a particular function by adjusting the weights and developing associations between input space and

output space. In the recognition phase, a fully trained network can predict an output for an unknown input

using the learned association. The backpropagation algorithm is a supervised learning algorithm in which the

sample inputs and outputs are presented to the network in the training phase. This is the algorithm we use for

the recognition of ASL handshapes.

A two-stage network as shown in Figure 5 is developed for the handshape recognition. The input layer

consists of 15 nodes (p1 to p15), one for each angle value of the finger joints. The output layer consists of 36

nodes (a21 to a2

36), each representing a different handshape. The initial weights and biases are set to random

values. The handshape recognition ANN model has 15 input neurons, 20 hidden layer neurons, and 36 output

neurons. Hence, the weight matrix between the input and hidden layers has a size of (20 × 15) and the weight

matrix between the hidden layer and output has a size of (36 × 20). A log-sigmoid transfer function is used

for calculating the hidden layer and output layer values. The network output lies between 0 and 1. The target

vector consists of 36 elements. All the elements are set to 0 except one, which is set to 1. The index of the

element corresponding to 1 defines the handshape.

Network training and testing is done with the MATLAB program using the Levenberg–Marquardt algo-

rithm (TRAINLM) and resilient backpropagation algorithm (TRAINRP). We trained and tested the network,

changing the network’s hidden layer neurons between 6 and 80 to get a good result using different numbers of

training samples of multiple users. Twenty hidden layers for the TRAINLM and TRAINRP algorithms give

good performance in terms of accuracy and training time.

2110


Figure 4. The 36 basic handshapes in ASL.

Figure 5. Neural network for handshape recognition.

The Levenberg–Marquardt algorithm results in large memory requirements and a large training time.

The comparisons between the Levenberg–Marquardt algorithm and the resilient backpropagation algorithm are

given in Figures 6 and 7 for 16 training samples.

Both of the algorithms are also checked for a multiuser system. The system trained by 3 and 7 users

is tested by users involved in training as well as completely new users. The test results are given in Table 1.

2111


Considering the comparable accuracy and low training time and memory requirements of TRAINRP, it is

considered as the better solution.

92

93

94

95

96

97

98

99

0 20 40 60 80

Number of hidden layer nodes

Acc

ura

cy (

%)

TRAINRP

TRAINLM0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

0 20 40 60 80Number of hidden layer nodes

Tim

e fo

r tr

ain

ing

(s)

TRAINRP

TRAINLM

Figure 6. Accuracy comparison of TRAINLM and

TRAINRP.

Figure 7. Training time comparison of TRAINLM and

TRAINRP.

Table 1. Handshape recognition results for multiuser system.

Trained by 3 users Trained by 7 usersAlgorithm Same users New users Same users New usersTRAINLM 93.51 80.09 92.09 82.17TRAINRP 87.96 72.22 95.72 85.41

4. Velocity network and data collection

The system is developed for recognition of ‘isolated’ signs. Hence, it is assumed that there is a distinct pause

between the two successive signs. The signing period is defined as the period over which the signing action is

performed. Typically, during this period, the hand is moving. When the hand stops moving, it is considered as

the end of the sign. From the viewpoint of real-time sign recognition, it is necessary to identify the beginning

and the end of the sign. The ‘velocity’ of the hand at a given time instant gives a good indication of whether

the hand is moving or steady. However, due to the high frequency of data collection, the instantaneous velocity

value can rise or drop suddenly due to momentary shaking or stopping of the hand. The ‘threshold velocity’

approach considers the hand velocity at only one time instant and hence fails to achieve the desired purpose.

Figure 8 shows the velocity graph for sign BELL and the output of the threshold approach. Hence, the history

of velocity values for a few previous time instants should be considered to determine whether the hand is steady

or moving.

Neural networks provide a good solution to this problem. ANNs [32] are composed of several computing

elements operating in parallel and arranged in a pattern indicative of the biological nervous system. The input

and output nodes are interconnected by links called weights. In the training phase, the network learns to

perform a particular function by adjusting the weights and developing associations between input space and

output space. In the recognition phase, a fully trained network can predict an output for an unknown input

using the learned association. The backpropagation algorithm is a supervised learning algorithm in which the

sample inputs and outputs (targets) are presented to the network in the training phase. This algorithm is used

for the velocity network and for the recognition of three features.

A two-stage neural network, the ‘velocity network’, is used to classify the hand at a given time instant as

steady (0) or moving (1). After numerous trials, the input to the network is decided as ‘summation of velocity

values during past Tv instants including the current one’. In our case, after a few trials, Tv is decided as 5.

2112


The velocity is given by Eq. (1). In our case, ∆thas a constant value of 0.03 s.

v(t) =√∆x(t)2 +∆y(t)2 +∆z(t)2/∆t

where,

∆x(t) = x(t)− x(t− 1)

∆y(t) = y(t)− y(t− 1)

∆z(t) = z(t)− z(t− 1)

(1)

The input to the velocity network at a given time instant is given by Eq. (2).

u(t) =

Tv∑i=1

v(t− i+ 1) (2)

The velocity network has 1 input layer neuron, 5 hidden layer neurons, and 2 output layer neurons. A hyperbolic

tangent sigmoid function is used in both layers. The network is trained using the resilient backpropagation

algorithm [33]. The resilient backpropagation algorithm updates the weights using a factor that is independent

of the slope of the activation function. Hence, it results in fast training as compared to the steepest descent

algorithm, in which the weight update factor depends on the slope of the activation function. The weight

increment and decrement factors are selected as 1.2 and 0.7, respectively. The initial weight update factor is

selected as 0.07. These values are also used in the three-feature recognition networks discussed in Section 4.

The number of maximum epochs is set to 500. The error goal is set to 10E-5. The output of the network lies

in the range of [–1, 1], which is converted to 0 or 1.

The input to the network is the velocity summation specified in Eq. (2). The target vectors were manually

created by studying the velocity graphs for different randomly selected signs. The target vector corresponding

to the steady position is [1, –1] and that corresponding to the moving position is [–1, 1]. Twenty training

samples of different signs resulted in recognition accuracy of 91%.

The network continuously calculates the output using the stored data for the past five time instants,

which gets updated at 33.33 Hz. Figure 9 shows the performance of the network for the sign BELL. It can be

seen that the network produces a continuous output of 1 during the actual signing period.

0

0.5

1

1.5

2

2.5

6 6.5 7 7.5 8

Time (s)

Velocity

!reshold output

0

0.5

1

1.5

2

2.5

6 6.5 7 7.5 8Time (s)

Velocity

Network output

Figure 8. Output from the threshold approach. Figure 9. Performance of the velocity network.

2113


If the output of the velocity network is 1, then the 15 data elements from the glove and 6 data elements

from the motion tracker are stored at 33.33 Hz. When the velocity network output changes from 1 to 0, the

data recording for the current sign is completed, and the sign recognition phase starts. The data at T number

of equally divided time instants is chosen from the recorded data during the ‘signing period’. It creates a simple

model, yet we have enough information. These data are sent to the four feature recognition functions, which

are activated at this point. No data are recorded if the output of the velocity network is 0.

5. Feature recognition

ASL is spoken with finger spelling and hand signing. While finger spelling is used to spell names and words

that are not defined as ASL words using the 26-letter ASL alphabet, signing of ASL words, which are defined

by a serial handshape and hand movement depending on the signer’s body, is called hand signing. There are

about 6000 ASL words. Hand signing is easier than finger spelling. The ASL alphabet is shown in Figure 10.

Some ASL words are given in Table 2 to give an idea about ASL words.

Figure 10. ASL alphabet gestures.

This study is focused on ASL word recognition. We used four previously defined feature vectors [16,17],

from studies aimed at developing an ASL writing style using handshape, hand location, hand movement, and

hand orientation. These features are also known as the ‘big four’ of ASL. We redesigned these features for

our model and used a neural network classifier for handshape, hand location, and hand orientation feature

extraction, while if-else logic was used for hand movement feature extraction. Feature extractions are discussed

in the following subsections.

5.1. Handshape recognition

As mentioned above, there are 36 handshapes. Handshapes are static gestures and 15 data glove data are

enough for handshape recognition. Handshapes are usually confused with alphabet characters, but alphabet

2114


characters do not depend on finger bending only, they also depend on hand orientation. Of course, there are

some common and similar handshapes among the alphabet characters. Handshape is an important feature for

classifying words.

Table 2. Example words from American Sign Language [2].

ASL word Definition

Hello

Place hand on forward as if to salute (but not as rigid). Move hand outward, ending up with palm facing forward in the air just a few inches from the forehead.

I (me)

Point to self, touching the center of the chest.

meet

Bring “d” hands together, palm to palm.

Handshape recognition using neural networks was discussed in detail in Section 3. The raw data from

the glove vary from 0 to 256, scaled from –1 to 1 before being used as input to the network. The resilient

backpropagation algorithm is used to train the network. The maximum number of epochs is set to 1000. The

error goal is set to 10E-5. The network trained using 12 samples of each handshape resulted in 98% recognitionaccuracy.

A part of the output of the data collection algorithm is fed to the handshape neural network. It is the

data from 15 finger bending angles at T selected time instants. The output of the network is a sequence of T

numbers at T time instants. Each number is between 1 and 36, representing the observed handshape at a given

time instant.

5.2. Hand orientation recognition

Handshape is a very important feature for ASL signs, but some ASL words have the same handshape. Without

hand orientation, it is less helpful. Orientation is used to distinguish between ASL words that have similar

handshapes. The term “orientation” refers to the position of the palm of the hands and the direction the palm

is facing. For example, a palm placed on the signer’s own chest would mean “mine”. A palm facing toward the

reader would mean “your”. The total possible orientations of the hand are divided into 12 classes. These are

four orientations about three axes.

Similar to a handshape classifier, a multilayer feedforward neural network is used for classifying the hand

orientation. The Flock of Birds receiver is mounted on the right-hand wrist and it tracks the wrist position (x,

y, z) and wrist orientation (alpha, beta, gamma). Right-hand wrist alpha, beta, and gamma angle data are used

for classifying the orientation. The angle values are between –180 and 180. They are normalized between –1 and

1. During the training and testing of the hand orientation network, 20 normalized angle values are used. The

MATLAB program was used for training and testing the ANN model. A logarithmic sigmoid transfer function

is used in the hidden layer and the output layer. Network training is done with the resilient backpropagation

2115


algorithm. When the network success is 100% during offline testing, then trained network parameters and data

like weights and bias are stored in a text file. It is tested with real-time data using the stored network data and

parameters. Accuracy of the system is about 96%.

5.3. Hand position recognition

The ASL signing area is the hemisphere space around the front of the signer. ASL words alter in meaning

when changing the hand position, even when the sign has the same handshape and hand orientation. The

classification of possible hand positions is quite subjective. There are a number of specified positions given in

[34], but division into a large number of classes will reduce the accuracy of recognition in our case. Based on

our own judgment, we have divided the total space around the signer into 7 regions as shown in Figure 11. The

black dots represent approximate centers of the different regions. We use a right-handed glove, and the hand

occupies the right side of the body more often as compared to the left side. Hence, more divisions are made

on the right side of the body than the left side. The (x, y, z) values obtained from Flock of Birds are used for

position recognition. These values are measured with respect to the magnetic sensor. Thus, (x, y, z) depend

on the position in which the user is sitting, which can vary. To overcome this problem, we initialize the origin

with respect to the user’s nose. The user just needs to touch his nose and hit a particular key before using the

system.

Figure 11. Classification of hand position.

Similar to the handshape network, a two-stage feedforward neural network is used for position recognition.

The three relative coordinates (x, y, z) are the input to the position recognition network. The (x, y, z) values

can vary from –92 to 92 cm, which are scaled from –1 to 1. The network has 3 input layer neurons, 7 output

2116


layer neurons, and 10 hidden layer neurons. A logarithmic sigmoid function is used in both layers. The resilient

backpropagation algorithm is used to train the network. The network trained by using 12 samples of each

position resulted in recognition accuracy of 93%. The output of the network is a sequence of T numbers. Each

number is an integer that lies in the interval of 1 to 7.

5.4. Hand movement recognition

The ‘hand movement’ at a given instant is divided into 6 classes. These are +X, –X, +Y, –Y, +Z, and –Z.

The values of (∆x, ∆y, ∆z) at a given time instant are the input to the movement recognition function. (∆x,

∆y, ∆z) are obtained by taking the difference between (x, y, z) positions at the given time instant and the

previous time instant. If-else logic is used to identify the maximum magnitude of deviation and the direction

of movement, which in turn decides the ‘movement class’. Similar to the previous features, the output of this

function is a sequence of T numbers. Each number is an integer that lies in the interval of 1 to 6.

6. Probabilistic model for sign recognition

Further processing of the data is done using a probabilistic model based on the concept of Markov chains and

HMMs [33]. The processing of the raw data by the four feature recognition functions is quite rigorous and gives

a good deal of information about the given sign. For each sign, we get four observation sequences of length

T. Each observation in the handshape sequence takes any value in {1, 2,. . . , 36} . Similarly, the observations

of orientation, position, and movement sequences can take any value in {1, 2,. . . , 12} , {1, 2,. . . , 7} , and {1,2,. . . , 6} , respectively. The sequence of four features for a given sign can be unique and the process from here

should not be complicated. Hence, we have developed a simple model based on a probabilistic approach that

is able to analyze this information and give a result without much computation as compared to the standard

HMM with multiple states [35].

6.1. Training phase

Creation of the training database based on the ASL signs in the current vocabulary of the system is described

here. Matrix H is associated with the handshape feature and gives the probability of occurrence of all possible

handshapes in all the words in this vocabulary. If Hj is the column vector of matrix H associated with word

j , then the element Hij represents the probability of occurrence of handshape i in word j .

To train a given word, say the j th word in the vocabulary, the raw data for the given samples of that

word are stored. The raw data consist of the 21 values from the glove and motion tracker during the ‘signing

period’. The data are then processed through the data selection algorithm and then through the four feature

recognition functions. The four observation sequences corresponding to the four features are recorded. If nj

samples of word j are used to train the system, then there will be nj sequences of handshape observations,

each having length T , i.e. T observations. Element Hij is then calculated by finding the ratio of number of

times the handshape i appears in word j with regard to the total readings in the training samples of word j ,

i.e. njT .

The matrix H is created by going through the above process for all the words to be trained. Thus, if

there is a vocabulary of Nwords, then matrix Hwill be of size (36 ×N). Similarly, matrices R , L , and M are

created, which are associated with the hand orientation, hand location, and hand movement.

2117


6.2. Recognition phase

For an unknown word, there are sequences of four observations after processing the raw data through the

feature recognition functions. If O is the observation matrix for an unknown word, it will have a size of (4 ×T ). Each of its rows will be an observation vector of length T corresponding to one of the four features. Let

hbe the handshape observation vector and h[i] be the handshape observed at time instant i . The goal is to

find the closeness of this observation vector with respect to each of the Nwords in the database. A ‘handshape

probability index’ of observation O for word j in the trained database is defined as:

P (O|Hj) =T∏

i=1

Hj [h[i]] (3)

where Hj is a vector representing the j th column of matrix H . It is assumed that an observation at a time

is independent of the other observations in that sequence. Hence, the likelihood of a sequence of observations

appearing together can be expressed as the multiplication of probabilities of the individual observations in thatsequence.

Similarly, the ‘orientation probability index’, ‘location probability index’, and ‘movement probability

index’ can be calculated for the unknown observation O . The ‘total probability index’ of the observation O for

the word j is defined as:

P (O|Wj) = fh · P (O|Hj) + fr · P (O|Rj) + fl · P (O|Lj) + fm · P (O|Mj). (4)

fh ,fr ,fl , and fm in the above equation are the weight factors corresponding to the four features, which will

be discussed later. This process of calculating the total probability index is repeated for all the N words in the

available vocabulary. Finally, the output of the recognition system, w , is the sign or word corresponding to the

maximum of the total probability index values of observation O for all the words in the vocabulary, W .

w = argmax(P (O|Wj)|j = 1, 2, ..., N) (5)

This word output string is displayed on the computer screen in the Open Inventor interface. The voice of the

output string is generated using Microsoft Speech SDK. After all these processes, the system continues to run

and collect the data at 33.33 Hz until the sign recognition function is triggered again by the velocity network.

The four weight factors indicate the importance of each feature. These factors are determined by

considering the available vocabulary of signs. For example, consider the handshape weight factor fh . One

half of the available vocabulary is used as the training database and the remaining as the recognition. The

recognition database samples are tested for word recognition considering the handshape feature only. Since the

targets or the correct answers corresponding to the tested signs are already known, the target results can be

compared with the recognition results obtained by considering the handshape feature only. The factor fh is

calculated by finding the ratio of the number of accurate results using only the handshape feature to the total

number of tested samples.

The factor fh is essentially the accuracy ratio for the handshape feature. Similarly, the factors fr , fl ,

and fm can be calculated. They give an appropriate amount of importance to each of the four features.

7. Interactive recognition and training

The system is capable of online training and updating of the database. Every time a new sample of any sign is

added, the elements of the database matrices, H , R , L and M , are updated corresponding to that sign. The

possible cases are given below.

2118


Case 1: The recognition output is accepted and no change is made in database.

Case 2: If the output is correct (j th sign), then the database can be updated by recalculating the various

probabilities for the j th sign. This case can be useful while adding more samples of an existing sign to database.

Hij =Ci +Hij · T · nj

T · (nj + 1)(6)

Case 3: If the user enters that the output is incorrect, he can then enter the name of the correct sign. After

entering the correct name, there are two options:

1. If the entered sign is present in the database (say the k th word), then the probabilities for the k th word

in the database are updated. The matrix update expression is same as in Case 2, except that the data

corresponding to the k th word are updated.

2. If the entered sign is not present in the database, it means that it is a new sign. Hence, the database should

be updated to include this (N+1 )th sign. Thus, an extra column is added in matrix H corresponding to

the new sign. N incremented to N+1.

Hi(N+1) =Ci

T(7)

The four weight factors are also updated with the addition of a new sample of any sign. With the addition

of a new sample of any sign, that sample is recognized only using the handshape feature. If it is recognized

correctly, then factor fh is updated as given in Eq. (8).

fh =

fh(N∑i=1

ni − 1) + 1

N∑i=1

ni

(8)

If the recognition output using only the handshape feature is incorrect, then the factor is updated as given in

Eq. (9). A similar procedure is used to update fr , fl , and fm .

fh =

fh(N∑i=1

ni − 1)

N∑i=1

ni

(9)

In order to develop a large vocabulary for the ASL recognition system, it should be trained using thousands of

words. It is difficult to gather the data for all words and train the system at once. According to the method in

Section 7.1, there will be a need to retrain the system every time a new sample of any sign is added. The module

discussed in this section overcomes this problem of retraining and allows easy extension of the vocabulary. This

makes the system capable of online training. The online training can be especially useful for human/robot

applications, as mentioned in [26].

2119


8. Test results and discussion

The system is tested for a vocabulary of 40 commonly used words in English chosen from the ASL dictionary.

The words are selected so that some meaningful sentences can be constructed with them. This will be useful

in future work of sentence recognition. Hence, the selected words contain a few verbs as well as nouns and

adjectives. The words are chosen such that the signs only need one hand to be performed. This is because the

system has only one glove at present. The list of selected words is given in Table 3.

Table 3. List of ASL words used in the system.

No. Word No. Word1 About 21 Easy2 Airplane 22 Eat3 Alone 23 False4 Always 24 Get away5 Am 25 Give6 And 26 Go7 Another 27 Goodbye8 Apologize 28 Hanger9 Apple 29 Have to10 Appoint 30 Hey11 Bell 31 High12 Bore 32 Hot13 Careless 33 How many14 Concept 34 I15 Consume 35 Imagination16 Curious 36 Interest17 Delete 37 Interrogate18 Deny 38 Invent19 Disregard 39 Lion20 Drink 40 Right

The tests are carried out using 15, 20, and 25 training samples of each word. The effect of number of time

instants (T ) on the recognition accuracy is also studied. Figure 12 shows that the accuracy increases slightly

with more training samples. The accuracy increases with the number of time instants in the earlier stages, but

then it remains fairly constant. More investigation is needed to find the effect of the number of time instants

on the recognition accuracy. Time instants from 8 to 16 are found to give almost the same accuracy.

Probabilistic model: vocabulary of 40 words

80

84

88

92

96

100

0 5 10 15 20

Number of time instants

Acc

ura

cy (

%)

15 samples

20 samples

25 samples

Figure 12. Test results of word recognition system.

2120


The C++ code has been made flexible to accommodate a varying number of time instants. At present,

8 time instants are used in the real-time system. The corresponding accuracy is 92.5%. The code works well,

and words are displayed in the Open Inventor environment and spoken using the Microsoft speech synthesis

software without any noticeable lag. Figure 12 shows examples of words recognized by the system.

Figure 13. Sample recognition results.

9. Conclusion

A real-time ASL word recognition system has been successfully developed. The accuracy of the system can be

improved with higher numbers of training samples per sign. There is scope for improvement in position and

movement recognition. The number of classes for both the features can be increased to achieve more diverse

observations, which will result in better distinction between different signs. In the case of position network,

an additional motion tracker mounted on the user’s neck or back can be used. This will remove the errorscaused due to the movement of the user during actual signing. In the case of movement recognition, the data at

more time instants can be used along with neural networks to identify movements like clockwise, anticlockwise,

zigzag, etc. This will create more diverse classification of movements.

The probability model is based on simple calculations, yet it works well. This is because the four feature

recognition functions extract sufficient information from the raw data and make the further processing easier.

However, it will be very interesting to test the system using standard HMMs with multiple states and a left-right

state model. With the left-right model, the states of the HMM can be assigned with time in a straightforward

manner. This can overcome the drawback of the current model, which assumes equal probability of occurrence

of the observations at different time instants in the sequence.

The system has some advantages over a complete neural network model for recognizing signs in ASL

[36], where a backpropagation neural network is used to recognize 50 signs in ASL. The input to the network is

the 21 element data at six time instants from the glove and motion tracker. Fifteen of the 21 inputs are from

the glove and hence the data from the glove are dominant in the total input and have more influence on the

output. In our system, we convert the raw data into features and each feature is assigned a weight factor, which

indicates its importance. The backpropagation network needs to be retrained for addition of new signs in the

system, but in our system, the online training module makes the addition of new words easier.

The system needs much improvement to be realistically useful for continuous ASL to English translation

applications. However, the present system can be applicable in the area of human/robot interfaces, as in [26–29].

2121


The interactive updating allows easy addition of new commands to operate a machine or robot. The system

can also be useful in the preliminary stages of teaching ASL to a person.

Future work will involve expanding the current vocabulary of signs and testing the system for larger

vocabulary. Two gloves and three motion trackers can be used, which will allow the inclusion of signs that

use both hands. Further work will be directed towards the problem of sentence recognition by developing a

methodology for ASL grammar recognition and distinguishing continuously performed signs without pause.

Acknowledgment

This research was partially supported by a National Science Foundation award (DMI-0079404) and a Ford

Foundation grant, as well as by the Intelligent Systems Center at the Missouri University of Science and

Technology in the United States.

References

[1] National Center of Birth Defects and Developmental Disabilities. Data. Atlanta, GA, USA: CDC, 2012. Available

at http://www.cdc.gov/ncbddd/birthdefects/data.html.

[2] Sternberg MLA. American Sign Language Dictionary. 3rd ed. New York, NY, USA: Harper Perennial, 1994.

[3] Wysoski SG, Lamar MV, Kuroyanagi S, Iwata A. A rotation invariant approach on static gesture recognition

using boundary histograms and neural networks. In: Proceedings of the 9th International Conference on Neural

Information Processing; November 2002; Singapore. pp. 2137–2141.

[4] Vogler C, Metaxas D. ASL recognition based on a coupling between HMMS and 3D motion analysis. In: Proceedings

of the Sixth IEEE International Computer Vision Conference; January 1998; Bombay, India. pp. 363–369.

[5] Lalit G, Suei M. Gesture-based interaction and communication: automated classification of hand gesture contours.

IEEE T Syst Man Cy C 2001; 31: 114–120.

[6] Yang MH, Ahuja N. Recognizing hand gestures using motion trajectories. In: Ahuja N, editor. Face Detection and

Gesture Recognition for Human-Computer Interaction. Berlin, Germany: Springer, 1999. pp. 466–472.

[7] Vogler C, Metaxas D. Adapting hidden Markov models for ASL recognition by using three-dimensional computer

vision methods. In: Proceeding of the IEEE International Conference on Systems, Man and Cybernetics; October

1997; Orlando, FL, USA. pp. 12–15.

[8] Cui Y, Weng J. A learning-based prediction and verification segmentation scheme for hand sign image sequence.

IEEE T Pattern Anal 1999; 21: 798–804.

[9] Vogler C, Metaxas D. Parallel hidden Markov models for American Sign Language recognition. In: Proceedings of

the IEEE International Computer Vision Conference; September 1999; Kerkyra, Greece. pp. 116–122.

[10] Zaki MM, Shaheen SI. Sign language recognition using a combination of new vision based features. Pattern Recogn

Lett 2011; 32: 572–577.

[11] Fels SS, Hinton GE. Glove-Talk: A neural network interface between a data-glove and a speech synthesizer. IEEE

T Neural Networ 1993; 4: 2–8.

[12] Fels SS, Hinton GE. Glove-Talk II - A neural-network interface which maps gestures to parallel formant speech

synthesizer controls. IEEE T Neural Networ 1997; 8: 977–984.

[13] Waldron MB, Kim S. Isolated ASL sign recognition system for deaf persons. IEEE T Rehabil Eng 1995; 3: 261–271.

[14] Wu J, Gao W, Song Y, Liu W, Pang B. A simple sign language recognition system based on data glove. In:

Proceedings of Signal Processing ICSP; October 1998; Beijing, China. pp. 12–16.

[15] Gao W, Ma J, Wu J, Wang C. Sign language recognition based on HMM/ANN/DP. Int J Pattern Recogn 2000;

14: 587–602.

2122

http://dx.doi.org/10.1109/ICONIP.2002.1199054



http://dx.doi.org/10.1109/ICCV.1998.710744


http://dx.doi.org/10.1109/ICSMC.1997.625741






[16] Gaolin F, Gao W, Ma J. Signer-independent sign language recognition based on SOFM/HMM. In: IEEE ICCV

Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems; July 2001; Van-

couver, Canada. pp. 90–95.

[17] Wang C, Gao W, Shan S. An approach based on phonemes to large vocabulary Chinese Sign Language recognition.

In: Proceedings of 5th International IEEE Conference on Automatic Face and Gesture Recognition; May 2002;

Washington, DC, USA. pp. 393–398.

[18] Fang G, Gao W, Zhao D. Large vocabulary sign language recognition based on fuzzy decision trees. IEEE T Syst

Man Cyb 2004; 34: 122–131.

[19] Liang R, Ouhyoung M. A real-time continuous alphabetic sign language to speech conversion VR system. Comput

Graph Forum 1995; 14: 67–74.

[20] Liang R, Ouhyoung M. A sign language recognition system using hidden markov model and context sensitive search.

In: Proceedings of the ACM Symposium on Virtual Reality Software and Technology; June 1996; Hong Kong. pp.

59–66.

[21] Oz C, Leu MC. Linguistic properties based on American Sign Language recognition with artificial neural network

using a sensory glove and motion tracker. Neurocomputing 2007; 70: 2891–2901.

[22] Oz C, Leu MC. American Sign Language word recognition with a sensory glove using artificial neural networks.

Eng Appl Artif Intel 2011; 24: 1204–1213.

[23] Kim J, Jang W, Bien Z. A dynamic gesture recognition system for the Korean Sign Language (KSL). IEEE T Syst

Man Cyb 1996; 26: 1144–1148.

[24] Qutaishat M, Moussa H, Bayan T, Hiba AA. American sign language (ASL) recognition based on Hough transform

and neural networks. Expert Syst Appl 2007; 32: 24–37.

[25] Stergiopolou E, Papamarkos N. Hand gesture recognition using a neural network shape fitting technique. Eng Appl

Artif Intel 2009; 22: 1141–1158.

[26] Lee C, Xu Y. Online, interactive learning of gestures for human/robot interfaces. In: Proceedings of IEEE Interna-

tional Conference on Robotics and Automation; April 1996; Minneapolis, MN, USA. pp. 124–128.

[27] Wang H, Leu MC, Oz C. American Sign Language recognition using multi-dimensional hidden Markov models. J

Inf Sci Eng 2006; 22: 1109–1123.

[28] Damasio FW, Musse SR. Animating virtual humans using hand postures. In: Proceedings of Symposium on

Computer Graphics and Image Processing; October 2002; Rio de Janeiro, Brazil. pp. 437–442.

[29] Weissmann J, Salomon R. Gesture recognition for virtual reality applications using data gloves and neural networks.

In: International Joint Conference on Neural Networks; July 1999; Washington, DC, USA. pp. 2043–2046.

[30] Wang HG, Sarawate NN, Leu MC. Recognition of American Sign Language gestures with a sensory glove. In: Japan

USA Symposium on Flexible Automation; July 2004; Denver, CO, USA. pp. 102–109.

[31] Sturman DJ, Zelter D. A survey of glove-based input. IEEE Comput Graph 1994; 14: 30–39.

[32] Lippman RP. An introduction to computing with neural nets. IEEE ASSP Mag 1987; 4: 4–22.

[33] Reidmiller M, Braun H. A direct adaptive method for faster backpropagation learning: the RPROP algorithm.

In: Proceedings of the IEEE International Conference on Neural Networks; March–April 1993; San Francisco, CA,

USA. pp. 586–591.

[34] Wilbur RB. American Sign Language: Linguistic and Applied Dimensions. 2nd ed. Boston, MA, USA: College Hill

Publication, 1987.

[35] Rabiner L. A tutorial on hidden Markov models and selected applications in speech recognition. P IEEE 1989; 77:

257–286.

[36] Oz C, Sarawate NN, Leu MC. American Sign Language word recognition with a sensory glove using artificial neural

networks. In: ANNIE Conference; November 2004; St. Louis, MO, USA. pp. 563–569.

2123

http://dx.doi.org/10.1109/RATFG.2001.938915



http://dx.doi.org/10.1109/ROBOT.1996.509165

http://dx.doi.org/10.1109/ROBOT.1996.509165

http://dx.doi.org/10.1109/SIBGRA.2002.1167207

http://dx.doi.org/10.1109/SIBGRA.2002.1167207

http://dx.doi.org/10.1109/IJCNN.1999.832699

http://dx.doi.org/10.1109/IJCNN.1999.832699

http://dx.doi.org/10.1109/ICNN.1993.298623



http://dx.doi.org/10.1016/j.engappai.2011.06.015

http://dx.doi.org/10.1016/j.engappai.2011.06.015

a real-time american sign language word recognition...

Documents