a real-time american sign language word recognition...
TRANSCRIPT
Turk J Elec Eng & Comp Sci
(2015) 23: 2107 – 2123
c⃝ TUBITAK
doi:10.3906/elk-1303-167
Turkish Journal of Electrical Engineering & Computer Sciences
http :// journa l s . tub i tak .gov . t r/e lektr ik/
Research Article
A real-time American Sign Language word recognition system based on neural
networks and a probabilistic model
Neelesh SARAWATE1, Ming Chan LEU2, Cemil OZ3,∗
1Department of Mechanical Engineering, Ohio State University, Columbus, OH, USA2Department of Mechanical and Aerospace Engineering, Missouri University of Science and Technology,
Rolla, MO, USA3Department of Computer Engineering, Faculty of Computer and Information Sciences, Sakarya University,
Sakarya, Turkey
Received: 22.03.2013 • Accepted/Published Online: 31.07.2013 • Printed: 31.12.2015
Abstract: The development of an American Sign Language (ASL) word recognition system based on neural networks
and a probabilistic model is presented. We use a CyberGlove and a Flock of Birds motion tracker to extract the gesture
data. The finger joint angle data obtained from the sensory glove defines the handshape while the data from the motion
tracker describes the trajectory of the hand movement. The four gesture features, namely handshape, hand position,
hand orientation, and hand movement, are recognized using different functions that include backpropagation neural
networks. The sequence of these features is used to generate a specific sign or word in ASL based on a probabilistic
model. The system can recognize the ASL signs in real time and update its database based interactively. The system
has an accuracy of 95.4% over a vocabulary of 40 ASL words.
Key words: American Sign Language, American Sign Language recognition, virtual reality, artificial neural network,
probabilistic model
1. Introduction
Deafness is a serious disability for humankind. It is defined as a degree of impairment such that a person is
unable to understand speech. Every country has a deaf community; while most of these individuals are born
with hearing loss, others lose their sense of hearing while they are children, or even as adults [1]. Almost all
deaf communities have their own sign language, which is highly different from spoken language. The United
Kingdom and United States both use the English language, but American Sign Language (ASL) was created
from French sign language by French teachers. Due to this, it is different from British Sign Language (BSL).
Turkish Sign Language (TSL) and BSL are more similar to each other than to ASL because TSL was created
by British teachers. It is often very difficult for deaf people to communicate with people outside the deaf
community.
Deaf people in the United States and parts of Canada communicate with ASL [2], but not many people
can understand ASL besides deaf people, other than deaf people’s family members. An automatic machine
translation system between deaf people and hearing people will be helpful for communication. It has to have
two parts, the first part converting ASL sign to text and speech. The second part must convert speech to
text and sign language animation. Most deaf people who know ASL can read English text. There are some
∗Correspondence: [email protected]
2107
SARAWATE et al./Turk J Elec Eng & Comp Sci
successful speech recognition programs like Microsoft Speech SDK, but converting ASL signs to English text is
a very challenging problem due to complex body movements.
Generally, research for hearing devices is supported more than research for sign language recognition.
This is affecting ASL recognition studies; however, there are many ASL recognition and gesture recognition
studies. Some ASL studies are based on computer vision, which specially set up a high-frame camera, where the
user needs to wear a colored glove, a special suit, and a marker to increase the accuracy of the system. Most of
these systems are not portable and some of them are not affordable, like the VICON system or OPTITRACK.
They require complicated data processing and have slower recognition rates. Development of some successful
computer vision-based ASL recognition systems can be found in [3–10]. There are some successful ASL studies
using data gloves and motion trackers. These ASL recognition systems use a special sensory glove and a motion
tracker to detect handshapes and hand movements. Some ASL recognition researchers use sensor devices and
computer vision systems together to get better results.
Between 1980 and 1990, high-accuracy virtual reality devices were developed to interact with the virtual
world, like data gloves, haptics, or Flock of Birds. There have been several ASL studies based on data gloves
and Flock of Birds using artificial neural networks (ANNs), hidden Markov models (HMMs), and statistical
methods. The research based on neural networks includes work done by Fels and Hinton [11,12] and by Waldron
and Kim to recognize ASL signs [13]. Recent attempts in sign language recognition involve the use of HMMs.
Gao et al. did considerable work in Chinese Sign Language (CSL) recognition [14–18] using sensory gloves and
motion tracker. Their work included the use of ANNs and HMMs for continuous recognition of CSL. Liang and
Ouhyoung did work on the continuous recognition of Taiwanese Sign Language recognition using sensory gloves
[19,20]. Oz et al. developed an ASL word recognition system using ANNs [21,22]. Other attempts include the
study by Kim et al. [23] to recognize Korean Sign Language words using neural networks. Some recent neural
network-based ASL studies were done by Qutaishat et al. and by Stergiopolou and Papamarkos [24,25]. Lee
and Xu [26] recognized 14 alphabets in ASL for human/robot interface using HMMs. Wang et al. developed
an ASL alphabet using HMMs [27]. The applications of gesture recognition research are not limited to sign
language recognition applications. Gesture recognition systems can be useful to human/robot interface and
virtual reality [28–30].
This paper discusses the development of a real-time ASL sign recognition system using neural networks
and a probabilistic model. A sign is composed of four features that are recognized by four neural network
classifiers, and then the sequence of network outputs is analyzed using the probabilistic model. The sign output
is then converted to English text and speech. We use a CyberGlove sensory glove and Flock of Birds motion
tracker for collecting the data. Other components in the system include a velocity network, which is developed
to define the start and end of the signing based on the velocity of hand. The system is also capable of online
training of the ASL signs. This allows the user to add new words to the vocabulary without retraining the
whole system.
2. Experimental setup and procedure
Standard computer input and output devices such as mouse, keyboard, and monitor are designed to be used
while sitting in a chair and they are constrained devices. Virtual reality applications need specially designed
devices to provide physical flexibility to the user, like gloves, motion trackers, and head-mounted display. These
interaction-based input devices were designed to overcome constrained peripheral devices [31].
A CyberGlove and Flock of Birds are used to track bending fingers, hand position, and hand orientation.
2108
SARAWATE et al./Turk J Elec Eng & Comp Sci
The CyberGlove has 18 sensors, three strain gauge for the palm, three strain gauge for the thumb, two strain
gauge for each of the four fingers, and four strain gauge between the fingers. Every strain gauge value of the
glove converts the degree of bending to a number between 1 and 255. CyberGlove data, except the three palm
data, are used for classifying handshape and ASL alphabet. The glove is shown in Figure 1.
Flock of Birds has three parts, the system unit, transmitter, and receiver. The receiver is used to get the
hand position and orientation and it is connected to the wrist of the right hand. The transmitter produces a
magnetic field with DC pulse. The magnetic field of the transmitter is about 180 cm. It is used to track the
position and the orientation of the right-hand wrist. Figure 2 shows the Flock of Birds motion tracker system
Figure 1. CyberGlove with 18 sensors. Figure 2. Flock of Birds 3-D motion tracker.
Figure 3 shows the block diagram of our ASL word recognition system. The CyberGlove and the Flock
of Birds tracker have their own system units to get fast and reliable data, and each system unit is connected
to the computer system through an RS-232 serial port. The velocity network listens to the Flock of Birds data
and calculates hand velocity. It is designed to classify the hand velocity at any time instant as “signing” or
“not signing”. Feature vectors are extracted and sensor data are recorded if the output of the velocity network
is “signing”. When the velocity network output changes from “signing” to “steady”, the features and data
recording for the given sign are completed, and the sign recognition phase starts. A probabilistic model based
on the Markov chains and HMM is used to process the outputs of the four networks. This model gives the sign
output based on the trained database. There is a provision to train the system online and to update the present
database in an interactive fashion. A VRML model of the hand is continuously updated, which shows the
position of the fingers and hand on the computer screen along with the sign output. The Microsoft Speech SDK
produces the voice of the sign output text. All these different units are discussed in detail in the subsequent
sections.
3. Handshape recognition using neural networks
Each sign in ASL can be considered to be composed of a combination of 36 basic handshapes. Handshapes
form the most important and interesting features of an ASL sign. During the ASL signing, the dominant hand,
usually the right hand, does the majority of the signing. There are one-handed signs that rely simply on the
dominant hand, symmetric two-handed signs, where the dominant and nondominant hands perform the same
action, and nonsymmetrical two-handed signs, where both hands are often performing different actions. Figure
4 shows the ASL handshapes.
2109
SARAWATE et al./Turk J Elec Eng & Comp Sci
Feature Recognition
CyberGlove® and Flock of Birds® data at 33.33 Hz, 15 angles + (x, y, z) + (α, β, γ)
Velocity Network
(15 angles)* T Handshape Network
(x, y, z)*T Position Network
(α, β, γ)* T Orientation Network
(Δx, Δy, Δz)*T Movement Function
Probability Model
Interactive feedback from the user
Update/train the model online
Recognized sign output
Speech by Microso'® Speech SDK
Display on screen using Open Inventor
VRML hand model
Figure 3. The overall structure of our system.
We have used ANNs for solving the problem of handshape recognition. The 26 handshapes in ASL, which
are used for spelling nouns and words out of the ASL dictionary, are very similar in nature to handshapes and
a similar solution can also be used for alphabet recognition. ANNs [12] are composed of several computing
elements operating in parallel and arranged in a pattern indicative of the biological nervous system. The input
and output nodes are interconnected by links called weights. In the training phase, the network learns to
perform a particular function by adjusting the weights and developing associations between input space and
output space. In the recognition phase, a fully trained network can predict an output for an unknown input
using the learned association. The backpropagation algorithm is a supervised learning algorithm in which the
sample inputs and outputs are presented to the network in the training phase. This is the algorithm we use for
the recognition of ASL handshapes.
A two-stage network as shown in Figure 5 is developed for the handshape recognition. The input layer
consists of 15 nodes (p1 to p15), one for each angle value of the finger joints. The output layer consists of 36
nodes (a21 to a2
36), each representing a different handshape. The initial weights and biases are set to random
values. The handshape recognition ANN model has 15 input neurons, 20 hidden layer neurons, and 36 output
neurons. Hence, the weight matrix between the input and hidden layers has a size of (20 × 15) and the weight
matrix between the hidden layer and output has a size of (36 × 20). A log-sigmoid transfer function is used
for calculating the hidden layer and output layer values. The network output lies between 0 and 1. The target
vector consists of 36 elements. All the elements are set to 0 except one, which is set to 1. The index of the
element corresponding to 1 defines the handshape.
Network training and testing is done with the MATLAB program using the Levenberg–Marquardt algo-
rithm (TRAINLM) and resilient backpropagation algorithm (TRAINRP). We trained and tested the network,
changing the network’s hidden layer neurons between 6 and 80 to get a good result using different numbers of
training samples of multiple users. Twenty hidden layers for the TRAINLM and TRAINRP algorithms give
good performance in terms of accuracy and training time.
2110
SARAWATE et al./Turk J Elec Eng & Comp Sci
Figure 4. The 36 basic handshapes in ASL.
Figure 5. Neural network for handshape recognition.
The Levenberg–Marquardt algorithm results in large memory requirements and a large training time.
The comparisons between the Levenberg–Marquardt algorithm and the resilient backpropagation algorithm are
given in Figures 6 and 7 for 16 training samples.
Both of the algorithms are also checked for a multiuser system. The system trained by 3 and 7 users
is tested by users involved in training as well as completely new users. The test results are given in Table 1.
2111
SARAWATE et al./Turk J Elec Eng & Comp Sci
Considering the comparable accuracy and low training time and memory requirements of TRAINRP, it is
considered as the better solution.
92
93
94
95
96
97
98
99
0 20 40 60 80
Number of hidden layer nodes
Acc
ura
cy (
%)
TRAINRP
TRAINLM0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
0 20 40 60 80Number of hidden layer nodes
Tim
e fo
r tr
ain
ing
(s)
TRAINRP
TRAINLM
Figure 6. Accuracy comparison of TRAINLM and
TRAINRP.
Figure 7. Training time comparison of TRAINLM and
TRAINRP.
Table 1. Handshape recognition results for multiuser system.
Trained by 3 users Trained by 7 usersAlgorithm Same users New users Same users New usersTRAINLM 93.51 80.09 92.09 82.17TRAINRP 87.96 72.22 95.72 85.41
4. Velocity network and data collection
The system is developed for recognition of ‘isolated’ signs. Hence, it is assumed that there is a distinct pause
between the two successive signs. The signing period is defined as the period over which the signing action is
performed. Typically, during this period, the hand is moving. When the hand stops moving, it is considered as
the end of the sign. From the viewpoint of real-time sign recognition, it is necessary to identify the beginning
and the end of the sign. The ‘velocity’ of the hand at a given time instant gives a good indication of whether
the hand is moving or steady. However, due to the high frequency of data collection, the instantaneous velocity
value can rise or drop suddenly due to momentary shaking or stopping of the hand. The ‘threshold velocity’
approach considers the hand velocity at only one time instant and hence fails to achieve the desired purpose.
Figure 8 shows the velocity graph for sign BELL and the output of the threshold approach. Hence, the history
of velocity values for a few previous time instants should be considered to determine whether the hand is steady
or moving.
Neural networks provide a good solution to this problem. ANNs [32] are composed of several computing
elements operating in parallel and arranged in a pattern indicative of the biological nervous system. The input
and output nodes are interconnected by links called weights. In the training phase, the network learns to
perform a particular function by adjusting the weights and developing associations between input space and
output space. In the recognition phase, a fully trained network can predict an output for an unknown input
using the learned association. The backpropagation algorithm is a supervised learning algorithm in which the
sample inputs and outputs (targets) are presented to the network in the training phase. This algorithm is used
for the velocity network and for the recognition of three features.
A two-stage neural network, the ‘velocity network’, is used to classify the hand at a given time instant as
steady (0) or moving (1). After numerous trials, the input to the network is decided as ‘summation of velocity
values during past Tv instants including the current one’. In our case, after a few trials, Tv is decided as 5.
2112
SARAWATE et al./Turk J Elec Eng & Comp Sci
The velocity is given by Eq. (1). In our case, ∆thas a constant value of 0.03 s.
v(t) =√∆x(t)2 +∆y(t)2 +∆z(t)2/∆t
where,
∆x(t) = x(t)− x(t− 1)
∆y(t) = y(t)− y(t− 1)
∆z(t) = z(t)− z(t− 1)
(1)
The input to the velocity network at a given time instant is given by Eq. (2).
u(t) =
Tv∑i=1
v(t− i+ 1) (2)
The velocity network has 1 input layer neuron, 5 hidden layer neurons, and 2 output layer neurons. A hyperbolic
tangent sigmoid function is used in both layers. The network is trained using the resilient backpropagation
algorithm [33]. The resilient backpropagation algorithm updates the weights using a factor that is independent
of the slope of the activation function. Hence, it results in fast training as compared to the steepest descent
algorithm, in which the weight update factor depends on the slope of the activation function. The weight
increment and decrement factors are selected as 1.2 and 0.7, respectively. The initial weight update factor is
selected as 0.07. These values are also used in the three-feature recognition networks discussed in Section 4.
The number of maximum epochs is set to 500. The error goal is set to 10E-5. The output of the network lies
in the range of [–1, 1], which is converted to 0 or 1.
The input to the network is the velocity summation specified in Eq. (2). The target vectors were manually
created by studying the velocity graphs for different randomly selected signs. The target vector corresponding
to the steady position is [1, –1] and that corresponding to the moving position is [–1, 1]. Twenty training
samples of different signs resulted in recognition accuracy of 91%.
The network continuously calculates the output using the stored data for the past five time instants,
which gets updated at 33.33 Hz. Figure 9 shows the performance of the network for the sign BELL. It can be
seen that the network produces a continuous output of 1 during the actual signing period.
0
0.5
1
1.5
2
2.5
6 6.5 7 7.5 8
Time (s)
Velocity
!reshold output
0
0.5
1
1.5
2
2.5
6 6.5 7 7.5 8Time (s)
Velocity
Network output
Figure 8. Output from the threshold approach. Figure 9. Performance of the velocity network.
2113
SARAWATE et al./Turk J Elec Eng & Comp Sci
If the output of the velocity network is 1, then the 15 data elements from the glove and 6 data elements
from the motion tracker are stored at 33.33 Hz. When the velocity network output changes from 1 to 0, the
data recording for the current sign is completed, and the sign recognition phase starts. The data at T number
of equally divided time instants is chosen from the recorded data during the ‘signing period’. It creates a simple
model, yet we have enough information. These data are sent to the four feature recognition functions, which
are activated at this point. No data are recorded if the output of the velocity network is 0.
5. Feature recognition
ASL is spoken with finger spelling and hand signing. While finger spelling is used to spell names and words
that are not defined as ASL words using the 26-letter ASL alphabet, signing of ASL words, which are defined
by a serial handshape and hand movement depending on the signer’s body, is called hand signing. There are
about 6000 ASL words. Hand signing is easier than finger spelling. The ASL alphabet is shown in Figure 10.
Some ASL words are given in Table 2 to give an idea about ASL words.
Figure 10. ASL alphabet gestures.
This study is focused on ASL word recognition. We used four previously defined feature vectors [16,17],
from studies aimed at developing an ASL writing style using handshape, hand location, hand movement, and
hand orientation. These features are also known as the ‘big four’ of ASL. We redesigned these features for
our model and used a neural network classifier for handshape, hand location, and hand orientation feature
extraction, while if-else logic was used for hand movement feature extraction. Feature extractions are discussed
in the following subsections.
5.1. Handshape recognition
As mentioned above, there are 36 handshapes. Handshapes are static gestures and 15 data glove data are
enough for handshape recognition. Handshapes are usually confused with alphabet characters, but alphabet
2114
SARAWATE et al./Turk J Elec Eng & Comp Sci
characters do not depend on finger bending only, they also depend on hand orientation. Of course, there are
some common and similar handshapes among the alphabet characters. Handshape is an important feature for
classifying words.
Table 2. Example words from American Sign Language [2].
ASL word Definition
Hello
Place hand on forward as if to salute (but not as rigid). Move hand outward, ending up with palm facing forward in the air just a few inches from the forehead.
I (me)
Point to self, touching the center of the chest.
meet
Bring “d” hands together, palm to palm.
Handshape recognition using neural networks was discussed in detail in Section 3. The raw data from
the glove vary from 0 to 256, scaled from –1 to 1 before being used as input to the network. The resilient
backpropagation algorithm is used to train the network. The maximum number of epochs is set to 1000. The
error goal is set to 10E-5. The network trained using 12 samples of each handshape resulted in 98% recognitionaccuracy.
A part of the output of the data collection algorithm is fed to the handshape neural network. It is the
data from 15 finger bending angles at T selected time instants. The output of the network is a sequence of T
numbers at T time instants. Each number is between 1 and 36, representing the observed handshape at a given
time instant.
5.2. Hand orientation recognition
Handshape is a very important feature for ASL signs, but some ASL words have the same handshape. Without
hand orientation, it is less helpful. Orientation is used to distinguish between ASL words that have similar
handshapes. The term “orientation” refers to the position of the palm of the hands and the direction the palm
is facing. For example, a palm placed on the signer’s own chest would mean “mine”. A palm facing toward the
reader would mean “your”. The total possible orientations of the hand are divided into 12 classes. These are
four orientations about three axes.
Similar to a handshape classifier, a multilayer feedforward neural network is used for classifying the hand
orientation. The Flock of Birds receiver is mounted on the right-hand wrist and it tracks the wrist position (x,
y, z) and wrist orientation (alpha, beta, gamma). Right-hand wrist alpha, beta, and gamma angle data are used
for classifying the orientation. The angle values are between –180 and 180. They are normalized between –1 and
1. During the training and testing of the hand orientation network, 20 normalized angle values are used. The
MATLAB program was used for training and testing the ANN model. A logarithmic sigmoid transfer function
is used in the hidden layer and the output layer. Network training is done with the resilient backpropagation
2115
SARAWATE et al./Turk J Elec Eng & Comp Sci
algorithm. When the network success is 100% during offline testing, then trained network parameters and data
like weights and bias are stored in a text file. It is tested with real-time data using the stored network data and
parameters. Accuracy of the system is about 96%.
5.3. Hand position recognition
The ASL signing area is the hemisphere space around the front of the signer. ASL words alter in meaning
when changing the hand position, even when the sign has the same handshape and hand orientation. The
classification of possible hand positions is quite subjective. There are a number of specified positions given in
[34], but division into a large number of classes will reduce the accuracy of recognition in our case. Based on
our own judgment, we have divided the total space around the signer into 7 regions as shown in Figure 11. The
black dots represent approximate centers of the different regions. We use a right-handed glove, and the hand
occupies the right side of the body more often as compared to the left side. Hence, more divisions are made
on the right side of the body than the left side. The (x, y, z) values obtained from Flock of Birds are used for
position recognition. These values are measured with respect to the magnetic sensor. Thus, (x, y, z) depend
on the position in which the user is sitting, which can vary. To overcome this problem, we initialize the origin
with respect to the user’s nose. The user just needs to touch his nose and hit a particular key before using the
system.
Figure 11. Classification of hand position.
Similar to the handshape network, a two-stage feedforward neural network is used for position recognition.
The three relative coordinates (x, y, z) are the input to the position recognition network. The (x, y, z) values
can vary from –92 to 92 cm, which are scaled from –1 to 1. The network has 3 input layer neurons, 7 output
2116
SARAWATE et al./Turk J Elec Eng & Comp Sci
layer neurons, and 10 hidden layer neurons. A logarithmic sigmoid function is used in both layers. The resilient
backpropagation algorithm is used to train the network. The network trained by using 12 samples of each
position resulted in recognition accuracy of 93%. The output of the network is a sequence of T numbers. Each
number is an integer that lies in the interval of 1 to 7.
5.4. Hand movement recognition
The ‘hand movement’ at a given instant is divided into 6 classes. These are +X, –X, +Y, –Y, +Z, and –Z.
The values of (∆x, ∆y, ∆z) at a given time instant are the input to the movement recognition function. (∆x,
∆y, ∆z) are obtained by taking the difference between (x, y, z) positions at the given time instant and the
previous time instant. If-else logic is used to identify the maximum magnitude of deviation and the direction
of movement, which in turn decides the ‘movement class’. Similar to the previous features, the output of this
function is a sequence of T numbers. Each number is an integer that lies in the interval of 1 to 6.
6. Probabilistic model for sign recognition
Further processing of the data is done using a probabilistic model based on the concept of Markov chains and
HMMs [33]. The processing of the raw data by the four feature recognition functions is quite rigorous and gives
a good deal of information about the given sign. For each sign, we get four observation sequences of length
T. Each observation in the handshape sequence takes any value in {1, 2,. . . , 36} . Similarly, the observations
of orientation, position, and movement sequences can take any value in {1, 2,. . . , 12} , {1, 2,. . . , 7} , and {1,2,. . . , 6} , respectively. The sequence of four features for a given sign can be unique and the process from here
should not be complicated. Hence, we have developed a simple model based on a probabilistic approach that
is able to analyze this information and give a result without much computation as compared to the standard
HMM with multiple states [35].
6.1. Training phase
Creation of the training database based on the ASL signs in the current vocabulary of the system is described
here. Matrix H is associated with the handshape feature and gives the probability of occurrence of all possible
handshapes in all the words in this vocabulary. If Hj is the column vector of matrix H associated with word
j , then the element Hij represents the probability of occurrence of handshape i in word j .
To train a given word, say the j th word in the vocabulary, the raw data for the given samples of that
word are stored. The raw data consist of the 21 values from the glove and motion tracker during the ‘signing
period’. The data are then processed through the data selection algorithm and then through the four feature
recognition functions. The four observation sequences corresponding to the four features are recorded. If nj
samples of word j are used to train the system, then there will be nj sequences of handshape observations,
each having length T , i.e. T observations. Element Hij is then calculated by finding the ratio of number of
times the handshape i appears in word j with regard to the total readings in the training samples of word j ,
i.e. njT .
The matrix H is created by going through the above process for all the words to be trained. Thus, if
there is a vocabulary of Nwords, then matrix Hwill be of size (36 ×N). Similarly, matrices R , L , and M are
created, which are associated with the hand orientation, hand location, and hand movement.
2117
SARAWATE et al./Turk J Elec Eng & Comp Sci
6.2. Recognition phase
For an unknown word, there are sequences of four observations after processing the raw data through the
feature recognition functions. If O is the observation matrix for an unknown word, it will have a size of (4 ×T ). Each of its rows will be an observation vector of length T corresponding to one of the four features. Let
hbe the handshape observation vector and h[i] be the handshape observed at time instant i . The goal is to
find the closeness of this observation vector with respect to each of the Nwords in the database. A ‘handshape
probability index’ of observation O for word j in the trained database is defined as:
P (O|Hj) =T∏
i=1
Hj [h[i]] (3)
where Hj is a vector representing the j th column of matrix H . It is assumed that an observation at a time
is independent of the other observations in that sequence. Hence, the likelihood of a sequence of observations
appearing together can be expressed as the multiplication of probabilities of the individual observations in thatsequence.
Similarly, the ‘orientation probability index’, ‘location probability index’, and ‘movement probability
index’ can be calculated for the unknown observation O . The ‘total probability index’ of the observation O for
the word j is defined as:
P (O|Wj) = fh · P (O|Hj) + fr · P (O|Rj) + fl · P (O|Lj) + fm · P (O|Mj). (4)
fh ,fr ,fl , and fm in the above equation are the weight factors corresponding to the four features, which will
be discussed later. This process of calculating the total probability index is repeated for all the N words in the
available vocabulary. Finally, the output of the recognition system, w , is the sign or word corresponding to the
maximum of the total probability index values of observation O for all the words in the vocabulary, W .
w = argmax(P (O|Wj)|j = 1, 2, ..., N) (5)
This word output string is displayed on the computer screen in the Open Inventor interface. The voice of the
output string is generated using Microsoft Speech SDK. After all these processes, the system continues to run
and collect the data at 33.33 Hz until the sign recognition function is triggered again by the velocity network.
The four weight factors indicate the importance of each feature. These factors are determined by
considering the available vocabulary of signs. For example, consider the handshape weight factor fh . One
half of the available vocabulary is used as the training database and the remaining as the recognition. The
recognition database samples are tested for word recognition considering the handshape feature only. Since the
targets or the correct answers corresponding to the tested signs are already known, the target results can be
compared with the recognition results obtained by considering the handshape feature only. The factor fh is
calculated by finding the ratio of the number of accurate results using only the handshape feature to the total
number of tested samples.
The factor fh is essentially the accuracy ratio for the handshape feature. Similarly, the factors fr , fl ,
and fm can be calculated. They give an appropriate amount of importance to each of the four features.
7. Interactive recognition and training
The system is capable of online training and updating of the database. Every time a new sample of any sign is
added, the elements of the database matrices, H , R , L and M , are updated corresponding to that sign. The
possible cases are given below.
2118
SARAWATE et al./Turk J Elec Eng & Comp Sci
Case 1: The recognition output is accepted and no change is made in database.
Case 2: If the output is correct (j th sign), then the database can be updated by recalculating the various
probabilities for the j th sign. This case can be useful while adding more samples of an existing sign to database.
Hij =Ci +Hij · T · nj
T · (nj + 1)(6)
Case 3: If the user enters that the output is incorrect, he can then enter the name of the correct sign. After
entering the correct name, there are two options:
1. If the entered sign is present in the database (say the k th word), then the probabilities for the k th word
in the database are updated. The matrix update expression is same as in Case 2, except that the data
corresponding to the k th word are updated.
2. If the entered sign is not present in the database, it means that it is a new sign. Hence, the database should
be updated to include this (N+1 )th sign. Thus, an extra column is added in matrix H corresponding to
the new sign. N incremented to N+1.
Hi(N+1) =Ci
T(7)
The four weight factors are also updated with the addition of a new sample of any sign. With the addition
of a new sample of any sign, that sample is recognized only using the handshape feature. If it is recognized
correctly, then factor fh is updated as given in Eq. (8).
fh =
fh(N∑i=1
ni − 1) + 1
N∑i=1
ni
(8)
If the recognition output using only the handshape feature is incorrect, then the factor is updated as given in
Eq. (9). A similar procedure is used to update fr , fl , and fm .
fh =
fh(N∑i=1
ni − 1)
N∑i=1
ni
(9)
In order to develop a large vocabulary for the ASL recognition system, it should be trained using thousands of
words. It is difficult to gather the data for all words and train the system at once. According to the method in
Section 7.1, there will be a need to retrain the system every time a new sample of any sign is added. The module
discussed in this section overcomes this problem of retraining and allows easy extension of the vocabulary. This
makes the system capable of online training. The online training can be especially useful for human/robot
applications, as mentioned in [26].
2119
SARAWATE et al./Turk J Elec Eng & Comp Sci
8. Test results and discussion
The system is tested for a vocabulary of 40 commonly used words in English chosen from the ASL dictionary.
The words are selected so that some meaningful sentences can be constructed with them. This will be useful
in future work of sentence recognition. Hence, the selected words contain a few verbs as well as nouns and
adjectives. The words are chosen such that the signs only need one hand to be performed. This is because the
system has only one glove at present. The list of selected words is given in Table 3.
Table 3. List of ASL words used in the system.
No. Word No. Word1 About 21 Easy2 Airplane 22 Eat3 Alone 23 False4 Always 24 Get away5 Am 25 Give6 And 26 Go7 Another 27 Goodbye8 Apologize 28 Hanger9 Apple 29 Have to10 Appoint 30 Hey11 Bell 31 High12 Bore 32 Hot13 Careless 33 How many14 Concept 34 I15 Consume 35 Imagination16 Curious 36 Interest17 Delete 37 Interrogate18 Deny 38 Invent19 Disregard 39 Lion20 Drink 40 Right
The tests are carried out using 15, 20, and 25 training samples of each word. The effect of number of time
instants (T ) on the recognition accuracy is also studied. Figure 12 shows that the accuracy increases slightly
with more training samples. The accuracy increases with the number of time instants in the earlier stages, but
then it remains fairly constant. More investigation is needed to find the effect of the number of time instants
on the recognition accuracy. Time instants from 8 to 16 are found to give almost the same accuracy.
Probabilistic model: vocabulary of 40 words
80
84
88
92
96
100
0 5 10 15 20
Number of time instants
Acc
ura
cy (
%)
15 samples
20 samples
25 samples
Figure 12. Test results of word recognition system.
2120
SARAWATE et al./Turk J Elec Eng & Comp Sci
The C++ code has been made flexible to accommodate a varying number of time instants. At present,
8 time instants are used in the real-time system. The corresponding accuracy is 92.5%. The code works well,
and words are displayed in the Open Inventor environment and spoken using the Microsoft speech synthesis
software without any noticeable lag. Figure 12 shows examples of words recognized by the system.
Figure 13. Sample recognition results.
9. Conclusion
A real-time ASL word recognition system has been successfully developed. The accuracy of the system can be
improved with higher numbers of training samples per sign. There is scope for improvement in position and
movement recognition. The number of classes for both the features can be increased to achieve more diverse
observations, which will result in better distinction between different signs. In the case of position network,
an additional motion tracker mounted on the user’s neck or back can be used. This will remove the errorscaused due to the movement of the user during actual signing. In the case of movement recognition, the data at
more time instants can be used along with neural networks to identify movements like clockwise, anticlockwise,
zigzag, etc. This will create more diverse classification of movements.
The probability model is based on simple calculations, yet it works well. This is because the four feature
recognition functions extract sufficient information from the raw data and make the further processing easier.
However, it will be very interesting to test the system using standard HMMs with multiple states and a left-right
state model. With the left-right model, the states of the HMM can be assigned with time in a straightforward
manner. This can overcome the drawback of the current model, which assumes equal probability of occurrence
of the observations at different time instants in the sequence.
The system has some advantages over a complete neural network model for recognizing signs in ASL
[36], where a backpropagation neural network is used to recognize 50 signs in ASL. The input to the network is
the 21 element data at six time instants from the glove and motion tracker. Fifteen of the 21 inputs are from
the glove and hence the data from the glove are dominant in the total input and have more influence on the
output. In our system, we convert the raw data into features and each feature is assigned a weight factor, which
indicates its importance. The backpropagation network needs to be retrained for addition of new signs in the
system, but in our system, the online training module makes the addition of new words easier.
The system needs much improvement to be realistically useful for continuous ASL to English translation
applications. However, the present system can be applicable in the area of human/robot interfaces, as in [26–29].
2121
SARAWATE et al./Turk J Elec Eng & Comp Sci
The interactive updating allows easy addition of new commands to operate a machine or robot. The system
can also be useful in the preliminary stages of teaching ASL to a person.
Future work will involve expanding the current vocabulary of signs and testing the system for larger
vocabulary. Two gloves and three motion trackers can be used, which will allow the inclusion of signs that
use both hands. Further work will be directed towards the problem of sentence recognition by developing a
methodology for ASL grammar recognition and distinguishing continuously performed signs without pause.
Acknowledgment
This research was partially supported by a National Science Foundation award (DMI-0079404) and a Ford
Foundation grant, as well as by the Intelligent Systems Center at the Missouri University of Science and
Technology in the United States.
References
[1] National Center of Birth Defects and Developmental Disabilities. Data. Atlanta, GA, USA: CDC, 2012. Available
at http://www.cdc.gov/ncbddd/birthdefects/data.html.
[2] Sternberg MLA. American Sign Language Dictionary. 3rd ed. New York, NY, USA: Harper Perennial, 1994.
[3] Wysoski SG, Lamar MV, Kuroyanagi S, Iwata A. A rotation invariant approach on static gesture recognition
using boundary histograms and neural networks. In: Proceedings of the 9th International Conference on Neural
Information Processing; November 2002; Singapore. pp. 2137–2141.
[4] Vogler C, Metaxas D. ASL recognition based on a coupling between HMMS and 3D motion analysis. In: Proceedings
of the Sixth IEEE International Computer Vision Conference; January 1998; Bombay, India. pp. 363–369.
[5] Lalit G, Suei M. Gesture-based interaction and communication: automated classification of hand gesture contours.
IEEE T Syst Man Cy C 2001; 31: 114–120.
[6] Yang MH, Ahuja N. Recognizing hand gestures using motion trajectories. In: Ahuja N, editor. Face Detection and
Gesture Recognition for Human-Computer Interaction. Berlin, Germany: Springer, 1999. pp. 466–472.
[7] Vogler C, Metaxas D. Adapting hidden Markov models for ASL recognition by using three-dimensional computer
vision methods. In: Proceeding of the IEEE International Conference on Systems, Man and Cybernetics; October
1997; Orlando, FL, USA. pp. 12–15.
[8] Cui Y, Weng J. A learning-based prediction and verification segmentation scheme for hand sign image sequence.
IEEE T Pattern Anal 1999; 21: 798–804.
[9] Vogler C, Metaxas D. Parallel hidden Markov models for American Sign Language recognition. In: Proceedings of
the IEEE International Computer Vision Conference; September 1999; Kerkyra, Greece. pp. 116–122.
[10] Zaki MM, Shaheen SI. Sign language recognition using a combination of new vision based features. Pattern Recogn
Lett 2011; 32: 572–577.
[11] Fels SS, Hinton GE. Glove-Talk: A neural network interface between a data-glove and a speech synthesizer. IEEE
T Neural Networ 1993; 4: 2–8.
[12] Fels SS, Hinton GE. Glove-Talk II - A neural-network interface which maps gestures to parallel formant speech
synthesizer controls. IEEE T Neural Networ 1997; 8: 977–984.
[13] Waldron MB, Kim S. Isolated ASL sign recognition system for deaf persons. IEEE T Rehabil Eng 1995; 3: 261–271.
[14] Wu J, Gao W, Song Y, Liu W, Pang B. A simple sign language recognition system based on data glove. In:
Proceedings of Signal Processing ICSP; October 1998; Beijing, China. pp. 12–16.
[15] Gao W, Ma J, Wu J, Wang C. Sign language recognition based on HMM/ANN/DP. Int J Pattern Recogn 2000;
14: 587–602.
2122
SARAWATE et al./Turk J Elec Eng & Comp Sci
[16] Gaolin F, Gao W, Ma J. Signer-independent sign language recognition based on SOFM/HMM. In: IEEE ICCV
Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems; July 2001; Van-
couver, Canada. pp. 90–95.
[17] Wang C, Gao W, Shan S. An approach based on phonemes to large vocabulary Chinese Sign Language recognition.
In: Proceedings of 5th International IEEE Conference on Automatic Face and Gesture Recognition; May 2002;
Washington, DC, USA. pp. 393–398.
[18] Fang G, Gao W, Zhao D. Large vocabulary sign language recognition based on fuzzy decision trees. IEEE T Syst
Man Cyb 2004; 34: 122–131.
[19] Liang R, Ouhyoung M. A real-time continuous alphabetic sign language to speech conversion VR system. Comput
Graph Forum 1995; 14: 67–74.
[20] Liang R, Ouhyoung M. A sign language recognition system using hidden markov model and context sensitive search.
In: Proceedings of the ACM Symposium on Virtual Reality Software and Technology; June 1996; Hong Kong. pp.
59–66.
[21] Oz C, Leu MC. Linguistic properties based on American Sign Language recognition with artificial neural network
using a sensory glove and motion tracker. Neurocomputing 2007; 70: 2891–2901.
[22] Oz C, Leu MC. American Sign Language word recognition with a sensory glove using artificial neural networks.
Eng Appl Artif Intel 2011; 24: 1204–1213.
[23] Kim J, Jang W, Bien Z. A dynamic gesture recognition system for the Korean Sign Language (KSL). IEEE T Syst
Man Cyb 1996; 26: 1144–1148.
[24] Qutaishat M, Moussa H, Bayan T, Hiba AA. American sign language (ASL) recognition based on Hough transform
and neural networks. Expert Syst Appl 2007; 32: 24–37.
[25] Stergiopolou E, Papamarkos N. Hand gesture recognition using a neural network shape fitting technique. Eng Appl
Artif Intel 2009; 22: 1141–1158.
[26] Lee C, Xu Y. Online, interactive learning of gestures for human/robot interfaces. In: Proceedings of IEEE Interna-
tional Conference on Robotics and Automation; April 1996; Minneapolis, MN, USA. pp. 124–128.
[27] Wang H, Leu MC, Oz C. American Sign Language recognition using multi-dimensional hidden Markov models. J
Inf Sci Eng 2006; 22: 1109–1123.
[28] Damasio FW, Musse SR. Animating virtual humans using hand postures. In: Proceedings of Symposium on
Computer Graphics and Image Processing; October 2002; Rio de Janeiro, Brazil. pp. 437–442.
[29] Weissmann J, Salomon R. Gesture recognition for virtual reality applications using data gloves and neural networks.
In: International Joint Conference on Neural Networks; July 1999; Washington, DC, USA. pp. 2043–2046.
[30] Wang HG, Sarawate NN, Leu MC. Recognition of American Sign Language gestures with a sensory glove. In: Japan
USA Symposium on Flexible Automation; July 2004; Denver, CO, USA. pp. 102–109.
[31] Sturman DJ, Zelter D. A survey of glove-based input. IEEE Comput Graph 1994; 14: 30–39.
[32] Lippman RP. An introduction to computing with neural nets. IEEE ASSP Mag 1987; 4: 4–22.
[33] Reidmiller M, Braun H. A direct adaptive method for faster backpropagation learning: the RPROP algorithm.
In: Proceedings of the IEEE International Conference on Neural Networks; March–April 1993; San Francisco, CA,
USA. pp. 586–591.
[34] Wilbur RB. American Sign Language: Linguistic and Applied Dimensions. 2nd ed. Boston, MA, USA: College Hill
Publication, 1987.
[35] Rabiner L. A tutorial on hidden Markov models and selected applications in speech recognition. P IEEE 1989; 77:
257–286.
[36] Oz C, Sarawate NN, Leu MC. American Sign Language word recognition with a sensory glove using artificial neural
networks. In: ANNIE Conference; November 2004; St. Louis, MO, USA. pp. 563–569.
2123