automatic recognition of auslan finger-spelling using hidden

Automatic Recognition of

Auslan Finger-spelling

using Hidden Markov

Models

Paul Goh

This report is submitted as partial fulfilmentof the requirements for the Honours Programme of theSchool of Computer Science and Software Engineering,

The University of Western Australia,2005

Abstract

In recent years, gesture recognition has received much attention from researchcommunities. Computer vision-based gesture recognition has many potential ap-plications in the area of human-computer interaction as well as sign languagerecognition. Sign languages use a combination of hand shapes, motion and loca-tions as well as facial expressions. Finger-spelling is a manual representation ofalphabet letters, which is often used where there is no sign word to correspondto a spoken word. In Australia, a sign language called Auslan is used by the deafcommunity and and the finger-spelling letters use two handed motion, unlike thewell known finger-spelling of American Sign Language (ASL) that uses staticshapes.

This thesis presents the Auslan Finger-spelling Recognizer (AFR) that is areal-time system capable of recognizing signs that consists of Auslan manualalphabet letters from video sequences. The AFR system has two components:the first is the feature extraction process that extracts a combination of spatialand motion features from the images. Which classifies a sequence of featuresusing Hidden Markov Models (HMMs). Tests using a vocabulary of twenty signedwords showed the system could achieve 97% accuracy at the letter level and 88%at the word level using a finite state grammar network and embedded training.

Keywords: Sign Language Recognition, Gesture Recognition, Computer Vision,Optical Flow, Hidden Markov ModelsCR Categories: I.4.6 [Image Processing and Computer Vision]: Segmentation,I.4.7 [Image Processing and Computer Vision]: Feature Measurement, I.4.8 [Im-age Processing and Computer Vision]: Motion I.5.4 [Pattern Recognition]: Faceand Gesture Recognition

ii

Acknowledgements

First and foremost, I would like to thank my supervisor, Dr. EJ Holden, for hersupport, guidance and patience throughout the Honours year. She has helped meachieve and learn a great deal and has made this year a very rewarding experience.

Additionally, I extend my thanks to Dr. Garreth Lee for providing the Csource code for his HTK Recognizer on which the Auslan Fingerspelling Recog-nizer’s Recognition Module is based.

Last, but not least, I thank my parents and grandparents for providing thebackup support which has helped me survive many a late night in the ComputerScience labs.

iii

Contents

Abstract ii

Acknowledgements iii

1 Introduction 1

2 Literature Review 5

2.1 Static Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Lockton and Fitzgibbon . . . . . . . . . . . . . . . . . . . 5

2.1.2 Birk and Moeslund . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Lamar et al. . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.4 Hasanuzzaman et al. . . . . . . . . . . . . . . . . . . . . . 7

2.2 Dynamic Gesture Recognition . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Starner and Pentland . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Starner, Weaver and Pentland . . . . . . . . . . . . . . . . 9

2.2.3 Grobel and Assan - Netherlands Sign Language . . . . . . 9

2.2.4 Holden - Auslan . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.5 Holden, Lee and Owens . . . . . . . . . . . . . . . . . . . 11

2.2.6 Vogler and Metaxas . . . . . . . . . . . . . . . . . . . . . . 12

2.2.7 Bowden et al. . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.8 Efros et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.9 Cutler and Turk . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 The Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . 15

3 The Auslan Finger-spelling Recognizer (AFR) 16

3.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.1 Hand Detection . . . . . . . . . . . . . . . . . . . . . . . . 16

iv

3.1.2 Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 Training - Estimating Model Parameters . . . . . . . . . . 25

3.2.2 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 AFR Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.1 Capture Module . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.2 Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.3 Training Module . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.4 Recognition Module . . . . . . . . . . . . . . . . . . . . . 29

4 Experimental Results 32

4.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.1 Camera Orientation . . . . . . . . . . . . . . . . . . . . . . 32

4.1.2 Lighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.3 Background and other Restrictions . . . . . . . . . . . . . 32

4.2 Hand Tracking Results . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Feature Classification . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3.1 Experiment 1 - Isolated Training . . . . . . . . . . . . . . 36

4.3.2 Experiment 2 - Embedded Training . . . . . . . . . . . . . 36

4.3.3 Recognition Results . . . . . . . . . . . . . . . . . . . . . . 37

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Conclusion and Future Development 41

A Listing of HMM Algorithms 42

A.1 Notation used in this section . . . . . . . . . . . . . . . . . . . . . 42

A.2 Baum-Welch Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 42

A.3 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

B Original Honours Proposal 45

v

List of Tables

2.1 The feature set used by Grobel and Assan [8] . . . . . . . . . . . 10

2.2 The linguistic feature vector used by Bowden [4] . . . . . . . . . . 14

4.1 The AFR Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Experiment 1 Results - Isolated Training . . . . . . . . . . . . . . 37

4.3 Experiment 2 Results - Embedded Training . . . . . . . . . . . . 37

vi

List of Figures

1.1 Auslan fingerspelling signs. Note that J and H contain explicitmotion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Illustration of features used by Holden et al. [11] . . . . . . . . . . 12

3.1 Results of the skin detection algorithm . . . . . . . . . . . . . . . 17

3.2 Plot of optical flow vectors. This image has been enlarged for thisillustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Separated Optical Flow Channels. These images have been thresh-olded and inverted for display purposes. . . . . . . . . . . . . . . 22

3.4 The optical flow histograms corresponding to velocities shown inFigure 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.5 The seven state HMM used in this project . . . . . . . . . . . . . 25

3.6 Four main modules of the AFR System. Numbers in the imageare not part of the system and are for explanation purposes only. 28

3.7 The AFR System . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1 Sequence of signs for the letter ”B” . . . . . . . . . . . . . . . . . 33

4.2 Sequence of signs for the letter ”C” . . . . . . . . . . . . . . . . . 33

4.3 Sequence of signs for the letter ”X” . . . . . . . . . . . . . . . . . 34

4.4 Motion descriptor for Frame 1 of B . . . . . . . . . . . . . . . . . 34

4.5 Motion descriptor for Frame 1 of C . . . . . . . . . . . . . . . . . 34

4.6 The Recursive (no grammar) Network . . . . . . . . . . . . . . . . 35

4.7 Example of a grammar network. The blue and red lines show thepaths for the words ”BAD” and ”AUS” respectively . . . . . . . . 36

4.8 Sequence of signs for the name ”PAUL” . . . . . . . . . . . . . . 38

4.9 Sequence of signs for the name ”HOLDEN” . . . . . . . . . . . . 38

4.10 Sequence of signs for the name ”JANE” . . . . . . . . . . . . . . 39

4.11 Silhouette of hand regions for the signs, ”L”, ”M” and ”R” . . . . 39

vii

CHAPTER 1

Introduction

Sign language is a manual form of communication used by deaf communities.Unlike the incidental gestures which often accompany verbal conversation, signlanguage gestures are highly structured and have well defined meanings. Al-though sign languages vary across deaf communities in different countries, theyall share in common the use of hand shapes, motions, orientations, locations andfacial expressions. These are the phonetics or basic units of sign language.

Finger-spelling is a subset of signs used to represent the alphabet of a particu-lar sign language It is used to express words that are not found in the vocabularyof the sign language. One would use finger-spelling for names of people, placesor technical words. For example, for the name ’Tom’, the signer would performthe sign for the letter, ’T’, then ’O’ and finally ’M’.

In the past decade, sign language recognition has attracted extensive researchwithin computer vision communities. Many works on the vision based recognitionof various sign languages have been reported, including American Sign Language(ASL), British Sign Language (BSL), Greek Sign Language (GSL) and AustralianSign Language (Auslan) have been reported.

Vision based sign language recognition is composed of the following two sub-problems:

• Feature Extraction

The goal of feature extraction is to determine a mathematical model forthe sign gestures. Gestures can be modeled using a variety of featuresincluding local features (i.e. hand shape and orientation), global features(i.e. location, motion and trajectory).

• Feature Classification

The next phase classifies the extracted features as belonging to a specificgesture class. Some of the common methods of recognition are Template

1

Figure 1.1: Auslan fingerspelling signs. Note that J and H contain explicit mo-tion.

Matching, Artificial Neural Networks, Fuzzy Experts and Hidden MarkovModels.

Within the School of Computer Science & Software Engineering at the Univer-sity of Western Australia (UWA), the sign language translation group has beenactively working on two-way translation between English and Auslan. To trans-late English to Auslan, a sign language display system that has a tutorial inter-face, namely the Auslan Tuition System (ATS) [27] was developed. Recently, theyhave developed a system which which recognizes colloquial Auslan phrases [11].The system, however, does not yet support the recognition of Auslan finger-spelling. Other works have also been reported on Auslan recognition. However,these use virtual reality gloves instead of computer vision techniques [26] [24].

2

This thesis presents a computer vision based, real-time fingerspelling recog-nition system called the Auslan Fingerspelling Recogniser (AFR). Unlike ASLfingerspelling that uses static postures of one hand, Auslan uses two handeddynamic gestures for each letter as shown in Figure 1.1.

The existing Auslan Recogniser [11] uses global features such as trajectoryand geometric features such as the positions of the left and right hand relativeto the face. These features are then recognized using Hidden Markov Models.Occlusion of the hands over the face was handled using a combination of activecontour models (snakes) and motion cues. The system was devleoped usingMatlab and the Hidden Markov Toolkit (HTK 3.1) [1].

There are two main limitations of the Auslan Recognizer. One is that itonly handles the occlusions of a foreground moving object over a rather staticbackground, which make it difficult to recognize fingerspelling which uses bothmoving hands. The other limitation is that the snake algorithm is computation-ally expensive, which makes it difficult to implement a real-time system.

Thus, the objectives of this project were to explore the following modificationsto the Auslan Recogniser to support fingerspelling recognition:

• A new feature set which implicitly deals with occlusion using motion oc-curring with the occluding regions without incurring heavy computationalcosts.

• Alternatives to Matlab to support the development of a real-time system.

• Investigating improvements in accuracy by using embedded training to re-fine HMM parameters.

These objectives have been fulfilled through the development of the AFRsystem which consists of the following two modules, both implemented in C++for the Windows platform:

1. Feature Extraction Module

This module extracts a combination of geometric features (obtained byimage moments and Eigenspace analysis) and an optical flow motion de-scriptor adapted from Efros et al. [7], which has not yet been used in signlanguage recognition. Intel’s OpenCV [13] which has a large library ofcomputer vision algorithms was chosen to support development.

2. Recognition Module

3

Like the Auslan Recognizer, sign gesture recognition is achieved usingHMMs. However, an additional performance enhancement obtained byusing embedded training. The Hidden Markov Toolkit (HTK 3.1) [1] whichprovides implementations of the main algorithms required for training andrecognition. The module is capable of recognizing fingerspelling letters andwords from video sequences in real-time.

This AFR system could be utilized in the Auslan Recogniser to support therecognition of fingerspelt letters and words, as well as a feedback mechanism inthe Auslan Tuition System (i.e. inform the user if he or she has corretly performeda sign).

This thesis has the following chapters:

• Chapter 2 - This chapter reviews existing research in the areas of signlanguage recognition and gesture recognition in general.

• Chapter 3 - This chapter describes the chosen methods and algorithms usedin the proposed approach as well as the implementation details.

• Chapter 4 - This chapter presents experiments and results.

• Chapter 5 - The thesis concludes with a discussion of possible improvementsin any future undertaking of this research.

4

CHAPTER 2

Literature Review

Computer vision based sign recognition as well as human motion recognitionsystems in general, use various feature sets including coarse geometrical shapedescriptors, fine finger configurations, as well as appearance based descriptors torepresent signs. These features are then recognized as signs. Some systems rec-ognize static hand shapes using pattern recognition techniques, such as templatematching and neural networks. Others recognize dynamic gestures using HMMsor temporal neural networks.

In this chapter, the literature is divided into two categories. The first categoryconsists of systems which recognize static hand gestures while the second coverssystems which recognize dynamic gestures.

2.1 Static Gesture Recognition

2.1.1 Lockton and Fitzgibbon

A recent work towards ASL finger-spelling recognition has been conducted byLockton and Fitzgibbon [18]. Their system uses a single camera to capture atop-down view of the signer’s hands. The signer’s hands are tracked using asimple skin-detection algorithm. The signer’s skin colour is sampled during acalibration phase. The pixel values from the sampled region form a cluster inthe RGB colour-space. Pixels with colour values which lie inside the cluster arethen classified as skin pixels. In order to deal with the problems of scale androtation, the signer is required to wear a coloured wrist band. This is then usedto transform the input image into a canonical frame.

Their approach to recognition is based on nearest neighbour template match-ing. However, they acknowledge that the naive approach to template matchingwhich compares every single pixel in an input image to every single pixel in theexemplar set is computationally expensive. They obtain a speed-up by perform-

5

ing clustering in the exemplar space. A further performance boost is achieved byemploying a novel boosted classifier . Their tests report a 99.7% success rate fora set of 46 different hand shapes (26 alphabets, 10 numbers and 10 additionalcontrol gestures for the operation of a text editor).

2.1.2 Birk and Moeslund

Although, Principal Component Analysis (PCA) has traditionally been used forface recognition, Birk and Moeslund [3] have adopted this technique to recognizehand postures of 25 signs form the International Hand Alphabet. PCA works byreducing the dimensionality of the data set in which there are many correlatedvariables, while retaining as much of the variation in the data set as possible.The data set is decomposed into a new set of variables, namely the principalcomponents. They report an accuracy of 99.7% for a test set of 1500 images.

2.1.3 Lamar et al.

Artificial Neural Networks can be also be used to recognize static hand shapes.Lamar et al. [15] have researched the use of PCA and Neural Networks for rec-ognizing Japanese finger-spelling. In their approach, the signer wears a glovewith different colours for the four fingers, thumb and palm. Feature extractionis facilitated by thresholding for each of the desired colours in the RGB colour-space. Background noise is eliminated by performing dilation and erosion on thedetected regions.

Once this is done, PCA is used to extract the following features for each re-gion: the normalized centroids, eigenvalue ratios and angle of orientation. Thesefeatures are used as input to a 3 layer feed-forward perceptron that is trained byback propagation [15]. A particular observation uncovered by their research isthe importance of the fingers in the recognition of static hand gestures. A sec-ond neural network was trained to disregard the palm region in the classificationprocess. For a training set of 1260 images (42 hand postures performed 30 timeseach) the system achieved an accuracy of 89.06% for the case where the palm wasdisregarded, as opposed to 85.24% with the palm included. This implies that thefingers are a more descriptive feature for the static hand gestures.

6

2.1.4 Hasanuzzaman et al.

Hasanuzzaman et al. [9] have used a template matching technique in the de-velopment of a gesture interface for controlling a Sony AIBO robot. Templatematching is one of the simplest methods for object recognition. There are twomain phases to template matching: creating templates from training data andcomparing input to the created templates for recognition. They employ a com-bined approach to template matching: a maximum correlation coefficient anda minimum distance classifier (Manhattan distance) between two images of thesame size in order to improve the recognition accuracy. In their approach 480 im-ages sub-sampled to a 60 x 60 resolution were used to generate templates for eightgesture classes. The first stage of the classification is performed by consideringthe maximum correlation coefficient calculated by:

αt =Mt

Pt

(0 < αt < 1) (2.1)

where Mt is the total number of matching pixels in the input image and thet-th template and Pt is the total number of pixels in the t-th template. The nextstage of classification considers the minimum distance classifier:

δt =N∑

1

|I − Gt| (2.2)

where I is the input image and N is the total number of pixels in I and G. andGt are template images. Using these two features, a given input is considered amatch to the template t if and only if:

(αt > th1)AND(δt < th2) (2.3)

where th1 and th2 are the correlation coefficient threshold and the minimumdistance threshold respectively. The values used for these thresholds were notspecified. It is also difficult to gauge the success of this approach as no hardresults were presented in the paper.

2.2 Dynamic Gesture Recognition

The static shape recognition technique has an inherent limitation for sign lan-guage recognition, as sign languages are mostly dynamic gestures and are con-tinuously performed one after another.For example, sign phrases consist of a

7

sequence of sign words and fingerspelling words consist of a sequence of sign let-ters. The following literature reports modeling and recognition techniques usedfor the temporal motion inherent in sign language gestures.

2.2.1 Starner and Pentland

The work of Starner and Pentland [22] pioneered vision-based sign languagerecognition. A desk mounted camera to was used to capture an angled top-down view of the signer. Their system facilitated real-time hand detection andtracking by requiring the signer to wear different coloured gloves on each hand(yellow on the right and orange on the left). A region growing algorithm is usedto segment the hand regions in the image. Their algorithm scans the image till apixel matching the colour of the gloves is found. The region is grown by checkingthe eight connected pixels for the appropriate colour. This process is iterateduntil both hand regions have been segmented.

The feature vector used for gesture modeling consisted of the center of gravity(centroid), area, angle of axis of least inertia and the eccentricity of the boundingellipse of each hand. These features are extracted from the bitmap resultingfrom the region growing process, by performing image moments and Eigenspaceanalysis. This feature set represents locations and a coarse description of handshapes.

Hidden Markov Models (HMMs) are used to provide temporal modeling andto achieve sign classification. Using 395 sentences (constructed from a vocabularyof 40 sign words) to train the system, and a further 99 for independent tests, theyobserved 91.3% accuracy without grammar. However, with a known grammarstructure (a predefined order in which words can appear), an accuracy of 99.2%was achieved.

While these results were promising, Starner and Pentland uncovered severalproblems with their approach. Due to the coarse grained descriptors of the hand,the recognizer could be confused by signs which only differ based on specific fingerpositions (minimal pairs). Another problem was the use of absolute positions.The system was trained to expect certain gestures in certain locations. Conse-quently, any sight variance in the signer’s position within an image could confusethe system.

8

2.2.2 Starner, Weaver and Pentland

Starner et al. [23] addressed some of the problems described above, in their laterwork with wearable computing. To capture video of the signer’s hands, they useda cap-mounted camera. With this system, the hands were tracked based on skin-tone, thus removing the requirement for coloured gloves. Using unadorned hands,they built an a priori model of skin colour based on the assumption that differentskin-tones have approximately the same range of hue and saturation values, anddiffer mainly in intensity. This model was used with the region growing approachin [22] to track a signer’s hands. This hand tracking approach did not deal withocclusions.

A few minor changes were made with respect to the representation of hands.In addition to the features used in their earlier work, the delta change in thex and y co-ordinates of the centroids (the change of co-ordinates between eachframe) is also included. Their experiments revealed that the cap-mounted sys-tem was 5.9% more accurate than the original system. They account this tothe cap mounted system being less susceptible to body rotation and occlusionproblems [23]. Despite this advancement, the system still does not address theproblem of differentiating between minimal pairs of signs.

2.2.3 Grobel and Assan - Netherlands Sign Language

Grobel and Assan [8] have also used HMMs for the isolated recognition of thenative sign language of the Netherlands. Unlike Starner and Pentland [22] [23],their system uses a frontal view of the signer and a highly detailed feature setis used to model hand gestures. They require the signer to wear a glove withseven colours on the dominant hand (one colour for each finger, the palm andthe back of the hand) and a glove of an eigth colour on the non-dominant hand.The features used in their system are described in Table 2.1.

This highly detailed model allows for a more accurate classification of handgestures which have only small differences in hand shape. Moreover, the useof relative positions instead of absolute positions addresses the problems foundin [22]. It is also important to note this feature vector’s inclusion of featureswhich describe finger orientation and shape.

Their system is capable of recognizing as many as 262 signs. Three trainingsets were collected for their experiments. The first set consisted of 10 samples persign (2620 samples). For the second set, 5 samples per sign were collected froma second person (1310 samples). The third training set used the total samples ofboth persons. Experiments were conducted to test the reliance on collating sets

9

Hand Parameter FeaturesDominanthand

Location x co-ordinate relative to the central verticalbody axis (1)y co-ordinate relative to the height of theright shoulder (1)

Shape andOrienta-tion

Distances of centers of gravity of all colourareas to each other (20)

Size of all colour areas (7)Angles of all fingers (2)

Non-dominanthand

Location x co-ordinate relative to the central verticalbody axis (1)

y co-ordinate relative to the height of theright shoulder (1)

Shape andOrienta-tion

Distances of centers of gravity of all colourareas to each other (20)

Size of all colour areas (7)Angles of all fingers (2)

Bothhands

Location Distance of the COGs (center of gravities ofboth hands (1)

Table 2.1: The feature set used by Grobel and Assan [8]

10

in order achieve accurate classification.

The best results were reported when recognition is limited to 43 words andthe system is trained with 1310 samples: 99.2%. Even when the vocabularyis increased to 262 words, their system is still able to achieve a relatively highaccuracy of 91.3%. However, it’s important to note the extensive amount oftraining required to achieve these results.

2.2.4 Holden - Auslan

One of the earliest works on the visual recognition of Auslan is Holden’s HandMotion Understanding System [10]. This work is unique for its use of threedimensional hand configuration features obtained from two-dimensional images.Hand positions were modeled according to the anatomical properties of the hu-man hand. The model has 21 degrees of freedom 3D features extracted from 2Dimages using a model-based tracking technique.

Recognition was performed using an adaptive fuzzy expert system. The fuzzyrules use the high level description of hand posture and motion to describe asign. defined by the anatomy of the hand skeleton. The system was capable ofrecognizing 22 different signs with an accuracy of 99%.

Using computer vision techniques to extract these 3D features from 2D im-ages is computationally expensive and less robust than using virtual reality gloveswhich were used by Vamplew and Adams [24] for recognizing Auslan using tem-poral neural networks.

2.2.5 Holden, Lee and Owens

Holden et al. [11] have recently done work towards the recognition of colloquialAuslan. Their system used a single camera to capture video of the signer from afrontal view.

A skin detection algorithm based on Principle Component Analysis (PCA) ofthe RGB colour-space was used to detect the signer’s head and hands. Using aset of training images, a skin-colour model was formed from the average colourcomponent and a colour covariance matrix. This model forms a cluster in thePCA colour space. Hence, the skin area in images can be obtained by thresholdingthe colour components of each pixel against a Mahalanobis distance from the skincolour population model. The skin regions detected are further refined by usingmorphological operations to remove noise.

11

Figure 2.1: Illustration of features used by Holden et al. [11]

Unlike previous approaches which use coloured-coded gloves to handle occlu-sion, Holden et al. have used a combination of active contour models (snakes) andtemporal motion queues to explicitly handle the occlusion of unadorned hands.

Their feature set uses geometric properties of the current positions of the headand hands with respect to their positions in the previous frame: the centroids ofthe face, right hand and left hand at time t (Ft, Rt, Lt) and at time t-1 (Ft−1,Rt−1, Lt−1) as illustrated in Figure 2.1.

Their feature vector thus consisted of the angle between the two arm vectors,the moving directions of the right and left hands, the bounding ellipse of eachhand and the ratio between the areas of each hand. This feature set was foundto be invariant to scale and rotation.

Recognition is performed using Hidden Markov Models. In order to recognizersentences a grammar graph was constructed to describe permissible combinationsof words. To train and test the system, 379 utterances of 15 different sentencesutilizing 21 sign words were used. 216 of these were used for training and theremaining 163 for independent tests. Their system exhibited a 97% accuracy atthe sentence level and as high as 99% at the word level.

2.2.6 Vogler and Metaxas

The previously described works have all used 2D appearance based features.Alternatively motion can be recognized using 3D model of the hand and arms.

12

Such an approach has been used by Vogler and Metaxas [25] for the recognitionof ASL. They used a set of three orthogonally positioned cameras to capturevideos of the signer. This was done in order to alleviate problems arising fromocclusion. Using a method based on deformable models, the three-dimensionalmotion parameters of a subject’s arms are obtained from the multiple images. Intheir approach, the hand and lower arm are modeled as one part. They arguethat the overall arm movement carries sufficient information for the recognitionof ASL.

Their feature vector consists of the polar position co-ordinates, velocities andwrist orientation angles in three-dimensional space. Unlike Starner et al. [22] [23],they use a strong grammar network. For a training set of 389 sequences andtest set of 97 examples, their system exhibited 89.91% accuracy compared to anaccuracy of 83.63% two dimensional data (obtained by projecting the 3D dataonto a plane [25]). The clear disadvantage of this approach is the computationalcost of obtaining these features. Moreover, the need for three cameras is notpractical in general.

2.2.7 Bowden et al.

One of the major problems with the previously described approaches to signlanguage recognition is the dependence of accuracy on the collation of a compre-hensive set of training data. Moreover, as shown by Grobel [8], these systemsare only really accurate if the same person trains and tests the systems. Thetraining is non-transferrable. To overcome these limitations, Bowden et al. [4]have proposed a novel two stage classification based on sign linguistics.

In their system, hands are tracked using colour coded gloves. The first stage ofclassification extracts raw information representing the shapes and trajectories ofthe hands which are subsequently converted to a high level sign language viseme(the sign language equivalent of a phoneme) representation. The features are asdescribed in Table 2.2.

HA represents the positions of the hands relative to each other. TAB repre-sents the position of the hands relative to key body locations. SIG represents therelative movements of the hands and DEZ represents basic hand shapes. Thisbroad level of description generalizes the features representing a sign and thusreduces the requirements for further stages of classification.

For the second stage, each sign is modeled as a 1st order Markov chain forclassification. However, this introduces the problem of mapping feature vectorsto states in the Markov chain. Minor variances in instances of signs classified inthe first stage can lead to noise in the feature vector, which confounds accurate

13

HA TAB SIG DEZRight hand high The neutral space Hand makes no movement 5Left hand high Face Hand moves up AHands side by side Left side of face Hands move down BHands are in contact Right side of face Hands move left GHands are crossed Chin Hand moves right H

Right shoulder Hand moves apart VLeft shoulder Hand moves togetherChest Hand moves in unisonStomachRight hipLeft hipRight elbowLeft elbow

Table 2.2: The linguistic feature vector used by Bowden [4]

recognition. To handle this, Independent Component Analysis (ICA) is used toseparate correlated features from noise [4].

For their intended lexicon of 49 signs, they recorded a single person perform-ing the signs at an average of 5 repetition per sign, resulting in a total of 249signs. A single instance of each sign was selected for training while the remainderwas used as an unseen test set. The training resulted in a classifier bank of 49Markov chains. Their tests showed that the ICA transformed data yielded an84% classification rate as opposed to 73% on the untransformed data. A furtherclassification boost was observed when ambiguous signs were removed from thevocabulary.

2.2.8 Efros et al.

Efros et al developed a human motion recognition system, specifically designedto recognize sports actions. They introduce a novel motion descriptor based onpixel-wise optical flow measurements. They argue that this is the most naturaltechnique for capturing motion independent of appearance. To obtain their mo-tion descriptor, the pixel wise optical flow of a given sequence is first computedfrom subsequent frames. The optical flow vector field is then split into the hor-izontal and vertical velocity components, FxandFy, which are then decomposedinto four non-negative channels, Fx+Fy+, Fx−andFy−. This motion descriptor isthen used to classify various sports actions (tennis, ballet and soccer) in a nearest

14

neighbour framework. Even though their work does not deal directly with ges-ture recognition, their approach of using optical flow presents itself as a potentialfeature vector which is invariant to the performer’s appearance.

2.2.9 Cutler and Turk

Another application of optical flow was reported by Cutler and Turk [6]. Theirapproach is based on the segmentation of the optical flow vector field into motionblobs. They use the number of detected motion blobs; the absolute motion of theblobs; relative size of the blobs and the relative distance of the blobs as featuresfor their rule based recognition approach. The system is used to recognize simplegestures such as clapping hands or flapping wings.

2.3 The Proposed Approach

The works on gesture static gesture recognition focus mainly on gestures whichuse only one hand [18] [3] [9]. This indicates that such an approach would notbe appropriate for the recognition of Auslan fingerspelling which uses both handsand is inherently dynamic.

Hence, a dynamic gesture recognition approach using Hidden Markov Mod-els has been adopted for the Auslan Fingerspelling Recognizer. Previous workhave shown, HMMs can succesfully model and recognize the temporal aspectsof sign language. [11] [22] [25] [8]. These works mainly differ in methods ofhand detection and features used to model sign gestures. To facilitate real-timetracking and recognition, a simple skin detection method is used to extract thehand regions from which coarse grained geometric features are extracted. Thisis similar to the feature set used in [22].

Occlusion of the hands is a common problem in these approaches. Occlusioncan be dealt with explicitly as in [11], but at great computational cost. On theother hand, occlusion can be ignored at the expense of discriminative information.An alternative to this is dealing with occlusion implicitly by including informationon the motion of the two hands. Hence, Efros et al.’s [7] approach which usesoptical flow, was used to capture motion information from the two hands whenocclusion occurs. The motion information are then combined with the geometricfeatures to form the full feature set.

The details of each of these methods are discussed in greater detail in thefollowing chapter.

15

CHAPTER 3

The Auslan Finger-spelling Recognizer(AFR)

The AFR system consists of two main phases: Feature Extraction and FeatureClassification. In the feature extraction phase, skin regions are detected and a setof features which include geometric features and an optical flow motion descriptorare extracted from video frames. The second phase classifies the sequence offeatures by utilising HMMs as used by Holden et al [11].

3.1 Feature Extraction

3.1.1 Hand Detection

Firstly the hand regions must be extracted from the image frame. This is achievedby using a simple skin colour detection approach. The technique adopted forthe proposed system relies solely on a thresholding in the YCrCb colour space.YCrCb is a colour-space that is commonly used in video systems. The Y valuerepresents the brightness level, while the Cr and Cb components represent thetrue colour information. The transformation from the RGB colour-space toYCrCb can be obtained by the following equations:

Y = 0.299 ∗ R + 0.587 ∗ G + 0.114 ∗ B (3.1)

Cr = (R − Y ) ∗ 0.713 + 128 (3.2)

Cb = (B − Y ) ∗ 0.564 + 128 (3.3)

The YCrCb colour-space has been found to be superior to other colour-spacessuch as RGB and HSV [21] for skin-detection. Chai and Bouzerdom [5] haveclaimed that the YCrCb colour-space provides good coverage of all human racesand that pixels belonging to the skin region have similar Cr and Cb values. The

16

perceived difference is governed by the intensity value (Y component). Otherresearch has also indicated that Caucasian, Asian and African skin-pixels occupythe same region in the CrCb plane [17].

In my implementation, a pixel is classified as skin if both the Cr and Cbvalues of that pixel falls inside the respective ranges, RCr = [133, 173] and RCb= [77, 127] (as empirically determined by Chai and Ngan [20]), otherwise thepixel is classified as non-skin. This algorithm results in a binary image where theon pixels represent skin regions. The results of this region are shown in Figure3.1(a) and Figure 3.1(b).

(a) Colour image of signer (b) Detected skin regions with noise

(c) Scale filtered skin regions

Figure 3.1: Results of the skin detection algorithm

To remove noise, morphological operations such as opening and closing canbe used, but such operations are computationally expensive and would hinderreal-time operation. Thus, scale filtering is used instead. This method relies onthe assumption that the hands would be the largest skin regions in the image.Therefore any region with an area below a particular threshold is ignore. Athreshold of 500 was empirically determined to be suitable for this implementa-tion. A result of the scale filtering can be found in Figure 3.1(c). This process

17

is executed in each frame to segment the hand regions.

3.1.2 Feature Set

The feature set devised for this project consists of two components: a set ofgeometric features and a motion descriptor based on optical flow.

Geometric Features

The geometric features are extracted using image moments analysis and Eigenanalysis of the binary hand regions. Each hand region is represented by its angleof area; angle of orientation, major axis length and minor axis length (parametersof the region’s bounding ellipse).

• Area:

The area, A, of a binary object is given by its zero-th moment:

A =∫ ∫

B(x, y)dxdy (3.4)

where

B(x, y) =

{1 for points on the object0 for points on the background

• Angle of Orientation:

In order to determine the objects angle of orientation we need to determineits 1st moments (centroid).

x =

∫ ∫xB(x, y)dxdy

A(3.5)

y =

∫ ∫yB(x, y)dxdy

A(3.6)

Next we calculate the 2nd moments as:

a =∫ ∫

(x′)2B(x′, y′)dx′dy′ (3.7)

b =∫ ∫

x′y′B(x′, y′)dx′dy′ (3.8)

c =∫ ∫

(y′)2B(x′, y′)dx′dy′ (3.9)

18

Where (x′, y′) are the coordinates with respect to the center of mass:

x′ = x − x (3.10)

y′ = y − ys (3.11)

Finally, the angle of orientation, θ can be obtained by:

θ = arctanb

a − c(3.12)

• Major and Minor Axis Length of Bounding Ellipse:

The 2nd as determined in Equations[ 3.7 - 3.9] are used to form the covari-ance matrix as shown below:

[xx xy/2

xy/2 yy

](3.13)

(3.14)

where

xx = a/A (3.15)

xy = b/A (3.16)

yy = c/A (3.17)

The Eigenvalues of this covariance matrix can then be used to determinethe major and minor axis lengths of the bounding ellipse of the detectedhand regions.

The Eigenvalues of this covariance matrix can then be used to determinethe major and minor axis lengths of the bounding ellipse of the detectedregions. The major and minor axis lengths are defined as:

Minor axis length = 2.5 ∗√

min(xx, yy) (3.18)

Major axis length = 2.5 ∗√

max(xx, yy) (3.19)

These values are calculated for both the left and right hand regions in theimage. My approach assumes the left hand to be the left-most object and theright hand to be the rightmost object. If an occlusion, which my be caused bytouching or overlap, from the viewing point occurs, each of the left and righthand is assigned as a combined region. This occlusion problem is handled by themotion feature described in the following subsection.

19

The Motion Descriptor

Each Auslan finger-spelling sign is formed using varying hand motions. In orderto represent motion, optical flow is used to extract motion information from thevideo sequence. Efros et al. [7] have used optical flow as a motion descriptor fortheir work on recognizing human sports actions.

Optical flow is defined as the apparent velocities associated with the displace-ment of brightness patterns in an image [12]. A major problem with optical flowestimation is the aperture problem. The implication of this is that for a regionwith a strongly oriented intensity gradient (edge), only the velocity componentnormal to the edge is available. Hence, only image regions with strong high orderintensity variations such as corners or textured regions do not suffer, and henceboth velocity components can be obtained. The human hand does not have verysignificant corners or texture features. Hence in my approach, the optical flowis computed for the hand contours. This approach has previously been usedin [14]for the tracking of lab animals.

The optical flow of the contour points in two subsequent frames is calculatedusing the Lucas-Kanade algorithm [19]. The first step of this process is to deter-mine the image gradients in the x and y directions. These spatial derivatives canbe calculated by convolving the image with 3 x 3 Sobel filters:

Horizontal Sobel Filter =

−1 −2 −10 0 01 2 1

Vertical Sobel Filter =

−1 0 1−2 0 2−1 0 1

Temporal gradients on the other hand are determined by taking the pixeldifference between the current frame and the following frame. The differenceimage is smoothed by convolution with a Gaussian filter. These elements areused to construct the two-by-two coefficient matrix for a contour point.

Having constructed the coefficient matrix the optical flow points can be de-termined by solving the following linear equation for each of the desired contourpoints:

[ ∑I2x

∑IxIy∑

IxIy

∑I2y

] [vx

vy

]+

[ ∑IxIt∑IyIt

]=

[00

](3.20)

where

20

Figure 3.2: Plot of optical flow vectors. This image has been enlarged for thisillustration.

Ix is the derivative of the image in the horizontal direction,

Iy is the derivative of the image in the vertical direction,

It is the temporal derivative between subsequent images,

vx is the optical flow velocity in the x direction and

vy is the optical flow velocity in the y direction.

It is difficult to derive much meaning from a simple plot of the velocity vectors,vx and vy, as in Figure 3.2. However, if the velocity vectors are decomposed intonon-negative channels, x+, x−, y+ and y− direction in each frame, the motionproperties for each hand can be observed more clearly. The formula for thedecomposition is show in Equation 3.21 This approach of separating the opticalflow field into channels, was adapted from Efros’ technique [7].

vx+ = vx where vx > 0 (3.21)

vx− = |vx| where vx < 0 (3.22)

vy+ = vy where vy > 0 (3.23)

vy− = |vy| where vy < 0 (3.24)

Figure 3.3 shows the separated optical flow channels obtained when the signeris performing the sign for the letter ’A’. The black dots represent velocity valuesscaled to lie between 0 and 1. The right hand is moving towards the relatively

21

static left hand. Hence, the x+ and y+ exhibit strong optical flow in the righthand and only latent flow in the left hand.

(a) Velocity x+ (b) Velocity y

+

(c) Velocity x− (d) Velocity y

−

Figure 3.3: Separated Optical Flow Channels. These images have been thresh-olded and inverted for display purposes.

The x and y velocity values of the contour points, were found to consistentlylie between -4 and 4 (inclusive). Hence, an 11-bin histogram was constructedfor each velocity field (Figure 3.4). This histogram represents the frequency ofvelocity vector values for each of the bin ranges. For example, the value in bin-4, represents the number of occurrences of velocity vectors which lie between -4and -3.

The frequency values as represented by the histograms are used as the finalcomponent of the feature vector.

Therefore the final feature vector has 26 parameters consisting of the follow-ing:

22

(a) X Velocity Histogram

(b) Y Velocity Histogram

Figure 3.4: The optical flow histograms corresponding to velocities shown inFigure 3.3

23

• Geometric features

– Left hand angle of orientation

– Left hand area

– Left hand major axis length

– Left hand minor axis length

– Right hand angle of orientation

– Right hand area

– Right hand major axis length

– Right hand minor axis length

• Motion-based features

– X-velocity optical flow histogram (bins for ranges -4 to 4)

– Y-velocity optical flow histogram (bins for ranges -4 to 4)

3.2 Hidden Markov Models

HMMs are a stochastic process which can be used to model any time serieswhich are assumed to be a first order Markov process. In recent years, HMMshave been used in various works on speech recognition and gesture recognition.It is useful for the recognition of sign languages which can be viewed as a seriesof hand postures and motions which vary with time. It is however necessary toassume that the current state of the series being modeled is dependent only onthe immediately preceding state (a first order Markov process).

Each hidden state of the model has a likelihood of producing an output ob-servation, O. For the AFR, the output observations are represented by featurevectors for each frame in a video sequence. Because the observations are not dis-crete symbols, the output probabilities for each state are represented by a mixtureof Gaussians probability density function. Hence the HMM is parameterized asfollows:

λ = (A, c, µ, Σ, π)

where:

A = state transition probabilities

c = weighting coefficients

24

Figure 3.5: The seven state HMM used in this project

µ = mean vectors

Σ = Covariance matrix

π = Initial state occupancy probabilities

For this case, each manual alphabet letter is modeled by a single HMM. Thetopology of the HMMs is determined by estimating the number of states involvedin performing a sign. Through an empirical process, a seven-state model withtransitions was chosen for the system. The first and last states are non-emittingstates. It has been suggested that better results could be achieved by specificallytailoring models for each sign [22]. However, this improvement is not investigatedin this paper.

The three fundamental problems to solve with HMMs are:

• Problem 1 - Evaluation

Given an observation sequence O = O1, O2, ...OT , and a model λ = (A, B, µ, σ, π),find the probability P (O|λ) of the observation sequence. Solving this prob-lem gives the HMM its recognition capabilities.

• Problem 2 - Decoding

Given an observation sequence O = O1, O2, ...OT , and a model λ, find theoptimal state sequence which best describes the observation.

• Problem 3 - Estimation

Given an observation sequence and a model λ, adjust the model parametersin order to maximize P (O|λ). This problem is addressed during the trainingphase.

25

3.2.1 Training - Estimating Model Parameters

The goal of training is to solve the third HMM problem. The goal is to maximizethe probability of a model producing a given observation sequence. This can beachieved by isolated training and further refined using embedded training.

Isolated Training

The model parameters of a HMM are determined using the following process.Each of the training examples (the sequence of feature vectors) are initially lin-early segmented against the models and subsequently Viterbi alignment is per-formed to provide initial estimate parameters of the HMMs [28]. These estimatesare then refined by applying the Baum-Welch re-estimation formulae for isolatedmodel training. Thus, the process consists of the following steps:

1. Perform Viterbi training to obtain initial estimates of parameters and observations-to-state alignment.

2. For every parameter vector requiring re-estimation allocate storage for nu-merator and denominator.

3. For each training observations we need to calculate the forward and back-ward probabilities, α and β.

4. Use the final accumulator values to calculate new parameter values.

5. If the value P = P (O|λ) for this iteration is not higher than the value fromthe previous iteration, then stop, otherwise repeat the process using there-estimated parameter values.

Refer to Appendix for a complete listing of the formulae of this algorithm.

Embedded Training

Further refinement for the gesture models can be achieved by embedded training.This method uses the same Baum-Welch procedure as for the isolated case, butrather than training each model individually, all models are trained in parallel.It works in the following way:

1. Allocate and initialize accumulators for all parameters of all HMMs.

26

2. For each training sequence:

(a) Construct a composite HMM by joining in sequence the HMMs cor-responding to the symbol transcriptions of the training utterance.

(b) Calculate the forward and backward probabilities (αandβ) for thecomposite HMM. The formulae used here is slightly different fromthe formulae used in the isolated case.

(c) Use the forward and backward probabilities to compute the probabil-ities of state occupation at each time frame and update the accumu-lators.

3. Use the accumulators to calculate new parameter estimates for all of theHMMs. For a complete listing of the embedded re-estimation formula referto the Appendix.

3.2.2 Recognition

Recognition can be achieved by solving the evaluation problem. That is, given anobservation sequence and a model, the probability that the observed sequence wasgenerated by the model, P (O|λ) must be calculated. By evaluating the observa-tion sequence against each model, the model resulting in the highest probabilityis selected as the recognized sign.

The Viterbi algorithm produces an efficient means of evaluating a set of HMMsby taking only the maximum-likelihood path at each time step instead of allpaths. It is a form of dynamic programming which finds the best possible align-ment of feature vectors against HMM states which maximizes the probability ofthe observations given the model. The HMMs corresponding to the individualsigns can be chained together to form a word level HMM. Bayes’ rule can be usedto reverse the conditions, resulting in the probability of each word model giventhe observation sequence:

P (λi|O) =P (O|λi)P (λi)

P (O)(3.26)

A value of 1 is used for P(O) since all signs are equally likely to occur. Con-sequently, the word for which P (λi|O) is maximal is chosen as the recognizedword.

27

Improving Continuous Recognition using a Grammar Network

In order to achieve recognition of a sign word, continuous recognition of lettersis required. It is necessary to specify how the models for each individual signedletter are interconnected. There are two ways of achieving this. One is to use arecursive network (no grammar) in which every signed letter can be followed byany of the other letters.

Alternatively, a finite state grammar network can be specified. This definesa specific order in which the finger-spelling letters can occur. Previous workshave shown that this approach can greatly improve continuous recognition ofsign gestures. In theory it would be possible to devise a generalized networkbased on linguistics. For example, it is a known fact that the letter ’E’ is themost commonly occurring letter in English words, and consonants are more likelyto be followed by vowels. However, this enhancement has not been implementedfor this project.

A full description of the vocabulary and grammar networks used in the AFRsystem can be found in Section 4.3.

3.3 AFR Implementation

The AFR system was developed for the Windows environment and makes useof Intel’s OpenCV [13] library and the Hidden Markov Toolkit (HTK 3.1) [1].To provide real-time feedback, the system was developed using C++. The AFRhas four main modules and a finger-spelling database as shown in Figure 3.6.Functionalities of each module are as follows:

3.3.1 Capture Module

The capture module is responsible for grabbing image frames from video files(AVI format). It is a wrapper for the video capture functions available in theOpenCV library [13]. For future extensions to the application, it is also capableof grabbing frames directly from a compatible USB webcam.

3.3.2 Feature Extractor

The feature extractor makes use of the pre-implemented computer vision algo-rithms available in OpenCV [13]. It analyzes the images from the capture file

28

Figure 3.6: Four main modules of the AFR System. Numbers in the image arenot part of the system and are for explanation purposes only.

and extracts the relevant features as described in Section 3.2.2. It writes thesefeatures to binary data files which is the required format for the training module.

3.3.3 Training Module

The training module is written in Python script and was previously used byHolden et al. [11]. As the script was originally written for the UNIX platform,some minor modifications were necessary. The training module uses the HTKtools: HINIT, HREST and HEREST to train the HMMs for each finger-spellingsign [1].

3.3.4 Recognition Module

The recognition module uses the Feature Extractor to parse input videos intoHTK binary data files, which are then recognized using the HVITE tool [1]. Themodule’s front end (Figure 3.7) is a dialog based MFC (Microsoft FoundationClasses) application.

AFR User Interface Description

This section describes the numbered labels in Figure 3.7.

1. List Button

29

Figure 3.7: The AFR System

30

This is mainly for debugging purposes. It brings up a file open dialog fromwhich you can select a custom list of HMMs from the HMM database. It isalso useful for future extensions of the system. For example, if the lexiconwere expanded to include finger-spelt numbers, the user may specify to onlyuse number models, alphabet models or both.

2. Dictionary Button

This is also for debugging purposes. It allows the user to select a customdictionary file. The dictionary file represents the letter associated with eachHMM.

3. Network Button

This button allows the user to select the grammar network to be used forrecognition. This is an extremely useful functionality for future develop-ment of the system. As different networks can be chosen depending on theclass of words the user wishes to recognize. For example, there can be adictionary specifically for names of people and another for names of places.

4. Input File Select Button

This button brings up a file dialog box from which the user can select avideo file (AVI format) which contains a sign gesture to be recognized.

5. Play Button

The Play button starts the feature extraction process and plays the videoto be recognized in the Video display box (8).

6. Result Display

Once the video has been analyzed, the recognized signed word or letter isdisplayed in this display field.

7. Frame Rate Display

Displays the rate at which the video frames are processed by the system.

8. Video Display

Displays the video be recognized.

31

CHAPTER 4

Experimental Results

This chapter is divided into four main sections, the first section describes thehardware setup used for data acquisition. The following section presents someof the results of the hand tracking process. The third section describes theexperiments used to evaluate the accuracy of the system. Finally, a discussion ofthe experimental results is presented.

4.1 Data Acquisition

4.1.1 Camera Orientation

Two options commonly considered in literature are either a top down view [22] ora frontal view [11] [8]. Initially, a top down view was decided on. However, uponcareful consideration it was decided that a frontal view captures a more naturalview of a person performing the sign gestures. The camera is then positionedhalf a meter away from the signer.

4.1.2 Lighting

The task of differentiating skin pixels from those of the background is madeconsiderably easier if lighting is carefully controlled. However, in the generalcase it is not reasonable to impose the requirement of special lighting equipment.It was decided that the system would be tested under normal room conditions.In this case, ceiling mounted fluorescent lighting.

4.1.3 Background and other Restrictions

In order to simplify the hand tracking process it is required that the backgroundcolour differs 2 as much as possible from that of the skin. For this work a white

32

board was found to be suitable. Additionally, the signer is required to wear ablack long-sleeved shirt. In order to avoid any confusion between the hands andface, the camera is focused on the torso of the signer, as in 3.7

4.2 Hand Tracking Results

(a) Frame 1 (b) Frame 20 (c) Frame 30

(d) Frame 40

Figure 4.1: Sequence of signs for the letter ”B”


Figure 4.2: Sequence of signs for the letter ”C”

Figures 4.1 to 4.3 show some results of the hand tracking process. From thehand tracking process, the geometric features such as: angles of orientation, ma-jor and minor axes lengths and the areas for the hand regions are obtained. Thesefeatures are combined with the optical flow motion descriptor for each frame. An

33


Figure 4.3: Sequence of signs for the letter ”X”

explanation and example of the motion descriptor presented in Section 3.1.2.Figure 4.4 and Figure 4.5 respectively show the motion descriptor histogramsfor the first frame of the letters B and C. It can be observed that the descriptorsare visually distinct.

(a) X Velocity Histogram (b) Y Velocity Histogram

Figure 4.4: Motion descriptor for Frame 1 of B

(a) X Velocity Histogram (b) Y Velocity Histogram

Figure 4.5: Motion descriptor for Frame 1 of C

34

Words GOOD, BAD, YOU, ME, OK, NO, WORK,HOW, LAZY, HI, FAT

Names PAUL, HOLDEN, JANE, JACK, AUS,SARAH, VELMA, TINA, MAX

Table 4.1: The AFR Vocabulary

Figure 4.6: The Recursive (no grammar) Network

4.3 Feature Classification

The system is intended to recognize a vocabulary of twenty words and namesformed from the twenty six Auslan finger-spelling signs (Table 4.1) as well assamples of isolated sign letters. The recognition capability of the system is in-vestigated using the following two tests:

• Isolated sign recognition (Letters)

Each of the signed letters are tested with ten previously unseen video se-quences, resulting in a total of 260 cases. These test videos are manuallysegmented from video sequences of the signer repeatedly performing eachof the finger-spelling signs.

• Continuous Sign Recognition (Words)

The networks used for recognition are shown in Figures 4.6 and 4.7.The recursive network (Figure 4.6) allows any signed letter to occur afteranother.

The grammar network is constructed based on this set of words in Table4.1. Figure 4.7 shows a grammar network which allows for the recognition

35

Figure 4.7: Example of a grammar network. The blue and red lines show thepaths for the words ”BAD” and ”AUS” respectively

of the sign words ”BAD” and ”AUS”. The grammar network can be easilyextended to incorporate the transitions for all 20 words. However, thecomplete network used in the final system is not shown here due to itscomplexity (it consists of 26 nodes and 106 arcs).

Each word in the vocabulary is tested with 4 previously unseen test videosequences, resulting in a total of 80 test cases. Tests are carried out

4.3.1 Experiment 1 - Isolated Training

The previously described tests are executed for HMMs that have been trainedby isolated training. The HMMs are trained from a set of 15 samples for eachfinger-spelling sign. Video sequences of the signer performing each of the twentyvocabulary words were recorded. The training samples were then obtained bymanually segmenting whole word sequences into individual letters, using Virtu-alDub [16]. For example, from the video sequence for the word, ”BAD”, samplesfor the signs, ”B”, ”A” and ”D” are obtained.

4.3.2 Experiment 2 - Embedded Training

The purpose of this experiment is to investigate any change in performance asa result of employing embedded training. The models generated in Experiment1 are refined by performing embedded training using 4 independent whole wordsamples for each of the 20 words in the vocabulary. The same tests are repeatedfor this experiment.

36

4.3.3 Recognition Results

These are the results of the previously described experiments. Additionally, therecognition results of some specific word instances are presented.

Test type Letter Level Word LevelIsolated Recognition 75.40% -Continuous (no-grammar) 80.78% 24.06%Continuous (with grammar) 90.75% 75.95%

Table 4.2: Experiment 1 Results - Isolated Training

Test type Letter Level Word LevelIsolated Recognition 57.86% -Continuous (no-grammar) 93.95% 59.49%Continuous (with grammar) 97.15% 88.61%

Table 4.3: Experiment 2 Results - Embedded Training

Result of Applying the Grammar Network

Figure 4.8 shows several frames from the video sequence and the recognitionresults for the name, ”PAUL”. The table shows that ”PAUL” is only successfullyrecognized when grammar is applied. the words recognized by the system, whenno grammar is used and when the grammar network is applied. However, whenno grammar is applied the system is prone to insertion errors (the letter J insertedbetween P and A; and the letter O between U and L).

Result of Using Embedded Training

There are still instances where the AFR fails to recognize a sign accurately, evenwhen a grammar network is used. Figure 4.9 shows the video sequence and therecognition results for the name ”HOLDEN”. The table shows the difference inrecognition produced by models trained by isolated training and models trainedby embedded training. In the case of isolated training, confusion arises as thetransition from letter O to O is a valid hypothesis from the full grammar network.

37

(a) Frame 12 (b) Frame 63 (c) Frame 94 (d) Frame 142

Recognized asNo Grammar PJAUOLGrammar PAUL

Figure 4.8: Sequence of signs for the name ”PAUL”


(e) Frame 233 (f) Frame 287

Recognized asNo Grammar HOOLADENGrammar HOLDEN

Figure 4.9: Sequence of signs for the name ”HOLDEN”

Failure Case

In some cases, the AFR system could not recognize a sign word accurately, evenwith the use of grammar and embedded training. This was observed for the name,”JANE” as shown in Figure 4.10. For all configurations, the name was wronglyrecognized as ”JANO”. This can be attributed to the fact that the letters ”O”and ”E” are formed using very similar gestures.

38


Figure 4.10: Sequence of signs for the name ”JANE”

4.4 Discussion

[h] The results of the experiments as shown in Table 4.2 and Table 4.3 identifysome problems with the proposed approach. The system does not have problemsdifferentiating between the finger-spelling signs which are visually distinct. How-ever, the problem with Auslan finger-spelling is that many of the signs have asimilar appearance and formed using similar motions. Consequently, most of thefailures arose from misclassification of the signs for E, I, O, U, L, M, V and N.This problem arises due to the coarse grained features used to model the gestures.

(a) L (b) M (c) R

Figure 4.11: Silhouette of hand regions for the signs, ”L”, ”M” and ”R”

As illustrated in Figure 4.11 the silhouettes of the hand regions for the signs,L, M and R are all very similar. The system cannot differentiate between thesesigns because the hand model ignores the problem of occlusion. This problemcould be addressed by explicitly segmenting the occluded hands to extract theshape of the right hand by adopting the technique of Holden et al. [11]. A moredetailed feature vector which incorporates finger information would also be betterable to model these minimal pairs of signs.

Using embedded training to train the HMMs gives a significant boost in ac-curacy for continuous recognition at the cost of accuracy in isolated recognition.

39

The reason for this is that embedded training updates the model parameters tobest describe whole word sequences by incorporating transition observations. Asa consequence, isolated signs which do not inherently possess transitional motionsare less accurately recognized.

Accuracy is consistently low when no-grammar is used. This is expectedas the system is allowed to match the observation vectors with any of the 26model HMMs. Hence, it is more prone to deletion, insertion and substitutionerrors. The use of the strong grammar network helps to address this problem.However, some problems are still noticeable. The name ’JANE’ was consistentlymisclassified as ’JANO’. This is a valid hypothesis considering the signs for ’O’and ’E’ are similar. Confusion arises due to the path corresponding to the word’NO’.

40

CHAPTER 5

Conclusion and Future Development

In this paper, an HMM based signer dependent continuous Auslan finger-spellingrecognition system has been described. The system uses a single USB camera forimage recording. Real-time signer localization and feature extraction is facilitatedusing a simple skin colour detection algorithm. In my approach, each signed letteris modeled using a single HMM.

The experiments undertaken show that the system is capable of recognizingisolated finger-spelling words. It achieved a 97% recognition rate at the letterlevel and 88% at the word level. These results are promising considering therelatively limited number of examples used to train the HMMs. It is expectedthat the system would show much higher accuracy if it were trained with moreexamples. Through the development of a prototype application, this thesis hasalso demonstrated the potential of using Intel’s OpenCV computer vision libraryto support the development of real-time computer vision systems.

The system still has some significant limitations. I have not addressed theissue of signer dependence in this paper, as the system has been trained andtested by a single person only. The system’s performance when attempting torecognize signs from a previously unseen person has not been investigated.

To work towards a practical application, it would be necessary to addressthe unrealistic restrictions on the background, lighting and the signer’s clothing.The system has been tested under a very controlled environment. In reality,the system should be robust to varying conditions. Addressing this issue is animportant matter and can involve whole research projects in themselves.

In conclusion, this research provides the foundation for future work whichmay involve the development of an application capable of on-the-fly recognitionfrom live camera input. HTK 3.1 [1] supports features for live audio recognition.It is expected that given sufficient time, the libraries could be modified to supportthe same feature for live video input.

41

APPENDIX A

Listing of HMM Algorithms

A.1 Notation used in this section

N number of statesT number of observationsNq number of states in the q-th model in a training sequenceO a sequence of observationsot the observation at time 1 ≤ t ≤ Taij the probability of a transition from state i to jµjm vector of means component for the state jσjm covariance matrix for the state jλ the set of all parameters defining a HMM

A.2 Baum-Welch Algorithm

1. Perform Viterbi training to obtain initial estimates of parameters and observations-to-state alignment.

2. For every parameter vector requiring re-estimation allocate storage for nu-merator and denominator.

3. For each training observation Or, 1 ≤ r ≤ R, we need to calculate theforward and backward probabilities, α and β.

Formula for α

αj(t) = [N−1∑

i=2

αi(t − 1)αij]bj(ot) (A.1)

with base cases:

α1(1) = 1 (A.2)

42

αj(1) = a1jbj(o1)1 < j < N (A.3)

αN(1) =N−1∑

i=2

αi(T )aiN (A.4)

Formula for β

βi(t) =N−1∑

i=2

aijbj(ot+1)βj(t + 1) (A.5)

with base cases:

βi(T ) = aiN for 1 < i < N (A.6)

β1(1) =N−1∑

i=2

aijbj(o1)βj(1)for1 < j < N (A.7)

Now the forward or backward probabilities can be used to calculate thetotal probability, P (O|λ).

P = P (O|λ) = αN (T ) = β(1) (A.8)

4. Use the final accumulator values to calculate new parameter values.

State Transition Probabilities, a

aij =

∑Rr=1

1

Pr

∑Tr−1t=1 αr

i (t)aijbj(ort+1)β

rj (t + 1)

∑Rr=1

1

Pr

∑Tr

t=1 αri (t)β

ri (t)

(A.9)

where 1 < i < N and 1 < j < N .

a1j =1

R

R∑

r=1

1

Pr

αrj(1)βr

j (1) (A.10)

where 1 < j < N .

aiN =

∑Rr=1

1

Pr

∑Tr−1t=1 αr

i (T )βri (T )

∑Rr=1

1

Pr

∑Tr

t=1 αri (t)β

ri (t)

(A.11)

where 1 < i < N .

Mean vector

µj =

∑Rr=1

∑Tr

t=1 Lrj(t)Ot

∑Rr=1

∑Tr

t=1 Lrj(t)

(A.12)

43

Σj =

∑Rr=1

∑Tr

t=1 Lrj(t)(o

rt − µj)(o

rt − µj)

′

∑Rr=1

∑Tr

t=1 Lrj(t)

(A.13)

where

Lrj(t) =

1

Pαj(t)βj(t) (A.14)

5. If the value P = P (O|λ) for this iteration is not higher than the value fromthe previous iteration, then stop, otherwise repeat the process using there-estimated parameter values.

A.3 Viterbi Algorithm

Initialization

δ1(i) = πibi(O1) (A.15)

φ1(i) = 0 (A.16)

Recursion

δt(j) = Maxi[δt−1(i)aij ]bj(Ot) (A.17)

φt(j) = arg maxi

[δt−1(i)aij ] (A.18)

Termination

P = max sǫSf [δT (s)] (A.19)

St = arg max sǫSf [δT (s)] (A.20)

Recovering the state sequence

From t = T - 1 to 1

st = φt+1(st+1) (A.21)

44

APPENDIX B

Original Honours Proposal

Title: Automatic Recognition of Auslan Finger-spelling using HiddenMarkov Models

Author: Paul Goh

Supervisor: Dr. Eun-Jung Holden

Background

The automatic recognition of hand gestures has been an active area of research inrecent years as it has wide applications in human computer interaction as well asthe recognition of sign language. The native sign language of the Australian deafcommunity is Auslan (Australian Sign Language). The proposed research willbe specifically concerned with recognizing the subset of Auslan gestures used forfinger-spelling. Finger-spelling is a manual representation of English where eachletter of a word is signed. In Auslan, finger-spelling is used to express words thatcannot be found in the vocabulary. For example, one would use finger-spellingfor names of people and places.

The main motivation for research into sign language recognition is in bridgingthe communication barrier between deaf individuals and the hearing communitywho do not understand sign language. The proposed finger-spelling recognitionsystem is another step towards the automatic translation between Auslan andEnglish.

There has been much prior research into the area of sign language recognition.One of the most cited works is Starner’s and Pentland’s [22] research into thevisual recognition of American Sign Language (ASL). Initially, coloured gloveswere used to identify and track the left and right hands. However, the gloves werephased out in later development [23], and the hands were tracked based on skincolour and location. Global features representing positions, angle of axis of least

45

inertia and eccentricity of the bounding ellipse were extracted from the gestureimages. Their Hidden Markov Model (hereafter referred to as HMM) basedrecognition process achieved an accuracy rating of 99.2% for 99 test sequences.

Similarly, Bauer and Hienz [2] used a HMM approach for automatic recognitionof German Sign Language (GSL). The following features were extracted by thesystem: the position of both hands relative to the body-axis and height of rightshoulder; the distances between fingers; distance between the palm and back ofthe dominant hand; distance between both hands; size fingers; size of palm andback of dominant hand and size of the non-dominant hand and the angle of allfingers accordant to the palm. Their tests reported an accuracy of 94% for a lex-icon of 52 signs. A higher accuracy rating was achieved when language modeling(use of a priori knowledge regarding the ordering and frequency of occurrence ofsigns) was employed. Both the Bauer and Starner systems recognized gesturesin real-time.

Recent work on vision-based recognition of colloquial Auslan has been carriedout by Holden, Lee and Owens [11]. By using a feature set, which is invariant toscaling, 2D rotations and signing speed as inputs to a HMM based recognitionsystem, they were able to achieve an accuracy of 97% at the sentence level andas high as 99% at the word level. However, the major drawback in this systemis that recognition is not executed in real-time.

Also, fairly recently, Bowden et al. [4] proposed a novel two stage classificationprocedure for sign language recognition. By using high level linguistic descriptorsas well as HMMs to model temporal transitions of individual signs, a classificationrate as high as 97.6% was recorded.

As an alternative to HMM, Ray Lockton [18] of Oxford University proposedand implemented a hand shape based American Sign Language (ASL) finger-spelling recognition system for his final year thesis. His approach, which usedfast template matching was able to achieve an accuracy rating of 99.1% for alexicon of 46 single-hand gestures.

Aim

The main goal of my research is to implement a real-time, vision-based Auslanfinger-spelling recognition system capable of recognizing the twenty six Auslanfinger-spelling gestures.

As Auslan finger-spelling can be seen as a dynamic stochastic process, it ishypothesized that a HMM approach would be appropriate for the recognitionprocess. Previous works which have employed HMMs in a real-time systemdemonstrated high recognition accuracy [22]. My research intends to examinethe effectiveness of a HMM approach within the context of Auslan finger-spellingrecognition.

46

Method

The proposed system will use a single USB web camera to capture streamingvideo of a person performing the finger-spelling gestures. The camera should bemounted in a top-down position, over a coloured work-space on a desk.

Development of the proposed system will be divided into three main phases:

1. Hand Tracking and Segmentation: The tracking process will consist of skincolour detection to locate the signer’s hands over the background. Theproblem of occlusion of either hand by the other will not be addressedin this project. When occlusion occurs, I will handle the two hands as asingle object as opposed to two different hands. It has been shown that theappropriate features combined with the time context provided by HiddenMarkov Models is sufficient to distinguish between many different signswhere hand occlusion occurs [23].

2. Feature Extraction: A set of effective features will be devised to representhand shapes. Geometrical properties of the hand shapes as well as theirtopological properties will be examined to find a set of features that canbe used for real-time processing and recognition. A particular avenue ofinterest is investigating the polar-space representation of hand shapes.

3. Training of HMM for the Recognition Process: The extracted features arethen recognized using a set of continuous density HMMs. Each alphabet willbe modeled as a HMM and a sequence of these alphabets will be recognized.Entropic’s Hidden Markov Toolkit (HTK 3.0) [1] will be used to implementthe HMM recognizer.

Plan

The following is an initial outline of tasks to be carried out throught my research.Notes for the dissertation are to be collected through the course of each phase.

Task Approximate DatesResearch and preparation of Proposal February - MarchImplement and test video capture module March - AprilImplement and test feature extraction module April - MayImplement and test recognition module June - JulyApplication front end development and testing July - AugustCompile results for draft dissertation August - September

Software and Hardware Requirements

The main hardware requirement for this project is a USB Webcam. In orderto achieve real-time recognition, the system will be developed using C/C++.

47

Additionally, Intel’s Open Source Computer Vision (OpenCV) Library [13] willbe utilized for image processing. Entropic’s Hidden Markov Toolkit (HTK) [1]would be required for the implementation of the recognition module. An x86 PCcapable of running the necessary software is also required.

48

Bibliography

[1] Hidden Markov Toolkit (HTK 3.1)}. [online] Avail-able:http://htk.eng.cam.ac.uk/ .

[2] Bauer, B., Hienz, H., and Kraiss, K. Video-based continuous signlanguage recognition using statistical methods. In IEEE International Con-ference on Pattern Recognition (2000).

[3] Birk, H., Moeslund, T. B., and Madsen, C. B. Real-time recognitionof hand alphabet gestures using principal component analysis. In Proceedingsof Scandinavian Conference on Image Analaysis (1997), pp. 261–268.

[4] Bowden, R., Windridge, D., Kadir, T., Zisserman, A., and

Brady, M. A linguistic feature vector for the visual interpretation of signlanguage. In The 8th European Conference on Computer Vision (2004),pp. 391–401.

[5] Chai, D., and Bouzerdom, A. A bayesian approach to skin colour clas-sification. In IEEE Region Ten Conference (2000).

[6] Cutler, R., and Turk, M. View based interpretation of real-time opticalflow for gesture recognition. In Proceedings of IEEE Conference on Face andGesture Recognition (1998).

[7] Efros, A. A. Recognizing action at a distance. In Proceedings of NinthIEEE International Conference on Computer Vision (2003).

[8] Grobel, K., and Assan, M. Video based sign language recognition usinghidden markov models. In Proceedings of Gesture Workshop (1997), pp. 97–109.

[9] Hasanuzzaman, M., Zhang, T., Amporanamveth, V., Bhuiyan, M.,

Shirai, Y., and Ueno, H. Gesture recognition for human-robot interac-tion through a knowledge based software platform. In International Confer-ence on Image Analysis and Recognition (2004), vol. 3211, pp. 530 – 537.

[10] Holden, E.-J. Visual Recognition of Hand Motion. Ph.d. thesis, TheUniversity of Western Australia, 1997.

[11] Holden, E.-J., Lee, G., and Owens, R. Automatic recognition ofcolloquial auslan. In Proceedings of IEEE Workshop on Motion and VideoComputing (1999).

[12] Horn, B. K. B., and Schunck, B. G. Determining optical flow. ArtificialIntelligence 17 (1981), 185–203.

49

[13] Intel. Intel’s open source computer vision library (opencv).http://www.intel.com/research/mrl/research/opencv, March 2005.

[14] Kalafatic, Z., Ribaric, S., and Stanisavljevic. A system for trackinglaboratory animals based on optical flow and active contours. In 11th Inter-national Conference on Image Analysis and Processing (ICIAP’01) (2001).

[15] Lamar, M. V., Bhuiyan, M. S., and Iwata, A. Hand alphabet recog-nition using morphological pca and neural networks. In Proceedings of IEEESMC 99 (1999).

[16] Lee, A. Virtualdub 1.5.1. [online] Available:http://virtualdub.sourceforge.net, April 2005.

[17] Lim, C., and Habili, N. Hand and face segmentation using motion andcolour cues in digital image sequences. In IEEE International Conferenceon Multimedia (2001).

[18] Lockton, R., and Fitzgibbon, A. W. Real-time gesture recognitionusing deterministic boosting. In British Machine Vision Conference ’02(2002).

[19] Lucas, B. D., and Kanade, T. An iterative image registration techniquewith an application to stereo vision. In Proceedings of Imaging Understand-ing Workshop (1981).

[20] Ngan, K. N., and Chai, D. Face segmentation using skin-color map invideo phone applications. IEEE Transactions on Circuits and Systems ForVideo Technology 9 (1999), 551–564.

[21] Sazonov, V., Vezhnevets, V., and Andreeva, A. A survey on pixelbased skin colour detection techniques. In Proceedings of Graphicon (2003),pp. 85–92.

[22] Starner, T., and Pentland, A. Real-time american sign languagerecognition from video using hidden markov models. In SCV95 (1995),p. 5B Systems and Applications.

[23] Starner, T., Weaver, J., and Pentland, A. A wearable computerbased American Sign Language recognizer. Lecture Notes in Computer Sci-ence 1458 (1998).

[24] Vamplew, P., and Adams, A. Recognition and anticipation of handmotions using a recurrent neural network. In IEEE International Conferenceon Neural Networks (1995).

[25] Vogler, C., and Metaxas, D. Adapting hidden markov models for aslrecognition by using three dimensional computer vision methods. In SMC’97(1997).

[26] Waleed, K. GRASP: Recognition of Australian Sign Language using In-strumented Gloves. Ph.d. thesis, The University of New South Wales, 1995.

50

[27] Yeates, S., Holden, E.-J., and Owens, R. An animated auslan tuitionsystem. In International Journal of Machine Graphics and Vision (2003),vol. 12, pp. 203–214.

[28] Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J.,

Ollason, D., Valtchev, V., and Woodland, P. The HTK Book (forHTK Version 3.1). Intel, 2000.

51

automatic recognition of auslan finger-spelling using hidden

Documents