hand gesture recognition using haar-wavelet

1

HAND GESTURE RECOGNITION

SYSTEM USING HAAR WAVELET

A PROJECT REPORT

Submitted by

J.JENKIN WINSTON (Reg. No: 96207106036)

M.MARIA GNANAM (Reg.No:96207106056)

R.RAMASAMY (Reg.No:96207106306)

in partial fulfillment for the award of the degree

of

BACHELOR OF ENGINEERING

in

ELECTRONICS AND COMMUNICATION ENGINEERING

NATIONAL ENGINEERING COLLEGE, KOVILPATTI

ANNA UNIVERSITY OF TECHNOLOGY,

TIRUNELVELI - 627 007.

March 2011

2

ANNA UNIVERSITY OF TECHNOLOGY, TIRUNELVELI

BONAFIDE CERTIFICATE

Certified that this project report titled “HAND GESTURE RECOGNITON

SYSTEM USING HAAR WAVELET” is the bonafide work of J.JENKIN

WINSTON (96207106036), M.MARIA GNANAM (96207106056),

R.RAMASAMY (96207106306) who carried out the project work under my

supervision.

Signature Signature

Head of the Department Supervisor

Dr.V.Vijayarangan, B.E., M.Sc.,(Engg),Ph.D., Mr.M.Sundaram, M.E.,

Department of ECE Sr.Lecturer/ECE,

National Engineering College, National Engineering

College,

Kovilpatti -628 503 Kovilpatti -628 503

Submitted for Viva-Voce Examination held at NATIONAL

ENGINEERING COLLEGE, Kovilpatti on ____________

Internal Examiner External

Examiner

3

ACKNOWLEDGEMENT

First and foremost we express our wholehearted gratitude to the Almighty

for having given wisdom and courage to take over this project.

We wish to express our sincere thanks to the Director

Dr.Kn.K.S.K.Chockalingam, B.E., M.Sc(Engg)., Ph.D., who helped us in

carrying our project successfully.

We would like to express our sincere thanks to our former Principal

Dr.N.S.Marimuthu, B.E., M.Sc(Engg)., Ph.D., for providing us this

opportunity to do this project.

Our heartfelt acknowledgement goes to the Professor and Head of the

Department of Electronics and Communication Engineering,

Dr.V.Vijayarangan, B.E., M.Sc(Engg)., Ph.D., for his valuable and consistent

encouragement for carrying out the project.

Our gratitude is no less for our project coordinators Mr.N.Arumugam,

M.E., Assistant Professor and Mrs.S.D.Jayavathi, M.E., Senior Lecturer, in the

Department of Electronics and Communication for their encouragement.

We express our deepest gratitude to our guide Mr.M.Sundaram, M.E.,

Senior lecturer, in the Department of Electronics and Communication

Engineering, for rendering excellent guidance and for being extremely kind and

approachable in nature, being a great source of support and encouragement

throughout the course of the project work.

4

We hereby acknowledge the efforts of all staff members, technicians of

Electronics and Communication Engineering Department, whose help was

instrumental in completion of my project.

Also we would like to express our hearty thanks to our beloved parents and

dear friends for their valuable suggestions and cooperation for the project.

5

ABSTRACT

With the increasing growth of technology and the entrance into the digital

age, we handle difficulties concerned with handicapped people in a new

approach. The sign language they use for communication is not understandable

by everyone. This isolates them from the speaking community. So we have aimed

at providing an effective means of communication for the dumb people by

programming a gesture recognition system with the concept of image processing.

Our algorithm is developed in Matlab to recognize static hand gestures, namely, a

subset of American Sign Language (ASL). It is fairy robust to background cluster

and uses skin color for hand gesture tracking and recognition. In this project we

have reduced the database size by normalizing the orientation of hands using the

idea of principal axis. We have taken correlation factor to improve the degree of

recognition. Every human has a hand geometry different from one another. So to

tradeoff this we are using a transform that converts an image into a feature

vector, which will then be compared with the feature vectors of a training set of

gestures. Improvising on all this features would decrease the computation time.

This method is very compact and handy compared to other hand gesture

recognition systems.

6

TABLE OF CONTENTS

CHAPTER

NO

TITLE

PAGE NO

ABSTRACT

i

LIST OF FIGURES

ii

LIST OF ABBREVIATIONS

iii

1 INTRODUCTION

1.1 Prelude

1

1

1.2 Need for sign language 4

1.3 American sign language 5

1.4 Gesture recognition 6

1.4.1 Gesture recognition and pen computing 7

1.4.2 Gesture types 7

1.4.3 Uses

8

2 LITERATURE SURVEY

11

3 BACKGROUND 12

3.1 Existing system 12

3.2 Problem statement

13

4 METHODOLOGY 14

4.1 Image capturing devices 14

4.1.1 Challenges 15

7

4.2 Significance of grayscale images 16

4.2.1 Grayscale as single channel of

multichannel color images

17

4.3 Hand segmentation 18

4.3.1 Threshold selection 20

4.3.2 Adaptive thresholding 20

4.3.3 Multiband thresholding

20

4.4 Morphological operation 21

4.4.1 Structuring element 21

4.4.2 Image closing 23

4.4.3 Effect of image closing

24

4.5 Image registration 26

4.5.1 Algorithm classifications 26

4.5.1.1 Intensity based vs feature based 27

4.5.1.2 Spatial vs Frequency domain methods 28

4.5.1.3 Single vs Multi-modality methods 28

4.5.1.4 Automatic vs Interactive methods 29

4.5.2 Uncertainity

4.5.3 Transformation methods

29

4.5.4 Radon transform

30

8

4.6 Feature Extraction 31

4.6.1 Wavelets 33

4.6.2 Wavelet transform 34

4.6.3 The discrete wavelet transform 34

4.6.4 2D-Discrete wavelet transform

35

5 OVERVIEW OF THE PROJECT

5.1 An overlay of our algorithm

37

37

5.2 Proposed work

38

6 SOFTWARE DESCRIPTION 41

6.1 Introduction

6.2 Features of Matlab

41

42

6.2.1 Command window 42

6.2.2 Graphics window 42

6.2.3 Edit window 42

6.2.4 Input output 42

6.2.5 Data type 42

6.2.6 Dimensioning 43

6.2.7 Case sensitivity

43

6.3 Images in Matlab

43

6.4 File types 44

6.4.1 M-files 44

9

6.4.2 Script files 44

7

6.4.3 Function files

6.4.4. MAT-files

SIMULATION RESULT

45

45

46

8 CONCLUSION

52

9 REFERENCES

53

10

LIST OF FIGURES

FIGURE NO.

TITLE

PAGE NO.

1.1

4.1

4.2

4.3

4.4

4.5

4.6

4.7

5.1

7.1

7.2

7.3

7.4

7.5

7.6

ASL examples

A model gray image

Three channels of a RGB image

Original image

Example of a threshold effect used on an image

Structuring element

Effect of closing using 3x3 square structuring

element

Multi-resolution expansion using Haar wavelet

Overlay of our algorithm

Resized image

Gray scale image

Segmented hand

Morphologically operated image

Normalized image

Horizontal vector of DWT

5

17

18

19

19

23

25

32

37

46

47

48

49

50

51

11

LIST OF ABBREVIATIONS

ASL

CAD

PUI

GUI

HMI

MRI

CT

PET

DWT

STFT

MATLAB

American Sign Language

Computer Aided Design

Perceptual User Interface

Graphical User Interface

Human Machine Interface

Magnetic Resonance

Imaging

Computed Tomography

Positron Emission

Tomography

Discrete Wavelet Transform

Short Time Fourier

Transform

Matrix Laboratory

12

CHAPTER 1

INTRODUCTION

1.1. PRELUDE

Since the existing common computer devices are adequate. It is also now

that computers have been so tightly integrated with everyday life, that new

applications and hardware are constantly introduced. The means of

communicating with computers at the moment are limited to keyboards, mice,

light pen, trackball, keypads etc. These devices have grown to be familiar but

inherently limit the speed and naturalness with which we interact with the

computer.

Recently, there has been a surge in interest in recognizing human hand

gestures. Hand gesture recognition has various applications like computer games,

machinery control (e.g. crane), and thorough mouse replacement. One of the

most structured sets of gestures belongs to sign language. In sign language, each

gesture has an assigned meaning (or meanings).

Computer recognition of hand gestures may provide a more natural-

computer interface, allowing people to point, or rotate a CAD model by rotating

their hands. Hand gestures can be classified in two categories as static and

dynamic. A static gesture is a particular hand configuration and pose, represented

by a single image. A dynamic gesture is a gesture, represented by a sequence of

images. We will focus on the recognition of static images.

The reliance on sign language among dumb people communities result in

linguistic isolation from the general community. The overwhelming majority of

hearing people do not understand sign language. Many approaches for effective

man-machine communication have been proposed such as voice, face, iris, retinal

13

scans and gesture recognition systems. Gesture recognition, along with facial

recognition, voice recognition, eye tracking and lip movement recognition are

components of what developers refer to as a perceptual user interface (PUI). The

goal of PUI is to enhance the efficiency and ease of use for the underlying logical

design of a stored program, a design discipline known as usability. In personal

computing, gestures are most often used for input commands.

Despite the use of face and voice features, hands require less complexity in

terms of imaging conditions. Consequently hand based recognition is friendlier

and it is less prone to disturbances and robust to environmental conditions.

Our goal is to offer a sign recognition system as another choice of

augmenting communication between dumb people and the speaking community.

This wearable system would capture and recognize the dumb user’s signing. The

user could then cue the system to generate text or speech. Recognizing gestures

as input allows computers to be more accessible for the physically-impaired and

makes interaction more natural in a 3D virtual world environment. Hand and

body gestures can be amplified by a controller that contains accelerometers and

gyroscopes to sense tilting, rotation and acceleration of movement or the

computing device can be outfitted with a camera so that software in the device

can recognize and interpret specific gestures. Conventional methods used in hand

gesture recognition systems are glove based techniques with embedded

accelerometer and multiple sensors and computer vision based technique.

The use of accelerometer demands for hardware components and power

supply. In hand gesture recognition using a sensing glove with 6 embedded

accelerometers, it recognizes 28 static hand gestures and the computation time is

1 characters/second. However, this algorithm is not efficient to be applied in

realtime. Another recognition system by using colored gloves and neural

networks algorithm was introduced. But the success rate ranges from 70% to

14

93%. Although these systems can recognize hand gestures, the wearing of a

sensory glove is not convenient for daily application.

For computer vision based techniques, one or a set of cameras are utilized

to capture hand images for recognition. It is based on computer vision techniques

without restricting backgrounds or using any markers. This method first separates

the region of hand gesture from complex background images by measuring

entropy from adjacent frames images. A hand gesture is then recognized by the

approach of improved centroidal profile. However, mis-recognitions can be

caused by hand gestures with similar spatial features. Therefore the number of

hand gestures that can be recognized by the proposed algorithm is limited. To be

an effective vision system, it should be glove-free, fast, small database and

accurate. Moreover the use of computer vision based technique increases the

complexity of image recognition.

In addition to the technical challenges of implementing gesture

recognition, there are also social challenges. Gestures must be simple, intuitive

and universally acceptable. The study of gestures and other nonverbal types of

communication is known as kinesics.

The key problem in gesture interaction is how to make hand gestures

understood by computers. The approaches present can be mainly divided into

“Data-Glove based” and “Vision Based” approaches. The Data-Glove based

methods use sensor devices for digitizing hand and finger motions into multi-

parametric data. The extra sensors make it easy to collect hand configuration and

movement. However, the devices are quite expensive and bring much

cumbersome experience to the users. In contrast, the Vision Based methods

require only a camera, thus realizing a natural interaction between humans and

computers without the use of any extra devices. These systems tend to

15

complement biological vision by describing artificial vision systems that are

implemented in software and hardware. This poses a challenging problem as

these systems need to be background invariant, lighting insensitive, person and

camera independent to achieve real time performance. Moreover, such systems

must be optimized to meet the requirements, including accuracy and robustness.

In this project, a new approach in realtime hand gesture recognition is

developed. It is a recognition algorithm based on Haar wavelet representation.

Hands are extracted by a skin color approach rather than user input. The problem

of hand orientation in the image is also solved by utilizing the idea of axis of

elongation. It helps keeping the database small by standardizing the hand gestures

in fixed orientations. We then introduce a new approach to disperse hand features

in the image which shows a promotion in the success rate.

1.2. NEED FOR SIGN LANGUAGE

Creating a proper sign language (ASL – American Sign Language at this

case) dictionary is not the desired result at this point. This would combine

advanced grammar and syntax structure understanding of the system, which is

outside the scope of this project. The American Sign Language will be used as

the database since it’s a tightly structured set. From that point further applications

can be suited. The distant (or near) future of computer interfaces could have the

usual input devices and in conjunction with gesture recognition some of the

user’s feelings would be perceived as well.

Taking ASL recognition further a full real-time dictionary could be created

with the use of video. As mentioned before this would require some Artificial

Intelligence for grammar and syntax purposes.

16

Another application is huge database annotation. It is far more efficient

when properly executed by a computer, than by a human.

1.3. AMERICAN SIGN LANGUAGE

American Sign Language is the language of choice for most deaf people in

the United States. It is part of the “deaf culture” and includes its own system of

puns, inside jokes, etc. However, ASL is one of the many sign languages of the

world. As an English speaker would have trouble understanding someone

speaking Japanese, a speaker of ASL would have trouble understanding the Sign

Language of Sweden. ASL also has its own grammar that is different from

English. ASL consists of approximately 6000 gestures of common words with

finger spelling used to communicate obscure words or proper nouns. Finger

spelling uses one hand and 26 gestures to communicate the 26 letters of the

alphabet. Some of the signs can be seen in fig1.1.

Fig 1.1 ASL examples

Another interesting characteristic that will be ignored by this project is the

ability that ASL offers to describe a person, place or thing and then point to a

place in space to temporarily store for later reference.

17

ASL uses facial expressions to distinguish between statements, questions

and directives. The eyebrows are raised for a question, held normal for a

statement, and furrowed for a directive. There has been considerable work and

research in facial feature recognition, they will not be used to aid recognition in

the task addressed. This would be feasible in a full real-time ASL dictionary.

1.4. GESTURE RECOGNITION

Gesture recognition is a language technology with the goal of interpreting

human gestures via mathematical algorithms. Gestures can originate from any

bodily motion or state but commonly originate from the face or hand. Current

focuses in the field include emotion recognition from the face and hand gesture

recognition. Many approaches have been made using cameras and computer

vision algorithms to interpret sign language. However, the identification and

recognition of posture, gait, proxemics, and human behaviors is also the subject

of gesture recognition techniques.

Gesture recognition can be seen as a way for computers to begin to

understand human body language, thus building a richer bridge between

machines and humans than primitive text user interfaces or even GUIs (graphical

user interfaces), which still limit the majority of input to keyboard and mouse.

Gesture recognition enables humans to interface with the machine (HMI)

and interact naturally without any mechanical devices. Using the concept of

gesture recognition, it is possible to point a finger at the computer screen so that

the cursor will move accordingly. This could potentially make conventional input

devices such as mouse, keyboards and even touch-screens redundant.

18

Gesture recognition can be conducted with techniques from computer

vision and image processing.

The literature includes ongoing work in the computer vision field on

capturing gestures or more general human pose and movements by cameras

connected to a computer.

1.4.1. GESTURE RECOGNITION AND PEN COMPUTING

In some literature, the term gesture recognition has been used to refer more

narrowly to non-text-input handwriting symbols, such as inking on a graphics

tablet, multi-touch gestures, and mouse gesture recognition. This is computer

interaction through the drawing of symbols with a pointing device cursor.

1.4.2. GESTURE TYPES

In computer interfaces, two types of gestures are distinguished:

� Offline gestures:

Those gestures that are processed after the user interaction with the

object. An example is the gesture to activate a menu.

� Online gestures:

Direct manipulation gestures. They are used to scale or rotate a

tangible object.

19

1.4.3. USES

Gesture recognition is useful for processing information from humans

which is not conveyed through speech or type. As well, there are various types of

gestures which can be identified by computers.

� Sign language recognition:

Just as speech recognition can transcribe speech to text, certain

types of gesture recognition software can transcribe the symbols

represented through sign language into text.

� For socially assistive robotics:

By using proper sensors (accelerometers and gyros) worn on the

body of a patient and by reading the values from those sensors, robots can

assist in patient rehabilitation. The best example can be stroke

rehabilitation.

� Directional indication through pointing:

Pointing has a very specific purpose in our society, to reference an

object or location based on its position relative to ourselves. The use of

gesture recognition to determine where a person is pointing is useful for

identifying the context of statements or instructions. This application is of

particular interest in the field of robotics.

20

� Control through facial gestures:

Controlling a computer through facial gestures is a useful

application of gesture recognition for users who may not physically be able

to use a mouse or keyboard. Eye tracking in particular may be of use for

controlling cursor motion or focusing on elements of a display.

� Alternative computer interfaces:

Foregoing the traditional keyboard and mouse setup to interact with

a computer, strong gesture recognition could allow users to accomplish

frequent or common tasks using hand or face gestures to a camera.

� Immersive game technology:

Gestures can be used to control interactions within video games to

try and make the game player's experience more interactive or immersive.

� Virtual controllers:

For systems where the act of finding or acquiring a physical

controller could require too much time, gestures can be used as an

alternative control mechanism. Controlling secondary devices in a car, or

controlling a television set are examples of such usage.

� Affective computing:

In affective computing, gesture recognition is used in the process of

identifying emotional expression through computer systems.

21

� Remote control:

Through the use of gesture recognition, "remote control with the

wave of a hand" of various devices is possible. The signal must not only

indicate the desired response, but also which device to be controlled.

22

CHAPTER 2

LITERATURE SURVEY

A hand gesture analysis system based on a three-dimensional hand

skeleton model with 27 degrees of freedom was developed by Lee and Kunii.

They incorporated five major constraints based on the human hand kinematics to

reduce the model parameter space search. To simplify the model matching,

specially marked gloves were used.

Full ASL recognition systems (words, phrases) incorporate data gloves.

Takashi and Kishino discuss a Data glove-based system that could recognize 34

of the 46 Japanese gestures (user dependent) using a joint angle and hand

orientation coding technique. From their paper, it seems the test user made each

of the 46 gestures 10 times to provide data for principle component and cluster

analysis. A separate test was created from five iterations of the alphabet by the

user, with each gesture well separated in time. While these systems are

technically interesting, they suffer from a lack of training.

Excellent work has been done in support of machine sign language

recognition by Sperling and Parish, who have done careful studies on the

bandwidth necessary for a sign conversation using spatially and temporally sub-

sampled images. Point light experiments (where “lights” are attached to

significant locations on the body and just these points are used for recognition),

have been carried out by Poizner.

23

CHAPTER 3

BACKGROUND

3.1. EXISTING SYSTEM

The key problem in gesture interaction is how to make hand gestures

understood by computers. The approaches present can be mainly divided into

“Data-Glove based”, “Vision Based” and “Analysis of Drawing Gestures”

approaches.

Research on hand gestures can be classified into three categories. The first

category, glove based analysis, employs sensors (mechanical or optical) attached

to a glove that transduces finger flexions into electrical signals for determining

the hand posture. The relative position of the hand is determined by an additional

sensor. This sensor is normally a magnetic or an acoustic sensor attached to the

glove. The methods use sensor devices for digitizing hand and finger motions

into multi-parametric data. The extra sensors make it easy to collect hand

configuration and movement. However, the devices are quite expensive and bring

much cumbersome experience to the users. For some data glove applications,

look-up table software toolkits are provided with the glove to be used for hand

posture recognition.

The second category, vision based analysis, is based on the way human

beings perceive information about their surroundings, yet it is probably the most

difficult to implement in a satisfactory way. Several different approaches have

been tested so far. One is to build a three-dimensional model of the human hand.

The model is matched to images of the hand by one or more cameras, and

24

parameters corresponding to palm orientation and joint angles are estimated.

These parameters are then used to perform gesture classification. Another Vision

Based method requires only a camera, thus realizing a natural interaction between

humans and computers without the use of any extra devices. These systems tend

to complement biological vision by describing artificial vision systems that are

implemented in software and hardware. This poses a challenging problem as

these systems need to be background invariant, lighting insensitive, person and

camera independent to achieve real time performance. Moreover, such systems

must be optimized to meet the requirements, including accuracy and robustness.

The third category, analysis of drawing gestures, usually involves the use

of a stylus as an input device. Analysis of drawing gestures can also lead to

recognition of written text. The vast majority of hand gesture recognition work

has used mechanical sensing, most often for direct manipulation of a virtual

environment and occasionally for symbolic communication. Sensing the hand

posture mechanically has a range of problems, however, including reliability,

accuracy and electromagnetic noise. Visual sensing has the potential to make

gestural interaction more practical, but potentially embodies some of the most

difficult problems in machine vision. The hand is a non-rigid object and even

worse self-occlusion is very usual.

3.2. PROBLEM STATEMENT

In the existing system we make use of sensors, accelerometer and sensing

glove. In all these methods more number of hardware components are required.

The sweat produced in hand would reduce the efficiency of the tactile sensors.

The efficiency of the sensors would also diminish due to wear caused due to

aging. Moreover we could not expect people to move around with sensing glove.

25

CHAPTER 4

METHODOLOGY

4.1. IMAGE CAPTURING DEVICES

The ability to track a person's movements and determine what gestures

they may be performing can be achieved through various tools. Although there is

a large amount of research done in image/video based gesture recognition, there

is some variation within the tools and environments used between

implementations.

� Depth-aware cameras:

Using specialized cameras such as time-of-flight cameras, one can

generate a depth map of what is being seen through the camera at a short

range, and use this data to approximate a 3-D representation of what is

being seen. These can be effective for detection of hand gestures due to

their short range capabilities.

� Stereo cameras:

Using two cameras whose relations to one another are known, a 3d

representation can be approximated by the output of the cameras. To get

the cameras' relations, one can use a positioning reference such as a lexian-

stripe or infrared emitters. In combination with direct motion measurement

(6D-Vision) gestures can directly be detected.

26

� Controller-based gestures:

These controllers act as an extension of the body so that when

gestures are performed, some of their motion can be conveniently captured

by software. Mouse gestures are one such example, where the motion of

the mouse is correlated to a symbol being drawn by a person's hand, as is

the Wii Remote, which can study changes in acceleration over time to

represent gestures.

� Single camera:

A normal camera can be used for gesture recognition where the

resources/environment would not be convenient for other forms of image-

based recognition. Although not necessarily as effective as stereo or depth

aware cameras, using a single camera allows a greater possibility of

accessibility to a wider audience.

4.1.1. CHALLENGES

There are many challenges associated with the accuracy and usefulness of

gesture recognition software. For image-based gesture recognition there are

limitations on the equipment used and image noise. Images or video may not be

under consistent lighting, or in the same location. Items in the background or

distinct features of the users may make recognition more difficult.

The variety of implementations for image-based gesture recognition may

also cause issue for viability of the technology to general usage. For example, an

algorithm calibrated for one camera may not work for a different camera. The

27

amount of background noise also causes tracking and recognition difficulties,

especially when occlusions (partial and full) occur. Furthermore, the distance

from the camera, and the camera's resolution and quality, also cause variations in

recognition accuracy.

In order to capture human gestures by visual sensors, robust computer

vision methods are also required, for example for hand tracking and hand posture

recognition or for capturing movements of the head, facial expressions or gaze

direction.

The recognition problem is approached through a matching process in

which the segmented hand is compared with all the postures in the system’s

memory using the Hausdorff distance. The system‘s visual memory stores all the

recognizable postures, their distance transform, their edge map and morphologic

information. A faster and more robust comparison is performed thanks to this

data, properly classifying postures, even those which are similar, saving valuable

time needed for real time processing. The postures included in the visual memory

may be initialized by the human user, learned or trained from previous tracking

hand motion or they can be generated during the recognition process.

4.2. SIGNIFICANCE OF GRAYSCALE IMAGES

The image captured by the camera is in RGB form. Inorder to reduce

complexity in hand segmentation we convert the RGB to gray scale images.

A grayscale (or gray level) image is simply one in which the only colors

are shades of gray. The reason for differentiating such images from any other sort

of color image is that less information needs to be provided for each pixel. In fact

28

a `gray' color is one in which the red, green and blue components all have equal

intensity in RGB space, and so it is only necessary to specify a single intensity

value for each pixel, as opposed to the three intensities needed to specify each

pixel in a full color image.

Fig 4.1 A model gray image

Often, the grayscale intensity is stored as an 8-bit integer giving 256

possible different shades of gray from black to white. If the levels are evenly

spaced then the difference between successive gray levels is significantly better

than the gray level resolving power of the human eye.

4.2.1. GRAYSCALE AS SINGLE CHANNEL OF MULTICHANNEL

COLOUR IMAGES

Color images are often built of several stacked color channels, each of

them representing value levels of the given channel. For example, RGB images

are composed of three independent channels for red, green and blue primary

color components.

Here is an example of color channel splitting of a full RGB color image.

The column at left shows the isolated color channels in natural colors, while at

right there are their grayscale equivalences:

29

Fig 4.2 Three channels of a RGB image

The reverse is also possible: to build a full color image from their separate

grayscale channels. By mangling channels, using offsets, rotating and other

manipulations, artistic effects can be achieved instead of accurately reproducing

the original image.

4.3. HAND SEGMENTATION

Thresholding is the simplest method of image segmentation. From a

grayscale image, thresholding can be used to create binary images. The key

parameter in the thresholding process is the choice of the threshold value. During

the thresholding process, individual pixels in an image are marked as “object”

pixels if their value is greater than some threshold value (assuming an object to

be brighter than the background) and as “background” pixels otherwise. This

convention is known as threshold above. Variants include threshold below, which

is opposite of threshold above; threshold inside, where a pixel is labeled "object"

30

if its value is between two thresholds; and threshold outside, which is the

opposite of threshold inside. Typically, an object pixel is given a value of “1”

while a background pixel is given a value of “0.” Finally, a binary image is

created by coloring each pixel white or black, depending on a pixel's label's.

Fig 4.3 Original Image

Fig 4.4 Example of a threshold effect used on an image

31

4.3.1. THRESHOLDING SELECTION

The key parameter in the thresholding process is the choice of the

threshold value (or values, as mentioned earlier). Several different methods for

choosing a threshold exist; users can manually choose a threshold value, or a

thresholding algorithm can compute a value automatically, which is known as

automatic thresholding. A simple method would be to choose the mean or median

value, the rational being that if the object pixels are brighter than the background,

they should also be brighter than the average. In a noiseless image with uniform

background and object values, the mean or median will work well as threshold,

however, this will generally not be the case. A more sophisticated approach

might be to create a histogram of the image pixel intensities and use the valley

point as the threshold. The histogram approach assumes that there is some

average value for the background and object pixels, but that the actual pixel

values have some variation around these average values. However, this may be

computationally expensive, and image histograms may not have clearly defined

valley points, often making the selection of an accurate threshold difficult.

4.3.2. ADAPTIVE THRESHOLDING

Thresholding is called adaptive thresholding when a different

threshold is used for different regions in the image. This may also be known as

local or dynamic thresholding.

4.3.3. MULTIBAND THRESHOLDING

Color images can also be thresholded. One approach is to

designate a separate threshold for each of the RGB components of the image and

then combine them with an AND operation. This reflects the way the camera

32

works and how the data is stored in the computer, but it does not correspond to

the way that people recognize color.

Therefore, it is easy to design a threshold value for a grayscale

image rather than the color image.

4.4. MORPHOLOGICAL OPERATION

While point and neighborhood operations are generally designed to alter

the look or appearance of an image for visual considerations, morphological

operations are used to understand the structure or form of an image. This usually

means identifying objects or boundaries within an image. Morphological

operations play a key role in applications such as machine vision and automatic

object detection.

4.4.1. STRUCTURINING ELEMENT

In mathematical morphology, a structuring element is a shape, used to

probe or interact with a given image, with the purpose of drawing conclusions on

how this shape fits or misses the shapes in the image. It is typically used in

morphological operations, such as dilation, erosion, opening, and closing, as well

as the hit-or-miss transform.

According to Georges Matheron, knowledge about an object depends on

the manner in which we probe (observe) it. In particular, the choice of a certain

structuring element for a particular morphological operation influences the

information one can obtain. There are two main characteristics that are directly

related to structuring elements.

33

� Shape

For example, the structuring element can be a ``ball" or a line;

convex or a ring, etc. By choosing a particular structuring element, one sets

a way of differentiating some objects from others, according to their shape

or spatial orientation.

� Size

For example, one structuring element can be a square or a square.

Setting the size of the structuring element is similar to setting the

observation scale, and setting the criterion to differentiate image objects or

features according to size.

SE = strel ('disk', R, N) creates a flat, disk-shaped structuring element,

where R specifies the radius. R must be a non-negative integer. N must be 0, 4, 6,

or 8.

When N is greater than 0, the disk-shaped structuring element is

approximated by a sequence of N periodic-line structuring elements. When N

equals 0, no approximation is used, and the structuring element members consist

of all pixels whose centers are no greater than R away from the origin. If N is not

specified, the default value is 4.

34

Fig 4.5 Structuring Element

4.4.2. IMAGE CLOSING

Closing is an important operator from the field of mathematical

morphology. Like its dual operator opening, it can be derived from the

fundamental operations of erosion and dilation. Like those operators it is

normally applied to binary images, although there are gray level versions.

Closing is similar in some ways to dilation in that it tends to enlarge the

boundaries of foreground (bright) regions in an image (and shrink background

color holes in such regions), but it is less destructive of the original boundary

shape. As with other morphological operators, the exact operation is determined

by a structuring element. The effect of the operator is to preserve background

regions that have a similar shape to this structuring element, or that can

completely contain the structuring element, while eliminating all other regions of

background pixels.

Closing is opening performed in reverse. It is defined simply as dilation

followed by erosion using the same structuring element for both operations. See

the sections on erosion and dilation for details of the individual steps. The closing

35

operator therefore requires two inputs: an image to be closed and a structuring

element.

Gray level closing consists straightforwardly of a gray level dilation

followed by gray level erosion.

Closing is the dual of opening, i.e. closing the foreground pixels with a

particular structuring element, is equivalent to closing the background with the

same element.

4.4.3. EFFECT OF IMAGE CLOSING

One of the uses of dilation is to fill in small background color holes in

images, e.g. `pepper noise'. One of the problems with doing this, however, is that

the dilation will also distort all regions of pixels indiscriminately. By performing

an erosion on the image after the dilation, i.e. a closing, we reduce some of this

effect. The effect of closing can be quite easily visualized. Imagine taking the

structuring element and sliding it around outside each foreground region, without

changing its orientation. For any background boundary point, if the structuring

element can be made to touch that point, without any part of the element being

inside a foreground region, then that point remains background. If this is not

possible, then the pixel is set to foreground. After the closing has been carried out

the background region will be such that the structuring element can be made to

cover any point in the background without any part of it also covering a

foreground point, and so further closings will have no effect. This property is

known as idempotence. The effect of a closing on a binary image using a 3×3

square structuring element is illustrated in Fig 4.6.

36

Fig 4.6 Effect of closing using a 3×3 square structuring element

As with erosion and dilation, this particular 3×3 structuring element is

the most commonly used, and in fact many implementations will have it

hardwired into their code, in which case it is obviously not necessary to specify a

separate structuring element. To achieve the effect of a closing with a larger

structuring element, it is possible to perform multiple dilations followed by the

same number of erosions.

Closing can sometimes be used to selectively fill in particular background

regions of an image. Whether or not this can be done depends upon whether a

suitable structuring element can be found that fits well inside regions that are to

be preserved, but is doesn't fit inside regions that are to be removed.

37

4.5. IMAGE REGISTRATION

Image registration is the process of overlaying two or more images of the

same scene taken at different times, from different viewpoints, and/or by

different sensors. It geometrically aligns two images—the reference and sensed

images. The present differences between images are introduced due to different

imaging conditions. Image registration is a crucial step in all image analysis tasks

in which the final information is gained from the combination of various data

sources like in image fusion, change detection, and multichannel image

restoration. Typically, registration is required in remote sensing (multispectral

classification, environmental monitoring, change detection, image mosaicing,

weather forecasting, creating super-resolution images, integrating information

into geographic information systems (GIS)), in medicine (combining computer

tomography (CT) and NMR data to obtain more complete information about the

patient, monitoring tumor growth, treatment verification, comparison of the

patient’s data with anatomical atlases), in cartography (map updating), and in

computer vision (target localization, automatic quality control), to name a few.

4.5.1. ALGORITHM CLASSIFICATIONS

4.5.1.1. INTENSITY BASED VS FEATURE BASED

Image registration or image alignment algorithms can be classified into

intensity-based and feature-based. One of the images is referred to as the

reference or source and the second image is referred to as the target or sensed.

Image registration involves spatially transforming the target image to align with

the reference image. Intensity-based methods compare intensity patterns in

38

images via correlation metrics, while feature-based methods find correspondence

between image features such as points, lines, and contours. Intensity-based

methods register entire images or sub images. If sub images are registered,

centers of corresponding sub images are treated as corresponding feature points.

Feature-based method established correspondence between a number of points in

images. Knowing the correspondence between a number of points in images, a

transformation is then determined to map the target image to the reference

images, thereby establishing point-by-point correspondence between the

reference and target images.

4.5.1.2. SPATIAL VS FREQUENCY DOMAIN METHODS

Spatial methods operate in the image domain, matching intensity patterns

or features in images. Some of the feature matching algorithms are outgrowths of

traditional techniques for performing manual image registration, in which an

operator chooses corresponding control points (CPs) in images. When the

number of control points exceeds the minimum required to define the appropriate

transformation model, iterative algorithms like RANSAC can be used to robustly

estimate the parameters of a particular transformation type (e.g. affine) for

registration of the images.

Frequency-domain methods find the transformation parameters for

registration of the images while working in the transform domain. Such methods

work for simple transformations, such as translation, rotation, and scaling.

Applying the Phase correlation method to a pair of images produces a third image

which contains a single peak. The location of this peak corresponds to the relative

translation between the images. Unlike many spatial-domain algorithms, the

phase correlation method is resilient to noise, occlusions, and other defects

typical of medical or satellite images. Additionally, the phase correlation uses the

39

fast Fourier transform to compute the cross-correlation between the two images,

generally resulting in large performance gains. The method can be extended to

determine rotation and scaling differences between two images by first

converting the images to log-polar coordinates. Due to properties of the Fourier

transform, the rotation and scaling parameters can be determined in a manner

invariant to translation.

4.5.1.3. SINGLE VS MULTI- MODALIY METHODS

Another classification can be made between single-modality and

multi-modality methods. Single-modality methods tend to register images in the

same modality acquired by the same scanner/sensor type, while multi-modality

registration methods tended to register images acquired by different

scanner/sensor types.

Multi-modality registration methods are often used in medical

imaging as images of a subject are frequently obtained from different scanners.

Examples include registration of brain CT/MRI images or whole body PET/CT

images for tumor localization, registration of contrast-enhanced CT images

against non-contrast-enhanced CT images for segmentation of specific parts of

the anatomy, and registration of ultrasound and CT images for prostate

localization in radiotherapy.

4.5.1.4. AUTOMATIC VS INTERACTIVE METHODS

Registration methods may be classified based on the level of

automation they provide. Manual, interactive, semi-automatic, and automatic

methods have been developed. Manual methods provide tools to align the images

manually. Interactive methods reduce user bias by performing certain key

40

operations automatically while still relying on the user to guide the registration.

Semi-automatic methods perform more of the registration steps automatically but

depend on the user to verify the correctness of a registration. Automatic methods

do not allow any user interaction and perform all registration steps automatically.

4.5.2. UNCERTAINITY

There is a level of uncertainty associated with registering images that

have any spatio-temporal differences. A confident registration with a measure of

uncertainty is critical for many change detection applications such as medical

diagnostics.

In remote sensing applications where a digital image pixel may

represent several kilometers of spatial distance (such as NASA's LANDSAT

imagery), an uncertain image registration can mean that a solution could be

several kilometers from ground truth. Several notable papers have attempted to

quantify uncertainty in image registration in order to compare results. However,

many approaches to quantifying uncertainty or estimating deformations are

computationally intensive or are only applicable to limited sets of spatial

transformations.

4.5.3. TRANSFORMATION METHODS

Image registration algorithms can also be classified according to the

transformation models they use to relate the target image space to the reference

image space.

The first broad category of transformation models includes linear

transformations, which include translation, rotation, scaling, and other affine

41

transforms. Linear transformations are global in nature, thus, they cannot model

local geometric differences between images.

The second category of transformations allow 'elastic' or 'nonrigid'

transformations. These transformations are capable of locally warping the target

image to align with the reference image. Nonrigid transformations include radial

basis functions (thin-plate or surface splines, multiquadrics, and compactly-

supported transformations), physical continuum models (viscous fluids), and

large deformation models (diffeomorphisms).

4.5.4. RADON TRANSFORM

The Radon transform of a 2-D function f (x, y) is defined as:

where r is the perpendicular distance of a line from the origin and q is the angle

between the line and the y-axis. According to the Fourier slice theorem, this

transformation is invertible and the 1-D Fourier transforms of the Radon

transform along r are the 1-D radial samples of the 2-D Fourier transform of

f (x, y) at the corresponding angles. The transform we have used is radon

transform. The RADON function computes the Radon transform, which is the

projection of the image intensity along a radial line oriented at a specific angle.

R = RADON(I,THETA) returns the Radon transform of the intensity

image I for the angle THETA degrees. If THETA is a scalar, the result R is a

42

column vector containing the Radon transform for THETA degrees. If THETA is

a vector, then R is a matrix in which each column is the Radon transform for one

of the angles in THETA. If you omit THETA, it defaults to 0:179.

[R,Xp] = RADON(...) returns a vector Xp containing the radial coordinates

corresponding to each row of R.

4.6. FEATURE EXTRACTION

In pattern recognition and in image processing, feature extraction is a

special form of dimensionality reduction.

When the input data to an algorithm is too large to be processed and it is

suspected to be notoriously redundant (much data, but not much information)

then the input data will be transformed into a reduced representation set of

features (also named features vector). Transforming the input data into the set of

features is called feature extraction. If the features extracted are carefully chosen

it is expected that the features set will extract the relevant information from the

input data in order to perform the desired task using this reduced representation

instead of the full size input

Feature extraction involves simplifying the amount of resources required to

describe a large set of data accurately. When performing analysis of complex

data one of the major problems stems from the number of variables involved.

Analysis with a large number of variables generally requires a large amount of

memory and computation power or a classification algorithm which overfits the

training sample and generalizes poorly to new samples. Feature extraction is a

43

general term for methods of constructing combinations of the variables to get

around these problems while still describing the data with sufficient accuracy.

The discrete wavelets transform (DWT) decomposes an input signal into

low and high frequency component using a filter bank. Haar wavelet, which

characteristics the filter bank, has important properties of orthogonality, linearity,

and completeness. We can repeat the DWT multiple times to multiple-level

resolution of different octaves. For each level, wavelets can be separated into

different basis functions for image compression and recognition.

Fig 4.7 Multi-resolution expansion using Haar wavelet

The wavelet transform can be used to represent a two-dimensional (2D)

signal by the 2D resolution decomposition procedure, where an image is

repeatedly decomposed into an approximation and several detail components at

each level.

In order to construct the wavelet pyramid, we decide the number of Haar

coefficients and approximation levels. We would like to extract salient points

44

from any part of the image where “something” happens in the image at any

resolution. A high wavelet coefficient (in absolute value) at a coarse resolution

corresponds to a region with high global variations. The properly chosen length

of the Haar wavelet and the number of the approximation levels provides the

optimum local key points or features.

4.6.1. WAVELETS

The Wavelets analysis is performed using a prototype function called

wavelet, which has the effect of a band pass filter. Wavelets are functions defined

over a finite interval and having an average value of zero. The basic idea of

wavelet transform is to represent any arbitrary functions f (t) as a superposition of

a set of such wavelets or basis function. These basis functions are derived from a

single prototype mother wavelet.

The term wavelet means a small wave. The smallness refers to the

condition that this window function is of finite length (compactly supported). The

‘wave’ refers to the condition that this function is oscillatory. The term ‘mother’

implies that the functions with different region of support that are used in the

transformation process are derived from (scaling) and translations (shifts).

45

4.6.2. WAVELET TRANSFORM

The wavelet transform is a mathematical tool that decomposes a signal in

to a representation that shows signal details and trends as a function of time. It is

used to characterize transient, reduce noise, compress data, and perform many

other operations.

Wavelet analysis is a windowing technique, similar to the STFT, with the

variable -sized windows. Wavelet analysis is capable of revealing aspects of data

that other signal analysis techniques miss, including aspects such as trends,

breakdown points, discontinuities, and self-similarity. It is also often used to

compress or denoise a signal without any appreciable degradation.

4.6.3. THE DISCRETE WAVELET TRANSFORM

“Discrete Wavelet Transform”, transforms discrete signal from time

domain in to time-frequency domain. The transformation product is set of

coefficients organized in the way that enables not only spectrum analyses of the

signal, but also spectral behavior of the signal in time. This is achieved by

decomposing signal, breaking into two components, each caring information

about source signal. Filters from the filter bank used for decomposition come in

pairs: low pass and high pass. The filtering is succeeded by down sampling

(obtained filtering result is “re-sampled” so that every second coefficient is kept).

Low pass filtered signal contains information about slow changing component of

the signal, looking very similar to the original signal, only two times shorter in

term of samples. High pass filtered signal contains information about fast

changing component of the signal. In most cases high pass component is not so

rich with data offering good property for compression. In some cases, such as

46

audio or video signal, it is possible to discard some of the samples of the high

pass component without noticing any significant changes in signal. Filters from

filter bank are called “wavelets”.

4.6.4. 2D-DISCRETE WAVELET TRANSFORM

The two-dimensional DWT can be implemented using digital filters and

down samplers. With separable two-dimensional scaling and wavelet functions,

we get one approximation coefficients and three sets of detail coefficients such as

horizontal, vertical and diagonal coefficients.

The concepts of one-dimensional DWT and its implementation through

sub band coding can be easily extended to two-dimensional signals for digital

images. In case of sub band analysis of images, we require extraction of its

approximate forms in both horizontal and vertical directions, details in horizontal

direction alone (detection of horizontal edges), details in vertical direction alone

(detection of vertical edges) and details in both horizontal and vertical directions

(detection of diagonal edges). This analysis of 2-D signals require the use of

following two-dimensional filter functions through the multiplication of

separable scaling and wavelet functions in (horizontal) and (vertical) directions,

as defined below:

47

represents the approximated

signal, signal with horizontal details, signal with vertical details and signals with

diagonal details respectively.

48

CHAPTER 5

OVERVIEW OF THE PROJECT

5.1. AN OVERLAY OF OUR ALGORITHM

Fig 5.1 Overlay of our algorithm

CAPTURED HAND

GESTURE

MORPHOLOGICAL OPERATION

HAND SEGMENTATION

GRAY SCALED IMAGE

RESIZING IMAGE

IMAGE NORMALIZATION

HAAR WAVELET TRANSFORM

FEATURE EXTRACTION

RECOGNISED

SPEECH SIGNAL

49

5.2. PROPOSED WORK

Here we have approached gesture recognition through image processing.

With a constraint of constant background and constant zoom level we have

tracked the hand gesture. The captured image is generalized to a common size so

that the machine has a constant frame size to process. The RGB image requires a

threshold value for each sub-image. So to reduce such complexity we have

converted it into a gray scale image. The hand is then extracted using the skin

color approach and we assume that the front arm of user is cover by clothes. A

pixel is defined as a skin pixel if it satisfies the following condition

Gray scale < Threshold

Where the gray scale denotes the intensity value of the input hand image.

By the skin color approach, a hand map image can then be defined. In the

hand map image, a white pixel (pixel value =1) and a black pixel (pixel value =

0) indicate the skin and non skin pixels respectively.

Now the segmented hand image is binary and it undergoes many

preprocessing. The binary image obtained has noise and does not figure out the

exact hand geometry. Because of noise in a hand image, holes are resulted which

are then minimized by utilizing morphological operations. A structuring element

of suitable size is designed and image closing is done to get the exact geometry

of the hand.

In morphology dilation expands an image and erosion shrinks it. Closing

tends to smooth contours but it generally fuses narrow breaks and long thin gulfs,

it eliminates some small holes, and fills gaps in the contour.

50

The closing of set A by structuring element B, denoted by

Which, in words, says that the closing of A by B is simply the dilation of A

by B, followed by the erosion of the result by B.

The gesture captured at different time intervals may be at different angular

position. Our next step is to normalize the segmented hand to a common axis.

Image registration is performed to rotate the image to a constant image axis

thereby reducing the number of training images in the database. This

normalization is done with the help of radon transform.

There are several choices for the selection of features inorder to

discriminate between hands in a hand gesture recognition system.

In numerical analysis and functional analysis, a discrete wavelet

transform (DWT) is any wavelet transform for which the wavelets are discretely

sampled. As with other wavelet transforms, a key advantage it has over Fourier

transforms is temporal resolution, it captures both frequency and location

information.

For an input represented by a list of 2n numbers, the Haar wavelet

transform may be considered to simply pair up input values, storing the

difference and passing the sum. This process is repeated recursively, pairing up

the sums to provide the next scale, finally resulting in 2n − 1 differences and one

final sum.

The feature used here is the horizontal component of the discrete wavelet

transformed image. This vector component which is different from other

statistical method gives a better recognition rate.

BBABA Θ⊕=• )(

51

This vector component is correlated with the training image. Correlation is

a measure of how well the predicted values from a forecast model "fit" with the

real-life data.

The correlation coefficient is a number between 0 and 1. If there is no

relationship between the predicted values and the actual values the correlation

coefficient is 0 or very low (the predicted values are no better than random

numbers). As the strength of the relationship between the predicted values and

actual values increases so does the correlation coefficient. A perfect fit gives a

coefficient of 1.0. Thus higher the correlation coefficient, better the recognition.

Then comes our speech signal unit which plays the corresponding wave

file. The recognized image has to be authenticated by a sound file. This would

make the ordinary speaking community to easily understand what the dumb

people mean to say. In this part we have saved a .wav file for each alphabet. The

.wav file for that particular gesture is read and played. This is done by using

wavread and wavplay command.

As this project computes wavelet and finds the correlation for a particular

gesture which is very easy for computation compared to other gesture recognition

system it has an added advantage. Moreover the hardware requirement is

reduced.

52

CHAPTER 6

SOFTWARE DESCRIPTION

6.1. INTRODUCTION

The simulation tool used for the development of the software is MATLAB.

MATLAB stands for Matrix Laboratory. It is a technical computing environment

for high performance numeric computation and visualization. It indicates

numerical analysis, matrix computation, signal processing and graphics in an

easy to use environment, where problems and solutions are expressed just as they

are written mathematically, without traditional programming. MATLAB allows

us to express the entire algorithm in few dozen lines to compute the solutions

with great accuracy in a few minutes on a computer, and to readily manipulate a

tree dimensional display of the result in color.

The basic building block of MATLAB is the matrix. The fundamental

data-type is the array, vectors, scalars, real matrices and complex matrices are all

automatically handled as special cases of fundamental data types. It also features

a family of applications specific solutions called ‘Tool Boxes’. Areas in which

tool boxes are available include signal processing, image processing, control

systems designs, dynamic system simulations, system identifications, neural

networks, wavelength communications and others. These tool boxes are

collections of functions written for special applications such as symbolic

computing, image processing, neural networks etc.

53

6.2. FEATURES OF MATLAB

Some of the special features of MATLAB are

6.2.1. COMMAND WINDOW

This is the main window. It is characterized by the MATLAB command

prompt’>>’, when the application is launched the user is taken to this prompt. All

commands including those for running user-written programs are typed in the

MATLAB prompt.

6.2.2. GRAPHICS WINDOW

The outputs of all the graphics are flushed to the graphics (or) figure

window, a separate gray window with (default) white background. The user can

create as many figure windows, as the memory will allow.

6.2.3. EDIT WINDOW

This is where we write, edit, create and save our own programs in M-files.

Any text editor can be used to carry out these tasks. On most systems such as PCs

and Macs MATLAB provides its own built-in editor. On other systems standard

file editing program is invoked by typing a command prompt.

6.2.4. INPUT OUTPUT

MATLAB supports interactive computation, taking the input from the

screen and flushing the output to the screen. In addition, it reads input files and

writes output files. The following features hold all forms of input- output.

6.2.5. DATA TYPE

The fundamental data type in the MATLAB is the array. It encompasses

several distinct data objects, integers, matrices, doubles, character strings,

structures and cells. In most cases, however data type (or) data object declaration

is not needed.

54

6.2.6. DIMENSIONING

Dimensioning is automatic in MATLAB. No dimension statement is

required in vectors (or) arrays. The command ‘size’ and ‘length’ yields the

dimension of an existing matrix (or) vector.

6.2.7. CASE SENSITIVITY

MATLAB is case sensitive, i.e. it differentiates lowercase and uppercase

letters. Thus ’a’ and ‘A’ are different variables. Most MATLAB commands and

built-in function calls are typed in lowercase letters.

6.3. IMAGES IN MATLAB

The basic data structure in MATLAB is the array, an ordered set of real or

complex elements. This object is naturally suited to the representation of images,

real-valued, ordered sets of color or intensity data MATLAB stores. Most images

as two-dimensional arrays, in which each element of the matrix corresponds to a

single pixel in the displayed image. For example an image composed of 200 rows

and 300 columns of different colored dots are stored in MATLAB as 200 by 300

matrix.

By default, MATLAB stores most data in arrays of class double. The data

in these arrays is stored as double precision (64-bit) floating-point numbers. All

of MATLAB’s function and capabilities work with these arrays.

The number of pixels in an image may be large; for example a 1000 by

1000 image has a million pixels. Since each pixel is represented by at least one

array element, this image would require about 8 megabytes of memory.

In order to reduce memory requirements, MATLAB supports storing

image data in arrays of class unit 8. The data in these arrays requires one eighth

as much memory as data in double arrays. Because the types of values that can be

55

stored in unit 8 arrays and double arrays differ, the image processing toolbox uses

different conventions for interpreting the values in these arrays.

6.4. FILE TYPES

MATLAB has four types of files for storing information. They are

• M-files.

• Script files

• Function files.

• MAT files

6.4.1. M-FILES

M-files are standard ASCII text files with an .m extension to the file name.

There are two types of these files namely script files and function files. Most

programs written in MATLAB are saved as M-files. All built in functions are

provided with source code in readable form so that they can be copied and

modified.

6.4.2. SCRIPT FILES

Script files are an M-file with a valid set of MATLAB commands in it. A

script file is executed by typing the name of the file (without the ’in’ extension)

on the commands stored in the script file, one by one at the MATLAB prompt.

Naturally script files work on the global variables i.e. variables currently present

in the work space.

A script file may contain any number of commands; including those that

call built-in functions written by the user. Script files are useful when a certain

set of commands has to be repeated several times.

56

6.4.3. FUNCTION FILES

A function file is also an M-file, like a script file, except that the variables

in a function file are all local. Function files are like programs or subroutine in

FORTRAN, procedures in PASCAL and functions in C.

A Function file begins with a function definition line, which has a well-

defined list of inputs and outputs. Without this line the file becomes a script file.

The syntax of function definition line is as follows:

Function[Output Variable]=Function-name(input variable)

Where the function name should be the file name in which the function is written.

6.4.4. MAT-FILES

MAT-Files are binary data files with a ‘mat’ extension. MAT-Files are created by

MATLAB when a data is saved with ‘save’ command. The data is written in a

special format, which only MATLAB can decode. MAT-Files can be loaded in to

the MATLAB using the load command.

57

CHAPTER 7

SIMULATION RESULT

STEP: 1

The input image from different camera may have varying dimensions

(M x N). So to standardize the size of the input image which is to be processed

we resize it.

Fig 7.1 Resized image

58

STEP: 2

The RGB image obtained is converted to gray scale for easier thresholding

process. This gray scale image is a combination of the three sub colors red, green

and blue.

Fig 7.2 Gray scale image

59

STEP: 3

Here we segment the hand region from the background by choosing an

appropriate threshold value. This process gives an outline for the hand region that

need to be processed.

Fig 7.3 Segmented hand

60

STEP: 4

In this step we create a structuring element and perform the image closing

operation. This process removes the noise components and give an exact

geometry of the hand gesture.

Fig 7.4 Morphologically operated image

61

STEP: 5

Here we rotate the image to a common axis. This greatly reduces the

number of images in the database.

Fig 7.5 Normalized image

62

STEP: 6

This step involves the feature extraction. The horizontal component of the

wavelet transformed test image is used for recognition. This step too reduces the

database size by great measure.

Fig 7.6 Horizontal vector of DWT

63

CHAPTER 8

CONCLUSION

The inspiration behind this project came from the thought of helping to

alleviate the language barrier which stands between the dumb and hearing

communities. Attempting to translate finger spelling to a spoken English alphabet

was just a minute step towards achieving this ultimate goal. The resulting gesture

recognition approach achieved this desired step.The discrete wavelet concept and

normalization of the image axis helps to reduce the database size. Performing

wavelet transform is time efficient as it is easier for computation. The

normalization also provides an uniform pattern to correlate with the training

image and give out the corresponding speech signal which it matches.Our method

seems to be more promising as there is a substantial reduction in error rate and

processing time.

64

CHAPTER 9

REFERENCES

[1]Wing Kwong Chung, Xinyu Wu, and Yangsheng Xu, “A Realtime Hand

Gesture Recognition based on Haar Wavelet Representation”, Proceedings of the

2008 IEEE International Conference on Robotics and Biomimetics Bangkok,

Thailand, February, 2009.

[2] J. Allen, P. Asselin, and R. Foulds , “American Sign Language Finger

Spelling Recognition System”, Proceedings of Bioengineering Conference,

March, 2003.

[3] H. Brashear, T. Starner, P. Lukowicz, H. Junker, “Using multiple sensors for

mobile sign language recognition”, Proceedings of IEEE International

Symposium on Wearable Computers, pp. 45-52, October 2003.

[4] J. H. Shin, J. S. Lee, S. K. Kil, D. F. Shen, J. G. Ryu, E. H. H. K. Min, and S.

H. Hong, “Hand Region Extraction and Gesture Recognition using entropy

analysis”, Proceedings of International Journal of Computer Science and

Network Security, Vol. 6 No. 2 216 222, February 2006.

[5] C.L. Huang and W.Y. Huang, “Sign language recognition using model based

tracking and a 3D Hopfield neural network”, Machine Vision Application, Vol.

10, pp. 292301, 1998.

[6] G. Gomez, M. Sanchez, and L. E. Sucar, “On selecting an appropriate colour

space for skin detection”, Proceedings of Mexican International Conference on

Artificial Intelligence, Yucatan, pp. 69-78, 2002.

hand gesture recognition using haar-wavelet

Documents