an automatic arabic sign language recognition system (arslrs) and artificial intelligence... · an...

Accepted Manuscript

An Automatic Arabic Sign Language Recognition System (ArSLRS)

Nada B. Ibrahim, Mazen M. Selim, Hala H. Zayed

PII: S1319-1578(17)30177-5DOI: https://doi.org/10.1016/j.jksuci.2017.09.007Reference: JKSUCI 357

To appear in: Journal of King Saud University - Computer and In-formation Sciences

Received Date: 5 June 2017Revised Date: 12 September 2017Accepted Date: 26 September 2017

Please cite this article as: Ibrahim, N.B., Selim, M.M., Zayed, H.H., An Automatic Arabic Sign LanguageRecognition System (ArSLRS), Journal of King Saud University - Computer and Information Sciences (2017), doi:https://doi.org/10.1016/j.jksuci.2017.09.007

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customerswe are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, andreview of the resulting proof before it is published in its final form. Please note that during the production processerrors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

https://doi.org/10.1016/j.jksuci.2017.09.007

https://doi.org/10.1016/j.jksuci.2017.09.007


Nada B. Ibrahim 1*, Mazen M. Selim 2 and Hala H. Zayed 3

1 Department of Computer Science, Faculty of Computers and Informatics, Benha University, Benha, Egypt

Email: [email protected]





Corresponding Author:

Nada B. Ibrahim

Department of Computer Science, Faculty of Computers and Informatics, Benha University

Benha Mansoura Road, next to Holding Company for Water Supply and Sanitation

Benha, Qalyubia Governorate, Egypt

Tel.: 013- 3188266

Fax: 013- 3188265



Abstract

Sign language recognition system (SLRS) is one of the application areas of human computer interaction (HCI) wheresigns of hearing impaired people are converted to text or voice of the oral language. This paper presents an automaticvisual SLRS that translates isolated Arabic words signs into text. The proposed system has four main stages: handsegmentation, tracking, feature extraction and classification. A dynamic skin detector based on the face color tone isused for hand segmentation. Then, a proposed skin-blob tracking technique is used to identify and track the hands.A dataset of 30 isolated words that used in the daily school life of the hearing impaired children was developedfor evaluating the proposed system, taking into consideration that 83% of the words have different occlusion states.Experimental results indicate that the proposed system has a recognition rate of 97% in signer-independent mode.In addition to, the proposed occlusion resolving technique can outperform other methods by accurately specify theposition of the hands and the head with an improvement of 2.57% at τ=5 that aid in differentiating between similargestures.

Keywords: Gesture Recognition, Arabic Sign Language Recognition, Isolated Word Recognition, Image-BasedRecognition

1. Introduction

Hearing impairment is a board term referred to partialor complete loss of hearing in one or both ears. The levelof impairment varies between mild, moderate, serve orprofound.5

Granting to the world health organization (WHO) in theyear of 2017, Over 5% of the worlds population 360 mil-lion people has disabling hearing loss (328 million adultsand 32 million children). Roughly one-third of peopleover 65 years of historic period are affected by disabling10

hearing loss. The majority of people with disabling hear-ing loss live in low and middle income countries[1].SLRS is one of the application areas of HCI. The maingoal of SLRS is to recognize signs of hearing impairedpeople and converting them to text or voice of the oral15

language and vice versa. These systems use either iso-lated or continuous signs. The performer of the isolatedsystems signs only one letter or word at a time whilein continuous systems the performer signs one or morecomplete sentences. Further, SLRS can be categorized20

as a signer-dependent or signer-independent. Systemsrely on the same signers to perform in both training andtesting phases are signer-dependent and this affects therecognition rate positively. On the other hand, in signer-independent systems singers performed the training stage25

is not admitted in the testing stage and this adds a chal-lenge of adapting the system to accept any signer. Thegoal of SLRS can be achieved by either a sensor-based oran image-based system.The sensor-based system employs just about variety30

of electromechanical devices that are incorporated withmany sensors to recognize signs, e.g.: Data gloves [2],power gloves [3], cyber gloves[4], and Dexterous mas-ter gloves [5]. Mina et al.[6] designed a smart gloveusing a few sensors depending on a statistical analysis35

of the anatomical shape of the hands when performingthe 13oo words of the Arabic sign language (ArSL). Theglove costs about $65 which is according to the author 5%less than the cost of commercial smart gloves. The highcost and less normality of this method contributes to the40

appearance of the image-based way, where one or more

Preprint submitted to Journal of King Saud University - Computer and Information Sciences September 28, 2017

cameras are employed to capture the signs. Classificationcan be done either by marker-based or visual-based tech-niques.In marker-based techniques, markers with predefined45

color or colored gloves are placed on the fingertips andwrist. These predefined colors are then detected and seg-mented from an image captured by a 2D camera usingimage processing methods, but these techniques also lackfor normality [7, 8]. On the other hand, visual-based tech-50

niques use bare hands without any markers. These tech-niques have high normality and higher mobility than anyother types of the SLRS. Visual -based SLRS have lowcost as one camera can be used. But these techniques suf-fer from changing in illumination. Hand occlusion with55

either each other or the face is another drawback as 2D im-ages lack the depth information that aid in solving occlu-sion. This paves the way to the depth sensors that dependson RGB-D image technique giving the depth of each pixelin the image helping in constructing a 3D model of the60

objects in the scene. Till now it’s still an open field ofresearch. In most of the researches vision-based refersto visual-based vision system. A further discussion anddetailed overview on related work in the field of SLR isgiven at [9, 10, 11, 12]. This paper will focus on ArSL.65

Recent isolated vision ArSLR systems are pointed out.Al-Rousan et al. [13] developed a system that automat-ically recognizes 30 isolated ArSL words using discretecosine transform (DCT) to extract features and HiddenMarkov Model (HMM) as recognition method. The sys-70

tem obtained a word recognition rate of 94.2% for signer-independent off-line mode. Due to the nature of DCT,the observation features produced by the DCT algorithmmisclassified similar gestures. Also, the system did notconcern with working out the occlusion problem.75

To overcome the misclassification of similar gesture, Al-Rousan et al. [14] developed a system that used two-levelscheme of HMM classifier. The system overcomes the oc-clusion state by treating the occluded objects as one objector by taking the preceding features of the objects before80

occlusion. In the real situation, this is not the case.Another technique for solving the occlusion states was de-veloped by El-Jaber et al. [15] where a stereo vision isapplied to estimate and segment out the signers body us-ing its depth information to recognize 23 isolated gestures85

in signer-dependent mode. It aches from its high cost asmore than one camera is needed to construct the stereo

vision. Disparity maps are computationally expensive asany change in the distance between the two cameras andthe object will affect the performance of solving the cor-90

respondence problem.In [16], a 3D model of the hand posture is generated fromtwo 2D images from two perspectives that are weightedand linearly combined to produce single 3D features try-ing to classify 50 isolated ArsL words using Hybrid95

pulse-coupled neural network (PCNN) as feature gener-ator technique followed by non-deterministic finite au-tomaton (NFA). Then, ”Best-match” algorithm is usedto find the most probable meaning of a gesture. Therecognition accuracy reaches 96%. The misclassification100

comes from the fact that the NFA of some gestures maybe wholly included in another gesture NFA.Ahmed et al. [17] uses a combination of local binary pat-terns (LBP) and principal component analysis (PCA) toextract features that are fed into a HMM to recognize a105

lexical of 23 isolated ArSL words. Occlusion is not re-solved as any occlusion state is handled as one object andrecognition goes on. The system achieves a recognitionrate of 99.97% in signer- dependent mode. But LBP maynot work properly on the areas of constant gray-level be-110

cause of the thresholding schemes of the operator [17].Obviously, most vision systems suffer from two mainproblems: confusing similar gestures in motion, and cur-ing the occlusion problem. The aim of the research doc-umented in this paper is to decrease the misclassification115

rate for similar gestures and resolves all occlusion statesusing only one camera and without any complicated envi-ronment to compute the disparity map.This paper presents an automatic visual SLRS that trans-lates isolated Arabic word signs into text. The proposed120

system has four primary stages: hand segmentation, track-ing, feature extraction and classification. Hand segmenta-tion is performed utilizing a dynamic skin detector basedon the color of the face [18]. Then, the segmented skinblobs are used in identifying and tracking of the hands125

with the help of the head. Geometric features of the handsare employed to formulate the feature vector. Finally,Euclidean distance classifier is applied for classificationstage. A dataset of 30 isolated words used in the dailyschool life of the hearing impaired children was devel-130

oped. The experimental results indicate that the proposedsystem has a recognition rate of 97%. Taking into consid-eration that 83% of the words mainly cover all the occlu-

2

sion states to prove the robustness of the system.The upcoming sections are arranged as follows: Dataset135

description is illustrated in Sect. 2. The proposed ap-proach including a novel identifying and tracking methodis described in Sect. 3 . Results and evaluation are out-lined in Sect. 4. Finally, a conclusion is given in Sect. 5.

2. ArSLRS dataset140

A unified Arabic sign language dictionary was pub-lished in two editions in 2008. Despite of that, there are nocommon databases available for researchers in the area ofArabic sign language recognition. Thus, each researcherhas to establish his own database with reasonable size.145

The dataset used is an ArSL database videos which wascollected at Benha University. The database consists of450 colored ArSL videos captured at a rate of 30 fps.These videos represent 30 Arabic words which were se-lected as the daily common used words in school. 300150

videos are used for training while 150 are for testing.The signers performing the testing clips are differentfrom whom performed the training clips to guarantee thesigner -independency of the system designed. The videosare gathered in different illumination, backgrounds, and155

clothing. The signer is asked to face the camera with noorientation, then starts signing from silence state whereboth hands are placed beside the body and then ends againwith a silence state.It was considered that the database contains words that160

have variety of using one hand or both hands with occlu-sion with each other or with the face to test the validity ofthe system in solving different occlusion states.The list of the used words and their description is given inthe Table 1. The occlusion column identifies that the sign165

performed has an occlusion state with either one of thehands or both hands and the face. RH and LH columnsshow whether the sign is executed with the right hand orleft hand, respectively. R-L H column indicates that thesign is performed with both hands. The last row illus-170

trates the estimated percentage of occlusion states in thebuilt database.

3. The proposed ArSLRS

As shown in Fig. 1, the vision-based SLRS has twomodes. The first mode is from the hearing-impaired peo-175

Table 1: List of dataset words and their description.

Words Occlusion R-L H LH RH

Peace be upon you√ √

Thank you√ √

Telephone√

I√

Eat√ √

Sleep√ √

Drink√ √

Prayer√ √

To go√ √

Bathroom√ √

Ablution√ √

Tomorrow√

Today√ √

Food√ √

Water√ √

To love√ √

To hate√ √

Money√ √

Where’re you going?√ √

Where√ √

Why√ √

How much√

Yes√ √

No√ √

Want√

School√ √

Teacher√ √

Help√ √

Sick√ √ √

Friend√ √

Percentage 83% 33% 27% 40%

3

Figure 1: A diagram of vision-based SLRS.

ple to the vocal people where a video of the sign language(SL) is translated into oral language either in the form oftext or voice. This mode is called vision-based SLRS.On the other hand, the second mode is from vocal peopleto the hearing-impaired people where the oral language180

voice record is converted into SL video. The vision-basedSLRS mode was the interest in this paper. Each stage isillustrated in detail in the upcoming sub-section.

3.1. Hand segmentationThis term refers to the extraction of hands from the185

frames through the entire video sequence. The video se-quence may contain just the hands or the whole body ofthe signer. In the first instance, either a background re-moving technique or skin detection technique is employedto segment hands. Merely in the second case, background190

removing technique followed by skin detection may beused or a skin detection technique is applied directly tothe frames. Accumulated difference image (AD) is ap-plied to extract the hands if they were the only movingobject in the video [19].195

There has appeared many skin detectors that rely on theface to detect the skin regions. In [20], a face region thatincludes the eyes and the mouth which are non-skin non-smooth regions is used to calculate the probability of apixel to be skin or non-skin pixel. This has affected the200

results of this approach by detecting the mouth, the eyesand the brows as skin region which is not true.In [21], a

10*10 window around the center pixel of the face is usedto distinguish the skin tone pixels, which in most of thecases is the nose tip. But, this region suffers from the ef-205

fect of illumination and may give wrong indications.In this paper, a dynamic skin detector based on face skintone color is used in segmenting the hands [18]. YCbCrcolor space is used after discarding the luminance chan-nel. Face detector is applied to the first frame. The proba-210

bility distribution function (PDF) histogram bins are cal-culated and trimmed at 0.005. To avoid eyes and mouthregions to be recognized as the skin, a threshold is ap-plied to remaining PDF values after trimming. The pixelsalong the major and minor axes of the bounding rectangle215

of detecting face are used to calculate a dynamic thresh-old. This threshold is applied to the face image to iden-tify skin pixels. And then, the threshold is updated byincreasing the pixels around the axes until 95% from facepixels are recognized as a skin. Finally, this threshold is220

applied to the entire image. This method is employed dueto its adaptive nature which make it applicable for differ-ent races. In addition to, using the YCbCr color spacereduce dramatically the effect of illumination on the seg-mentation. The outcome of this phase is a binary image225

that holds the hands and the face with white pixels andother objects with dark.

3.2. TrackingTracking is defined as the problem of estimating the tra-

jectory of an object in the image plane as it moves around230

a scene [22]. Numerous approaches for tracking havebeen proposed. Some of these approaches are: detect-ing motion with an active camera [23], skin blob tracking[24], active contour [25], camshift [26], particle filter [27]and Kalman filter [28]. A review study on recent advances235

and trends in tracking is given in [29, 30].Dreuw et al. [31] developed a dynamic programmingtracking (DPT) technique that relies on two paths to de-cide the correct tracking path for the hands. A forwardpath is used to calculate an overall score function for all240

the frames of a sequence. A backward path that went fromthe final frame is applied to compute the best route forthe tracked hand. The overall scoring function was uti-lized in calculating the best path with respect to a specificscore function. This technique is a model-dependent and245

a signer-independent technique. Taking the tracking de-cision at the end of the sequence improves the ability of

4

the DPT algorithm to prevent wrong local decisions. Bycombining this approach with Viola and Jones method fortracking [32] the results were improved to reach 0% track-250

ing error rate (TER) at tolerance (τ)= 20 [33]. Only thetwo-path method needs a bunch of computations and thescore functions needs some modifications to make the re-quired results.A proposed technique that relies on tracking skin blobs255

by using the Euclidean distance between the skin blobs intwo consecutive frames is developed to tackle using twopaths and using scoring functions. The Viola and Jonesmethod first identify the head. Then, hands are identifiedby the distance between their centers and the centroid of260

the head. Euclidean distance is used to keep track of thehead and the hands. An occlusion is detected when twoor more tracked objects pointed to the same skin blob.Resolving of occlusion states depends on computing thedeviation in elevation between the former and the current265

positions of the head and the hands. In this technique, it isestimated that the hand shape changes are small. A trans-lation of the former position of the head and the handsis done to occupy the bounding rectangle of the occludedobjects. The TER at τ= 20 is 0.08%. This technique is270

signer-independent and model-free technique. This tech-nique uses forward tracking path in addition to the pre-ceding information about the tracking object to decide itsnext location.

3.2.1. Head tracking275

The head can be easily localized by using a cascadeboosting algorithm [32] but it is computationally very ex-pensive to apply this algorithm on all frames to detecthead especially if this application is a real-time one. Con-sequently, the cascade boosting algorithm is applied to thefirst frame only to obtain the bounding rectangle of thehead. Since the position of the head during the signingprocess mainly doesnt change and nearly has the same lo-cation, Euclidean distance applies to the preceding framesto recognize the head skin blob. When more than one skinblob appears, the head is distinguished as the skin blobwith the smallest Euclidean distance from the former po-sition of the head. Let Hp = (xp,yp) is the center of theprevious head bounding rectangle while Bi = (xi,yi) is thecenter of the current skin blobs where i = {1,2,3}. TheEuclidean distance (ξHB) between the Hp and Bi is given

(a) (b)

(c)

Figure 2: Head tracking

by:

(ξHBi) =√(xp− xi)2 +(yp− yi)2 (1)

The skin blob with the minimum ξHB is the current head(Hc). As shown in Fig. 2a, the head is marked with solidrectangle. In Fig. 2b, the previous head bounding rect-angle is marked with a dotted rectangle, while the newskin blobs are marked with solid rectangles. Euclidean280

distances between the center of the previous location ofthe head and the center of the current skin blobs are cal-culated. The blob with the minimum Euclidean distanceis recognized and marked as the new head with a solidrectangle as shown in Fig. 2c.285

3.2.2. Hands trackingThe center of the bounding rectangle of the head can

be used as a reference point to define the hands. Let Bbe a skin blob. To identify the skin blob as right hand(RH) or left hand (LH), the difference (4x) between thex-coordinates of the centers of the current head (Hc) andthe skin blob (B) must be calculated as follows:

4x = xHc − xB (2)

5

(a) (b)

Figure 3: Identifying the first appearance of the right and the left hands.

Then, the skin blob is identified according to the followingconditions:

B =

{RH, 4x > 0LH, Otherwise

(3)

Identifying of the right and the left-hand skin blobs isshown in Fig. 3.After localizing the first appearance of the head and thehands, Euclidean distance is used to keep track of them,290

as shown in Fig. 4. This will work well till an occlusiontakes place.

Occlusion resolving. Occlusion is the overlapping of oneor more of the tracked objects where one object may coversome or whole of the other object. Hands segmentation in295

the occlusion situation is a challenging task. The handshape change was estimated to be small where capturingof the signs was acquired on high-speed frame recording.In this paper, an occlusion resolving technique is pro-posed. This technique splits the problem into two sub-300

problems. The first is the occlusion of the head with oneor both hands which is the general case, while the secondone is the occlusion between the two hands only.

Occlusion with head. The head plays an important rolewhere it is considered as a reference point and also as anindicator for occlusion. If the area of the head increasesby nearly the third, this is an occlusion state. If the Eu-clidean distance calculated for the head and one or bothof the hands by using Equ. 1 labeled the same skin blobas the head and one or both of the hands then this is anocclusion situation.Any bounding rectangle of an object has four corners:

(a) (b)

(c)

Figure 4: Hands tracking.

Figure 5: Corners of the bounding rectangle of an object.

right upper (RU), right lower (RL), left upper (LU) andleft lower (LL) as illustrated in Fig. 5. The proposed al-gorithm uses the corners to identify the occluded objectslocation.Occlusion between head and right hand can be resolved

by calculating the difference between the y-coordinates(4 y) of the RU corners of the current skin blob (B)bounding rectangle and the previous right hand (PRH)bounding rectangle. (4 y) can be calculated as follows:

4y = yRUB − yRUPRH (4)

Then,

i f4y=

{≥ 0→move RUPRH to RUB and move LLHP to LLB

< 0→move RLPRH to RLB and move LUHP to LUB

(5)

6

(a) (b)

(c) (d)

(e)

Figure 6: Hands tracking.

On the other hand, for the occlusion of the head and theleft hand, the difference between the y-coordinates of theRU corners of the current skin blob (B) bounding rectan-gle and the previous left hand (PLH) bounding rectangleare calculated (4 y) as follows:

4y = yRUB − yRUPLH (6)

Then,

i f4y=

{≥ 0→move LUPRH to LUB and move RLHP to RLB

< 0→move LLPRH to LLB and move RUHP to RUB

(7)The case of occlusion between the head and the left handand how to resolve it is shown in detail in Fig. 6. In305

(a) (b)

(c) (d)

(e)

Figure 7: Resolving partial occlusion between two hands.

Fig. 6a, the head and the left hand are marked with solidrectangle. Then, Euclidean distance between the currentskin blobs and the previous head and left hand is calcu-lated, as shown in Fig. 6b. These calculations indicatethat the head and the left hand share the same skin blob,310

as illustrated in Fig. 6c. 4y is calculated and the arrowshown in Fig. 6d indicates the translation of the head andthe left hand from the previous locations to the new loca-tions. Finally, the occlusion is resolved and the locationof the new head and hand is in Fig. 6e. Finally, occlusion315

between the head and both hands is solved by calculatingthe Euclidean distance between both hands and the head.If this distance is less than a predefined threshold, thenremove occlusion as mentioned previously for both hands

7

with keeping the head position as its previous location.320

On the other hand, if the distance is greater than the pre-defined threshold, then remove this hand and resolve theocclusion using the previous methods for the remaininghand and the head.

Occlusion between hands. The partial occlusion of the325

hand is indicated by its area increase by one half. If theocclusion takes place, then solve it as if it was an occlu-sion between the hands and the head.Fig. 7a is the frame that contains the head and both handsbefore occlusion. In Fig. 7b the hands have moved and its330

region has increased by more than the half which indicatesan occlusion situation. The head area doesnt increase bymore than the third; therefore, the occlusion happened be-tween hands only. As shown in Fig. 7c the head is identi-fied and the other skin blob is recognized as the hands. y335

is calculated for hands as shown in Fig. 7d using Equ. 4and Equ. 6. The arrow in Fig. 7d shows the movement ofthe previous bounding rectangles of both hands. The RUof the right hand is moved to the RU of the current skinblob while the LL of the left hand is moved to the LL of340

the current skin blob. The result of the tracking is shownin Fig. 7e.For full occlusions between both hands, it is indicatedwhen the previous location of both hand points to thesame skin blob as the nearest one. If the Euclidean dis-345

tance between each of the previous hands locations andthe new skin blob is greater than a threshold value, thenthere is a full occlusion between the two hands. This oc-clusion is solved by labeling the skin blob as the new po-sition of both hands. If one of the hands has its Euclidean350

distance less than the threshold, then this hand is out ofthe scene and the skin blob is labeled as the other handonly.

3.3. Hand feature extractionThe next step is extracting hands features. Extracting

good features leads to a significant increase in the per-formance of the SLRS. The features fall in two domains:temporal domain and spatial domain [13].Temporal domain is sometimes referred as frequency do-main where the frame is converted to one of its frequencydomain transformation forms. While the spatial domainis based on direct manipulation of the pixels in the im-age that is divided into two categories: geometric feature

and statistical features [15]. Geometric features describethe two-dimensional projection of the hand in the imagewhile statistical features describe the statistical propertiesof the hand shape.In this paper, Geometric features of the spatial domain areused. The chosen features include: coordinates of the cen-ter of gravity of the hands, the velocity of the hand move-ment and the orientation of the main axis of the hand. Thefeature vector of any sign is represented as follows:

Feature vector = {xRH ,yRH ,vRH ,φRH ,xLH ,yLH ,vLH ,φLH}(8)

Where x,y,v and φ are the coordinates of the gravity of355

the hands, the velocity of the movement of the hands andthe orientation of the main axis of the hands, respectively.

3.4. Recognition

The most used recognition techniques are: HiddenMarkov Model (HMM), support vector machine (SVM),artificial neural networks (ANN), adaptive Neuro-fuzzyinference system (ANFIS) and Euclidean distance. In thisstudy, the dataset is not overly big, so HMM, ANN, SVMand ANFIS are not used as there are no decent data fortraining. Euclidean distance is used for classification as itacts to compare directly the feature vectors.Let the feature vector of the original sign is vo ={x1,x2,x3, . . .} and the feature vector of the testing signis vt = {y1,y2,y3, . . .} then, the Euclidean distance (ED)for the feature vector is computed utilizing the followingequation:

ED =√

(x1− y1)2 +(x2− y2)2 +(x3− y3)2 + . . . (9)

4. Results and Evaluation

Three scenarios have been followed to evaluate the pro-posed system. The first is to understand the effect ofchanging skin color and illumination on correctly seg-menting the hands. The proposed dynamic mid-waythreshold histogram skin detector was evaluated in ourprevious work [18]. This evaluation proposed that this de-tector turns down the False positive rate (FPR) by nearlyhalf while holding on the False negative rate (FNR) ap-proximately the same. It also diminishes the number ofpixels to deal with to be about 52% of the entire face

8

which dramatically decreases the detection time. Ulti-mately, it is recommended for real-time applications andis applicable for different races due to its adaptive dy-namic nature. On the other hand, using the YCbCr colorspace decrease the effect of illumination. Nevertheless,the light must be uniform and the skin tone of the faceand the hands must be the same.The second scenario is to investigate the performance ofhead and hand tracking algorithms. To achieve this, a datawith ground-truth annotations is required as well as anevaluation measure.For an image sequence XT

1 = X1, . . . ,XT and correspond-ing annotated object positions uT

1 = u1, . . . ,uT the trackingerror rate (TER) of tracked positions uT

1 is defined as therelative number of frames where the Euclidean distancebetween the tracker and the annotated position is largerthan or equal to a tolerance (τ) [31]:

T ER=1T ∑δτ(ut , ut) with δτ(ut ,vt)=

{0,‖ u− v ‖< τ

1, Otherwise(10)

A RWTH-BOSTON-104 tracking benchmark database360

[34] for video-based sign language recognition is appliedto assess the proposed tracking technique. The groundtruth of the head and the hands position are annotated toevaluate the tracking technique. The sequences have a lotof dynamic variations in movement and have most of the365

occlusion cases to be tested.Different methods are applied to RWTH-BOSTON-104database [33] like dynamic programming tracking (DPT),principle component analysis (PCA), Viola and Jones(VJ), active appearance model (AAM), Project-Out In-370

verse Compositional AAM (POICAAM) and Fixed Ja-cobian active appearance model (FJAAM). Most ofthese methods are model-based signer-dependent track-ing approaches and have been evaluated on all 15732ground-truth annotated frames of the RWTH-BOSTON-375

104 dataset. The results of these different algorithmsare compared to the results of the proposed techniquefor evaluation. The proposed technique has the lowestTER at τ=5, as shown in Table 2. By increasing the τ ,the TER decrease, but the proposed technique has small380

changes unlike other algorithms. This demonstrates therobustness of the proposed occlusion resolving techniqueas it can accurately specify the position of the hands andthe head with an improvement of 25.7% compared to the

Table 2: TER for the proposed tracking technique and other state-of-the-art tracking methods from [33].

TER%

Tracking methods τ=5 τ=10 τ=15 τ=20

DPT+PCA 26.77 17.32 12.7 10.86DPT+VJ 10.06 0.4 0.02 0VJD 9.75 1.23 1.09 1.07VJT 10.04 0.81 0.73 0.68FJAAM 10.17 6.85 6.82 6.81FJAAM 10.92 7.92 7.88 7.76POICAAM 3.54 0.12 0.08 0.08Proposed 0.97 0.70 0.22 0.08

result from POICAAM at τ=5. The proposed technique385

is not a model-based technique compared to other meth-ods which guarantee the less computation needed and thesigner independency of the method. Integrating the pro-posed technique with other methods may improve the ac-curacy, especially, for the tolerance 10≤ τ ≤20. The third390

scenario is evaluating the whole system by consideringthe percentage of the total number of Arabic sign wordsthat were correctly recognized. Euclidean distance is usedto classify the video of each gesture. The signer of thetested gesture is asked to face the camera with no orien-395

tation and freely perform the sign (remember that signermust begin and end with a silence state). The environmentis controlled as one signer at a time is performing withstationary environment around him. The system attainsa recognition rate of 97% in signer independent mode.400

Form the confusion matrix illustrated in Fig. 8, it was in-dicated that gestures (2, 8, 12, 19 and 26) confused withthe gestures (24, 4, 28, 20 and 3), respectively. Thesegestures have great similarity in either partial or full handmovement. Despite that, only one gesture is recognized405

wrongly. This indicates the ability of the proposed Ar-SLRS to differentiate between gestures with high similar-ity.Finally, it is clear that the proposed system has no demandfor more than one camera or complicated calculations to410

adjust the two cameras to achieve high recognition ratesas in [15]. The proposed system did not have to groupsimilar gesture to increase its recognition rate as in [35]where gesture that has common parts decrease dramati-

9

Figure 8: Confusion matrix of the proposed system.

cally its rate. The developed system does not construct415

a 3D model of hand posture that need two cameras anda sensitive calculation to weight the two views from bothcameras. The system proves its robust occlusion resolvingtechnique that outperform other methods.

5. Conclusions420

This paper presents an automatic visual SLRS thattranslates isolated Arabic word signs into text. The pro-posed system is signer-independent system that utilizes asingle camera and the signer does not employ any type ofgloves or markers. The system has four primary stages:425

hand segmentation, hand tracking, hand feature extrac-tion and classification. Hand segmentation is performedutilizing a dynamic skin detector based on the color of theface. Then, the segmented skin blobs are used to iden-tify and track hands with the aid of the head. The sys-430

tem proved its robust performance against all states ofocclusion as 83% of the words in the dataset has differ-ent occlusion states. Geometric features are employed toconstruct the feature vector. Finally, Euclidean distanceclassifier is applied to classification stage. A dataset of435

30 isolated words that are utilized in the daily school lifeof the hearing-impaired children was developed. The ex-perimental results indicate that the proposed system hasa recognition rate of 97% with the low misclassificationrate for similar gestures. In addition, the system proved440

its robustness against different cases of occlusion with aminimum TER of 0.08%.

References

[1] M. Center, Deafness and hearing loss, Tech. rep.,World Health Organization (2017).445

[2] A. Z. Shukor, M. F. Miskon, M. H. Jamaluddin,F. bin Ali, M. F. Asyraf, M. B. bin Bahar, et al.,A new data glove approach for malaysian sign lan-guage detection, Procedia Computer Science 76(2015) 60–67.450

[3] M. Mohandes, S. A-Buraiky, T. Halawani, S. Al-Baiyat, Automation of the arabic sign languagerecognition, in: Information and CommunicationTechnologies: From Theory to Applications, 2004.

10

Proceedings. 2004 International Conference on,455

IEEE, 2004, pp. 479–480.

[4] M. A. Mohandes, Recognition of two-handed ara-bic signs using the cyberglove, Arabian Journal forScience and Engineering (2013) 1–9.

[5] K. Hoshino, Dexterous robot hand control with data460

glove by human imitation, IEICE transactions on in-formation and systems 89 (6) (2006) 1820–1825.

[6] M. I. Sadek, M. N. Mikhael, H. A. Mansour, Anew approach for designing a smart glove for ara-bic sign language recognition system based on the465

statistical analysis of the sign language, in: RadioScience Conference (NRSC), 2017 34th National,IEEE, 2017, pp. 380–388.

[7] R. Y. Wang, J. Popovic, Real-time hand-trackingwith a color glove, in: ACM transactions on graph-470

ics (TOG), Vol. 28, ACM, 2009, p. 63.

[8] N. El-Bendary, H. M. Zawbaa, M. S. Daoud, A. E.Hassanien, K. Nakamatsu, Arslat: Arabic sign lan-guage alphabets translator, in: Computer Informa-tion Systems and Industrial Management Applica-475

tions (CISIM), 2010 International Conference on,IEEE, 2010, pp. 590–595.

[9] H. Cooper, B. Holt, R. Bowden, Sign languagerecognition, in: Visual Analysis of Humans,Springer, 2011, pp. 539–562.480

[10] M. Mohandes, M. Deriche, J. Liu, Image-basedand sensor-based approaches to arabic sign languagerecognition, IEEE transactions on human-machinesystems 44 (4) (2014) 551–557.

[11] S. S. Rautaray, A. Agrawal, Vision based hand ges-485

ture recognition for human computer interaction: asurvey, Artificial Intelligence Review 43 (1) (2015)1–54.

[12] S. C. Agrawal, A. S. Jalal, R. K. Tripathi, A surveyon manual and non-manual sign language recogni-490

tion for isolated and continuous sign, InternationalJournal of Applied Pattern Recognition 3 (2) (2016)99–134.

[13] M. Al-Rousan, K. Assaleh, A. Talaa, Video-basedsigner-independent arabic sign language recognition495

using hidden markov models, Applied Soft Comput-ing 9 (3) (2009) 990–999.

[14] M. Al-Rousan, O. Al-Jarrah, M. Al-Hammouri,Recognition of dynamic gestures in arabic sign lan-guage using two stages hierarchical scheme, Interna-500

tional Journal of Knowledge-Based and IntelligentEngineering Systems 14 (3) (2010) 139–152.

[15] M. El-Jaber, K. Assaleh, T. Shanableh, Enhanceduser-dependent recognition of arabic sign languagevia disparity images, in: Mechatronics and its Appli-505

cations (ISMA), 2010 7th International Symposiumon, IEEE, 2010, pp. 1–4.

[16] A. S. Elons, M. Abull-Ela, M. F. Tolba, A pro-posed pcnn features quality optimization techniquefor pose-invariant 3d arabic sign language recogni-510

tion, Applied Soft Computing 13 (4) (2013) 1646–1660.

[17] A. A. Ahmed, S. Aly, Appearance-based arabic signlanguage recognition using hidden markov models,in: Engineering and Technology (ICET), 2014 Inter-515

national Conference on, IEEE, 2014, pp. 1–6.

[18] N. B. Ibrahim, M. M. Selim, H. H. Zayed, A dy-namic skin detector based on face skin tone color,in: Informatics and Systems (INFOS), 2012 8th In-ternational Conference on, IEEE, 2012, pp. MM–1.520

[19] K. Assaleh, T. Shanableh, M. Fanaswala, F. Amin,H. Bajaj, et al., Continuous arabic sign languagerecognition in user dependent mode., JILSA 2 (1)(2010) 19–27.

[20] M. Kawulok, Dynamic skin detection in color im-525

ages for sign language recognition, Image and Sig-nal Processing (2008) 112–119.

[21] S. Bilal, R. Akmeliawati, M. J. E. Salami, A. A.Shafie, Dynamic approach for real-time skin detec-tion, Journal of Real-Time Image Processing 10 (2)530

(2015) 371–385.

[22] A. Yilmaz, O. Javed, M. Shah, Object tracking:A survey, Acm computing surveys (CSUR) 38 (4)(2006) 13.

11

[23] C.-Y. Lee, S.-J. Lin, C.-W. Lee, C.-S. Yang, An535

efficient continuous tracking system in real-timesurveillance application, Journal of Network andComputer Applications 35 (3) (2012) 1067–1073.

[24] M. M. Zaki, S. I. Shaheen, Sign language recog-nition using a combination of new vision based540

features, Pattern Recognition Letters 32 (4) (2011)572–577.

[25] E.-J. Holden, G. Lee, R. Owens, Australian signlanguage recognition, Machine Vision and Applica-tions 16 (5) (2005) 312.545

[26] Y.-b. Li, X.-l. Shen, S.-s. Bei, Real-time trackingmethod for moving target based on an improvedcamshift algorithm, in: Mechatronic Science, Elec-tric Engineering and Computer (MEC), 2011 Inter-national Conference on, IEEE, 2011, pp. 978–981.550

[27] F. Gianni, C. Collet, P. Dalle, Robust tracking forprocessing of videos of communications gestures,in: International Gesture Workshop, Springer, 2007,pp. 93–101.

[28] M. S. M. Asaari, S. A. Suandi, Hand gesture track-555

ing system using adaptive kalman filter, in: Intelli-gent systems design and applications (ISDA), 201010th international conference on, IEEE, 2010, pp.166–171.

[29] H. Yang, L. Shao, F. Zheng, L. Wang, Z. Song, Re-560

cent advances and trends in visual tracking: A re-view, Neurocomputing 74 (18) (2011) 3823–3831.

[30] J. Baskaran, R. Subban, Compressive objecttrackinga review and analysis, in: Computational In-telligence and Computing Research (ICCIC), 2014565

IEEE International Conference on, IEEE, 2014, pp.1–7.

[31] P. Dreuw, T. Deselaers, D. Rybach, D. Keysers,H. Ney, Tracking using dynamic programming forappearance-based sign language recognition, in:570

Automatic Face and Gesture Recognition, 2006.FGR 2006. 7th International Conference on, IEEE,2006, pp. 293–298.

[32] P. Viola, M. J. Jones, Robust real-time face detec-tion, International journal of computer vision 57 (2)575

(2004) 137–154.

[33] SIGNSPEAK, Scientific understanding and vision-based technological development for continuoussign language recognition and translation, Tech.rep., European Commission (2012).580

[34] P. Dreuw, J. Forster, H. Ney, Tracking benchmarkdatabases for video-based sign language recogni-tion., in: ECCV Workshops (1), 2010, pp. 286–297.

[35] M. AL-Rousan, O. Al-Jarrah, N. Nayef, Neural net-works based recognition system for isolated arabic585

sign language, in: Information Technology , 2007.ICIT 2007. The 3rd International Conference on,AL-Zaytoonah University, 2007.

12

an automatic arabic sign language recognition system (arslrs) and artificial intelligence... · an...

Documents