[ieee 2013 international conference on advanced technologies for communications (atc 2013) - ho chi...

Scene description for visually impaired in outdoor environment

Quoc- Hung NGUYEN1,

International Research Institute MICA, HUST CNRS/UMI 2954 - Grenoble INP Hanoi University of

Science and Technology 1 Dai Co Viet Road, Hanoi, Viet Nam

[email protected]

Thanh- Hai TRAN2

International Research Institute MICA, HUST CNRS/UMI 2954 - Grenoble INP Hanoi University of

Science and Technology 1 Dai Co Viet Road, Hanoi, Viet Nam

[email protected]

Abstract — Helping visually impaired people navigating and feeling the surrounding environment is a very important and useful task. In this paper, we propose working on visual analysis of environment surrounding the blind person. The analysis is based on GIST features extraction and k-NN method for classification. The analysis results will help blind person to understand where he is and what is happening. We have tested in real field with a pseudo blind person walking in the campus of Gent University with 6 changing scenes. The scene recognition rate is about 85% which make us a confidence to build description systems for blind people in reality.

Keywords — Scene; navigational aid; visually impaired; GIST; k-NN; Outdoor environment

I. INTRODUCTION Due to the large number of blind people in the world who

need to be aided in their life, assistance for blind people is becoming an active research topic in the world with the participation of both research institutes [1], [2], [3], [4], [5], [6] and industrial companies [7].

The helps for blind people vary from navigation through looking for and grasping objects of interest until feeling the communicating context, etc. In the world, there are a lot of scientific researches in navigation aiding (low level aids) using RFID [6, 8] and GPS technologies [6] because this aid is the most important for blind people. Some high level aids such as aid to go shopping (using barcode reader system to choose goods) in the supermarket, to buy train tickets at a station or to feel the emotion of communicating person remain in the idea, still not applied in reality (http://www.blinput.com, http://www.erikhals.com/design.html). In any cases, the blind people always want to feel what are helping where.

In this paper, we propose working on visual analysis of environment surrounding the blind person. Portable cameras will replace the eye of the human to see the scene; computer will help to analyze and describe the scene for the blind. By this way, the blind person will be informed about environment like having a friend besides talking about. To the best of our knowledge, most of state of the art works on blind assistance try to produce navigational assistance systembut no system informs the visually impaired about the environment.

Our main contributions of this paper are:

• A method for scene analysis in the context of navigational aids for blind people. This method is based on GIST features and k-NN (k-Nearest Neighbour).

• A dynamic library for environment description.

• A real database for testing the proposed method. This database has been built in outdoor environment.

The paper is organized as follows. In section 2, we summarize some related works on scene recognition. In section 3, we present our framework for scene analysis and recognition. We discuss the experimental results on real data in section 4. Finally, we conclude and give some ideas for future works.

II. RELATED WORKS ON SCENE RECOGINTION A general framework of scene recognition consists of two

main phases: learning and recognition. For each phase, features representing a scene need to be extracted for learning scene model (in learning phase) or for categorizing an image of scene (in recognition phase). In this paper, we are not ambitious to survey all works on scene recognition, but we focus our research on related methods to our proposed method.

In [9], the authors proposed to represent a scene image by its “GIST”. Based on the idea that the spectrum of an image gives the distribution of the signal’s energy among the different spatial frequencies, the energy spectrum (obtained from Fourier transform) therefore provides a scene representation that is invariant with respect to object arrangements and object identities, encoding only the dominant structural patterns present in the image. This approach has been experimented on a big database and obtained a classification rate ranging from 75% to 82%.

To continue their research, the authors in [9] have studied the spectrum statistics of scene in [10] and in 2010 a very big database has been constructed (899 scene categorizes) as benchmark for research community [11].

Recently, in [12], the authors have indicated that Gist descriptor is good for natural scene (outside) but as Gist ignores the object details in the scene, it cannot deal with inside scenes. Therefore, these authors have proposed a descriptor Centrist (Census Transform Histogram) that combines both

The 2013 International Conference on Advanced Technologies for Communications (ATC'13)

978-1-4799-1089-2/13/$31.00 ©2013 IEEE 398

local and global information in the image. Then k-NN (k-Nearest Neighbour) technique was used to classify scene images. The method obtained 84.96% in terms of classification rate for 15 scene classes. However, the descriptor Centrist is not invariant to rotation. In addition, it ignores colour information of the scene.

GIST and Centrist are global features. In [13], the authors proposed to use dense Sift (Scale Invariant Feature Transform) to represent scene image and obtain quite good result on scene classification. However, SIFT is very expensive in computation. HOG (Histogram of Oriented Gradient) and SSIM (Self Similarity Descriptor) provide the spatial ranging of the scene, Grayscale histogram have been studied in [11]. For all types of descriptor, SVM (Support Vector Machine) is used for classification. With 15 scene class database, using only SIFT descriptor gives 81.4% in terms of classification while combining features improve to 88%.

III. PROPOSED APPROACH

A. General description In this paper, we propose a system that informs the blind

person using results of scene analysis. The proposed system, designed for visually impaired in outdoor environment, is presented in the Figure 1.

Figure 1. The overall aid system for visually impaired

The system contains of the following components:

• A visually impaired person moves on the road. He brings an optic system that observes the surrounding environment. The person will get environment information from environment characterization module.

• A scene analysis module will receive images from camera, analyze and classify image into several predefined types of scenes.

• An environment characterization module receives output of the scene analysis module; builds a description about the environment; updates and informs the person.

In this paper, we propose to deal with the most important problem that is scene recognition. Based on our recent work presented in [14], GIST and k-NN is the best for scene classification, mostly when we work with outdoor

environment, we then propose to use GIST features and k-NN method for classification. The module of scene classification is presented in Figure 2.

Figure 2: Different steps of the proposed method for scene

recognition

B. Features extraction GIST of scene Results of variety of state of the art scene recognition

algorithms [15] shown that GIST features1 [16] obtains an acceptable result of outdoor scene classification (appr. 73 – 80 %). Therefore, in this study, we propose to use GIST features to characterize outdoor scene in our context. In this section, we briefly describe procedures of GIST feature extractions proposed in [16].

Figure 3. GIST feature extraction from input image

To capture remarkable/considering of a scene, Oliva et al. in [16] have evaluated seven characteristics of an outdoor scene such as naturalness, openness, roughness, expansion, ruggedness, so on. They suggested that these characteristics may be reliably estimated using spectral and coarsely localized information. Steps to extract GIST features are explained in Figure 3.

Firstly, an original image is converted and normalized to gray scale image I(x,y) (Figure 3 (a) – (b)). We then apply a pre-filtering to reduce illumination effects and to prevent some local image regions to dominate the energy spectrum. The image I(x,y) is decomposed by a set of Gabor filters. The 2-D Gabor filter is defined as follows:

( )yvxuj

yx

eeyxh yx 002

2

2

2

221

),( +−+−

= πδδ

(1)

1 GIST feature present a brief observation or a report at the first glance of a outdoor scene that summarizes the quintessential characteristics of an image


399

The parameters ( xδ , yδ ) are the standard deviation of the Gaussian envelope along vertical and horizontal directions; ( 0u , 0v ) refers to spatial central frequency of Gabor filters. As shown in (Figure 3 (c)), configurations of Gabor filters contains 4 spatial scales and 8 directions. At each scale ( xδ , yδ ), by passing the image I(x,y) through a Gabor filter h(x,y), we obtain all those components in the image that have their energies concentrated near the spatial frequency point ( 0u , 0v ). Therefore, the GIST vector is calculated using energy spectrum of 32 responses. We calculated averaging over each grid of 16 x 16 pixels on each response, as shown in (Figure 3 (d)). Totally, a GIST feature vector is reduced to 512 dimensions

C. Classification using k-Nearest Neighbor (k-NN) method k- Nearest Neighbor (k-NN) classifier is selected for

classification using GIST features because they are high dimensional descriptors. Certainly, we can use any classification method for learning GIST feature but k-NN seems to be simple while keeping the high performance. All works in the literature learn GIST features with k-NN.

Given a testing image, we found K cases in the training set that have minimum distance between the feature vectors (GIST) of the input image and those of the training set. A decision of the label of the testing image was based on majority vote of the K labels found in Figure 4.

Figure 4. k-NN classification

The fact that no general rule for selected appropriate dissimilarity measures (Minkowsky, Kullback-Leibler, Intersection...), in this work, we select Euclidian distance that is usually realized in the context of image retrieval.

IV. EVALUATION

A. Scenario definition To test our proposed system for visually impaired aids, we

propose the scenario as follows. A pseudo blind person moves in the campus of Hogeschool, Gent (Belgium). The total length of movement is about 200m.

Figure 5. Defined location map: arrows shows the moving direction of the blind person. We start at the point A and

comeback to the point A.

We define the critical 6 points that the visually impaired want to get description of environment, including:

• Point A - DoorInfrontofVIS

• Point B - LobbyofVISBuilding

• Point C - HallwayBicycleParking

• Point D - DoorInfrontofBuildingP

• Point E - LobbyofBuildingP

• Point G - HallwayCarParking

The visually impaired moves following arrows in Figure 5 starting from the point A through C, E, D, E, G, E, C, B and finish at the point A.

B. Database collection To collect the database for learning and testing our system,

we equip the pseudo blind person a camera FullHD 1080p@30fps that provides video files (Scene.avi) for further processing.

Figure 6. Camera settings for collection database


400

As illustrated in Figure 6, the user holds the camera vertically at a distance of 30 cm respect to the body and at a height of 130 cm with respect to the ground. The person moves at a speed of 1.25 foot/second. The handheld camera will capture images in front of the person.

We realize two rounds (trials) data as in the Table 1. Table 1. Two rounds data results

Scene.avi

Round/trial Starting Frame

Stopping frame

Duration (second)

Round1 1 6296 03:32

Round2 6297 11924 03:09

Then we build a dataset of 06 scenes. The total number of images is 11924 images exported to the png format. The Table 2 shows some examples of images of these scenes.

Table 2. Examples of scene images

A. DoorInfontofVIS

B. LobbyofVISBuilding

C. HallwayBicycleParking

D. DoorInfrontofBuildingP

E. LobbyofBuildingP

G. HallwayCarParking

We divide the database into 2 parts, one for training and one for testing.

• Training data: Each scene we take randomly 100 images from the first 6296 frames of the round 1 as in the Table 3

Table 3 : Training dataset 06 scene Round1

Scene/Trial Total Frame Training Percentage A-scene 228 100 ~ 44%

B-scene 1086 100 ~ 9%

C-scene 1672 100 ~ 6% D-scene 491 100 ~ 20% E-scene 1827 100 ~ 5% G-scene 992 100 ~ 10%

Total 6296 600 ~ 10%

• Testing data: We split remaining frames into two sets for testing. The first set contains all frames of the first round (6296 frames, including 600 training frames). The second set consists of all frames of the second round, so does not contain any training images. We note that two sets of frames are captured at different times so the lighting condition is different.

o Testing Dataset 1: On Round1 to Fstart: 1 from Fstop: 6296 images, includes 600 training image, 6296 testing image.

o Testing Dataset 2: On Round2 to Fstart: 6297 from Fstop: includes 11924 images. Total 5628 testing image.

C. Pre-processing Before applying the scene classification module, we need

to pre-process the captured data. • Data stabilization: As images are captured from

moving camera that can vibrate during the movement. Therefore, we propose to follow a work in [17] to reduce unstable/blur affects due to shrinks during capturing images.

• Image centralization: in order to eliminate “wastes” area in images (e.g. like sky blue or ground plane in scene.avi) we will keep only the region at the image center. In Figure 7 shows result obtained after stabilization, cropping and resizing the original image.


401

Figure 7. Result obtained after preprocessing

D. Evaluation measure There are many measures for evaluating recognition

performance such as Recall, Precision, and Accuracy [18]. In our context, as we know the distribution of positive and negative examples (the ratio between positive and negative is 1/6). For classification tasks, the terms true positives, true negatives, false positives, and false negatives compare the results of the classifier under test with trusted external judgments. The terms positive and negative refer to the classifier's prediction (sometimes known as the expectation), and the terms true and false refer to whether that prediction corresponds to the external judgment (sometimes known as the observation). This is illustrated by the table below:

actual class

(observation)

predicted class (expectation)

tp (true positive) Correct result

fp (false positive)

Unexpected result fn

(false negative) Missing result

tn (true negative)

Correct absence of result

Precision and recall are then defined as:

fp+tptp=Precision (2)

fn+tptp=Recall (3)

We propose to evaluate system by Precision and Recall, criterion

E. Experimental Results • Testing with dataset 1: We do experiments with 6296

images in the test1 part of our database. Table 4 : Scene classification Recall when testing with dataset 1

SceneA SceneB SceneC SceneD SceneE SceneG Average SceneA 100% 0 0 0 0 0%

SceneB 0 99% 0 0 0 0%

SceneC 0 0 100% 0 0 0%

SceneD 0 0 0 99% 0 0%

SceneE 0 1% 0 0 91% 17%

SceneG 0 0 0 1% 9% 83%

Recall 95%

Table 5 : Scene classification Precision when testing with dataset 1 SceneA SceneB SceneC SceneD SceneE SceneG Average

SceneA 100% 0% 0% 0% 0% 0% SceneB 0% 100% 0% 0% 0% 0% SceneC 0% 0% 100% 0% 0% 0% SceneD 0% 0% 0% 100% 0% 0% SceneE 0% 1% 0% 0% 90% 9% SceneG 0% 0% 0% 0% 16% 83%

Precision 96%

• Testing with dataset 2: We do experiments with 5628 images in the test 2 part of our database.

Table 6 : Scene classification Recall when testing with dataset 2 SceneA SceneB SceneC SceneD SceneE SceneG Average

SceneA 100% 1% 0% 0% 0% 0%

SceneB 0% 65% 1% 0% 0% 0%

SceneC 0% 12% 97% 0% 19% 2%

SceneD 0% 0% 0% 85% 2% 2%

SceneE 0% 18% 0% 15% 67% 15%

SceneG 0% 4% 2% 0% 13% 81% Recall 83%

Table 7 : Scene classification Precision when testing with dataset 2 SceneA SceneB SceneC SceneD SceneE SceneG Average

SceneA 95% 5% 0% 0% 0% 0%

SceneB 0% 98% 1% 0% 1% 0%

SceneC 0% 7% 66% 0% 25% 2%

SceneD 0% 0% 0% 93% 5% 3%

SceneE 0% 9% 0% 7% 73% 11%

SceneG 0% 3% 2% 0% 19% 77% Precision 84%

It is notable that for the first Dataset 1, as this dataset contains training images, it could be understandable that the recall and precision are a little higher than when testing with dataset 2. In both cases, the recall and precision (Figure 8) are very good for a real application.

Figure 8. Testing results on two dataset.


402

Concerning the computational time, we have run our system on a computer with the following configurations (CHIP Intel(R) Core (TM) 2 Quad CPU Q6600 @ 2.4 GHz x 2.39Ghz, RAM 8GB). The average resolution of images is 1024x1024. The computational time is about 3fps. We could reduce the time by down sampling the original image.

F. Some scene recognition results We show some examples of scene recognition. We can see

that even with scale or view point change, the recognition is still correct (Figure 9). In case where the change is really significant, the recognition is wrong (Figure 10).

Figure 9. Correct recognition examples

Figure 10. Wrong recognition examples

V. CONCLUSIONS This paper proposed a method to identify the contextual information of environment for navigational aids for blind people to move on one way. The method suggested using GIST features and k-NN classifier gives a high recognition rate (over than 80%) in both cases: dependent and independent dataset. With this method, we will build an environmental description on moving ways and test with more blind people in a bigger outdoor environment in the future.

ACKNOWLEDGMENT This work is supported by the project “Visually impaired

people assistance using multimodal technologies”; funded by the VLIR’s Own Initiative’s Programmer, under the grant reference VLIR-UOS ZEIN2012RIP19

REFERENCE [1] Borenstein, J. and I. Ulrich, The GuideCane - A Computerized Travel

Aid for the Active Guidance of Blind Pedestrians, in IEEE Int. Conf. on Robotics and Automation 1997. p. 1283-1288.

[2] Bradley, N.A. and M.D. Dunlop, An Experimental Investigation into Wayfinding Directions for Visually Impaired People Ubiquitous Computing, 2005(9): p. 395–403.

[3] Gharpure, C. and V. Kulyukin, Robot-Assisted Shopping for the Blind: Issues in Spatial Cognition and Product Selection. Journal of Intelligence Service Robotics, 2008. 1(3): p. 237-251.

[4] Golledge, R.G., et al., Stated Preferences for Components of a Personal Guidance System for Nonvisual Navigation. Journal of Visual Impairment and Blindness, 2004. 98(3): p. 135-147.

[5] Graf, B., Reactive Navigation of an Intelligent Robotic Walking Aid in IRS-2000 2001. p. 252-259.

[6] Helal, A.S., S.E. Moore, and B. Ramachandran, Drishti: an integrated navigation system for visually impaired and disabled in Proc. of Fifth International Symposium on Wearable Computers2001: Zurich p. 149 - 156

[7] http://vision.wicab.com/technology/. [8] Willis, S. and S. Helal, RFID Information Grid for Blind Navigation and

Wayfinding in Ninth IEEE International Symposium on Wearable Computers (ISWC'05)2005: Osaka, Japan p. 34-37.

[9] A. Oliva and A. Torralba, Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. International Journal of Computer Vision, 2001. 42(3): p. 145–175.

[10] A. Torralba and A. Oliva, Statistics of natural image categories. Network: Comput. Neural Syst, 2003. 14: p. 391–412.

[11] J. Xiao, et al. SUN Database: Large-scale Scene Recognition from Abbey to Zoo. in CVPR. 2010.

[12] J. Wu and J. M. Rehg, CENTRIST: A Visual Descriptor for Scene Categorization. IEEE TRANS. PAMI, 2009.

[13] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. in Proc. IEEE Conf. Computer Vision and Pattern Recognition. 2006.

[14] Thi-Lan Le, Hai Vu, and T.-H. Tran, Scene classification based for advertising service based on image content. Journal of Science and Technology Technical Universities, 2013(95): p. 140-144.

[15] Quattoni, A. and A.Torralba, Recognizing Indoor Scenes. In Proceeding of the International Conference on Computer Vision and Pattern Recognition, 2009: p. 1-8.

[16] Oliva, A. and A. Torralba, Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. Int. J. Comput. Vision, 2001. 42(3): p. 145-175.

[17] Shen, Y., et al., Video stabilization using principal component analysis and scale invariant feature transform in particle filter framework. IEEE Trans. on Consum. Electron., 2009. 55(3): p. 1714-1721.

[18] Everingham, M., et al., The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vision, 2006. 88(2): p. 303-338.


403

[ieee 2013 international conference on advanced technologies for communications (atc 2013) - ho chi...

Documents