[ieee 2012 4th international conference on intelligent human-machine systems and cybernetics (ihmsc)...

Applying Fast Planar Object Detection in Multimedia Augmentation for Products with Mobile Devices

Quoc-Minh BUI University of Science, VNU-HCM

Ho Chi Minh city, Vietnam [email protected]

Trung-Nghia LE University of Science, VNU-HCM


Vinh-Tiep NGUYEN University of Science

& John von Neumann Institute, VNU-HCM Ho Chi Minh city, Vietnam [email protected]

Minh-Triet TRAN University of Science, VNU-HCM


Anh-Duc DUONG University of Information Technology, VNU-HCM


Abstract— Texts, images, audio and video clips about products are important information for customers in shopping. However customers cannot have such information as soon as they see the products in physical stores. The authors propose a Fast Planar Object Detection in Multimedia Augmentation for Products system using mobile devices to provide useful information for customers when they go shopping. The experimental results show the strength of our proposed system in processing and displaying multimedia information on users’ mobile devices in real-time. The system can be used as a smart assistant for customers to get extra useful information about products and help them decide the best choice for their demands.

Keywords- Augmented Reality, Visual Search, Planar Object, Object Detection.

I. INTRODUCTION The term Augmented Reality (AR) is proposed by Tom

Caudell, a Boeing researcher in 1992 [1]. However, the idea of AR first appeared in 1930s [2]. Its applications has been developed to assist human life in many fields, e.g. health care [3], education [4], art [5], entertainment [6], etc., by providing extra information in various formats, such as texts, images, audio/video clips, or 3D models.

Shopping is an essential daily need of human beings. There are many forms of shopping in real life. Customers can go shopping online and can access extra multimedia information published by vendors or retailers. Therefore users can preview the table of content, sample chapters of a book; listen to sample audio clips of a CD, DVD; watch the trailer of a movie or a game; etc. Based on extra multimedia information given in websites by vendors or retailers, a customer can decide whether to buy a product or not.

However, the rich multimedia information in online shops is not available for customers in traditional shopping. When go shopping in physical bookstores, customers cannot

easily access to such multimedia information right when they see real products. They just receive the information from the external appearance of a product or from banners, advertisement nearby. However, this kind of information is not enough for customers to decide to buy that product. This motivates us to develop a system using mobile devices to detect a product based on its external appearance and then display extra multimedia information related to that product. With the great number of mobile devices, our proposed system can be a practical application that provides extra useful information for customers when shopping.

AR technology is used to provide extra information for products such as books, games, or audio CDs/DVDs. To recognize a product, we use its visual appearance (such as book cover, front matter of a CD/DVD). The process of visual query analysis includes two phases. In the first phase, the system uses lightweight filter that is histogram of dominant colors of the image to evaluate dissimilarity about colors between the query image and each template image in database. The purpose of this module is to filter out quickly the products in database that cannot match with the visual query image of the product. In the second phase, the system verifies which candidate products found in the first phase can match with the visual query image. After identifying the product that the customer wants to know its detail, the system will send the information about the product to the customer.

This paper is structured as follows. In Section II, we briefly present AR applications and Visual Search. The proposed system and process to recognize product based its visual query is presented in section III. The experimental results are presented in section IV. Finally section V presents conclusions and ideas for future work.

2012 4th International Conference on Intelligent Human-Machine Systems and Cybernetics

978-0-7695-4721-3/12 $26.00 © 2012 IEEE

DOI 10.1109/IHMSC.2012.166

292

II. BACKGROUND AND RELATED WORKS

A. Augmented Reality and its application The idea of creating AR systems first appeared in 1930s

[2]. However, it is not until 1960s and 1970s when major companies began to develop the first versions of augmented reality for training and visualization. In 1968, the very first augmented reality device is created by Ivan Sutherland [15]. This system uses an optical see-through head-mounted device that is tracked by one of two different 6DOF trackers: a mechanical tracker and an ultrasonic tracker. When the term Augmented Reality was used by Tom Caudell in 1992, Caudell developed applications to help train workers at Boeing [1]. In the following years, many researchers have worked to create programs that enable augmented reality [4]. Some of the companies involved in developments are ARToolKit, ARQuake, Layar Browser, Webitude, Total Immersion, Google and Android Phone. With more focus today on small mobile devices, augmented reality development is booming especially in the areas of marketing and entertainment.

AR has many applications in major fields in society such as health care [3], education [4], art [5], entertainment [6], etc. In these applications, markers [10] or bokode markers [11] are used to detect objects and estimate camera poses in real-time. However, these approaches may not provide user with natural feeling to access. Another approach to use is using regular native image such as cover or front page of a book, CD or DVD to provide user with the natural mean with interaction. B. Object detection based Template Matching

In order to recognize the cover of products, we use the template matching technique. It means that finding a sub-template in a bigger image. There are two main approaches of template matching: feature based template matching and area based template matching.

Feature based approaches use features such as edges [12], corners [13], blobs [8] [9] and a similarity measure to find the best matching between two features in a template image and a source image. There are two main steps to determine a local feature. The first step is to detect interest points and the second step is to descript that key points. However, one of the weaknesses of this approach is its high computational cost.

Area based approach uses color information of the template as a main factor to determine the similarity between template and an extracted pattern from source image. There also has others more accuracy measures but high computational cost and difficult to optimize code.

In this paper, we use local feature approach since its robustness with scale, rotation transformation and view change of the object. On the other hand, our cover template is big enough and has a lot of texture so this approach can easy to recognize. To enhance the speed of matching, we use some pruning strategy to skip matching certainly wrong template. This process is called lightweight filtering.

III. PROPOSED METHOD

A. Overview of the System A typical scenario of usage of our proposed system is

when a customer goes shopping, he or she wants to know information about some product such as book, CD or DVD. The user can use a mobile device to see the cover, front matter of a book, a CD or DVD. After sending the query image, the server will find the information about the product in the query image and display this on user’s mobile device screen.

Figure 1. A typical scenario of usage of our proposed system.

The specific steps of system are as follows Figure 1. � Step 1: A user uses a mobile device to capture the

visual appearance of a product that he or she wants to get extra multimedia information. In our system, the visual appearance of a product is a planar object, e.g. the front page of a book or the box cover of a CD or DVD. Then, the user sends this image as a visual query of product to the Recognition Server.

� Step 2: When receiving a visual query, the Recognition Server analyzes this query to find the best matches and send a list of appropriate product IDs to the user.

293

� Step 3: The user chooses the product that he/she wants to get extra information from a collection of product ID received from the Recognition Server and sends it to the Media Server.

� Step 4: The Media Server sends to the user a list of augmented multimedia objects corresponding to a product that the user wants to get information.

� Step 5: The user selects one of the augmented multimedia objects and sends a request for that object to the Media Server.

� Step 6: The server sends the multimedia object back the user and this object is displayed on the user’s mobile device.

The process of product detection consists of two main phases as shown in Figure 2:

Phase 1: We use histogram of color to detect color of the visual query of product to filter out products in database that cannot match with the product user want to get information in order to get candidate products (c.f. section III.B).

Phase 2: The system verify the best match with the query and find all information about that product (c.f. section III.C).

Figure 2. Product detection

B. Lightweight Filtering When shopping, customers tend to distinguish products

and another one through dominant colors in their outside appearance. Based on that, we use relative position of dominant colors to recognize product quickly. Thus, we propose lightweight filtering to filter out products in database that cannot be match with the visual query of the product which user wants to get information.

Approach: This module only uses the distributions of colors of two images to evaluate the dissimilarity between them. The color space we use is HSV model because of its similarity to human color perception. There are many dissimilarity measures but they can be classified into two main categories: bin-to-bin distance and cross-bin distance.

In this paper, we use cross-bin distance such as Quadratic-Chi histogram distance [7] because they have high speed in processing. The distance between a query image I* and a cover Ik is determined as follows:

� ��

��

jic

mcjkcc

c

mcikcc

ijkjjkiik

Am AIHIHAIHIH

AIHIHIHIHIIQC

, )))(*)(()())(*)((())(*)())((*)((

)*,(

where Ai,j is the similarity between bin i and j. After calculating the distance between query image and cover image, we choose no more than nk candidate products whose distance less than a threshold �H.

Query image Cover 1

Cover 2 Cover 3

Histograms of the query image and three covers

Figure 3. Lightweight filtering with a query image

and three covers.

294

This module is illustrated in Figure 3 with a query image and three cover images. The images of product 1 and product 2 have histograms similar to the query image’s histogram and the peaks of their histograms are at bin 6. On the other hand, the product 3 has a histogram with peak at bin 4, thus it cannot be a candidate because it is dissimilar to the query image.

This filter can be replaced by another new filter algorithm that is more accurate and faster because each step is independent.

C. Product Matching After filtering out products that cannot match with the

query image, the next step is to verify each of the candidates in the result of the matching step.

There are many approaches to identify product in query image based on its visual appearance. We decide to use template matching method. The purpose of this method is finding a sub-template in an image. Template matching can be classified into two main approaches: area based and feature based template matching. Because feature based template matching can be robust with scale, rotation transformation, and change of view point, our proposed system will be implemented that approach.

Because the query image is planar object, the first step in our matching step is to extract key points in the query image. Each key point is a blob-like structure described by its center and the properties of its neighbor region.

After extracting key points, the next step is matching key points between the given query image I* and the template T. Let �T and �I* be the key point collections T and I* respectively. For each key point p in �T, we find its corresponding key point q in �I* by the nearest neighbor search. The pair (p, q) is called a match and is only valid if the distance between p and q is not greater than a threshold �M. Let �T,I* be a collection of key points matched between �T and �I*. If we can find the transform matrix M than can map most of the key points in �T into �I*, we can consider the template T is the result of the matching step. The RANSAC method [14] is used to find the homography transform M in mapping key points. The key idea of RANSAC is to estimate the homography transform M from a subset of �T,I* with randomly selected of matches (usually with no less than 5 matches) between two images, then count the number of outliers, i.e. matches that do not support the estimated transform. This selection process repeats until the number of iterations exceeds a threshold.

Scale Invariant Feature Transform (SIFT) by D. Lowe [8] is the most popular in feature based template matching. However, in our proposed system, we decide to use Speeded-Up Robust Features (SURF) [9] to recognize products because it is not only faster than SIFT but also invariant with scale, rotation, illumination and view point. To further speed-up the matching process, we use the GPU implementation of SURF.

Figure 4 shows an example of template matching using SURF features with the product cover T (left) and the query image I* (right). Each line mapping from the left to the right is a pair of corresponding SURF features.

Similar to previous step, this step can also be upgraded by another method if it is more accurate and faster.

Figure 4. Detect a product cover in a query image using

SURF features.

IV. EXPERIMENTS

We present experiments to test different properties of our proposed system, including two main tasks: evaluating the performance of the query processing with and without Lightweight Filtering (c.f. Section IV.A), evaluating the accuracy of Template Matching (c.f. Section IV.AIV.B), and the scenarios of usage to apply our proposed system in real life (c.f. Section 0).

The experiments is done based on the system running Core Quad 2.4 GHz (with 2GB RAM) and a graphic card GeForce GTX 460 (1GB memory).

The dataset is collected from many sources, such as vbook.vn, vinabook.com, amazon.com. However, the number of products is used to test is not the total number of products of the shops.

This system and this dataset are used for all experiments in this paper.

295

A. Performance of Query Processing using Lightweight Filter

The experiment is to compare the performance of our proposed system in two contexts: using Lightweight Filter and not using Lightweight Filter.

We divide the dataset into 5 small groups with different sizes: 50, 100, 200, 500, and 1000 products. For each dataset, we perform 100 visual queries with different input images. For each visual query, we conduct the visual query process in two situations: without Lightweight Filtering and with Lightweight Filtering. The experimental results are illustrated in Figure 3.

In the first situation, a query image I* is matched with each product cover in a dataset. Thus the total time to process a query linearly increases with the number of product covers in that dataset.

In the second context, we apply Lightweight Filter to choose candidate product covers, and only the top nk candidate product covers are considered for matching features. In our experiment, we choose nk = 20.

Figure 5. Comparison between the performance (in milliseconds) of processing a visual query with and

without Lightweight Filtering

In Figure 5, the time to process a visual query in the second case increase slightly with the total number of product covers in a dataset because image matching is only executed with no more than nk candidate product covers for each query. On average, it takes about 130-135ms for the whole visual query process (for nk = 20). The average elapsed time is slightly higher than the total time for matching a query image with nk = 20 candidates because of the extra time to perform the Lightweight Filtering.

B. Accuracy of Template Matching

The experiment is to evaluate the accuracy of using Template Matching.

(a) (b)

(c) (d)

Figure 6. Sample images in 4 scenarios: (a) Being obscured (b) Glare lighting (c) Shadow (d) Motion blur

The covers captured by mobile devices are matched with the datasets. We conduct the experiments in four scenarios: a cover is obscured by fingers, a plastic cover makes glare lighting, a cover with shadow, and a cover with motion blur because of fast movement. Figure 6 illustrates sample images in 4 scenarios.

In each scenario, we detect 50 product covers and for each cover, we process in 300 frames. The accuracy percentage are shown in Table 1. In motion blur situation, the cover can not be detected in consecutive frames. However, we can sterilize the result by applying Kalman Filter[16] to correct detection and make the processing more smoothly and extractly.

Table 1. Accuracy of Template Matching

Scenario Without Kalman Filter

With Kalman Filter

Being obscured 92.4% 96.2% Glare lighting 83.8% 87.4%

Shading 94.3% 98.7% Motion blur 85.8% 94.7%

296

C. Scenario of Usage

We present a scenario of usage to illustrate the functions of our proposed system. A mobile device will provide information which a user can see the product of interest (as an image captured by the camera of that mobile device) augmented by multimedia contents, in Figure 7. Based on that information, the user will decide to buy the product or not.

Product cover

Augmented information

Result

Figure 7. Augmented information for product

V. CONCLUSION

In this paper, we introduce a system that provides extra multimedia information for customer when he/she goes shopping in a physical store. Based on information received from the server, user can decide to buy a product or not.

In order to recognize the product in a visual query image quickly, we propose a lightweight filter module based on the histogram of dominant colors between two images. The experimental result shows that the system can provide information for user in real-time manner.

In the future, we can upgrade the system by replace the algorithm in each step to improve system’s performance. Not only multimedia information, the system can collect and classify social media content (comments, “like”, rating) from social networks. Thus, users can have many options to choose when they have demand to know all information about the product (including specification, multimedia, social content information).

ACKNOWLEDGEMENTS This research is supported by the research funding of the

Honors Program in Computer Science, University of Science – Vietnam National University of Ho Chi Minh city.

REFERENCES [1] T. P. Caudel and D. W. Mizell, “Augmented Reality: An Application

of Heads-Up Display Technology to Manual Manufactoring Processes”, Proc. IEEE Hawaii International Conference on Systems Sciences (HICSS’92), 1992, pp. 659-669.

[2] G. Schweighofer and A. Pinz, “Robust Pose Estimation from a Planar Target”, Proc. IEEE Transactions on Pattern Analysis and Machine Intelligent, 2006, pp. 2024-2030.

[3] C. Bichlmeier, F. Wimmer, S. M. Heining and N. Navab, “Contextual Anatomic Mimesis Hybrid In-Situ Visualization Method for Improving Multi-Sensory Depth Perception in Medical Augmented Reality”, Proc. ISMAR’07, 2007.

[4] P. Smith and A. Sanchez, “Farming Education: A Case for Social Games in Learning”, Proc. HCI’11, 2011.

[5] P. Debenham, “Evolutionary augmented reality at the Natural History Museum”, Proc. ISMAR’11, 2011.

[6] A. Hiyama, Y. Doyama, M. Miyashita, E. Ebuchi, M. Seki, and M. Hirose, “Wearable Display System for Handing Down Intangible Cultural Heritage”, Proc. HCI’11, 2011.

[7] O. Pele, M. Werman, “The quadratic-chi histogram distance family”, In Proceedings of European conference on Computer vision (ECCV), 2010,pp.749-762.

[8] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision (IJCV), pp. 91-110, 2004.

[9] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “SURF: Speeded Up Robust Features”, Computer Vision and Image Understanding (CVIU), pp. 346-359, 2008.

[10] M. Knecht, C. Traxler, O. Mattausch, W. Purgathofer, M. Wimmer (2010). Differential Instant Radiosity for Mixed Reality. ISMAR 2010, pp. 99-107.

[11] A. Mohan, G. Woo, S. Hiura, Q. Smithwick, R. Raskar (2009). Bokode: Imperceptible Visual Tags for Camera-based Interaction from a Distance. SIGGRAPH 2009.

[12] J. Shi and C. Tomasi, “Good Features to Track”. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 593 – 600, 1994.

[13] J. Canny, “A Computational Approach to Edge Detection”, In IEEE Transactions on Pattern Analysis and Machine Intelligence , pp. 679-698, 1986.

[14] M. A. Fischler, R. C. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography”, Comm. of the ACM, Vol 24, pp 381-395, 1981.

[15] I. Sutherland, “A Head-Mounted Three Dimensional Display”, Proceedings of Fall Joint Computer Conference, pp. 757-764, 1968.

[16] R. E. Kalman, “A New Approach to Linear Filtering and Prediction Problems”, Transaction of ASME-Journal of Basic Engineering, 1960.

297

[ieee 2012 4th international conference on intelligent human-machine systems and cybernetics (ihmsc)...

Documents