[ieee 2011 ieee international symposium on mixed and augmented reality - basel...

7
Information-Theoretic Database Building and Querying for Mobile Augmented Reality Applications Pawan K. Baheti Ashwin Swaminathan Murali Chari Serafin Diaz Slawek Grzechnik Qualcomm Corporate Research and Development, San Diego, CA ABSTRACT Recently, there has been tremendous interest in the area of mobile Augmented Reality (AR) with applications including navigation, social networking, gaming and education. Current generation mobile phones are equipped with camera, GPS and other sensors, e.g., magnetic compass, accelerometer, gyro in addition to having ever increasing computing/graphics capabilities and memory storage. Mobile AR applications process the output of one or more sensors to augment the real world view with useful information. This paper’s focus is on the camera sensor output, and describes the building blocks for a vision-based AR system. We present information-theoretic techniques to build and maintain an image (feature) database based on reference images, and for querying the captured input images against this database. Performance results using standard image sets are provided demonstrating superior recognition performance even with dramatic reductions in feature database size. KEYWORDS: Mobile Augmented Reality, Object detection, database pruning. 1 INTRODUCTION Recently, there has been tremendous interest [1][2] in developing applications on the mobile phone. Modern cell phones, especially smart phones, have all of the key enabling components for AR, including camera, GPS, sensors, high performance multi-core processors with large memory, high-end graphics/multi-media. See [3], for example, for a listing of major milestones in the development of mobile AR. Broadly, the emerging applications of AR can be classified into the following areas: (a) Navigation, search, and discovery, (b) Educational/How-To aids, (c) Gaming, (d) Commerce, and (e) Social media sharing, i.e., enabling users to share annotated views of real world scenes with others via their social networks. Determining what the user ‘sees’ in the camera view requires using (or fusing) the output of multiple sensors. The vast majority of mobile AR applications, e.g., Wikitude, Layar, today combine GPS position estimates with magnetic compass orientation. GPS, compass, and other sensors, e.g., accelerometer, gyro, etc., have their limitations, which can be addressed to a certain extent using fusion techniques, such as Kalman filters, particle filtering, etc. Some other limitations of sensor based AR can be overcome by incorporating vision. Vision-based AR has been shown to provide better AR experiences by robustly being able to both detect and track points of interest (POI). A key enabler for vision-based AR is the use of techniques from the field of Computer Vision. There are many approaches to object detection, on a per frame basis, followed by tracking object movement across frames. Detection for AR involves not only the recognition (or not) of a reference object 1 1 The reference could also be a target image instead of an object. in the (query) image captured by camera but also computing the e-mail: {pbaheti, sashwin, mchari, sdiaz, sgrzechn}@qualcomm.com underlying spatial transformation of the object between reference and query. The computed transform is used by the graphics engine to render the information, e.g., text, video, 3-D object models, directly on the detected object. This paper focuses on vision based AR and more specifically on object detection algorithms. Object detection for most AR applications typically consists of three phases [4]-[10] as shown in Figure 1: (1) Database building: in this phase, keypoints and descriptors are extracted from a set of objects and stored in a database either in the form of raw features or in structures designed to facilitate subsequent matching (e.g., k-means tree, k-d trees, vocabulary trees, etc.); (2) Matching phase: in this phase, keypoints and descriptors are extracted from the query image and matched against those of the database (DB) images to identify candidate matches; and (3) Pose estimation: in this phase, a transformation model is fit between the spatial locations of matching query and the likely reference database image features. Such approaches have been shown to provide good recognition performance with large reference sets in the presence of occlusions, clutter, viewpoint/illumination changes, and noise. Figure 1. System overview for Mobile Augmented Reality The problem of object detection has been relatively well- studied in the computer vision literature and several algorithms have been proposed for feature extraction and matching [4]-[10]. Compared to these works which aim to develop better and more robust features, the work presented in this paper relies on a good feature extractor and focuses on identifying the optimal minimal set of descriptors that maximizes detection accuracy. The main contributions of this paper are two-fold: (1) information-theoretic database pruning algorithm that aims to compress the size of the feature set by identifying noisy keypoints/descriptors and removing them from the database, and (2) an information-optimal query algorithm that employs pruning weights to improve recognition rates. The proposed information optimal pruning and querying algorithm can be applied on top of any feature descriptor to improve its performance for object detection. Two most important considerations in the design of object detection systems for mobile AR are the recognition rate and the 47 IEEE International Symposium on Mixed and Augmented Reality 2011 Science and Technolgy Proceedings 26 -29 October, Basel, Switzerland 978-1-4577-2185-4/10/$26.00 ©2011 IEEE

Upload: slawek

Post on 08-Dec-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2011 IEEE International Symposium on Mixed and Augmented Reality - Basel (2011.10.26-2011.10.29)] 2011 10th IEEE International Symposium on Mixed and Augmented Reality - Information-theoretic

Information-Theoretic Database Building and Querying for Mobile Augmented Reality Applications

Pawan K. Baheti Ashwin Swaminathan Murali Chari Serafin Diaz Slawek Grzechnik

Qualcomm Corporate Research and Development, San Diego, CA

ABSTRACT

Recently, there has been tremendous interest in the area of mobile Augmented Reality (AR) with applications including navigation, social networking, gaming and education. Current generation mobile phones are equipped with camera, GPS and other sensors, e.g., magnetic compass, accelerometer, gyro in addition to having ever increasing computing/graphics capabilities and memory storage. Mobile AR applications process the output of one or more sensors to augment the real world view with useful information. This paper’s focus is on the camera sensor output, and describes the building blocks for a vision-based AR system. We present information-theoretic techniques to build and maintain an image (feature) database based on reference images, and forquerying the captured input images against this database. Performance results using standard image sets are provided demonstrating superior recognition performance even with dramatic reductions in feature database size.KEYWORDS: Mobile Augmented Reality, Object detection, database pruning.

1 INTRODUCTIONRecently, there has been tremendous interest [1][2] in developing applications on the mobile phone. Modern cell phones, especially smart phones, have all of the key enabling components for AR, including camera, GPS, sensors, high performance multi-core processors with large memory, high-end graphics/multi-media. See [3], for example, for a listing of major milestones in the development of mobile AR. Broadly, the emerging applications of AR can be classified into the following areas: (a) Navigation, search, and discovery, (b) Educational/How-To aids, (c) Gaming, (d) Commerce, and (e) Social media sharing, i.e., enabling users to share annotated views of real world scenes with others via their social networks.

Determining what the user ‘sees’ in the camera view requires using (or fusing) the output of multiple sensors. The vast majority of mobile AR applications, e.g., Wikitude, Layar, today combine GPS position estimates with magnetic compass orientation. GPS, compass, and other sensors, e.g., accelerometer, gyro, etc., have their limitations, which can be addressed to a certain extent using fusion techniques, such as Kalman filters, particle filtering, etc. Some other limitations of sensor based AR can be overcome by incorporating vision.

Vision-based AR has been shown to provide better AR experiences by robustly being able to both detect and track points of interest (POI). A key enabler for vision-based AR is the use of techniques from the field of Computer Vision. There are many approaches to object detection, on a per frame basis, followed by tracking object movement across frames. Detection for AR involves not only the recognition (or not) of a reference object1

1The reference could also be a target image instead of an object.

in the (query) image captured by camera but also computing the

�e-mail: {pbaheti, sashwin, mchari, sdiaz, sgrzechn}@qualcomm.com

underlying spatial transformation of the object between reference and query. The computed transform is used by the graphics engine to render the information, e.g., text, video, 3-D object models, directly on the detected object.

This paper focuses on vision based AR and more specifically on object detection algorithms. Object detection for most AR applications typically consists of three phases [4]-[10] as shown in Figure 1: (1) Database building: in this phase, keypoints and descriptors are extracted from a set of objects and stored in a database either in the form of raw features or in structures designed to facilitate subsequent matching (e.g., k-means tree, k-dtrees, vocabulary trees, etc.); (2) Matching phase: in this phase, keypoints and descriptors are extracted from the query image and matched against those of the database (DB) images to identify candidate matches; and (3) Pose estimation: in this phase, atransformation model is fit between the spatial locations of matching query and the likely reference database image features.Such approaches have been shown to provide good recognition performance with large reference sets in the presence of occlusions, clutter, viewpoint/illumination changes, and noise.

Figure 1. System overview for Mobile Augmented Reality

The problem of object detection has been relatively well-studied in the computer vision literature and several algorithms have been proposed for feature extraction and matching [4]-[10].Compared to these works which aim to develop better and more robust features, the work presented in this paper relies on a good feature extractor and focuses on identifying the optimal minimal set of descriptors that maximizes detection accuracy. The main contributions of this paper are two-fold: (1) information-theoretic database pruning algorithm that aims to compress the size of the feature set by identifying noisy keypoints/descriptors and removing them from the database, and (2) an information-optimal query algorithm that employs pruning weights to improve recognition rates. The proposed information optimal pruning and querying algorithm can be applied on top of any feature descriptor to improve its performance for object detection.

Two most important considerations in the design of object detection systems for mobile AR are the recognition rate and the

47

IEEE International Symposium on Mixed and Augmented Reality 2011

Science and Technolgy Proceedings

26 -29 October, Basel, Switzerland

978-1-4577-2185-4/10/$26.00 ©2011 IEEE

Page 2: [IEEE 2011 IEEE International Symposium on Mixed and Augmented Reality - Basel (2011.10.26-2011.10.29)] 2011 10th IEEE International Symposium on Mixed and Augmented Reality - Information-theoretic

size and composition of the database. The pruning method presented in this work addresses these two considerations. Our proposed approach is based on the observation that there are several features in the database which repeat and such repetitionscan be removed without any loss in recognition accuracy. For instance, most often a single object is captured in multiple viewsand several of its features may repeatedly occur more than onetime in the database. Such repetitions can be removed after assigning them appropriate priors/weights thereby reducing database size. Additionally, there also exist features which appear in multiple objects (e.g., corners of a window in a building database). Presence (or absence) of such features in the query image do not provide any additional information to identify the query since they would match with multiple objects in the database. Such features, therefore, constitute noise and can be removed without affecting recognition accuracy.

Along with the minimal set of descriptors in the database, it is also important to develop the query algorithms on the mobile. In this work we present a query algorithm that also makes use of the priors/weights generated during the database building process to determine the potential candidate matches. The candidate matches may then be subjected to outlier removal steps, which exploit the geometric transformation between the reference images and the query image to robustly estimate the pose of the object relative to the reference image of the object in the dataset.

The rest of the paper is organized as follows. Related work is presented in Section 2. Section 3 covers the database pruning algorithms developed in this work, and in Section 4, we provide the details on the query algorithms, including outlier removal methods on the mobile. We present the evaluation and performance comparisons of the proposed algorithm in Section 5and the final conclusions are drawn in Section 6.

2 RELATED WORKThe problem of object detection has been relatively well-

studied in the computer vision literature [4]-[10]. In order to recognize an image, it must be compared against all the objects in the database. Comparing the images pixel-by-pixel is not only inefficient but also not robust to variations in imaging conditions and object transformations. Therefore, recognition-specific representations of images have been used for comparison or matching. In literature, several sets of features have been proposed to facilitate object detection and some examples include Scale Invariant Feature Transform (SIFT) [4], Speed Up Robust Features (SURF) [5], multi-scale Harris features [7], Gradient Location-Orientation Histogram (GLOH) [8], Compressed Histogram of Gradients (CHoG) [10], etc. Mikolajczyk and Schmid [8] provide a detailed comparison of different types of extraction methods and compare the approaches for specific applications. Compared to these works which aim to develop better and more robust features, the work presented in this paper relies on a good feature extractor and focuses on identifying the optimal minimal set of descriptors that maximizes detection accuracy.

In literature, methods have been proposed to improve the robustness of feature extraction algorithms. Turcot and Lowe [11]present a method to identify the minimal set of features by an unsupervised pre-processing step that involves retainingdescriptors which are geometrically consistent across multiple views, ranking them based on the number of times descriptor appear, and representing the relationship between images using an adjacency graph. In [12], Naikal et al. improved upon this approach by optimizing for speed and accuracy for low-resolution images by using sparse Principal Component Analysis (PCA).Techniques such as histogram coding based approaches [13],

feature selection strategies based on vocabulary trees [14], andclassifier based approaches [15] etc., could alternatively be used to select the statistically informative descriptors per object. We note that the proposed pruning approach can be used alongside such techniques to further compress the database while maintaining good recognition rates.

Fritz et al. present an information-theoretic SIFT feature selection algorithm in [16], where feature selection was based on the criteria that ��)|( , jifXH , where .)|(.H is the conditional

entropy of the object X given the feature fi,j, and � was set to 1 bit. However, the authors do not consider the keypoint properties likescale and location as in the case of SIFT while selecting the pruned descriptor set. Furthermore, the paper does not provide details on the computation of the conditional probabilities. The proposed work in this paper addresses these two aspects. In addition to this, we also associate a weighting factor to each descriptor; this weighting factor represents the relative importance of the descriptor among others in the database. As we will show, the weighting factor can be used to facilitate several aspects of database building, help in enhancing the querying algorithm, and be used towards building robust techniques for incremental learning and adding/removing objects from the database. Additionally, as will been seen in the simulation results (see Section 5), the proposed approach provides 100% accuracy on ZuBud building database [17] with 8x reduction in database size and this is better than 91% recognition accuracy reported in [16].

3 DATABASE PRUNINGIn this section, we describe our proposed database pruning

algorithm. We illustrate the algorithm using SIFT features [4]. We choose SIFT since it has been shown to be robust to various transformations; however, it is to be noted that the techniques presented in this paper can be widely applied to other types of underlying feature localization or description algorithms as will be later presented in Section 5. As described earlier, the main goal of database pruning is two fold. Firstly, to reduce the amount of memory required to store the SIFT features in the database and secondly, to extract the most informative features in the database to improve recognition accuracy by reducing the amount of noise.

As an example for illustration, consider the ZuBud database [17] containing 201 unique building objects – with 5 views per POI amounting to a total of 1005 images. For an image in VGA resolution (640 pixels × 480 pixels) SIFT processing would result in around 2500 d-dimensional SIFT features with d � 128 for traditional SIFT [4]. Assuming 2 bytes per feature element, we would require around 2500 × 128 × 2 bytes or 625 Kb of memory to store the SIFT features for one image in VGA resolution. For the ZuBud database with 1005 VGA images, the amount of memory required could be of the order of 660 MB. Although it would be possible to download a smaller and relevant portion of the database on the mobile phone (possibly using other kinds of side information such as GPS location), the data download model is very much dependent on the AR use case and is limited by cache on the mobile phone and the communication network.

The main idea behind the proposed pruning algorithm is motivated by information theory and pattern recognition. In pattern recognition [15] literature, it has been shown that more features do not necessarily imply better recognition accuracy –especially when there are some features that are redundant across multiple object classes. Based on this motivation, we devise a three step approach to database pruning as shown in Figure 2. The steps include: intra-object pruning, inter-object pruning, and keypoint clustering.

48

Page 3: [IEEE 2011 IEEE International Symposium on Mixed and Augmented Reality - Basel (2011.10.26-2011.10.29)] 2011 10th IEEE International Symposium on Mixed and Augmented Reality - Information-theoretic

The goal of intra-object pruning is to remove similar and redundant keypoints within an object in different views of the same object while retaining just one among them. This step helps choose the keypoints that best represent a given object and improves the object recognition accuracy. Compared to intra-object pruning, the inter-object pruning step helps retain the most informative set of descriptors across different objects. This step helps improve classification performance and confidence by helping discard those keypoints in the database that appear in different objects. Finally, the keypoint clustering step ensures that final set of pruned descriptors have good information content and provide good matches across a range of scales.

Figure 2. Overview of Database Pruning

3.1 Useful Notations

Before we start describing the database pruning approaches developed in this work, let us briefly review the mathematical notation to be used in the remainder of the paper. Let M denote the number of unique objects in the database. Let the number of image views for ith object be denoted by Ni. Let the total number of descriptors across the Ni views of ith object be denoted by Ki.

We use jif , to represent the jth descriptor for ith object, where j =

1…Ki and i = 1…M. Let the set Si contain the Ki descriptors for ith

object such that }...1;{ , ijii KjfS �� . Let us define a source

variable X that takes integer values from 1 to M, where X = isuggests that the ith object from the database was selected. Let the

probability of X selecting the ith object be denoted by )( iXpr � .

Further, define iS~

represent the pruned descriptor set for the ith

object and iK~

is the cardinality of the set, iK~

= | iS~

|. We use the

term s to represent the mutual information [18] between X and S~

.

3.2 Problem Formulation

The goal of database pruning is to reduce the cardinality of the

descriptor sets iS~

so as to maintain high recognition accuracy.

From an information theoretic perspective, the database pruning problem can be formulated as:

MiSSS

KSSXI

M

iiS

...1and}~

...~

{~

where

,~~

such that )]~

;([max

1

~

��

� (1)

In other words, to form the pruned database we would like to retain the descriptors from the original database that maximize the

mutual information between X and the pruned database S~

. With

such a criterion we expect to get rid of features that are less informative about the occurrence of a database object in the input image. Note that this maximization is prohibitive because it involves the joint and conditional distribution of descriptors given the entire database and is computationally expensive even for small M, Ki. In this work we make an assumption that each descriptor is a statistically independent event, which implies that we can express the mutual information in Eqn. (1) as a summation of the mutual information provided by individual descriptors in the pruned set:

��

�Sf

ji

ji

fXISXI~

,

,

);()~

;( . (2)

We note from [18] that maximizing the individual mutual

information component );( , jifXI in Eqn. (2) is equivalent to

minimizing the conditional entropy )|( , jifXH which is a

measure of randomness about the source X given the

descriptor jif , . In other words, descriptors with lower conditional

entropy are statistically more informative. The conditional entropy

)|( , jifXH is given as

��

���M

k

jijiji fkXpfkXpfXH1

,,, )|(log)|()|( , (3)

where )|( , jifkXp � is the conditional probability of the source

variable X equal to kth object given the occurrence of descriptor fi,j(i=1…M and j=1…Ki). In a perfectly deterministic case, where the occurrence of a particular descriptor fi,j is associated with only one object in the database, the conditional entropy goes to 0suggesting that the chosen descriptor is very useful for the object detection task. On the other hand, if a specific descriptor is equally likely to appear in all the M database objects then itsconditional entropy is highest and is equal to log2M bits (assuming all objects are equally likely i.e., pr(X = k) = 1/M). The presence of such a descriptor does not provide any information to improve recognition accuracy and can therefore be removed from the database.

3.3 Intra-object pruning

The goal of intra-object pruning is to remove descriptor redundancies within the views of the same object. Using all the views of ith object (with total of Ki descriptors), the following steps are repeated for each descriptor fi, j:a) The set of matching keypoints, i.e., descriptors which are

������� �� ��� ���� �������� from fi, j are identified. Let ussuppose the cardinality of this set = Lj

b) These Lj keypoints are compounded into one descriptor vector. In this compounding step, only one of these Lj

descriptor instances is retained and the remaining (Lj 1)descriptors are removed from the database. The keypoint location (x, y), scale information, object ID and view ID of all the Lj keypoints are however stored in the database and are subsequently used for geometric consistency checksduring querying (see Section 4).

In our work, we introduce and associate a weighting factor to each

(compounded) descriptor. We denote these weights asjif

w,

and

initialize them toijf KLw

ji/

,� , where i=1…M (M = number of

objects) and j=1…Ki (Ki = number of descriptors in ith object).

3.4 Inter-object pruning

The goal of this step is to quantify the conditional information measure (i.e., “goodness”) of the descriptor with respect to theobjects in the database, and use it to prune the database further. As described in Section 3.2, we quantify goodness of a descriptor fbased on the conditional entropy )|( fXH which is calculated

using the Eqn. (3). The steps for computing the conditional

probabilities, )|( , jifkXp � , and the conditional entropy are as

follows. For each descriptor f = fi,j (i = 1…M; j=1…Ki ) in thedatabase, the following is performed:a) The nearest neighbors for the feature f in the descriptor

database with L2 (norm) distance less than � are retrieved.Denote the neighbors as fk,n where k is object IDcorresponding to the feature and n is nearest neighbor index.

49

Page 4: [IEEE 2011 IEEE International Symposium on Mixed and Augmented Reality - Basel (2011.10.26-2011.10.29)] 2011 10th IEEE International Symposium on Mixed and Augmented Reality - Information-theoretic

b) The conditional probabilities p(f = fi,j | X = k), where k =1…M are then computed using the nearest neighbors. We use mixture of Gaussians [15] to model the conditional probability:

,)][()|(

indexneighbor

nearest:

,,, ,� ���n

nkjifji ffGwkXffpnk

(4)

.2/and2

exp][,where2

2

2 ��

����

���

�� L

yyG

c) The posterior probabilities p(X = k | f = fi,j) are then computed using the conditional probabilities using Bayes rule:

.

)()|(

)()|()|(

1

,

,,

��

����

�������

M

l

ji

jiji

lXprlXffp

kXprkXffpffkXp (5)

Intuitively, the posterior probability conveys the information a descriptor provides about a specific object.

d) The entropy of the features is then calculated using Eqn. (3) and for each object, the descriptors with ��)|( fXH bits

are retained and the remaining features are removed from the database. The corresponding object ID and view ID are also stored along with the features.

Figure 3. Example illustration of information-optimal pruning: Left column shows the two views of an object from ZuBud with keypoints marked in blue, Middle column shows the two views after intra-object pruning step, and Right column shows the two views after inter-object pruning and retaining features that have higher scale and are spread out in geometric space. Different colors in the right column indicate different clusters of keypoints based on location.

3.5 Keypoint clustering and Feature selection

The keypoint clustering step further prunes the keypoints retained after the inter-object pruning based on its location and scale information. In this step, the keypoints in each databaseimage are clustered based on (x, y) locations into kc clusters.Within each cluster the top kl keypoints with dominant scales are retained in the database and the remaining ones are removed. Note that the above step limits the number of descriptors per POI view to be less than kc·kl. We choose the features with the highest scales because our experiments indicate that they are more reliable and less susceptible to noise.

Figure 3 shows an example illustrating intra-object pruning, inter-object pruning, and keypoint-clustering. After the three steps, the size of the resulting database is M � kc � kl. Note that the pruned database size affects the recognition accuracy, as we will see in Section 5. Besides database reduction, the information-optimal pruning approach gives us a formal framework to incrementally add or remove descriptors from the pruned set

given feedback about recognition confidence level, or given some system constraints like memory usage on the mobile phone, etc.

4 QUERYING AND POSE ESTIMATIONFigure 4 presents the details of our querying and outlier

removal algorithm. The main novelty in the querying algorithm is in the efficient use of pruning weights to reduce the amount of computations required for querying. For a query image, Q, the main steps are as follows:Q1) Searching for Nearest Neighbors using descriptors: For

each query image descriptor Qj (j=1…KQ), we retrieve Nnearest neighbors in the pruned database with L2 distance less than �. The nearest neighbors are then binned with respect to object ID to obtain fi,n where i is object ID and n is nearest neighbor index. We use randomized kd-trees [19] to perform the descriptor search and nearest neighbor retrieval.

Q2) Computing Posterior Probabilities: The identified neighbors are used to compute the probabilities p(Q = i) for all i = 1…M:

�����

�����

indexneighbor

nearest:

,,

1

)],[()|()|(where

,)|(1

)(

nnijnij

K

jj

Q

fQGfiQpQfiQp

QfiQpK

iQpQ

(6).

Note that the terms p(Q=i | fi,n) are pre-computed using Eqn (5) and stored as part of the database pruning algorithm and not calculated at querying time.

Q3) Computing Confidence Scores: The object candidates with highest probability p(Q=i) are chosen to be potential matches. The confidence level in the recognition is computed using the entropy measure on the posterior probabilities computed in step Q2, and are given by

Confidence = ��

���Mi

iQpiQpM

...1

22

)(log)(log

11 . (7)

Figure 4. Querying and outlier removal algorithm

At the end of the above three steps, the top matching candidates(or more generally, say T0 candidates) with posteriors greater than

a suitably chosen threshold, say �, are identified. If T0=1, i.e.,

only one candidate has a posterior probability greater than � and that of the remaining candidates are lower than the threshold (�),then the top candidate is declared as the query result and no outlier removal is performed (Pose estimation steps may be additionally performed using the estimated query result for AR applications requiring the relative pose). This ability to make a decision at the end of this step with sufficient confidence results in significant reduction in terms of computational complexity compared to previous work that rely on expensive outlier removal steps to improve recognition rates.

The choice of the threshold � is therefore critical in reducing the need for outlier removal and the number of candidates passed on to outlier removal. For instance, a larger value of � implies a

50

Page 5: [IEEE 2011 IEEE International Symposium on Mixed and Augmented Reality - Basel (2011.10.26-2011.10.29)] 2011 10th IEEE International Symposium on Mixed and Augmented Reality - Information-theoretic

lower T0 and therefore the outlier removal algorithm needs to be run over fewer candidate(s). However, if the value of � is chosen to be too high, then very few candidates would pass the threshold reducing the recall. In our experiments, we choose �=0.4 as a suitable trade-off between computational cost to maintain a high recall rate. With this choice of �, we find that that 67% of the query decisions for the ZuBud database results in only one top candidate (T0 = 1) i.e, the decision can be made without outlier removal as only one candidate passes the threshold.

In scenarios when T0 > 1 candidates have posterior probabilities greater than �, outlier removal steps are performed to identify the best among the top T0. The proposed outlier removal algorithm consists of the following three steps:O1) SIFT feature based distance filtering:

O2)

This step is based on majority voting. In this step, for each of the T0 object candidates in the set and all its POI views, we compute the number of feature matches between the query and database image. The object-view combinations with the maximum number of descriptor matches are then chosen as the top candidates and used for further processing. In our work, we choose the top T1 = 5 object-view candidates at the end of this step.Orientation filtering:

O3)

This step computes the histogram of the orientation difference between the query image and each of the top T1 candidate object-view combinations in the database and finds the object-view combinations with a large number of inliers that fall within < �0 degrees. Here, �0 is a

suitably chosen threshold (�0 = 100 in our experiments). The top T2 (T2 = 2 in our experiments) candidates that have the maximum percentage of inliers are chosen as input to the geometric filtering step.Geometric filtering:

[20]

The final step in outlier removal is pose estimation. We use homography as the transformation model

and estimate its model parameters using RANdom SAmpling Consensus (RANSAC) [21] on the corresponding pairs. The object-view combination that gives the maximum number of inliers is chosen as the closest match and the corresponding homography estimate is also obtained for AR applications that require relative pose.

5 PERFORMANCE EVALUATIONIn this section we present an evaluation of the information-

optimal database pruning and querying approach proposed in this paper. We demonstrate that the information-theoretic approach can provide very good database compression without compromising on object detection accuracies. We evaluate the detection performance in terms of the object recognition rate which is the ratio of the number of true positives to the total number of queries.

5.1 Results with 128-D SIFT over ZuBud database

The ZuBud database [17] has a total of 201 distinct objects each captured under 5 different views. All the images are VGA size (640x480 pixels). We use these 1005 VGA images for training. The ZuBud database also has a separate set of 115 query images, each half VGA size, which we use for testing.

We extract standard 128-D SIFT features on all these 1005 training images. The average number of descriptors for each object (combining all the views) in the database along is roughly 12500 and the total database size is around 660 MB. We obtained recognition rates of the order of 95% with 128-D SIFT withoutany pruning.

Next, we apply the proposed information optimal pruning algorithm to remove noise and redundancies in the database. In

this study, we fix the distance threshold � (for intra-object pruning and inter-object pruning) to 0.4 and the number of clusters (kc) per database image view to 20. We vary the number of keypoints (kl)to be selected per cluster from 3 to 15. From each cluster we first identify the most informative descriptors (by ordering them with respect to their conditional entropy described in Section 3), and then select the kl keypoints with top scales. This is how we generate the pruned database with size per object to vary from 300 to 1500. Note that this corresponds to 40x (for 300 features per object) and 8x (in the case of 1500 features per object) reduction in database size.

300 400 600 800 1,000 1,200 1,400 15000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pruned SIFT dataset size per POI

Reco

gn

itio

n r

ate

thresh = 0.1

thresh = 0.2

thresh = 0.3

thresh = 0.4

thresh = 0.5

thresh = 0.6

thresh = 0.7

Note: Maximum of 2keypoints retrieved perquery descriptor

8x reduction in database size

40x reduction in database size

Figure 5. Comparison of the recognition rate (for ZuBud database) as a function of pruned database size with 128-D SIFT descriptors;different curves represent the different distance thresholds used for retrieving neighbors in step Q1 of the querying algorithm.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance threshold during retrieval

Reco

gn

itio

n r

ate

keypoints/POI = 300

keypoints/POI = 500

keypoints/POI = 700

keypoints/POI = 900

keypoints/POI = 1100

keypoints/POI = 1300

keypoints/POI = 1500

Note: Maximum of 2 keypoints wereretrieved per query descriptor

Figure 6. Comparison of the recognition rate with respect to the distance threshold used during the retrieval intra-object pruning.Results are with 128-D SIFT over ZuBud database.

Figure 5 shows the recognition rate for the ZuBud query images with respect to the pruned database size. The different curves in the figure correspond to the different values for the distance

retrieval threshold (�) used during the querying algorithm (see steps Q1 and Q2 of the querying algorithm detailed in Section 4). We observe from Figure 5 that the recognition rate improves with the pruned database size. Further, we note that the performance improves with increasing distance retrieval threshold; however as

51

Page 6: [IEEE 2011 IEEE International Symposium on Mixed and Augmented Reality - Basel (2011.10.26-2011.10.29)] 2011 10th IEEE International Symposium on Mixed and Augmented Reality - Information-theoretic

the threshold starts increasing beyond 0.4, we start observing slight degradation in the performance because noisy matches are retrieved with higher threshold corrupting our probability estimate in Eqn. (6). Note that for the retrieval threshold equal to 0.4, the recognition rate achieved is 95% with 40x reduction in database size and 100% with 8x reduction in database size. These results are better than standard 128-D SIFT with provides 95% accuracy. This suggests that the proposed information pruning algorithm not only reduces database size but also improves recognition rate by removing noise from the database. Further, our proposed approach performs better than existing work from Fritz et al. [16], where the authors report 91% recognition rate based on their pruning approach with 128-D SIFT descriptors.

Next, we slice the data from Figure 5 differently, and present the recognition rate with respect to the distance threshold used for retrieval in Figure 6. The different curves represent different database size after pruning. Note that for database size of 300 keypoints per object (i.e., 40x reduction) we observe that recognition rate starts rolling over as the threshold is increased beyond 0.4 as explained above.

5.2 Results with 64-D SURF over ZuBud database

We also tested the ZuBud database using 64-D SURF [5]. The SURF algorithm uses Fast-Hessian detector for keypoint selection and the number of keypoints can be controlled by setting appropriate Hessian thresholds. In our work, we vary the Hessian thresholds from 500 to 2500 to obtain different number of keypoints per object from 6600 keypoints per object (at a threshold of 500) to 2500 keypoints per object (at a Hessian threshold of 2500). In each case, we build a database of SURF features at the extracted keypoint locations and query the database with the SURF features of the query image. The performance of the SURF feature extractor without database pruning is shown in Figure 7. As shown in the figure, the maximum recognition rate achieved using SURF features for the ZuBud database is around 90% and is achieved when the Hessian threshold is 500 with a database size of ~180 MB. Further, we notice from the figure that the recognition rate reduces from around 90% for a database size of 180 MB (Hessian threshold of 500) to 70% with a smaller database of size around 70 MB (Hessian threshold of 2500).

Next, we apply the proposed information optimal pruning algorithm on the SURF features. We start with a 64-D SURF database of total size 150 MB generated with a Hessian threshold of 700. As in Section 5.1, we set the distance threshold � (for intra-object pruning and inter-object pruning) to 0.05 and the number of clusters (kc) to 30 and vary the number of keypoints (kl)to be selected per cluster from 3 to 15 based on their conditional entropy to prune the database of SURF features. The blue curve in Figure 8 shows the performance of 64-D SURF with information optimal pruning. As can be seen in the figure, the proposed pruning algorithm reduces the database size significantly without compromising on recognition rate. For instance, we obtain recognition rates close to 90-91% at a database size of close to 4-5MB (which constitutes around 30x to 38x reduction). This result suggests that the proposed algorithms can be applied to a wide range of feature extraction algorithms to reduce the size of the database while maintaining good recognition rates.

5.3 Results with UKY benchmark dataset

We also evaluated the proposed techniques on a randomly selected subset of 4800 images from the UKY Benchmark database [14][22]. This subset actually contains 1200 objects with 4 views per object. We included three views from each object in our training database, and tested on the fourth view (resulting in total of 1200 query images). We use 128-D SIFT descriptor for

our study. It is to be noted that this database has much larger number of objects compared to ZuBud and SIFT algorithm without pruning provides only around 84% recognition accuracy. With the proposed pruning algorithm, we notice that the recognition rates improved slightly and we achieve 87.9% and 89.5% recognition rates for 7x and 4x database size reduction, respectively.

60 80 100 120 140 160 180 200

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Total database size (in MB)

Reco

gn

itio

n r

ate

Figure 7. Comparison of the recognition rate (for ZuBud database) as a function of pruned database size for 64-D SURF features without the proposed information optimal pruning. Different points in the figure correspond to different Hessian thresholds.

0 10 20 30 40 50 600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Total database size (in MB)

Reco

gn

itio

n r

ate

Figure 8. Comparison of the recognition rate (for ZuBud database) as a function of pruned database size for 64-D SURF features with the proposed information optimal pruning.

5.4 Discussions

We remark that the proposed pruning algorithm reduces database size by removing the keypoints and features that are noisy and redundant. Therefore, the exact compression gains that can be obtained with the proposed algorithm would to a large extent depend on the original database and the amount of redundancy present in it. For instance, the ZuBud building database had 201 objects with 5 views per object and the amount of overlap in these views was quite significant. As a consequence,we obtained close to 40x database size reduction on ZuBud. On the other hand, the UKY dataset contains many diverse set of objects with very limited redundancy in them and therefore, the compression gains with the proposed method were also

Remark on database compression gains:

52

Page 7: [IEEE 2011 IEEE International Symposium on Mixed and Augmented Reality - Basel (2011.10.26-2011.10.29)] 2011 10th IEEE International Symposium on Mixed and Augmented Reality - Information-theoretic

comparatively lower and around 4x to 7x. However, it is to be noted that in both cases, the proposed algorithms were able to reduce database size without compromising on recognition rates. Also, in several cases, we were able to improve recognition rates by removing noisy keypoints as can be seen in the results.

Information-theoretic pruning and querying can help identify the out-of-class (or out of database) images.

Identifying out-of-class images:

Figure 9 shows an example where we present out-of-class image examples to the querying algorithm that was originally trained over a pruned ZuBud database of 128-D SIFT descriptors with 8x reduction. The distance threshold used during retrieval was set to 0.3. The plotshows the posterior probability with respect to each of the 201 reference objects in ZuBud database. Note that the highest posterior probability retrieved for the out-of-class example in Figure 9 is .05 compared to 0.42 obtained for an in-class ZuBudquery image suggesting that we could use these posteriors and compute the confidence level to identify the out-of-class images.

Figure 9. Example demonstrating the usage of the querying algorithm to identify the out-of-class images. The plots show the posterior probability with respect to each of the 201 reference objects in ZuBud database.

6 CONCLUSIONSIn this paper, we described the building blocks for a vision-

based AR system. Using notions from information theory and detailed modeling of the keypoint distribution probabilities, we quantified the conditional entropy of keypoints in the database. We then used this measure to formulate our descriptor selection approach for preparing the pruned database. Based on the priors generated during the pruning steps, we developed an information-theoretic query algorithm and incorporated a geometric verification step to perform robust pose estimation. We demonstrated that our approach provides very good database compression without compromising on object detection accuracy. Our results indicate that we can achieve recognition rates close to 95% with around 40x reduction in database size over ZuBud and 87.95% with around 7x reduction over UKY datasets.

We are currently considering extensions for incremental learning of the databases based on user-generated content, incorporating recognition feedback, and analyzing the scalability of the system for various application scenarios.

REFERENCES[1] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D.

Schmalstieg, “Pose Tracking from Natural Features on Mobile Phones”, in Proceedings of the 7th IEEE/ACM International Symposium on Mixed and Augmented Reality, pp. 125-134,Cambridge, UK, September 2008.

[2] G. Klein and D. Murray, “Parallel Tracking and Mapping on a Camera Phone”, in Proceedings of the 8th IEEE/ACM International Symposium on Mixed and Augmented Reality, pp. 83-86, Orlando, FL, October 2009.

[3][4] D. Lowe, “Distinctive Image Features from Scale-Invariant

Keypoints,” in International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.

https://www.icg.tugraz.at/~daniel/HistoryOfMobileAR/.

[5] H. Bay, T. Tuytelaars, and L. V. Gool, “SURF: Speeded Up Robust Features,” in European Conference on Computer Vision, vol. 1, pp. 404–417, 2006.

[6] M. Özuysal, P. Fua, V. Lepetit, “Fast Keypoint Recognition in Ten Lines of Code,” in IEEE Conference on Conference on Computer Vision and Pattern Recognition, pp. 1-8, Minneapolis, MN, June 2007.

[7] K. Mikolajczyk and C. Schmid, “Scale and Affine Invariant Interest Point Detectors,” in International Journal of Computer Vision, vol. 60, no. 1, pp. 63-86, October 2004.

[8] K. Mikolajczyk and C. Schmid, “A Performance Evaluation of Local Descriptors,” in IEEE Transactions on Pattern Analysis and Machine Intelligence., vol. 27, no. 10, pp. 1615–1630, 2005.

[9] G. Takacs, V. Chandrasekhar, N. Gelfand, Y. Xiong, Wei-Chao Chen, T. Bismpigiannis, R. Grzeszczuk, K. Pulli and B. Girod, “Outdoor Augmented Reality on Mobile Phone using Loxel-Based Visual Feature Organization,” in Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval,Vancouver, Canada, October 2008.

[10] V. Chandrasekhar, G. Takacs, D. Chen, S. Tsai, R. Grzeszczuk, B. Girod, “CHoG: Compressed Histogram of Gradients: A Low Bit-Rate Feature Descriptor”, in IEEE Conference on Computer Vision and Pattern Recognition, pp. 2504-2511, Miami beach, FL, June 2009.

[11] P. Turcot and D. Lowe, “Better matching with fewer features: The selection of useful features in large database recognition problems” in IEEE International Conference on Computer Vision Workshop on Emergent Issues in Large Amounts of Visual Data, pp. 2109-2116, September 2009.

[12] N. Naikal, A. Yang, and S. Sastry, “Informative Feature Selection for Object Recognition via Sparse PCA”, Technical report, University of California, Berkeley, 2011.

[13] D. M. Chen, S. S. Tsai, V. Chandrasekhar, G. Takacs, J. Singh, and B. Girod, “Tree histogram Coding for Mobile Image Matching,” in Proceedings of Data Compression Conference (DCC), Snowbird, Utah, March 2009.

[14] D. Nistér and H. Stewénius, “Scalable Recognition with a Vocabulary Tree,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2161-2168, New York, USA, June 2006.

[15] R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification,Wiley-Interscience, 2nd edtition, 2000.

[16] G. Fritz, C. Seifert, and L. Paletta, “A Mobile Vision System for Urban Detection with Informative Local Descriptors,” in Proceedings of the Fourth IEEE International Conference on Computer Vision Systems, pp. 30, January 2006

[17] H. Shao, T. Svoboda, and L. V. Gool, “Zubud-Zürich Buildings Database for Image Based Recognition.” ETH Zürich, Tech. Report,no. 260, 2003.

[18] T. M. Cover and J. A. Thomas, Elements of information theory,Wiley-Interscience, 2nd edition, 2006.

[19] M. Muja and D. G. Lowe, “Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration”, in International Conference on Computer Vision Theory and Applications(VISAPP’09), Lisboa, Portugal, February 2009.

[20] R. Hartley and A. Zisserman, Multiple-view Geometry in Computer Vision, Cambridge University Press, March 2004.

[21] M. Fischler and R. Bolles, “RANdom SAmple Consensus: a paradigm for model fitting with application to image analysis and automated cartography,” in Communications of the Association for Computing Machinery, vol. 24, no. 6, pp. 381-395, June 1981.

[22] UKY dataset: http://www.vis.uky.edu/~stewe/ukbench/

53