ldb: an ultra-fast feature for scalable augmented reality...
TRANSCRIPT
LDB: An Ultra-Fast Feature for
Scalable Augmented Reality on Mobile Devices
Xin Yang1 and Kwang-Ting Cheng
2
University of California, Santa Barbara, CA 93106, USA
ABSTRACT
The efficiency, robustness and distinctiveness of a feature
descriptor are critical to the user experience and scalability of a
mobile Augmented Reality (AR) system. However, existing
descriptors are either too compute-expensive to achieve real-time
performance on a mobile device such as a smartphone or tablet, or
not sufficiently robust and distinctive to identify correct matches
from a large database. As a result, current mobile AR systems still
only have limited capabilities, which greatly restrict their
deployment in practice.
In this paper, we propose a highly efficient, robust and
distinctive binary descriptor, called Local Difference Binary
(LDB). LDB directly computes a binary string for an image patch
using simple intensity and gradient difference tests on pairwise
grid cells within the patch. A multiple gridding strategy is applied
to capture the distinct patterns of the patch at different spatial
granularities. Experimental results demonstrate that LDB is
extremely fast to compute and to match against a large database
due to its high robustness and distinctiveness. Comparing to the
state-of-the-art binary descriptor BRIEF, primarily designed for
speed, LDB has similar computational efficiency, while achieves
a greater accuracy and 5x faster matching speed when matching
over a large database with 1.7M+ descriptors.
Keywords: Augmented reality, binary feature descriptor, mobile
devices, object recognition, tracking
Index Terms: I.4.8 [Image Processing and Computer Vision]:
Scene Analysis; H.5.1[Information Interfaces and presentation]:
Multimedia Information Systems – Artificial, augmented, and
virtual realities.
1 INTRODUCTION
A scalable Mobile Augmented Reality (MAR) system that is able to track objects, recognized from a large database, can facilitate a wide range of applications, such as augmenting an entire book with digital interactions or linking city-wide movie posters or leaflets with multimedia trailers and advisements.
Several MAR systems [10, 11, 12, 13, 14, 15, 19] have been proposed recently, such as Wagner et al.’s pose tracking [10, 11, 12], Klein and Murray’s parallel tracking and mapping [13] and Ta et al.’s SURFTrac [14]. Despite of their success in real-time tracking, the scalability of these MAR systems remains limited. In practice, they can only manage a limited number of objects, which
greatly restricts their large-scale deployment. One of the major bottlenecks of today’s MAR systems is the
high complexity of computing, matching and storing feature point descriptors used for recognition and tracking. Most existing MAR systems rely on high-dimensional real-valued descriptors, e.g. SIFT [2] and SURF [1], which are too costly for mobile devices with limited computational and memory resources. As a result, most systems can only afford a small dataset for recognition and tracking.
Compact binary descriptors are very efficient to store and to
match by computing the Hamming distance between descriptors
(i.e. via XOR and bit count operations), attractive options for
MAR. To obtain compact binary descriptors, several approaches
[5, 9, 16] have been proposed to quantize the original descriptors
into bit strings, using very few bits per value. However,
computing these descriptors remains compute-intensive. To
improve the computational efficiency, Calonder et al. propose a
very fast binary descriptor, called BRIEF [3], by directly
generating binary strings from image patches. While efficient,
BRIEF descriptors utilize only intensities of a subset of pixels
within an image patch. Thus they might not be distinctive enough.
For example, it might fail to distinguish patches of different
gradient changes (see Figures 1 (a)-(b)) or with sparse textures
(see Figures 1 (c)-(d)). Lack of distinctiveness leads to many false
matches when matching against a large database and in turn
degrades the performance of an MAR system. In this paper we present a new binary descriptor, called Local
Difference Binary (LDB). Compared to BRIEF, LDB is more distinctive and robust, thus it can effectively identify correct matches from a large database with a few verifications, yielding high efficiency in matching. The distinctiveness and robustness of LDB are achieved through 3 steps. First, LDB captures the internal patterns of each image patch through a set of binary tests, each of which compares the average intensity Iavg and first-order gradients, dx and dy, of a pair of image grids within the patch (Figure 2 (a) and (b)). The average intensity and gradients capture
Figure 1: Image patches and BRIEF descriptors. (a) is a patch of
gradually increasing intensity and (b) consists of two homogeneous
sub-regions. ①, ② and ③ illustrate three exemplar binary tests for
them, which lead to identical BRIEF descriptor values for (a) and
(b). (c) and (d) are patches of sparse intensity textures, it is very
likely that most binary tests sample pixels in blank regions, yielding
non-distinguishing descriptors.
1e-mail: [email protected]
2e-mail: [email protected]
both DC and AC components of a patch, thus they provides a more complete description than BRIEF. Second, LDB employs a multiple gridding strategy to capture the structure at different spatial granularities (Figure 2 (c)). Coarse-level grids can cancel out high-frequency noise while fine-level grids can capture detailed local patterns, thus enhancing distinctiveness. Third, LDB selects a subset of highly-variant and distinctive bits and concatenates them to form a compact and unique LDB descriptor. Computing LDB descriptors is extremely fast. Relying on integral images, the average intensity and first-order gradients of each grid can be obtained by only 4~8 add/subtract operations (divide operation for averaging can be omitted for comparing two equal-sized grid cells).
Our experimental results show that the discriminative ability and robustness of LDB outperform the state-of-the-art binary descriptor BRIEF, while their computational efficiency is about the same. The performance of LDB for MAR applications is demonstrated via a recognition task with 228 objects and patch tracking on a Motorola Xoom1 tablet. Results show that LDB is much faster to match and track than BRIEF, while achieving a greater accuracy.
The rest of the paper is organized as follows: Sec. 2 reviews the related work. Sec. 3 presents details of the proposed descriptor. In Sec. 4, we apply LDB to scalable matching and analyze its performance. Sec. 5 provides experimental results on mobile devices for speed, robustness and discriminative power evaluation. Sec. 6 discusses the scale and rotation invariance of LDB and Sec. 7 concludes the paper.
2 RELATED WORK
In this section, we review state-of-the-art visual feature based MAR systems and lightweight local feature descriptors.
2.1 Visual-Feature-Based Mobile Augmented Reality
With the proliferation of mobile devices equipped with low-cost high-quality cameras, there have been several emerging MAR systems based on visual recognition and tracking [10, 11, 12, 13, 14, 15, 19]. In such a system, an image frame, captured by a mobile camera, is first described using a set of local feature descriptors, based on which objects in the frame are recognized. Periodic recognition results are bridged by tracking the recognized contents.
Most recent research efforts focused on improving the tracking speed for MAR. Wagner et al. made significant advancement in pose tracking [10, 11, 12] to meet tight real-time constraints. Klein and Murray [13] implemented parallel tracking and mapping on a mobile phone. Ta et al. [14] improved the speed by tracking SURF features in constrained locations. However, none of these efforts addressed the scalability of MAR. High-dimensional and real-valued descriptors are used, which are expensive to match and store. As a result, these systems can only handle a limited number of objects.
There are also several works to facilitate real-time recognition over a larger database for AR. For instance, Taylor [17] proposed a high-speed recognition system supporting typical video frame rates. Lepetit et al. [18] recasted matching as a classification problem using a decision tree and traded increased memory usage for expensive descriptor matching. Takacs et al. [15] exploited the temporal coherency between frames and developed a unified real-time tracking and recognition system. Julien et al. [19] presented an approach based on learning for retrieval and tracking which can handle hundreds of pictures. However, the scalability of their systems were only evaluated on desktop computers and not yet optimized for real-time requirements on handheld devices. Moreover, these systems may require a high memory usage and database redundancy. Therefore, for such systems, supporting a
larger database with hundreds of images on a handheld device is very challenging.
Due to the limited computing and memory resources in a mobile device, MAR’s scalability must be addressed – for better user experience and for enabling a wider spectrum of applications. Different from previous works, we focus on improving the scalability of MAR under the efficiency constraint. More specifically, we design a new binary descriptor which is very fast to compute, compact to store and efficient to match against a large dataset due to its distinctiveness and robustness.
2.2 State-of-the-Art Lightweight Descriptors
The increasing demand of handling a larger database on mobile
devices stimulates the development of light-weight descriptors.
Several approaches have been proposed to shorten descriptors to
speedup matching and reduce memory consumption. PCA-SIFT
[2] reduced the descriptor from 128 to 36 dimensions to reduce
the matching cost, whereas the increased time for descriptor
formation almost annihilates the increased speed of matching. [9]
quantized the floating-point descriptor vectors into a bag of
integers using a pre-trained vocabulary. Similarly, [5] and [8]
performed the quantization and represented each floating point
value using few bits without compromising recognition
performance. While effective, these approaches still require
computing the full descriptor first before minimizing its size.
Furthermore, though some of these approaches converted the
floating point descriptors into bit strings, the Hamming distance
cannot be used as a metric for matching because the bits are
arranged in blocks of four and thereby cannot be processed
independently.
A binary descriptor, called BRIEF [3], directly generates bit
strings by simple binary tests comparing pixel intensities in a
smoothed image patch. Rublee et al. [4] further enhanced this
descriptor by making it rotation invariant. Such binary descriptors
are extremely fast to compute, compact to store, and efficient to
compare based on the Hamming distance. However, as described
in Sec. 1, BRIEF descriptors utilize only intensities of a subset of
pixels within an image patch, thus are not distinctive enough to
effectively localized matched patches in large databases. Post-
processing for removing false matches is usually required to
Figure 2: Illustration of LDB extraction. (a) An image patch is divided into 3x3 equal-sized grids. (b) Compute the intensity summation (I), gradient in x and y directions (dx and dy) of each patch, and compare I, dx and dy between every unique pair of grids. (c) three gridding strategies (with 2x2, 3x3 and 4x4 grids) is applied to capture information at different granularities.
ensure a sufficient recognition accuracy, increasing the total time
cost for MAR.
3 LDB: LOCAL DIFFERENCE BINARY
Our feature design is inspired by Eli and Michal’s self similarity
descriptor [6], in which an image patch was divided into grid cells
and cross-correlations between the center grid cell and the others
are computed. Since most of the photometric changes (e.g.
lighting/contrast changes, blurring, image noises etc.) can be
removed by computing the difference between two sub-regions,
the self similarity descriptor is resilient to photometric changes,
yet captures intrinsic structures within the image.
However, computing cross-correlations between grid cells
demands Sum-of-Square-Differences operations for each pixel,
which is computationally expensive and cannot be sped up by
integral images. Moreover, the self similarity descriptor is a real-
valued feature vector thus it is not efficient to match and store.
Finally, many of today’s efficient point detectors, e.g. FAST [20],
detect points with local intensity extremes, i.e. points whose
intensities are all brighter or darker than surrounding pixels.
Therefore, only considering the differences between the center
grid cells and the surrounding ones may yield indistinctive
descriptors. Here, we avoid cross-correlation operations, and
create a bit vector based on binary tests between every pair of
grids.
More specifically, we divide each image patch P into n x n
equal-sized grids, extract representative information from each
grid cell and perform binary test on a pair of grid cells (i and j)
as:
1 if ( ( ) ( )) 0 and ( ( ), ( )) :
0 otherwise
Func i Func j i jFunc i Func j
, (1)
where ( )Func is the function for extracting information from a
grid cell.
The information used for binary tests determines both the
distinctiveness and efficiency of a descriptor. The most basic yet
fast-to-compute information is the average intensity, which
represents the DC component of a grid and can be computed
extremely fast via the integral image technique. However, the
average intensity is too simple to describe the intensity changes
inside a grid cell. For example, Figure 3 (a) displays three patches
with 2x2 grid cells. Consider the up-left grid cells of the three
patches (denoted by red rectangles); although they have very
different pixel intensity values and distributions, the means of
intensities are exactly the same (Figure 3(b)). A binary testsbetween diagonal grid cells yield identical results for these three
distinct patches.
To improve its distinctiveness, we introduce the first-order
gradients, in addition to using the average intensity. It’s known
that gradients are more resilient to photometric changes than
average intensities and can also capture intensity changes inside a
grid such as the magnitude and direction of an edge. Considering
the patches in Figure 3, the first-order gradients in the x and y
directions are able to distinguish them (Figure 3(b)). In addition,
gradients can be computed using box filters which can be easily
sped up by integral images. We define ( )Func for our LDB as a
tupleintensity dx dy( ) { ( ), ( ), ( )}Func Func Func Func , where
intensity 1~
dx
dy
1( ) : ( )
( ) : ( )
( ) : ( )
k m
x
y
Func i Intensity km
Func i Gradient i
Func i Gradient i
, (2)
and m is the total number of pixels in grid cell i. Since we use
equal-sized grids, m is the same in all cases and can be omitted in
the computation. Gradientx(i) and Gradienty(i) are the regional
gradients of grid cell i in the x and y directions, respectively.
Based on Eq. (2), the average intensity and the x and y gradients
are computed sequentially for a grid, thus each pair of grids
generates 3 bits using binary tests defined in Eq. (1) (Figure 3
(c)).
Performing binary tests on pairwise grid cells out of nxn grids
results in a bit string of 3n(n-1)/2. As n increases, the
distinctiveness of a descriptor increases while the complexity for
matching and storage increases as well. To form a compact LDB
descriptor, we choose a subset of nd bits out of 3n(n-1)/2 bits to
form a unique LDB descriptor. In the following subsections 3.1
and 3.2, we discuss the gridding and bit selection strategies
respectively.
3.1 Gridding Choices
The choice of the grid size influences both robustness and
distinctiveness of the LDB descriptor. If the gridding is fine, i.e.
having a large number of small grids per patch, corresponding
grids of two similar patches could be misaligned even for a small
amount of patch shifting. Hence the descriptor’s stability would
be lower. Whereas, the descriptor extracted from smaller grids
captures more detailed local patterns of a patch thus is more
distinctive than that extracted from larger grids. Dividing a patch
into a small number of large grids, on the other hand, results in a
more stable but less distinctive descriptor.
With this observation, we employ a strategy incorporating
multiple gridding choices to achieve both high robustness and
high distinctiveness. Each image patch is partitioned in multiple
ways, e.g. 2x2, 3x3 and 4x4, from each of which a LDB binary
string is derived. We then concatenate the strings from all the
partitions to form an initial LDB descriptor.
3.2 Bit Selection
The above mentioned gridding strategy may produce a long bit
string. For example, concatenating binary strings resulting from
2x2, 3x3 and 4x4 partitions to a patch leads to a binary string of
486 bits. Further increasing the number of ways for partitioning
would result in even more bits. Despite distinctive, long bit
strings reduce the efficiency for matching and storage. Moreover,
Figure 3: Illustration of performing binary tests on the pair of diagonal grids for three image patches. (a) Three image patches with different pixel intensity values and distributions. (b) The average intensity value I and gradient in x and y directions, dx and dy, for each pair of diagonal grids (red and green). For each of these three cases, red and green grids have the same I, while different dx and dy. Therefore, individual grids can still be distinguished from each other. (c) 3-bit binary test results for the three patches.
bits in a long string may have strong correlation, yielding
redundancy. To address this, we employ a bit selection strategy
which chooses a subset of least correlated bits to form the final
LDB descriptor.
There are many options for bits selection. In general, depending
on whether training data is needed or not, there are two options:
unsupervised or supervised. The unsupervised approach does not
involve any training phase; it selects features according to a set of
specific rules. The supervised approach relies on training to
identify a subset of features that have high variance and are
uncorrelated over the training set. Methods in this category
include PCA, LDA, AdaBoost, etc. Both unsupervised and
supervised selection methods are conducted offline.
Here, we use two simple methods: random bit selection as an
unsupervised approach and an entropy-based method as a
supervised approach. While simple, experimental results in Secs.
4 and 5 show that both methods can produce highly effective LDB
descriptors.
3.2.1 Random Bit Selection
Random bit selection is quite simple. We randomly select nd
unique bits out of N bits and apply the same selection to compute
the nd-dimensional LDB descriptors for all images.
3.2.2 Entropy-based Bit Selection
The entropy represents the information uncertainty, thus can be
used to identify dimensions of high variances. The entropy-based
method is as follows.
1. Set up a training set of, say, 300k patches, drawn from images
in a large image database, In our experiment, we use the
Document-Natural image set whose details are described in
Sec. 5.
2. For each image patch, we partition it in n different ways (e.g.
2x2, 3x3, 4x4, …, n x n), yielding a N-bit string,
1~3
2i n
iN
, (3)
The results of the 300k training data form a matrix of 300k
rows by N columns (each column is a binary bit).
3. Compute the entropy of each matrix column as:
(1)log( (1)) (0)log( (0)), (1 )i i i i ientropy P P P P i N , (4)
where Pi(1) and Pi(0) are the “1” and “0” probabilities of
column i, respectively.
4. Select nd columns with the largest entropy values which are
used to form the final LDB descriptor.
3.3 Invariance to Transformations
In this section, we test the robustness of our LDB descriptor to
several geometric and photometric transformations that commonly
exist in real-world scenarios. The discriminative ability, efficiency
in descriptor construction and scalable matching are evaluated in
Sec.4 and 5.
3.3.1 Datasets
We perform the evaluation using six publicly available test image
sequences [23] depicted in Figure. 4 They are designed to test
robustness to typical image disturbances occurring in real-world
scenarios, which include 4 categories:
viewpoint changes: the data sets are Graffiti and Wall;
image blur: the data sets are Trees and Bikes;
illumination changes: the data set is Light;
compression artifacts: the data set is Jpg.
For all sequences we match the first image to the remaining
five, yielding five image pairs per sequence. The five pairs are
sorted in order of ascending difficulty. More specifically, Graffiti
and Wall are sorted in order of increasing viewpoint changes to
the first image, Trees and Bikes are sorted in increasing image
blur, and Light and Jpg are in order of increasing illumination
changes and compression artifacts, respectively. Therefore, for all
sequences, pair 1-and-6, i.e. pair 1/6, is much harder to match than
pair 1-and-2, i.e. pair 1/2.
3.3.2 Evaluation
We apply the same evaluation procedure and metric used in the
BRIEF paper [3]. For each test image, we compute N = 300
keypoints using an ORB [4] keypoint detector and binary
descriptors, i.e. LDB and BRIEF. Then for each keypoint on the
first image, we perform brute-force matching based on Hamming
distance to find the Nearest Neighbor (NN) in the second one and
call it a match for each image pair. To obtain the ground truth of
correct matches, for each keypoints from the first image, we infer
its corresponding points on the second image using the known
Figure 4: Test image sequences, each of which contains 6 images. The first row shows the first image of each sequence, the bottom row shows the last image. Matching the first one against all subsequent images yields 5 image pairs per sequence. All the pairs are sorted according to ascending difficulty.
homography between them. After that, we count the number of
correct matches nc, i.e. matches with the matched NN points
within a predefined distance range (e.g. 10 pixels in this test) to
the inferred points. Finally, we compute Percentage of Correct
Matches which is nc/N, and use it to measure the robustness of a
descriptor. We use the OpenCV2.3.1 [21] implementation for the
ORB detector and BRIEF descriptor.
The size of smoothing kernels is a key factor affecting the
robustness of BRIEF descriptors. Rather than performing binary
tests on single pixel pairs, BRIEF relies on a pre-smoothed image
patch around the pixels to reduce the sensitivities to noises and
increase the stability of the descriptors. Generally speaking,
increasing the size of smoothing kernels can improve the
robustness of BRIEF descriptors. However, over-smoothing the
patch may blur its distinctive structure, hence decreasing the
distinctiveness. In this experiment, we tested two sizes - 5x5 and
9x9 (default setting in the BRIEF paper), and generate 256 bits for
a BRIEF descriptor.
Similarly, the patch size for computing LDB descriptors largely
impacts on the robustness. In Section 4, we will provide a
comprehensive discussion about its impact on robustness;
distinctiveness and matching efficiency and empirical selection of
optimal patch size are performed based on matching experiments
over a large database. Here in this experiment, we tested two
patch sizes - 36x36 and 45x45. In this experiment, we only
employ random bit selection to form our 256-bit LDB descriptors.
For entropy-based bit selection, we tested it in Section 4 for
matching to a large database.
Figure 5 displays the percentage of correct matches for the
BRIEF and LDB descriptors. For all the image sequences except
Wall 1/4 and Wall 1/5, LDB, with patch sizes of both 36x36 and
45x45, achieves a higher percentage of correct matches than
BRIEF with 5x5 or 9x9 smoothing kernels. Specifically, when
comparing LDB (patch size = 45x45) with BRIEF (kernel size =
9x9) on the most challenging pairs 1/6, the percentage of correct
matches achieved by LDB is 17.2% ~ 45.3% higher than that of
BRIEF.
In addition, for LDB, patch size 45x45 outperforms patch size
36x36 for almost all sequences. This result can be explained by
the following: a larger patch contains bigger grids at each gridding
level. Averaging intensity and gradients in larger grids reduces the
impact of noises from individual pixels, making the results more
resilient to noises.
3.3.3 Distance Distribution
We take a closer look at the distribution of Hamming distances
between the LDB descriptors of various image pairs. Specifically,
we analyze the matches of one image pair 1/5 for each category,
i.e. Graffiti 1/5, Trees 1/5, Jpg 1/5 and Light 1/5. Figure 6 shows
the normalized histograms, or distributions, of Hamming
distances between corresponding points (in red) and non-
corresponding points (in blue). The maximum possible Hamming
distance is 32 x 8 = 256 bits. Ideally, the two curves should be
completely separable. Establishing a match of two images can be
formulated as classifying pairs of points as being a match or not.
If their distributions are separable, a fixed distance threshold can
achieve perfect image classification. According to the results in
Figure 6, all the red curves and blue curves are distinctly
separated. As expected, the distribution of distances for non-
matching points is roughly Gaussian and centered around 128, and
for matching points the centers of the curves have a smaller value,
around 25 ~ 75. But for Graffiti 1/5, the center of its red curve is
around 75, which is larger than the other three pairs. This implies
more difficulties in distinguishing matching points from non-
matching ones. This is consistent with the result shown in Figure
5, where the percentage of correct matches of Graffiti 1/5 is much
lower than the other three image pairs. We believe one of the
reasons is that the Graffiti set include image rotations between
image pairs; however, LDB does not incorporate a dominant
orientation estimator, hence leading to an inferior performance.
Figure 5: Percentage of correct matches on Graffiti, Wall, Jpg, Trees, Bikes and Light sequences.
Figure 6: Distribution of Hamming distance for matching points (red lines) and non-matching points (blue lines) for four image pairs: Graffiti 1/5, Trees 1/5, Light 1/5 and Jpg 1/5.
4 SCALABLE MATCHING OF LDB
In this section we show that LDB outperforms BRIEF in Nearest-
Neighbor (NN) matching over a large database. The better
distinctiveness of LDB accounts for the superior performance. We
start with a brief description of Locality Sensitive Hashing (LSH),
the indexing structure used for large-scale NN matching.
4.1 Locality Sensitive Hashing for LDB
LSH [25] is a widely used technique for approximate NN search.
The key of LSH is a hash function, based on which similar
descriptors can be hashed into the same bucket of a hash table
while different ones will be stored in different buckets. To find the
NN of a query descriptor, we first retrieve its matching bucket and
then check all the descriptors within this bucket using a brute-
force matching.
For binary features, the hash function is usually a subset of bits
from the original bit string: buckets of a hash table contain
descriptors with a common sub-bit-string. The size of the subset,
i.e. hash key size, determines a maximum Hamming distance
among descriptors within the same buckets. To improve the chance that a correct NN can be retrieved by
LSH, i.e. detection rate of NN, two techniques - multi-table and multi-probe [26] are usually used. Multi-table stores the database descriptors in several hash tables, each of which leverages a different hash function. In the query phase, each query descriptor is hashed into a bucket of every hash table and checks the matches within them. Multi-table improves the detection rate of NN at the cost of linearly increased memory usage and matching time. Multi-probe examines both the bucket in which a query descriptor falls and the neighboring buckets. While multi-probe would result in more matches to check, it actually allows for less hash tables and thus less memory usage. In addition, it enables larger key size and therefore smaller buckets and fewer matches to check per bucket.
4.2 Evaluation
We use the query time per feature and detection rate of NN to evaluate the matching performance of LDB and BRIEF. The evaluation is performed based on 228 images from Document-Natural images (refer to Sec. 5 for details) for the database and 228 synthetically generated images for the query. The query images are obtained by applying motion blur transformations to the original Document-Natural images, implemented by ImageMagick [22].
For each image, we detect 724 interest points on average using
an ORB detector and compute a 256-bit LDB or BRIEF descriptor
on a patch around each interest point. We experiment with several
sets of LSH parameters with different hash key size and the
number of probes. In OpenCV 2.3.1, the number of probes is
specified through the parameter multi-probe level. Level = 0
indicates only the bucket where a query descriptor falls is
checked, level = 1 denotes the neighboring buckets with 1 bit
difference is also examined, and so on and so forth. Here, we use
five hash tables, for each table we set the key size as 20bits, 24bits
and 28bits and the multi-probe level as 0, 1 and 2, yielding 9 sets
of LSH parameters in total.
We perform the experiments on Motorola Xoom1 tablet, which
runs Android 3.0 Honeycomb, features 1GB RAM and a dual-
core ARM Cortex-A9 running at 1GHz clock rate. While there are
multi-cores in the processor, we use only one core in our
experiments.
Figure 7 establishes a correlation between the query time per
feature and the detection rate of NN. A successful match occurs
when many nearest neighbors to the query descriptors can be
found in a short time. We notice that given a detection rate, both
LDB-entropy (using entropy-based bit selection) and LDB-
random (using random bit selection) are much faster for NN
search than BRIEF. For example, in order to find over 40%
nearest neighbors, BRIEF spends 1.3ms per feature, achieved by
using 2 levels probes and 24bits for the key size. To obtain the
same detection rate, LDB-entropy only costs 0.16ms with multi-
probe=1 and key size=20 and LDB-random only takes 0.08ms
with probe-level= 1 and key size=24. In other words, LDB
achieves an 8x ~ 16x speedup in NN matching. The speedup is
most likely due to the better discriminative ability of LDB
descriptors than BRIEF descriptors: each bucket contains smaller
number of distinctive yet similar descriptors, leading to fewer
while more effective checks. To further validate this, we take a
close look at the distribution of buckets in the hash tables, shown
in Figure 8. We observe that LDB (-random and -entropy) makes
the buckets of the hash tables more even, the buckets are much
smaller in average compared to BRIEF: as the descriptors are
more distinctive, the bits of descriptors are highly variant, the
hash function does a better job at partitioning the data.
Surprisingly, the random bit selection approach provides
approximate or even superior performance than the entropy-based
approach. One possible explanation is that the entropy-based
approach only selects bits by maximizing variance rather than the
Figure 7: Time cost vs. detection rate. Five hash tables are used to store the database; each table contains 170k entries. Therefore, nearest neighbors are searched over 850k entries for LDB and BRIEF. We experiment with 9 sets of LSH parameters: we set the hash key size as 20, 24, and 28 and for each key size we set the multi-probe level as 0, 1, and 2.
Figure 8: LSH bucket size distribution on 228 Document-Natural image dataset. Both LDB-entropy and LDB-random give much more homogeneous buckets than BRIEF, thus improving the query speed and accuracy.
difference (i.e. distance) between matching and non-matching
points. Highly variant bits don’t always lead to uncorrelated and
discriminative bits; therefore, it might not greatly improve the
performance LDB descriptors. More advanced methods, such as
PCA or AdaBoost, can be incorporated in the future to enhance
the performance. From another angle, both LDB-random and
BRIEF do not need any training process, while LDB-random
exceeds BRIEF. Therefore, applying any advanced bit selection
approach to LDB should achieve consistently better performance
than applying it to BRIEF.
5 APPLICATIONS TO MOBILE AUGMENTED REALITY
We apply LDB to scalable MAR applications by implementing a
conventional AR pipeline: we first detect ORB points and LDB
descriptors for a captured image frame. Then we match it to our
database and return the database image with most and sufficient
number of matches as a potential recognized result. Finally, we
perform RANSAC [24] to validate the result and have a pose
estimate. Once the object is recognized, it is tracked from the next
frame by matching features of consecutive frames. The recognizer
and tracker of AR are executed complementarily during the entire
process: the recognizer is activated whenever the tracker fails or
new objects occur and the tracker bridges the recognized results
and speeds up the process.
In the following subsections, we evaluate the performance of
scalable object recognition and tracking individually.
5.1 Object Recognition on Mobile Devices
We first describe the database, the query set, experimental setup
and measurement metrics used for our evaluation, and then
provide results.
5.1.1 Document-Natural Database and Query Images
Document-Natural Database: our database consists of two
types of images – 109 document images generated from ICME06
proceedings and 119 natural scene images containing buildings
and people selected from Oxford Building 5K, yielding 228
images in total. Such a database could well evaluate the
performance of our method for both textured objects (e.g. natural
scenes) and sparsely textured objects (e.g. texts).
Query images: we manually captured a picture for each
database image. In mobile applications, the captured images are
often blurred due to the fast movement of the device during the
capturing process. In order to examine the performance of
descriptors with respect to such distortions, we implemented
motion blur distortions using ImageMagic [15] and applied them
to all query images.
Exemplar database images and query images are displayed in
Figure 11 (a) and (b).
5.1.2 Experimental Setup and Measurement
We extract 1500 features on average from each database image.
All the database features are stored in an LSH indexing structure,
consisting of 5 hash tables with 1.7+M entries in total. For all the
hash tables, we set the key size as 20 and multi-probe level as 1.
For query images, we extract 724 features per image on average.
All the codes were executed in a single thread running on an
ARM Cortex A9, 1GHz processor of a Motorola Xoom1 tablet.
Four measurement metrics are used:
Detection Rate: number of objects that are successfully
recognized over the total number of objects. In our case,
the detection rate is 1 if all the 228 images are successfully
identified.
Precision: number of correctly recognized objects over the
total number of recognized objects.
Descriptor Computation Time: average time cost in
computing descriptors per image.
Matching Time: average time cost in searching the nearest
neighbor of a query descriptor from the database.
5.1.3 Patch Size Selection
We first select the optimal patch size for our LDB descriptors. We
experiment with 6 different patch sizes: 24x24, 32x32, 36x36,
45x45, 50x50, and 64x64. For every patch size, we tune the
Hamming distance threshold to reject unreliable matches until it
provides an over 99% precision. Figure 9 displays the detection
rate and match time as a function of the patch size. As patch size
increases, the detection rate of LDB-random (blue curve) and
LDB-entropy (red curve) increases monotonically at first and
gradually saturates after 50x50. This is because larger patches
include more pixels, averaging more pixels when computing LDB
descriptors can reduce the negative impact from noises, enhancing
the robustness and improving the detection rate. However,
averaging too many pixels may also reduce the distinctiveness of
LDB descriptors, leading to more distracters due to false matches.
Therefore, curves of detection rate become flat after the patch size
reaches 50x50. From the point of match time, the time cost
decreases when enlarging the patch from 24x24 to 45x45. After
that, continue enlarging the patch inversely increasing the match
time. This also reflects the changes of descriptor’s robustness and
distinctiveness with respect to the patch size. Undersized patches
lacks robustness while oversized patches lack distinctiveness.
Both cases may increase the amount of false matches, leading to
more matches to check and longer processing time.
According to Figure 9, the size 45x45 provides the best tradeoff
between the detection rate and the match time for both LDB-
entropy and LDB-random. Thus in the following experiments, we
use 45x45 as the optimal patch size.
5.1.4 Results
Table 1 compares the performance of LDB and BRIEF in terms of
detection rate, precision, descriptor computation time and match
time. Firstly, we compare the performance of BRIEF using
different size of smoothing kernels in the first two rows of Table
1. We observe that BRIEF with smaller kernel (5x5) achieves
Figure 9: Match time and detection rate of LDB as a function of patch size. The detection rate of both LDB-random and LDB-entropy increases when increasing the patch size from 24x24 to 50x50 and saturates after that. The match time decreases when enlarging the patch size from 24x24 to 45x45, and increases when further enlarging the patch size. Based on an overall consideration of detection rate and match time, we select 45x45 as the optimal size for this database.
lower detection rate and precision yet shorter time in matching
than using larger kernel (9x9). This is because smaller kernel
makes BRIEF more sensitive to noises and more distinctive as
well due to less blurring. Secondly, we compare the performance
of BRIEF with 9x9-sized kernel with LDB. We notice that, LDB-
random can achieve slightly greater detection rate (95% vs. 94%),
similar precision (99%) and efficiency for extracting descriptors
(37ms) with BRIEF-9x9; however, a 5x speedup in matching. As
explained in subsection 4.2, the reasons accounts for this are the
better distinctiveness and robustness of LDB descriptors.
5.2 Real-Time Patch Tracking on Mobile Devices
Tracking on mobile devices involves matching the live frames to a
previously captured frame. Comparing to a pair of matching
images in recognition, consecutive frames usually share large
content overlaps, hence less challenging in achieving satisfactory
matching accuracy: the photometric changes (e.g. lighting
differences) and geometric changes (e.g. scaling) are smaller and
more predictable. Therefore, we extract much fewer ORB points
(200) for tracking than for recognition (742). Then we compute
binary descriptors for each point and search its nearest neighbor
on the previous frame using a brute-force descriptor matching.
The top-ranked putative matches (e.g. 40 matches with shortest
distance in our experiment) are then used in homography
estimation based on RANSAC.
We examine the total time cost in tracking, including cost for
matching and for homography estimation, and the inlier ratio.
Here, the inlier ratio is defined as the number of matches that are
consistent with the homography model over the total number of
top-ranked putative matches. Table 2 displays the results. The
time cost for matching is about the same for both BRIEF-9x9 and
LDB-entropy/-random. This is because the amount of descriptors
and matches to check are the same for both of them. But the time
cost for homography estimation is very different: BRIEF-9x9
spends 142ms, almost twice as the time cost of LDB (77ms for the
entropy-based approach and 79ms for the random approach). Also
the inlier ratio of BRIEF is 29%~31% smaller than that of LDB-
entropy/-random. These results further demonstrate that our LDB
is more distinctive so that much fewer false matches are generated
(as shown in Figure 10), leading to a shorter time for homography
estimation and a higher inlier ratio.
6 DISCUSSION
In this paper we focus on validating the idea of LDB, the
invariance to scaling and rotations will be addressed and
evaluated in the future. In principle, LDB can be incorporated
with any scale invariant point detector to cope with scale changes.
To make LDB invariant to in-plane rotation, one popular solution
is to estimate a dominant orientation for a patch to which the
patch is aligned before computing a descriptor. One problem of
computing LDB descriptors for rotated patches is that the integral
image of an entire image cannot be directly used for speeding up
the computations because the patch axis is not parallel to the
integral image axis. To address this problem, we can compute a
rotated integral image for each rotated patch. The rotated integral
image of a patch is calculated by summing up pixels along the
dominant orientation, instead of horizontal axis. Based on the
rotated integral image of the patch, we can efficiently compute the
intensity summation and gradients of any grid cell within the
patch.
In general, computing steered LDB won’t noticeably increase
the feature extraction time. This is because the increased
computation costs for a set of rotated integral images of patches
can be largely counteracted by the saved cost for computing an
integral image of the entire image. For example, computing an
integral image of a 1280x800 image involves 1280x800=~1M
operations. In comparison, computing 500 steered LDB
descriptors of the image also involves ~1M operations (i.e. 500
patches each of which has a patch size of 45x45). Another
overhead of steered LDB is the computation of the coordinates of
rotated patches. In principle, such cost can be saved by quantizing
orientations and pre-storing the rotated coordinates in lookup
tables.
7 CONCLUSION
In this paper, we have introduced a new binary descriptor, called
LDB, to facilitate scalable AR on mobile devices. LDB is
extremely fast to compute by using the integral image technique.
Computing 700+ LDB descriptors only takes less than 40ms on a
Motorola Xoom1 tablet. LDB is also very efficient for NN
matching over a large database. This is mainly due to its high
distinctiveness and robustness, which greatly reduce the number
of matches to check even in a large database.
Though we apply LDB to MAR applications using a
conventional pipeline, it can be directly incorporated into a more
advanced flow, e.g. a unified recognition-and-tracking flow, to
further reduce the time cost for feature detection and description
by exploiting the temporary coherency between consecutive
Table 1. Performance of binary descriptors BRIEF and LDB for
scalable object recognition. LDB-random achieves slightly greater
detection rate and a 5x speedup in matching than BRIEF-9x9,
while maintaining the same precision and computational
efficiency.
Descriptor Detection
Rate
Precision Desc.
Time(ms)
Match
Time (ms)
BRIEF-5x5
BRIEF-9x9
0.88
0.94
0.99
0.99
37
37
411
909
LDB-entropy
LDB-random
0.93
0.95
0.99
0.99
36
37
180
178
(a) matches of BRIEF (b) matches of LDB-entropy
Figure 10: 40 top-ranked putative matches of (a) BRIEF descriptors and (b) LDB-entropy descriptors between two consecutive frames. The green lines indicate the correct matches, i.e. inliers, and the red lines denote the false matches, i.e. outliers. LDB-entropy generates fewer false matches than BRIEF does.
Table 2: Time cost for matching and homography estimation
using RANSAC and inlier ratio.
Descriptor Match
Time (ms)
RANSAC
Time (ms)
Inlier
Ratio
BRIEF-9x9 25 142 46%
LDB-entropy 25 77 67%
LDB-random 25 79 65%
frames. Rotation invariance of LDB will be implemented and
evaluated in the future. Future work also includes exploring more
advanced bit selection method based on machine learning, such as
AdaBoost.
REFERENCES
[1] Bay, H., Ess, A., Tuytelaars, T. and Gool, L.V. SURF: Speeded-Up
Robust Features. In Proc. of ECCV’06.
[2] Lowe, D. G., Distinctive Image Features from Scale-Invariant
Keypoints. In International Journal of Computer Vision, 60, 2, pp.
91-110, 2004.
[3] Calonder, M., Lepetit, V., Strecha, C., and Fua, P. Brief: Binary
Robust Independent Elementary Features. In Proc. of ECCV’10.
[4] Rublee, E., Rabaud, V., Konolige, K., and Bradski, G.. ORB: an
Efficient Alternative to SIFT or SURF. In Proc. of ICCV’11,
Barcelona, Spain.
[5] T. Tuytelaars and C. Schmid, “Vector Quantizing Feature Space
With a Regular Lattice,” International Conference on Computer
Vision, 2007.
[6] Schechtman, E., and Irani, M. Matching Local Self-Similarities
across Images and Videos. In Proc. of CVPR’07.
[7] Ke, Y., and Sukthankar, R. PCA-SIFT: A More Distinctive
Representation for Local Image Descriptors. In Proc. of CVPR’04.
[8] Calonder, M., Lepetit, V., Konolige K., Bowman, J., Mihelich, P.,
and Fua, P. Compact Signatures for High-Speed Interest Point
Description and Matching, In Proc. of ICCV’09.
[9] Nister, D., and Stewenius, H. Scalable Recognition with a
Vocabulary Tree. In Proc. of CVPR’06.
[10] Wagner, D., Reitmayr, G., Mulloni, A., Drummond, T. and
Schmalstieg, D. Pose tracking from natural features on mobile
phones. In Proc. of ISMAR’08.
[11] Wagner, D., Schmalstieg, D., and Bischof, H. Multiple Target
Detection and Tracking with Guaranteed Framerates on Mobile
Phones. In Proc. of ISMAR’09.
[12] Wagner, D., Mulloni, A., Langlotz, T., and Schmalstieg, D. Real-
time Panoramic Mapping and tracking on Mobile Phones. In Proc of
IEEE VR’10.
[13] Klein, G., and Murray, D. Parallel tracking and mapping on a camera
phone. In Proc. of ISMAR’09, Orlando, October.
[14] Ta, D.N., Chen, W.C., Gelfand, N., and Pulli, K.. SURFTrac:
Efficient Tracking and Continuous Object Recognition using Local
Feature Descriptors. In Proc. of CVPR’09.
[15] Takacs, G., Chandrasekhar, V., and Tsai, S. Unified Real-Time
Tracking and Recognition with Rotation-Invariant Fast Features. In
Proc. of CVPR’10.
[16] Torralba, A., Fergus, R., Weiss, Y. Small Codes and Large
Databases for Recognition. In Proc. of CVPR’08.
[17] Taylor, S., Rosten, E., Drummond, T. Robust feature matching in
2.3us. In Proc. of CVPR’09.
[18] Lepetit, V., Lagger, P., and Fua, P. Randomized Trees for Real-time
Keypoint Recognition. In Proc. of CVPR’05, pp. 775-781.
[19] Pilet, J., and Saito, H. Virtually Augmenting Hundreds of Real
Pictures: An Approach based on Learning, Retrieval, and Tracking.
In Proc. of VR’10.
[20] Rosten, E., Porter, R., and Drummond, T. Faster and Better: A
machine learning approach to corner detection. IEEE Trans. PAMI,
32:105–119, 2010.
[21] OpenCV2.3.1, http://sourceforge.net/projects/opencvlibrary/
[22] ImageMagic, http://www.imagemagick.org/script/index.php
[23] ImageSequences, http://www.robots.ox.ac.uk/~vgg/research/affine.
[24] Chum, O., and Matas, J. Matching with PROSAC – progressive
sample consensus. In Proc. of CVPR’05, volume 1, pages 220–226.
[25] Gionis, A., Indyk, P., and Motwani, R.. Similarity search in high
dimensions via hashing. In Proc. of VLDB’99.
[26] Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K.. Multi-
probe LSH: efficient indexing for high-dimensional similarity
search. In Proc. of VLDB’07.
Figure 11: (a) Exemplar images from the Document-Natural database, (b) recognized images from the Query set, and (c) pictures of our object recognition system on Motorola Xoom1 tablet at run-time. The green rectangles illustrate the pose between the camera and the recognized object, and red dots denote the matched points. The semi-transparent marker on the original image is used to guide user’s capture, not used for the recognition.