ldb: an ultra-fast feature for scalable augmented reality...

9
LDB: An Ultra-Fast Feature for Scalable Augmented Reality on Mobile Devices Xin Yang 1 and Kwang-Ting Cheng 2 University of California, Santa Barbara, CA 93106, USA ABSTRACT The efficiency, robustness and distinctiveness of a feature descriptor are critical to the user experience and scalability of a mobile Augmented Reality (AR) system. However, existing descriptors are either too compute-expensive to achieve real-time performance on a mobile device such as a smartphone or tablet, or not sufficiently robust and distinctive to identify correct matches from a large database. As a result, current mobile AR systems still only have limited capabilities, which greatly restrict their deployment in practice. In this paper, we propose a highly efficient, robust and distinctive binary descriptor, called Local Difference Binary (LDB). LDB directly computes a binary string for an image patch using simple intensity and gradient difference tests on pairwise grid cells within the patch. A multiple gridding strategy is applied to capture the distinct patterns of the patch at different spatial granularities. Experimental results demonstrate that LDB is extremely fast to compute and to match against a large database due to its high robustness and distinctiveness. Comparing to the state-of-the-art binary descriptor BRIEF, primarily designed for speed, LDB has similar computational efficiency, while achieves a greater accuracy and 5x faster matching speed when matching over a large database with 1.7M+ descriptors. Keywords: Augmented reality, binary feature descriptor, mobile devices, object recognition, tracking Index Terms: I.4.8 [Image Processing and Computer Vision]: Scene Analysis; H.5.1[Information Interfaces and presentation]: Multimedia Information Systems Artificial, augmented, and virtual realities. 1 INTRODUCTION A scalable Mobile Augmented Reality (MAR) system that is able to track objects, recognized from a large database, can facilitate a wide range of applications, such as augmenting an entire book with digital interactions or linking city-wide movie posters or leaflets with multimedia trailers and advisements. Several MAR systems [10, 11, 12, 13, 14, 15, 19] have been proposed recently, such as Wagner et al.’s pose tracking [10, 11, 12], Klein and Murray’s parallel tracking and mapping [13] and Ta et al.’s SURFTrac [14]. Despite of their success in real-time tracking, the scalability of these MAR systems remains limited. In practice, they can only manage a limited number of objects, which greatly restricts their large-scale deployment. One of the major bottlenecks of today’s MAR systems is the high complexity of computing, matching and storing feature point descriptors used for recognition and tracking. Most existing MAR systems rely on high-dimensional real-valued descriptors, e.g. SIFT [2] and SURF [1], which are too costly for mobile devices with limited computational and memory resources. As a result, most systems can only afford a small dataset for recognition and tracking. Compact binary descriptors are very efficient to store and to match by computing the Hamming distance between descriptors (i.e. via XOR and bit count operations), attractive options for MAR. To obtain compact binary descriptors, several approaches [5, 9, 16] have been proposed to quantize the original descriptors into bit strings, using very few bits per value. However, computing these descriptors remains compute-intensive. To improve the computational efficiency, Calonder et al. propose a very fast binary descriptor, called BRIEF [3], by directly generating binary strings from image patches. While efficient, BRIEF descriptors utilize only intensities of a subset of pixels within an image patch. Thus they might not be distinctive enough. For example, it might fail to distinguish patches of different gradient changes (see Figures 1 (a)-(b)) or with sparse textures (see Figures 1 (c)-(d)). Lack of distinctiveness leads to many false matches when matching against a large database and in turn degrades the performance of an MAR system. In this paper we present a new binary descriptor, called Local Difference Binary (LDB). Compared to BRIEF, LDB is more distinctive and robust, thus it can effectively identify correct matches from a large database with a few verifications, yielding high efficiency in matching. The distinctiveness and robustness of LDB are achieved through 3 steps. First, LDB captures the internal patterns of each image patch through a set of binary tests, each of which compares the average intensity I avg and first-order gradients, d x and d y , of a pair of image grids within the patch (Figure 2 (a) and (b)). The average intensity and gradients capture Figure 1: Image patches and BRIEF descriptors. (a) is a patch of gradually increasing intensity and (b) consists of two homogeneous sub-regions. , and illustrate three exemplar binary tests for them, which lead to identical BRIEF descriptor values for (a) and (b). (c) and (d) are patches of sparse intensity textures, it is very likely that most binary tests sample pixels in blank regions, yielding non-distinguishing descriptors. 1 e-mail: [email protected] 2 e-mail: [email protected]

Upload: hoangkhanh

Post on 27-Nov-2018

226 views

Category:

Documents


1 download

TRANSCRIPT

LDB: An Ultra-Fast Feature for

Scalable Augmented Reality on Mobile Devices

Xin Yang1 and Kwang-Ting Cheng

2

University of California, Santa Barbara, CA 93106, USA

ABSTRACT

The efficiency, robustness and distinctiveness of a feature

descriptor are critical to the user experience and scalability of a

mobile Augmented Reality (AR) system. However, existing

descriptors are either too compute-expensive to achieve real-time

performance on a mobile device such as a smartphone or tablet, or

not sufficiently robust and distinctive to identify correct matches

from a large database. As a result, current mobile AR systems still

only have limited capabilities, which greatly restrict their

deployment in practice.

In this paper, we propose a highly efficient, robust and

distinctive binary descriptor, called Local Difference Binary

(LDB). LDB directly computes a binary string for an image patch

using simple intensity and gradient difference tests on pairwise

grid cells within the patch. A multiple gridding strategy is applied

to capture the distinct patterns of the patch at different spatial

granularities. Experimental results demonstrate that LDB is

extremely fast to compute and to match against a large database

due to its high robustness and distinctiveness. Comparing to the

state-of-the-art binary descriptor BRIEF, primarily designed for

speed, LDB has similar computational efficiency, while achieves

a greater accuracy and 5x faster matching speed when matching

over a large database with 1.7M+ descriptors.

Keywords: Augmented reality, binary feature descriptor, mobile

devices, object recognition, tracking

Index Terms: I.4.8 [Image Processing and Computer Vision]:

Scene Analysis; H.5.1[Information Interfaces and presentation]:

Multimedia Information Systems – Artificial, augmented, and

virtual realities.

1 INTRODUCTION

A scalable Mobile Augmented Reality (MAR) system that is able to track objects, recognized from a large database, can facilitate a wide range of applications, such as augmenting an entire book with digital interactions or linking city-wide movie posters or leaflets with multimedia trailers and advisements.

Several MAR systems [10, 11, 12, 13, 14, 15, 19] have been proposed recently, such as Wagner et al.’s pose tracking [10, 11, 12], Klein and Murray’s parallel tracking and mapping [13] and Ta et al.’s SURFTrac [14]. Despite of their success in real-time tracking, the scalability of these MAR systems remains limited. In practice, they can only manage a limited number of objects, which

greatly restricts their large-scale deployment. One of the major bottlenecks of today’s MAR systems is the

high complexity of computing, matching and storing feature point descriptors used for recognition and tracking. Most existing MAR systems rely on high-dimensional real-valued descriptors, e.g. SIFT [2] and SURF [1], which are too costly for mobile devices with limited computational and memory resources. As a result, most systems can only afford a small dataset for recognition and tracking.

Compact binary descriptors are very efficient to store and to

match by computing the Hamming distance between descriptors

(i.e. via XOR and bit count operations), attractive options for

MAR. To obtain compact binary descriptors, several approaches

[5, 9, 16] have been proposed to quantize the original descriptors

into bit strings, using very few bits per value. However,

computing these descriptors remains compute-intensive. To

improve the computational efficiency, Calonder et al. propose a

very fast binary descriptor, called BRIEF [3], by directly

generating binary strings from image patches. While efficient,

BRIEF descriptors utilize only intensities of a subset of pixels

within an image patch. Thus they might not be distinctive enough.

For example, it might fail to distinguish patches of different

gradient changes (see Figures 1 (a)-(b)) or with sparse textures

(see Figures 1 (c)-(d)). Lack of distinctiveness leads to many false

matches when matching against a large database and in turn

degrades the performance of an MAR system. In this paper we present a new binary descriptor, called Local

Difference Binary (LDB). Compared to BRIEF, LDB is more distinctive and robust, thus it can effectively identify correct matches from a large database with a few verifications, yielding high efficiency in matching. The distinctiveness and robustness of LDB are achieved through 3 steps. First, LDB captures the internal patterns of each image patch through a set of binary tests, each of which compares the average intensity Iavg and first-order gradients, dx and dy, of a pair of image grids within the patch (Figure 2 (a) and (b)). The average intensity and gradients capture

Figure 1: Image patches and BRIEF descriptors. (a) is a patch of

gradually increasing intensity and (b) consists of two homogeneous

sub-regions. ①, ② and ③ illustrate three exemplar binary tests for

them, which lead to identical BRIEF descriptor values for (a) and

(b). (c) and (d) are patches of sparse intensity textures, it is very

likely that most binary tests sample pixels in blank regions, yielding

non-distinguishing descriptors.

1e-mail: [email protected]

2e-mail: [email protected]

both DC and AC components of a patch, thus they provides a more complete description than BRIEF. Second, LDB employs a multiple gridding strategy to capture the structure at different spatial granularities (Figure 2 (c)). Coarse-level grids can cancel out high-frequency noise while fine-level grids can capture detailed local patterns, thus enhancing distinctiveness. Third, LDB selects a subset of highly-variant and distinctive bits and concatenates them to form a compact and unique LDB descriptor. Computing LDB descriptors is extremely fast. Relying on integral images, the average intensity and first-order gradients of each grid can be obtained by only 4~8 add/subtract operations (divide operation for averaging can be omitted for comparing two equal-sized grid cells).

Our experimental results show that the discriminative ability and robustness of LDB outperform the state-of-the-art binary descriptor BRIEF, while their computational efficiency is about the same. The performance of LDB for MAR applications is demonstrated via a recognition task with 228 objects and patch tracking on a Motorola Xoom1 tablet. Results show that LDB is much faster to match and track than BRIEF, while achieving a greater accuracy.

The rest of the paper is organized as follows: Sec. 2 reviews the related work. Sec. 3 presents details of the proposed descriptor. In Sec. 4, we apply LDB to scalable matching and analyze its performance. Sec. 5 provides experimental results on mobile devices for speed, robustness and discriminative power evaluation. Sec. 6 discusses the scale and rotation invariance of LDB and Sec. 7 concludes the paper.

2 RELATED WORK

In this section, we review state-of-the-art visual feature based MAR systems and lightweight local feature descriptors.

2.1 Visual-Feature-Based Mobile Augmented Reality

With the proliferation of mobile devices equipped with low-cost high-quality cameras, there have been several emerging MAR systems based on visual recognition and tracking [10, 11, 12, 13, 14, 15, 19]. In such a system, an image frame, captured by a mobile camera, is first described using a set of local feature descriptors, based on which objects in the frame are recognized. Periodic recognition results are bridged by tracking the recognized contents.

Most recent research efforts focused on improving the tracking speed for MAR. Wagner et al. made significant advancement in pose tracking [10, 11, 12] to meet tight real-time constraints. Klein and Murray [13] implemented parallel tracking and mapping on a mobile phone. Ta et al. [14] improved the speed by tracking SURF features in constrained locations. However, none of these efforts addressed the scalability of MAR. High-dimensional and real-valued descriptors are used, which are expensive to match and store. As a result, these systems can only handle a limited number of objects.

There are also several works to facilitate real-time recognition over a larger database for AR. For instance, Taylor [17] proposed a high-speed recognition system supporting typical video frame rates. Lepetit et al. [18] recasted matching as a classification problem using a decision tree and traded increased memory usage for expensive descriptor matching. Takacs et al. [15] exploited the temporal coherency between frames and developed a unified real-time tracking and recognition system. Julien et al. [19] presented an approach based on learning for retrieval and tracking which can handle hundreds of pictures. However, the scalability of their systems were only evaluated on desktop computers and not yet optimized for real-time requirements on handheld devices. Moreover, these systems may require a high memory usage and database redundancy. Therefore, for such systems, supporting a

larger database with hundreds of images on a handheld device is very challenging.

Due to the limited computing and memory resources in a mobile device, MAR’s scalability must be addressed – for better user experience and for enabling a wider spectrum of applications. Different from previous works, we focus on improving the scalability of MAR under the efficiency constraint. More specifically, we design a new binary descriptor which is very fast to compute, compact to store and efficient to match against a large dataset due to its distinctiveness and robustness.

2.2 State-of-the-Art Lightweight Descriptors

The increasing demand of handling a larger database on mobile

devices stimulates the development of light-weight descriptors.

Several approaches have been proposed to shorten descriptors to

speedup matching and reduce memory consumption. PCA-SIFT

[2] reduced the descriptor from 128 to 36 dimensions to reduce

the matching cost, whereas the increased time for descriptor

formation almost annihilates the increased speed of matching. [9]

quantized the floating-point descriptor vectors into a bag of

integers using a pre-trained vocabulary. Similarly, [5] and [8]

performed the quantization and represented each floating point

value using few bits without compromising recognition

performance. While effective, these approaches still require

computing the full descriptor first before minimizing its size.

Furthermore, though some of these approaches converted the

floating point descriptors into bit strings, the Hamming distance

cannot be used as a metric for matching because the bits are

arranged in blocks of four and thereby cannot be processed

independently.

A binary descriptor, called BRIEF [3], directly generates bit

strings by simple binary tests comparing pixel intensities in a

smoothed image patch. Rublee et al. [4] further enhanced this

descriptor by making it rotation invariant. Such binary descriptors

are extremely fast to compute, compact to store, and efficient to

compare based on the Hamming distance. However, as described

in Sec. 1, BRIEF descriptors utilize only intensities of a subset of

pixels within an image patch, thus are not distinctive enough to

effectively localized matched patches in large databases. Post-

processing for removing false matches is usually required to

Figure 2: Illustration of LDB extraction. (a) An image patch is divided into 3x3 equal-sized grids. (b) Compute the intensity summation (I), gradient in x and y directions (dx and dy) of each patch, and compare I, dx and dy between every unique pair of grids. (c) three gridding strategies (with 2x2, 3x3 and 4x4 grids) is applied to capture information at different granularities.

ensure a sufficient recognition accuracy, increasing the total time

cost for MAR.

3 LDB: LOCAL DIFFERENCE BINARY

Our feature design is inspired by Eli and Michal’s self similarity

descriptor [6], in which an image patch was divided into grid cells

and cross-correlations between the center grid cell and the others

are computed. Since most of the photometric changes (e.g.

lighting/contrast changes, blurring, image noises etc.) can be

removed by computing the difference between two sub-regions,

the self similarity descriptor is resilient to photometric changes,

yet captures intrinsic structures within the image.

However, computing cross-correlations between grid cells

demands Sum-of-Square-Differences operations for each pixel,

which is computationally expensive and cannot be sped up by

integral images. Moreover, the self similarity descriptor is a real-

valued feature vector thus it is not efficient to match and store.

Finally, many of today’s efficient point detectors, e.g. FAST [20],

detect points with local intensity extremes, i.e. points whose

intensities are all brighter or darker than surrounding pixels.

Therefore, only considering the differences between the center

grid cells and the surrounding ones may yield indistinctive

descriptors. Here, we avoid cross-correlation operations, and

create a bit vector based on binary tests between every pair of

grids.

More specifically, we divide each image patch P into n x n

equal-sized grids, extract representative information from each

grid cell and perform binary test on a pair of grid cells (i and j)

as:

1 if ( ( ) ( )) 0 and ( ( ), ( )) :

0 otherwise

Func i Func j i jFunc i Func j

, (1)

where ( )Func is the function for extracting information from a

grid cell.

The information used for binary tests determines both the

distinctiveness and efficiency of a descriptor. The most basic yet

fast-to-compute information is the average intensity, which

represents the DC component of a grid and can be computed

extremely fast via the integral image technique. However, the

average intensity is too simple to describe the intensity changes

inside a grid cell. For example, Figure 3 (a) displays three patches

with 2x2 grid cells. Consider the up-left grid cells of the three

patches (denoted by red rectangles); although they have very

different pixel intensity values and distributions, the means of

intensities are exactly the same (Figure 3(b)). A binary testsbetween diagonal grid cells yield identical results for these three

distinct patches.

To improve its distinctiveness, we introduce the first-order

gradients, in addition to using the average intensity. It’s known

that gradients are more resilient to photometric changes than

average intensities and can also capture intensity changes inside a

grid such as the magnitude and direction of an edge. Considering

the patches in Figure 3, the first-order gradients in the x and y

directions are able to distinguish them (Figure 3(b)). In addition,

gradients can be computed using box filters which can be easily

sped up by integral images. We define ( )Func for our LDB as a

tupleintensity dx dy( ) { ( ), ( ), ( )}Func Func Func Func , where

intensity 1~

dx

dy

1( ) : ( )

( ) : ( )

( ) : ( )

k m

x

y

Func i Intensity km

Func i Gradient i

Func i Gradient i

, (2)

and m is the total number of pixels in grid cell i. Since we use

equal-sized grids, m is the same in all cases and can be omitted in

the computation. Gradientx(i) and Gradienty(i) are the regional

gradients of grid cell i in the x and y directions, respectively.

Based on Eq. (2), the average intensity and the x and y gradients

are computed sequentially for a grid, thus each pair of grids

generates 3 bits using binary tests defined in Eq. (1) (Figure 3

(c)).

Performing binary tests on pairwise grid cells out of nxn grids

results in a bit string of 3n(n-1)/2. As n increases, the

distinctiveness of a descriptor increases while the complexity for

matching and storage increases as well. To form a compact LDB

descriptor, we choose a subset of nd bits out of 3n(n-1)/2 bits to

form a unique LDB descriptor. In the following subsections 3.1

and 3.2, we discuss the gridding and bit selection strategies

respectively.

3.1 Gridding Choices

The choice of the grid size influences both robustness and

distinctiveness of the LDB descriptor. If the gridding is fine, i.e.

having a large number of small grids per patch, corresponding

grids of two similar patches could be misaligned even for a small

amount of patch shifting. Hence the descriptor’s stability would

be lower. Whereas, the descriptor extracted from smaller grids

captures more detailed local patterns of a patch thus is more

distinctive than that extracted from larger grids. Dividing a patch

into a small number of large grids, on the other hand, results in a

more stable but less distinctive descriptor.

With this observation, we employ a strategy incorporating

multiple gridding choices to achieve both high robustness and

high distinctiveness. Each image patch is partitioned in multiple

ways, e.g. 2x2, 3x3 and 4x4, from each of which a LDB binary

string is derived. We then concatenate the strings from all the

partitions to form an initial LDB descriptor.

3.2 Bit Selection

The above mentioned gridding strategy may produce a long bit

string. For example, concatenating binary strings resulting from

2x2, 3x3 and 4x4 partitions to a patch leads to a binary string of

486 bits. Further increasing the number of ways for partitioning

would result in even more bits. Despite distinctive, long bit

strings reduce the efficiency for matching and storage. Moreover,

Figure 3: Illustration of performing binary tests on the pair of diagonal grids for three image patches. (a) Three image patches with different pixel intensity values and distributions. (b) The average intensity value I and gradient in x and y directions, dx and dy, for each pair of diagonal grids (red and green). For each of these three cases, red and green grids have the same I, while different dx and dy. Therefore, individual grids can still be distinguished from each other. (c) 3-bit binary test results for the three patches.

bits in a long string may have strong correlation, yielding

redundancy. To address this, we employ a bit selection strategy

which chooses a subset of least correlated bits to form the final

LDB descriptor.

There are many options for bits selection. In general, depending

on whether training data is needed or not, there are two options:

unsupervised or supervised. The unsupervised approach does not

involve any training phase; it selects features according to a set of

specific rules. The supervised approach relies on training to

identify a subset of features that have high variance and are

uncorrelated over the training set. Methods in this category

include PCA, LDA, AdaBoost, etc. Both unsupervised and

supervised selection methods are conducted offline.

Here, we use two simple methods: random bit selection as an

unsupervised approach and an entropy-based method as a

supervised approach. While simple, experimental results in Secs.

4 and 5 show that both methods can produce highly effective LDB

descriptors.

3.2.1 Random Bit Selection

Random bit selection is quite simple. We randomly select nd

unique bits out of N bits and apply the same selection to compute

the nd-dimensional LDB descriptors for all images.

3.2.2 Entropy-based Bit Selection

The entropy represents the information uncertainty, thus can be

used to identify dimensions of high variances. The entropy-based

method is as follows.

1. Set up a training set of, say, 300k patches, drawn from images

in a large image database, In our experiment, we use the

Document-Natural image set whose details are described in

Sec. 5.

2. For each image patch, we partition it in n different ways (e.g.

2x2, 3x3, 4x4, …, n x n), yielding a N-bit string,

1~3

2i n

iN

, (3)

The results of the 300k training data form a matrix of 300k

rows by N columns (each column is a binary bit).

3. Compute the entropy of each matrix column as:

(1)log( (1)) (0)log( (0)), (1 )i i i i ientropy P P P P i N , (4)

where Pi(1) and Pi(0) are the “1” and “0” probabilities of

column i, respectively.

4. Select nd columns with the largest entropy values which are

used to form the final LDB descriptor.

3.3 Invariance to Transformations

In this section, we test the robustness of our LDB descriptor to

several geometric and photometric transformations that commonly

exist in real-world scenarios. The discriminative ability, efficiency

in descriptor construction and scalable matching are evaluated in

Sec.4 and 5.

3.3.1 Datasets

We perform the evaluation using six publicly available test image

sequences [23] depicted in Figure. 4 They are designed to test

robustness to typical image disturbances occurring in real-world

scenarios, which include 4 categories:

viewpoint changes: the data sets are Graffiti and Wall;

image blur: the data sets are Trees and Bikes;

illumination changes: the data set is Light;

compression artifacts: the data set is Jpg.

For all sequences we match the first image to the remaining

five, yielding five image pairs per sequence. The five pairs are

sorted in order of ascending difficulty. More specifically, Graffiti

and Wall are sorted in order of increasing viewpoint changes to

the first image, Trees and Bikes are sorted in increasing image

blur, and Light and Jpg are in order of increasing illumination

changes and compression artifacts, respectively. Therefore, for all

sequences, pair 1-and-6, i.e. pair 1/6, is much harder to match than

pair 1-and-2, i.e. pair 1/2.

3.3.2 Evaluation

We apply the same evaluation procedure and metric used in the

BRIEF paper [3]. For each test image, we compute N = 300

keypoints using an ORB [4] keypoint detector and binary

descriptors, i.e. LDB and BRIEF. Then for each keypoint on the

first image, we perform brute-force matching based on Hamming

distance to find the Nearest Neighbor (NN) in the second one and

call it a match for each image pair. To obtain the ground truth of

correct matches, for each keypoints from the first image, we infer

its corresponding points on the second image using the known

Figure 4: Test image sequences, each of which contains 6 images. The first row shows the first image of each sequence, the bottom row shows the last image. Matching the first one against all subsequent images yields 5 image pairs per sequence. All the pairs are sorted according to ascending difficulty.

homography between them. After that, we count the number of

correct matches nc, i.e. matches with the matched NN points

within a predefined distance range (e.g. 10 pixels in this test) to

the inferred points. Finally, we compute Percentage of Correct

Matches which is nc/N, and use it to measure the robustness of a

descriptor. We use the OpenCV2.3.1 [21] implementation for the

ORB detector and BRIEF descriptor.

The size of smoothing kernels is a key factor affecting the

robustness of BRIEF descriptors. Rather than performing binary

tests on single pixel pairs, BRIEF relies on a pre-smoothed image

patch around the pixels to reduce the sensitivities to noises and

increase the stability of the descriptors. Generally speaking,

increasing the size of smoothing kernels can improve the

robustness of BRIEF descriptors. However, over-smoothing the

patch may blur its distinctive structure, hence decreasing the

distinctiveness. In this experiment, we tested two sizes - 5x5 and

9x9 (default setting in the BRIEF paper), and generate 256 bits for

a BRIEF descriptor.

Similarly, the patch size for computing LDB descriptors largely

impacts on the robustness. In Section 4, we will provide a

comprehensive discussion about its impact on robustness;

distinctiveness and matching efficiency and empirical selection of

optimal patch size are performed based on matching experiments

over a large database. Here in this experiment, we tested two

patch sizes - 36x36 and 45x45. In this experiment, we only

employ random bit selection to form our 256-bit LDB descriptors.

For entropy-based bit selection, we tested it in Section 4 for

matching to a large database.

Figure 5 displays the percentage of correct matches for the

BRIEF and LDB descriptors. For all the image sequences except

Wall 1/4 and Wall 1/5, LDB, with patch sizes of both 36x36 and

45x45, achieves a higher percentage of correct matches than

BRIEF with 5x5 or 9x9 smoothing kernels. Specifically, when

comparing LDB (patch size = 45x45) with BRIEF (kernel size =

9x9) on the most challenging pairs 1/6, the percentage of correct

matches achieved by LDB is 17.2% ~ 45.3% higher than that of

BRIEF.

In addition, for LDB, patch size 45x45 outperforms patch size

36x36 for almost all sequences. This result can be explained by

the following: a larger patch contains bigger grids at each gridding

level. Averaging intensity and gradients in larger grids reduces the

impact of noises from individual pixels, making the results more

resilient to noises.

3.3.3 Distance Distribution

We take a closer look at the distribution of Hamming distances

between the LDB descriptors of various image pairs. Specifically,

we analyze the matches of one image pair 1/5 for each category,

i.e. Graffiti 1/5, Trees 1/5, Jpg 1/5 and Light 1/5. Figure 6 shows

the normalized histograms, or distributions, of Hamming

distances between corresponding points (in red) and non-

corresponding points (in blue). The maximum possible Hamming

distance is 32 x 8 = 256 bits. Ideally, the two curves should be

completely separable. Establishing a match of two images can be

formulated as classifying pairs of points as being a match or not.

If their distributions are separable, a fixed distance threshold can

achieve perfect image classification. According to the results in

Figure 6, all the red curves and blue curves are distinctly

separated. As expected, the distribution of distances for non-

matching points is roughly Gaussian and centered around 128, and

for matching points the centers of the curves have a smaller value,

around 25 ~ 75. But for Graffiti 1/5, the center of its red curve is

around 75, which is larger than the other three pairs. This implies

more difficulties in distinguishing matching points from non-

matching ones. This is consistent with the result shown in Figure

5, where the percentage of correct matches of Graffiti 1/5 is much

lower than the other three image pairs. We believe one of the

reasons is that the Graffiti set include image rotations between

image pairs; however, LDB does not incorporate a dominant

orientation estimator, hence leading to an inferior performance.

Figure 5: Percentage of correct matches on Graffiti, Wall, Jpg, Trees, Bikes and Light sequences.

Figure 6: Distribution of Hamming distance for matching points (red lines) and non-matching points (blue lines) for four image pairs: Graffiti 1/5, Trees 1/5, Light 1/5 and Jpg 1/5.

4 SCALABLE MATCHING OF LDB

In this section we show that LDB outperforms BRIEF in Nearest-

Neighbor (NN) matching over a large database. The better

distinctiveness of LDB accounts for the superior performance. We

start with a brief description of Locality Sensitive Hashing (LSH),

the indexing structure used for large-scale NN matching.

4.1 Locality Sensitive Hashing for LDB

LSH [25] is a widely used technique for approximate NN search.

The key of LSH is a hash function, based on which similar

descriptors can be hashed into the same bucket of a hash table

while different ones will be stored in different buckets. To find the

NN of a query descriptor, we first retrieve its matching bucket and

then check all the descriptors within this bucket using a brute-

force matching.

For binary features, the hash function is usually a subset of bits

from the original bit string: buckets of a hash table contain

descriptors with a common sub-bit-string. The size of the subset,

i.e. hash key size, determines a maximum Hamming distance

among descriptors within the same buckets. To improve the chance that a correct NN can be retrieved by

LSH, i.e. detection rate of NN, two techniques - multi-table and multi-probe [26] are usually used. Multi-table stores the database descriptors in several hash tables, each of which leverages a different hash function. In the query phase, each query descriptor is hashed into a bucket of every hash table and checks the matches within them. Multi-table improves the detection rate of NN at the cost of linearly increased memory usage and matching time. Multi-probe examines both the bucket in which a query descriptor falls and the neighboring buckets. While multi-probe would result in more matches to check, it actually allows for less hash tables and thus less memory usage. In addition, it enables larger key size and therefore smaller buckets and fewer matches to check per bucket.

4.2 Evaluation

We use the query time per feature and detection rate of NN to evaluate the matching performance of LDB and BRIEF. The evaluation is performed based on 228 images from Document-Natural images (refer to Sec. 5 for details) for the database and 228 synthetically generated images for the query. The query images are obtained by applying motion blur transformations to the original Document-Natural images, implemented by ImageMagick [22].

For each image, we detect 724 interest points on average using

an ORB detector and compute a 256-bit LDB or BRIEF descriptor

on a patch around each interest point. We experiment with several

sets of LSH parameters with different hash key size and the

number of probes. In OpenCV 2.3.1, the number of probes is

specified through the parameter multi-probe level. Level = 0

indicates only the bucket where a query descriptor falls is

checked, level = 1 denotes the neighboring buckets with 1 bit

difference is also examined, and so on and so forth. Here, we use

five hash tables, for each table we set the key size as 20bits, 24bits

and 28bits and the multi-probe level as 0, 1 and 2, yielding 9 sets

of LSH parameters in total.

We perform the experiments on Motorola Xoom1 tablet, which

runs Android 3.0 Honeycomb, features 1GB RAM and a dual-

core ARM Cortex-A9 running at 1GHz clock rate. While there are

multi-cores in the processor, we use only one core in our

experiments.

Figure 7 establishes a correlation between the query time per

feature and the detection rate of NN. A successful match occurs

when many nearest neighbors to the query descriptors can be

found in a short time. We notice that given a detection rate, both

LDB-entropy (using entropy-based bit selection) and LDB-

random (using random bit selection) are much faster for NN

search than BRIEF. For example, in order to find over 40%

nearest neighbors, BRIEF spends 1.3ms per feature, achieved by

using 2 levels probes and 24bits for the key size. To obtain the

same detection rate, LDB-entropy only costs 0.16ms with multi-

probe=1 and key size=20 and LDB-random only takes 0.08ms

with probe-level= 1 and key size=24. In other words, LDB

achieves an 8x ~ 16x speedup in NN matching. The speedup is

most likely due to the better discriminative ability of LDB

descriptors than BRIEF descriptors: each bucket contains smaller

number of distinctive yet similar descriptors, leading to fewer

while more effective checks. To further validate this, we take a

close look at the distribution of buckets in the hash tables, shown

in Figure 8. We observe that LDB (-random and -entropy) makes

the buckets of the hash tables more even, the buckets are much

smaller in average compared to BRIEF: as the descriptors are

more distinctive, the bits of descriptors are highly variant, the

hash function does a better job at partitioning the data.

Surprisingly, the random bit selection approach provides

approximate or even superior performance than the entropy-based

approach. One possible explanation is that the entropy-based

approach only selects bits by maximizing variance rather than the

Figure 7: Time cost vs. detection rate. Five hash tables are used to store the database; each table contains 170k entries. Therefore, nearest neighbors are searched over 850k entries for LDB and BRIEF. We experiment with 9 sets of LSH parameters: we set the hash key size as 20, 24, and 28 and for each key size we set the multi-probe level as 0, 1, and 2.

Figure 8: LSH bucket size distribution on 228 Document-Natural image dataset. Both LDB-entropy and LDB-random give much more homogeneous buckets than BRIEF, thus improving the query speed and accuracy.

difference (i.e. distance) between matching and non-matching

points. Highly variant bits don’t always lead to uncorrelated and

discriminative bits; therefore, it might not greatly improve the

performance LDB descriptors. More advanced methods, such as

PCA or AdaBoost, can be incorporated in the future to enhance

the performance. From another angle, both LDB-random and

BRIEF do not need any training process, while LDB-random

exceeds BRIEF. Therefore, applying any advanced bit selection

approach to LDB should achieve consistently better performance

than applying it to BRIEF.

5 APPLICATIONS TO MOBILE AUGMENTED REALITY

We apply LDB to scalable MAR applications by implementing a

conventional AR pipeline: we first detect ORB points and LDB

descriptors for a captured image frame. Then we match it to our

database and return the database image with most and sufficient

number of matches as a potential recognized result. Finally, we

perform RANSAC [24] to validate the result and have a pose

estimate. Once the object is recognized, it is tracked from the next

frame by matching features of consecutive frames. The recognizer

and tracker of AR are executed complementarily during the entire

process: the recognizer is activated whenever the tracker fails or

new objects occur and the tracker bridges the recognized results

and speeds up the process.

In the following subsections, we evaluate the performance of

scalable object recognition and tracking individually.

5.1 Object Recognition on Mobile Devices

We first describe the database, the query set, experimental setup

and measurement metrics used for our evaluation, and then

provide results.

5.1.1 Document-Natural Database and Query Images

Document-Natural Database: our database consists of two

types of images – 109 document images generated from ICME06

proceedings and 119 natural scene images containing buildings

and people selected from Oxford Building 5K, yielding 228

images in total. Such a database could well evaluate the

performance of our method for both textured objects (e.g. natural

scenes) and sparsely textured objects (e.g. texts).

Query images: we manually captured a picture for each

database image. In mobile applications, the captured images are

often blurred due to the fast movement of the device during the

capturing process. In order to examine the performance of

descriptors with respect to such distortions, we implemented

motion blur distortions using ImageMagic [15] and applied them

to all query images.

Exemplar database images and query images are displayed in

Figure 11 (a) and (b).

5.1.2 Experimental Setup and Measurement

We extract 1500 features on average from each database image.

All the database features are stored in an LSH indexing structure,

consisting of 5 hash tables with 1.7+M entries in total. For all the

hash tables, we set the key size as 20 and multi-probe level as 1.

For query images, we extract 724 features per image on average.

All the codes were executed in a single thread running on an

ARM Cortex A9, 1GHz processor of a Motorola Xoom1 tablet.

Four measurement metrics are used:

Detection Rate: number of objects that are successfully

recognized over the total number of objects. In our case,

the detection rate is 1 if all the 228 images are successfully

identified.

Precision: number of correctly recognized objects over the

total number of recognized objects.

Descriptor Computation Time: average time cost in

computing descriptors per image.

Matching Time: average time cost in searching the nearest

neighbor of a query descriptor from the database.

5.1.3 Patch Size Selection

We first select the optimal patch size for our LDB descriptors. We

experiment with 6 different patch sizes: 24x24, 32x32, 36x36,

45x45, 50x50, and 64x64. For every patch size, we tune the

Hamming distance threshold to reject unreliable matches until it

provides an over 99% precision. Figure 9 displays the detection

rate and match time as a function of the patch size. As patch size

increases, the detection rate of LDB-random (blue curve) and

LDB-entropy (red curve) increases monotonically at first and

gradually saturates after 50x50. This is because larger patches

include more pixels, averaging more pixels when computing LDB

descriptors can reduce the negative impact from noises, enhancing

the robustness and improving the detection rate. However,

averaging too many pixels may also reduce the distinctiveness of

LDB descriptors, leading to more distracters due to false matches.

Therefore, curves of detection rate become flat after the patch size

reaches 50x50. From the point of match time, the time cost

decreases when enlarging the patch from 24x24 to 45x45. After

that, continue enlarging the patch inversely increasing the match

time. This also reflects the changes of descriptor’s robustness and

distinctiveness with respect to the patch size. Undersized patches

lacks robustness while oversized patches lack distinctiveness.

Both cases may increase the amount of false matches, leading to

more matches to check and longer processing time.

According to Figure 9, the size 45x45 provides the best tradeoff

between the detection rate and the match time for both LDB-

entropy and LDB-random. Thus in the following experiments, we

use 45x45 as the optimal patch size.

5.1.4 Results

Table 1 compares the performance of LDB and BRIEF in terms of

detection rate, precision, descriptor computation time and match

time. Firstly, we compare the performance of BRIEF using

different size of smoothing kernels in the first two rows of Table

1. We observe that BRIEF with smaller kernel (5x5) achieves

Figure 9: Match time and detection rate of LDB as a function of patch size. The detection rate of both LDB-random and LDB-entropy increases when increasing the patch size from 24x24 to 50x50 and saturates after that. The match time decreases when enlarging the patch size from 24x24 to 45x45, and increases when further enlarging the patch size. Based on an overall consideration of detection rate and match time, we select 45x45 as the optimal size for this database.

lower detection rate and precision yet shorter time in matching

than using larger kernel (9x9). This is because smaller kernel

makes BRIEF more sensitive to noises and more distinctive as

well due to less blurring. Secondly, we compare the performance

of BRIEF with 9x9-sized kernel with LDB. We notice that, LDB-

random can achieve slightly greater detection rate (95% vs. 94%),

similar precision (99%) and efficiency for extracting descriptors

(37ms) with BRIEF-9x9; however, a 5x speedup in matching. As

explained in subsection 4.2, the reasons accounts for this are the

better distinctiveness and robustness of LDB descriptors.

5.2 Real-Time Patch Tracking on Mobile Devices

Tracking on mobile devices involves matching the live frames to a

previously captured frame. Comparing to a pair of matching

images in recognition, consecutive frames usually share large

content overlaps, hence less challenging in achieving satisfactory

matching accuracy: the photometric changes (e.g. lighting

differences) and geometric changes (e.g. scaling) are smaller and

more predictable. Therefore, we extract much fewer ORB points

(200) for tracking than for recognition (742). Then we compute

binary descriptors for each point and search its nearest neighbor

on the previous frame using a brute-force descriptor matching.

The top-ranked putative matches (e.g. 40 matches with shortest

distance in our experiment) are then used in homography

estimation based on RANSAC.

We examine the total time cost in tracking, including cost for

matching and for homography estimation, and the inlier ratio.

Here, the inlier ratio is defined as the number of matches that are

consistent with the homography model over the total number of

top-ranked putative matches. Table 2 displays the results. The

time cost for matching is about the same for both BRIEF-9x9 and

LDB-entropy/-random. This is because the amount of descriptors

and matches to check are the same for both of them. But the time

cost for homography estimation is very different: BRIEF-9x9

spends 142ms, almost twice as the time cost of LDB (77ms for the

entropy-based approach and 79ms for the random approach). Also

the inlier ratio of BRIEF is 29%~31% smaller than that of LDB-

entropy/-random. These results further demonstrate that our LDB

is more distinctive so that much fewer false matches are generated

(as shown in Figure 10), leading to a shorter time for homography

estimation and a higher inlier ratio.

6 DISCUSSION

In this paper we focus on validating the idea of LDB, the

invariance to scaling and rotations will be addressed and

evaluated in the future. In principle, LDB can be incorporated

with any scale invariant point detector to cope with scale changes.

To make LDB invariant to in-plane rotation, one popular solution

is to estimate a dominant orientation for a patch to which the

patch is aligned before computing a descriptor. One problem of

computing LDB descriptors for rotated patches is that the integral

image of an entire image cannot be directly used for speeding up

the computations because the patch axis is not parallel to the

integral image axis. To address this problem, we can compute a

rotated integral image for each rotated patch. The rotated integral

image of a patch is calculated by summing up pixels along the

dominant orientation, instead of horizontal axis. Based on the

rotated integral image of the patch, we can efficiently compute the

intensity summation and gradients of any grid cell within the

patch.

In general, computing steered LDB won’t noticeably increase

the feature extraction time. This is because the increased

computation costs for a set of rotated integral images of patches

can be largely counteracted by the saved cost for computing an

integral image of the entire image. For example, computing an

integral image of a 1280x800 image involves 1280x800=~1M

operations. In comparison, computing 500 steered LDB

descriptors of the image also involves ~1M operations (i.e. 500

patches each of which has a patch size of 45x45). Another

overhead of steered LDB is the computation of the coordinates of

rotated patches. In principle, such cost can be saved by quantizing

orientations and pre-storing the rotated coordinates in lookup

tables.

7 CONCLUSION

In this paper, we have introduced a new binary descriptor, called

LDB, to facilitate scalable AR on mobile devices. LDB is

extremely fast to compute by using the integral image technique.

Computing 700+ LDB descriptors only takes less than 40ms on a

Motorola Xoom1 tablet. LDB is also very efficient for NN

matching over a large database. This is mainly due to its high

distinctiveness and robustness, which greatly reduce the number

of matches to check even in a large database.

Though we apply LDB to MAR applications using a

conventional pipeline, it can be directly incorporated into a more

advanced flow, e.g. a unified recognition-and-tracking flow, to

further reduce the time cost for feature detection and description

by exploiting the temporary coherency between consecutive

Table 1. Performance of binary descriptors BRIEF and LDB for

scalable object recognition. LDB-random achieves slightly greater

detection rate and a 5x speedup in matching than BRIEF-9x9,

while maintaining the same precision and computational

efficiency.

Descriptor Detection

Rate

Precision Desc.

Time(ms)

Match

Time (ms)

BRIEF-5x5

BRIEF-9x9

0.88

0.94

0.99

0.99

37

37

411

909

LDB-entropy

LDB-random

0.93

0.95

0.99

0.99

36

37

180

178

(a) matches of BRIEF (b) matches of LDB-entropy

Figure 10: 40 top-ranked putative matches of (a) BRIEF descriptors and (b) LDB-entropy descriptors between two consecutive frames. The green lines indicate the correct matches, i.e. inliers, and the red lines denote the false matches, i.e. outliers. LDB-entropy generates fewer false matches than BRIEF does.

Table 2: Time cost for matching and homography estimation

using RANSAC and inlier ratio.

Descriptor Match

Time (ms)

RANSAC

Time (ms)

Inlier

Ratio

BRIEF-9x9 25 142 46%

LDB-entropy 25 77 67%

LDB-random 25 79 65%

frames. Rotation invariance of LDB will be implemented and

evaluated in the future. Future work also includes exploring more

advanced bit selection method based on machine learning, such as

AdaBoost.

REFERENCES

[1] Bay, H., Ess, A., Tuytelaars, T. and Gool, L.V. SURF: Speeded-Up

Robust Features. In Proc. of ECCV’06.

[2] Lowe, D. G., Distinctive Image Features from Scale-Invariant

Keypoints. In International Journal of Computer Vision, 60, 2, pp.

91-110, 2004.

[3] Calonder, M., Lepetit, V., Strecha, C., and Fua, P. Brief: Binary

Robust Independent Elementary Features. In Proc. of ECCV’10.

[4] Rublee, E., Rabaud, V., Konolige, K., and Bradski, G.. ORB: an

Efficient Alternative to SIFT or SURF. In Proc. of ICCV’11,

Barcelona, Spain.

[5] T. Tuytelaars and C. Schmid, “Vector Quantizing Feature Space

With a Regular Lattice,” International Conference on Computer

Vision, 2007.

[6] Schechtman, E., and Irani, M. Matching Local Self-Similarities

across Images and Videos. In Proc. of CVPR’07.

[7] Ke, Y., and Sukthankar, R. PCA-SIFT: A More Distinctive

Representation for Local Image Descriptors. In Proc. of CVPR’04.

[8] Calonder, M., Lepetit, V., Konolige K., Bowman, J., Mihelich, P.,

and Fua, P. Compact Signatures for High-Speed Interest Point

Description and Matching, In Proc. of ICCV’09.

[9] Nister, D., and Stewenius, H. Scalable Recognition with a

Vocabulary Tree. In Proc. of CVPR’06.

[10] Wagner, D., Reitmayr, G., Mulloni, A., Drummond, T. and

Schmalstieg, D. Pose tracking from natural features on mobile

phones. In Proc. of ISMAR’08.

[11] Wagner, D., Schmalstieg, D., and Bischof, H. Multiple Target

Detection and Tracking with Guaranteed Framerates on Mobile

Phones. In Proc. of ISMAR’09.

[12] Wagner, D., Mulloni, A., Langlotz, T., and Schmalstieg, D. Real-

time Panoramic Mapping and tracking on Mobile Phones. In Proc of

IEEE VR’10.

[13] Klein, G., and Murray, D. Parallel tracking and mapping on a camera

phone. In Proc. of ISMAR’09, Orlando, October.

[14] Ta, D.N., Chen, W.C., Gelfand, N., and Pulli, K.. SURFTrac:

Efficient Tracking and Continuous Object Recognition using Local

Feature Descriptors. In Proc. of CVPR’09.

[15] Takacs, G., Chandrasekhar, V., and Tsai, S. Unified Real-Time

Tracking and Recognition with Rotation-Invariant Fast Features. In

Proc. of CVPR’10.

[16] Torralba, A., Fergus, R., Weiss, Y. Small Codes and Large

Databases for Recognition. In Proc. of CVPR’08.

[17] Taylor, S., Rosten, E., Drummond, T. Robust feature matching in

2.3us. In Proc. of CVPR’09.

[18] Lepetit, V., Lagger, P., and Fua, P. Randomized Trees for Real-time

Keypoint Recognition. In Proc. of CVPR’05, pp. 775-781.

[19] Pilet, J., and Saito, H. Virtually Augmenting Hundreds of Real

Pictures: An Approach based on Learning, Retrieval, and Tracking.

In Proc. of VR’10.

[20] Rosten, E., Porter, R., and Drummond, T. Faster and Better: A

machine learning approach to corner detection. IEEE Trans. PAMI,

32:105–119, 2010.

[21] OpenCV2.3.1, http://sourceforge.net/projects/opencvlibrary/

[22] ImageMagic, http://www.imagemagick.org/script/index.php

[23] ImageSequences, http://www.robots.ox.ac.uk/~vgg/research/affine.

[24] Chum, O., and Matas, J. Matching with PROSAC – progressive

sample consensus. In Proc. of CVPR’05, volume 1, pages 220–226.

[25] Gionis, A., Indyk, P., and Motwani, R.. Similarity search in high

dimensions via hashing. In Proc. of VLDB’99.

[26] Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K.. Multi-

probe LSH: efficient indexing for high-dimensional similarity

search. In Proc. of VLDB’07.

Figure 11: (a) Exemplar images from the Document-Natural database, (b) recognized images from the Query set, and (c) pictures of our object recognition system on Motorola Xoom1 tablet at run-time. The green rectangles illustrate the pose between the camera and the recognized object, and red dots denote the matched points. The semi-transparent marker on the original image is used to guide user’s capture, not used for the recognition.