Indexing in Large Scale Image Collections: Scaling Properties,
Parameter Tuning, and Benchmark
Mohamed Aly
Computational Vision Lab
Electrical Engineering, Caltech
Pasadena, CA 91125, USA
vision.caltech.edu/malaa
Mario Munich
Evolution Robotics
Pasadena, CA 91106, USA
www.evolution.com
Pietro Perona
Computational Vision Lab
Electrical Engineering, Caltech
Pasadena, CA 91125, USA
vision.caltech.edu
October 2010
Abstract
Indexing quickly and accurately in a large collection of images has become an important problem with many applications.
Given a query image, the goal is to retrieve matching images in the collection. We compare the structure and properties of
seven different methods based on the two leading approaches: voting from matching of local descriptors vs. matching his-
tograms of visual words, including some new methods. We derive theoretical estimates of how the memory and computational
cost scale with the number of images in the database. We evaluate these properties empirically on four real-world datasets
with different statistics. We discuss the pros and cons of the different methods and suggest promising directions for future
research.
1 Introduction
Indexing in a large scale collection of images has become an important problem of widespread applications, especially with
the increasing popularity of smart phones. There are currently several applications that allow the user to take a photo with
the smart phone camera and search a database of stored images e.g. Google Goggles1, Snaptell2, and Barnes and Noble3
app. These image collections typically include images of book covers, CD/DVD covers, retail products, and buildings and
landmark images. The size of databases involved vary from 105 to 107 images, but they can conceivably reach billions of
images. The ultimate goal is to identify the database image containing the object depicted in a probe image, e.g. an image of
a book cover from a different view point and scale. The correct image can then be presented to the user, together with some
revenue generating information, e.g. sponsor ads or referral links.
This application poses three main challenges: storage, computational cost, and recognition performance. For example, if
we consider 1 billion images, and store only 100KB per image, we need to store 100 TB of data just for the feature descriptors,
and naively searching for the nearest feature in a database of 1012 features takes 4 minutes on a supercomputer with 1 TFLOPS.
Clearly we need clever ways of indexing these huge databases of images that will reduce either or both storage requirements
and run time, to allow it to be practical for user interactive applications.
In this work we explore how the memory usage, computational cost, and recognition performance of the two leading
approaches scale with the number of images in the database. These two approaches are based on extracting local features
from the images, e.g. SIFT [17] features, and then indexing these features or some information extracted therefrom. This
information is then used to find the best match of a probe image in the database images. The first approach, the full rep-
resentation [17] approach, stores the exact features and uses efficient data structures to quickly search the huge number of
database features and identify candidate images that are further verified for geometric consistency. We consider different
variants of this approach, which use different data structures to perform the fast approximate search in the features database.
The second approach, the bag-of-words [22, 10, 9, 20, 11, 14, 15] approach, stores only occurrence counts of vector quantized
features, and uses a efficient search techniques to get candidate image matches. We consider two variants, the exact inverted
file method [25, 20], and the approximate Min-Hash method [7, 6, 9]. We did not implement and test recent improvements,
e.g. [14, 21, 11, 15, 24, 8], since their scaling properties are essentially the same.
1www.google.com/mobile/goggles/2www.snaptell.com3www.barnesandnoble.com/iphone
1
Algorithm 1 Basic Algorithm
1. Extract local features {fi j} j from every image i
2. Store some function of these features g(fi j) and build some data structure d(g) based on g()
3. Given the probe image, extract its local features and compute g(fq j)
4. Search through the data structure d(g) for “nearest neighbors”
5. Every nearest neighbor votes for the database image i it comes from, accumulating to its score si
6. Sort the database images based on their score si
7. Post-process the sorted list of images to enforce some geometric consistency and obtain a final list of sorted images s′i.
The geometric consistency check is done using a RANSAC algorithm to fit an affine transformation between the query
image q and the database image i.
We provide the following contributions:
1. We provide a benchmark of seven different methods based on the two leading approaches: local feature matching (full
representation) and visual words histogram matching (bag of words). We compare recognition rate and matching time
as the number of images increases. We also include in the benchmark three new methods: using spherical LSH hash
functions [23] and using Hierarchical K-Means [19] for matching local features. We implemented these methods, and
we plan to make this code publicly available for further comparisons.
2. We provide theoretical estimates of the storage requirement and computational cost of these methods as a function of
the database size. We also explore how these methods are amenable to parallelization.
3. We empirically evaluate these properties on four real world datasets with different statistics, and
4. We present a number of empirical new findings for some of these methods. In particular, we show that:
(a) using L1 distance with L1 normalized histograms in the inverted file method is superior to the standard way of
using dot product with tf-idf weighted histograms [20] and to using binarized histograms [15].
(b) using spherical LSH hash functions [23] provide comparable recognition performance to the standard L2 LSH [2]
function but faster search time.
5. Based on the benchmark, we suggest promising directions for future research on efficient image matching in large
databases.
2 Methods Overview
In this work we consider the two leading approaches for image matching: Full Representation (FR) and Bag-of-Words rep-
resentations (BoW). They are based on extracting local features from image e.g. SIFT features. They can all be seen as
representing the same approach with varying degrees of approximation to improve speed and/or storage requirements. The
basic idea is described in Algorithm 1. In the following, we provide only a brief description of the algorithm for brevity, so
please refer to the relevant references for further details.
2.1 Full Representation (FR)
This follows the same general approach outlined in Algorithm 1. The specifics are:
1. The function g(fi j) = fi j is the identity function i.e. the full features are stored
2. The data structure d({fi j}i j) tries to facilitate faster nearest neighbor look up at the expense of some additional storage.
Four general algorithms are considered here:
(a) Exhaustive: just use brute force search to get the nearest neighbor for each query image feature fq j.
2
(b) Kd-Tree: a number of randomized Kd-trees [3, 17] are built for the database features {fi j}i j to allow for logarith-
mic retrieval time
(c) LSH: a number of locality sensitive hash functions [2, 16] are extracted from the database features {fi j}i j, and
are arranged in a set of tables. Each table has a set of has functions, which are then concatenated to get the index
of the bucket within the table where the feature should go. All features with the same hash value go to the same
bucket. Three different hash functions are considered:
i. L2: this approximates the Euclidean distance [2], where the hash function is h(x) =⌊
<x,r>+bw
⌋
where < ., . >
is the dot product, r is a random unit vector, b is a random offset, and w is the bin width. For normalized
feature vectors, it basically projects the feature onto a random direction and then returns the bin number where
the projection lies.
ii. Spherical-Simplex: this approximates distances on the hyper sphere [23], where the hash function is h(x) =argmini < x,yi > where yi are the vertices of a random simplex inscribed in the unit hyper sphere. The hash
value is the index of the nearest vertex of the simplex.
iii. Spherical-Orthoplex: this approximates distances on the hyper sphere [23], where the hash function is
h(x) = argmini < x,yi > where yi are the vertices of a random orthoplex inscribed in the unit hyper sphere.
The hash value is the index of the nearest vertex of the orthoplex.
(d) Hierarchical K-Means: where a hierarchical decomposition [19] is built from the database features {fi j}i j. At
each level of the tree, K-means algorithm is performed to obtain a clustering of the data in the current node, and
the process is repeated recursively until the maximum depth allowed is reached.
2.2 Bag-of-Words Representation (BoW)
The differences from algorithm 1 are:
1. The function g(fi j) represents a clustering of the input features. For pre-processing, a “dictionary” is built from the
database images by clustering features into representative “visual words”. Then, each image is represented by a his-
togram of occurrences of these visual words {hi}i, and the original local feature vectors are thrown away. This is
inspired from text search applications [25].
2. We consider two data structures that try to perform faster search through the database histograms:
(a) Inverted File: where an inverted file [4, 25] structure is built from the database images.
(b) Min-Hash: a number of locality sensitive hash functions [6, 7, 5, 9] are extracted from the database histograms
{hi}i, and are arranged in a set of tables. The histograms are binarized (counts are ignored), and each image is
represented as a “set” of visual words {bi}i. The hash function is defined as h(b) = minπ(b) where π is a random
permutation of the numbers {1, ...,W} where W is the number of words in the dictionary.
3 Theoretical Scaling Properties
3.1 Description
We provide theoretical estimates of how the storage and computational cost of these methods scale with the number of images
in the database. We present the definitions of the parameters in table 1 and summary of the results in table 2, with details
in Appendix A. We note that these calculations are based on minimum theoretical storage and average case scenarios. We
also note that we compute distance between vectors using the dot product, which is equivalent to euclidean distance since
we assume feature vectors are normalized. We do not consider any compression technique that might decrease storage (e.g.
run-length encoding). Moreover, actual run times of these algorithms will differ from the average case presented here, for
example for inverted file method or spherical LSH methods, see figure 4.
3.2 Discussion
3.2.1 Storage and Run-time
Figures 1a show how storage requirements and run-time scale with the number of images in the database, assuming they are
implemented on one machine with infinite storage and 10 GFLOPS processor. We note the following:
3
104
106
108
10−4
10−2
100
102
104
Total Storage
Number of Images
Tota
l S
tora
ge (
TB
)
104
106
108
10−5
100
105
1010
1015
Running Time
Number of Images
Tim
e p
er
image (
msec)
exhaustive
kd−tree
lsh−l2
lsh−sph−sim
lsh−sph−orth
hkm
bow−inv−file
bow−min−hash
(a) Theoretical storage vs. size (left), and run time vs. size (right), assuming single machine with infinite memory.
(b) Advanced parallelization scheme for Kd-trees.The root machine stores
the top of the tree, while leaf machines store the leaves of the tree.
100
101
102
103
104
105
100
101
102
103
Number of Machines
Tim
e p
er
ima
ge
(m
se
c)
kd−tree
kd−tree−adv
lsh−l2
bow−inv−file
(c) Time per image vs min. no. of machines required M for different dataset
sizes. Each point represents a dataset size from 106 to 109 images, going left
to right. kd-tree-adv is the advanced parallelization scheme
Figure 1: Theoretical Scaling Properties. Refer to tables 1-2
4
Parameter Description Typical Value Parameter Description Typical Value
I no. of images 109 Hl2 # hash fun LSH-L2 50
s bytes/feature dim 1 Blsh # buckets 106
d feature dimension 128 Hsph # funcs for LSH-Spherical 5
F #features/image 1,000 D depth of HKM tree 7
M # machines varies k branching factor of HKM 10
C main memory/machine 50GB W # words for BoW 106
Tkdt # kd-trees 4 Tmh # tables for Min-Hash 50
L length of ranked lists 100 Hmh # hash funs for Min-Hash 1
Bkdt # backtracking 250 Bmh # buckets in Min-Hash 106
Tlsh # lsh tables 4
Table 1: Parameter definitions and typical values
• BoW methods take one order of magnitude less storage than FR methods, due to the fact that we don’t need to store the
feature vectors
• Run-time for Kd-trees and Min-hash grows very slowly with the database size.
• Inverted file and Lsh methods have asymptotically similar run-time. The theoretical run time increases linearly with
the number of images. However, in practice, inverted files have much smaller run time than the average case depicted
above.
3.2.2 Parallelization
The parallelization considered here is the simplest: for every method, we determine the minimum number of machines, M,
that can fit the storage required in their main memory, assuming machines with 50GB of memory. Then we split the images
evenly across these machines and each will take a copy of the probe image and search its own share of images. Finally, all
the machines will communicate their ranked list of images (of length L) and produce a final list of candidate matches that is
further geometrically checked.
More sophisticated parallelization techniques are possible, that can take advantage of the properties of the method at hand.
For example, in the case of Kd-trees, one such advanced approach is to store the top part of the Kd-tree on one machine,
and divide the bottom subtrees across other machines, see fig. 1b). For 1012 features, we have 40 levels in the Kd-tree,
and so we can store up to 30 levels in one root machine, and split the bottom 10 levels evenly across the leaf machines.
Given a probe image, we first query the root machine and get the list of branches (and hence subtrees) that we need to search
with the backtracking. Then the appropriate leaf machines will process these query features and update their ranked list of
images. This approach has the advantage of significant speed up and better utilization of the machines at hand, since not all
the machines will be working on the same query feature at the same time, rather they will be processing different features
from the probe image concurrently, see figure 1c. However, it also has some drawbacks: 1) the geometric verification step is
more complicated as the node machines will not, in general, have all the features of any particular image, and will have to
collect these features from other node machines; 2) the root machine will become a bottleneck for processing input images,
but that can be alleviated by adding more replicas of the root machine to process more features concurrently; 3) some features
will be stored more than once when using more than one kd-tree, since, in general, features will not fall into the subtrees of
one particular leaf machine.
Figure 1b shows a schematic of the advanced parallelization scheme for Kd-Trees, while figure 1c shows the time per
image vs the minimum number of machines M for different dataset sizes from 106 to 109. We note the following:
• The advanced parallelization method provides significant speed ups starting at 108 images. It might seem paradoxical
that increasing the dataset size decreases the run time, however it makes sense when we notice that adding more ma-
chines allows us to interleave processing of more features concurrently, and thus allows faster processing. This however
comes at a cost of more storage, see figure 1c.
• We also note that many of the FR methods have parameters that affect the trade off between run time and accuracy,
e.g. number of hash functions or bin size for LSH. These parameters have to be tuned for every database size under
consideration, and we should not use the same settings when enlarging the database. This poses a problem for these
methods, which have to be continuously updated, as opposed to Kd-trees which adapt to the current database size and
need very little tuning.
5
Method Storage (bytes) Ex. (TB) Comp. (FLOP/im) Ex. (GFLOP/im)
Exhaustive (sd +4)IF 132 F2I(2d +1) 256×106
Kd-Tree IF(sd +4+2Tkdt +Tkdtlog2 IF
8) 160 Bkdt F(2d +1+ log2 FI) 0.074
LSH-L2 IF(sd +4+Tlshlog2 IF
8) 152 FTlsh(Hl2(2d +2)+ FI
Blsh(2d +1)) 1028
LSH-Sim IF(sd +4+Tlshlog2 IF
8) 152 FTlsh(Hsph(2d2 +3d)+ FI
Blsh(2d +1)) 1028
LSH-Orth IF(sd +4+Tlshlog2 IF
8) 152 FTlsh(Hsph(2d2 +3d)+ FI
Blsh(2d +1)) 1028
HKM IF(sd +4)+ kD−1k−1
ksd 132 FD(2d + k)+ F2 I
kD × (2d +1)) 25.7
Inverted File Wsd +FI(5+log2 I
8) 9 FBkdt (2d +1+ log2 W )+F(2+ I) 1
Min-Hash Wsd +FI(4+log2 W
8)+Tmh
log2 I
87 FBkdt (2d +1+ log2 W )+4FTmhHmh +
TmhI
MBmh0.07
(a) Theoretical storage requirement (Storage), computational cost on a single machine with infinite memory (Comp.)
Parallel (FLOP/im) Ex. (GFLOP/im)
Exhaustive F2 IM
(2d +1)+L(M−1) 128×103
Kd-Tree FB(2d +1+ log2FIM
)+L(M−1) 0.071
Kd-Tree-adv FB log2C
4Tkdt+ FB
M(2dBkdt +Bkdt )+F log2
4FITC
+L(min(FB,M)−1) 0.012
LSH-L2 FTlsh(Hl2(2d +2)+ FIMBlsh
(2d +1))+L(M−1) 85
LSH-Sim FTlsh(Hsph(2d2 +3d)+ FIMBlsh
(2d +1))+L(M−1) 85
LSH-Orth FTlsh(Hsph(2d2 +3d)+ FIMBlsh
(2d +1))+L(M−1) 85
HKM FD(2d + k)+ F2 I
MkD × (2d +1)+L(M−1) 0.021
Inverted File FBkdt (2d +1+ log2 W )+F(2+ IFMW
) 0.075
Min-Hash FBkdt (2d +1+ log2 W )+4FTmhHmh +TmhI
Bmh0.07
(b) Theoretical parallel computational cost with minimum required number of machines M. Kd-Tree-adv
is the advanced parallelization scheme, see fig. 1b.
Table 2: Theoretical Scaling Properties. Refer to figure 1.
4 Evaluation Details
4.1 Datasets
We have two kinds of datasets:
1. Distractors: A set of images that constitute the bulk of the database to be searched. It represents the clutter, or distractor
images that the algorithm must go through in order to find the right image. In the actual setting, this would include all
the objects of interest e.g. book covers, CD covers, ... etc.
2. Probe: A set of labeled images used for benchmarking purposes. This has two types of images per object:
(a) Model Image: represents the ground truth image to be retrieved for that object
(b) Probe Images: used for querying the database, representing the object in the model image from different view
points, lighting conditions, scales, ... etc.
4.1.1 Distractor Datasets
• D1: Caltech-Covers A set of ~ 100K images of CD/DVD covers used in [1].
• D2: Flickr-Buildings A set of ~1M images of buildings collected from flickr.com
• D3: Image-net A set of ~400K images of “objects” collected from image-net.org, specifically images under synsets:
instrument, furniture, and tools.
• D4: Flickr-Geo A set of ~1M geo-tagged images collected from flickr.com
Figure 2 shows some examples of images from these distractor sets.
6
Figure 2: Example distractor images. Each row depicts a different set: D1, D2, D3, and D4, respectively.
total #model #probe
P1 485 97 388
P2 750 125 525
P3 720 80 640
P4 957 233 724
Table 3: Probe Sets
4.1.2 Probe Sets
• P1: CD Covers: A set of 5×97=485 images of CD/DVD covers. The model images come from freecovers.net while
the probe images come from the dataset used in [19]. This was also used in [1].
• P2: Pasadena-Buildings A set of 6×125=750 images of buildings around Pasadena, CA from [1]. The model image
is image2 (frontal view in the afternoon), and the probe images are images taken at two different times of day from
different viewpoints.
• P3: ALOI A set of 9×80=640 3D objects images from the ALOI collection [13] with different illuminations and view
points. We use the first 80 objects, with the frontal view of each object as the model image, and four orientations and
four illuminations as the probe images.
• P4: INRIA-Holidays a set of 957 images, which forms a subset of images from [14], with groups of at least 3 images.
There are 233 model images and 724 probe images. The first image in each group is the model image, and the rest are
the probe images.
Figure 3 shows some examples of images from these distractor sets. Table 3 summarizes the properties of these probe sets.
4.2 Setup
We used four different evaluation scenarios, where in each we use a specific distractor/probe set pair. Table 4 displays a list
of scenarios used. Evaluation was done by increasing the size of the dataset from 100, 1k, 10k, 50k, 100k, up to 400k. For
each such size, we include all the model images to the specified number of distractor images e.g. for 1k images, we have 1000
images from the distractor set in addition to all images in the probe set.
Performance is measured as the percentage of probe images correctly matched to their ground truth model image i.e.
whether the correct model image is the first ranked image returned. In addition, we measure performance before and after the
7
Figure 3: Example probe images. Each row depicts a different set: P1, P2, P3, and P4, respectively. Each row shows two
model images and 2 or 3 of its probe images. Details in Table 3.
Scenario Distractor Probe
1 D1 D1
2 D2 P2
3 D3 P3
4 D4 P4
Table 4: Evaluation Scenarios. See Sec. 4.2.
geometric consistency check, which is done using RANSAC algorithm [12] to fit an affine transformation between the probe
and the model images. Ranking before the geometric check is based on the number of nearest features in the image, while
ranking after is based on the number of inliers of the affine transform.
We want to emphasize the difference between the setup used here and the setup used in other “image retrieval”-like papers
[14, 11, 20]. In our setup, we have only ONE correct ground truth image to be retrieved and several probe images, while in the
other setting there are a number of images that are considered correct retrievals. Our setting is like the dual of image retrieval
setting. In fact, in our probe sets, we invert the role of model and probe images e.g. P1 and P4 above.
We use SIFT [17] feature descriptors with hessian affine [18] feature detectors. We used the binary available from
tinyurl.com/vgg123. All experiments were performed on machines with Intel dual Quad-Core Xeon E5420 2.5GHz processor
and 32GB of RAM. We implemented all the algorithms using a mix of Matlab and Mex/C++ scripts 4.
4.3 Parameter Tuning
All of the methods we compare have different parameters that affect their run time, storage, and recognition performance. We
performed parameter tuning in two steps: first quick tuning to get a set of promising parameters using a subset of of probe
images from scenario 1, and then we ran experiments using this set of parameters for the four scenarios with sizes from 100 up
to 10K images. Based on these results, we chose the settings that made most sense in terms of their recognition performance
and run time. Please see the details in Appendix B. Table 5 summarizes the parameters chosen for the full benchmark.
5 Results and Discussion
Figure 4 shows the recognition performance for different dataset sizes, before and after the geometric consistency check.
Figure 5 shows recognition performance vs time for one dataset size (10K images), before and after geometric checks. We
note the following:
4available at vision.caltech.edu/malaa/software/research/image-search
8
Parameters
Kd-Trees T = 1 tree, B =100 backtracking steps
LSH-L2 T =4 tables, H =25 hash functions, bin size w =0.25
LSH-Sim T =4 tables, H =5 hash functions
LSH-Orth T =4 tables, H =5 hash functions
HKM tree depth D = 5, branching factor k = 10
tf-idf weighting, l2 normalization, cos distance
Inverted File raw histograms, l1normalization, l1 distance
binary histograms, l2 normalization, cos distance
Min-Hash T = 100 tables, H = 1 hash function
Table 5: The chosen methods parameters for the full experiments. See Sec. 4.3 and Appendix B.
• Full representation (FR) methods based on (approximately) matching the local features provide superior recognition
performance to bag-of-words (BoW) methods.
• BoW methods provide nearly constant search time compared to linear increase in case of most FR methods, with the
exception of Kd-trees (before exhausting all the main memory, see the big jump in figure 4 for 50K images) which
provide logarithmic increase in search time.
• FR methods take an order of magnitude more memory than BoW methods e.g. we can easily fit up to 400K images for
BoW while for some scenarios we can only fit up to 50K images for some FR methods.
• It is important to use diverse datasets for measuring and quantifying performance of different methods. While the trend
is similar for all the four scenarios, we notice that scenarios 1 and 3 are the easiest, while scenarios 2 and 4 provide
much more difficulty. We believe that is the case because images in scenarios 2 and 4 have much more clutter than the
other two scenarios, and that makes the recognition task much harder.
• We notice that BoW techniques can provide acceptable accuracy in some situations e.g. scenarios 1 & 3. This suggests
that for some applications we can use BoW methods, which have the advantage of near constant search time and less
storage requirements.
• FR methods pose a trade off between recognition rate and search time. In particular, for scenarios 2 & 4, we notice that
LSH methods are generally better than Kd-tree in terms of recognition rate but inferior in terms of search time which
rises sharply with database size. This suggests that we can trade off extra search time for better recognition rate.
• Spherical LSH methods provide comparable recognition rate to LSH-L2 method, but they provide better search time.
• Using the novel combination of l1 normalization with l1 distance in the inverted file method provides better performance
than the standard way of using tf-idf weighting with l2 normalization and dot product. Using binary histograms also
outperforms the standard tf-idf scheme, as reported in [15].
• Kd-trees provide the best overall trade off between recognition performance and computational cost. We notice that its
run time grows very slowly with the number of images while giving excellent performance. Moreover, with smart and
more complicated parallelization schemes, like the one described in §3.2.2, we can have significant speedup when run-
ning on multiple machines. The only drawback, which is shared with all FR methods, is the larger storage requirements.
• The geometric verification step is in general not necessary, and gives performance comparable to that beforehand, except
in the case of Min-Hash method. This is especially true for hard scenarios, e.g. 2 & 4, where there is a lot of noise
features i.e. features belonging to clutter around the object, like trees around the house. These features get matched to
noise features in the probe image and thus make the geometric step harder.
• We believe the two main promising directions for further research in large scale image indexing are:
1. Devising ways to reduce run time and storage requirements for FR methods. This includes developing better
features that take less storage, finding ways to store less features in the database, and compressing information of
features using projection methods (e.g. PCA, Random Projection, ... etc.).
2. Devising ways to improve the recognition performance of BoW methods. This includes finding better ways to
generate the visual words dictionaries and encoding geometric information in the BoW representation.
9
6 Summary
We explored the problem of indexing in large scale image databases both theoretically and experimentally on four real world
benchmark datasets. We compared seven methods of the two leading approaches in terms of their storage, recognition per-
formance, and computational cost. We reported new results for LSH and inverted file methods, and we suggested promising
directions for future research.
102
103
104
105
0
20
40
60
80
100Scenario 1
Number of Images
Re
co
gn
itio
n P
erf
orm
an
ce
(b
efo
re)
102
103
104
105
0
20
40
60
80
100Scenario 1
Number of Images
Re
co
gn
itio
n P
erf
orm
an
ce
(a
fte
r)
102
103
104
105
10−2
10−1
100
101
102
103
Scenario 1
Number of Images
Tim
e (
be
fore
) (s
ec /
im
ag
e)
102
103
104
105
10−2
10−1
100
101
102
103
Scenario 1
Number of Images
Tim
e (
aft
er)
(se
c /
im
ag
e)
kdt
lsh−l2
lsh−sim
lsh−orth
hkm
invf−none−l1−l1
invf−tfidf−l2−cos
invf−bin−l2−l2
minhash
102
104
106
0
20
40
60
80
100Scenario 2
Number of Images
Re
co
gn
itio
n P
erf
orm
an
ce
(b
efo
re)
102
104
106
0
20
40
60
80
100Scenario 2
Number of Images
Re
co
gn
itio
n P
erf
orm
an
ce
(a
fte
r)
102
104
106
10−2
10−1
100
101
102
103
Scenario 2
Number of Images
Tim
e (
be
fore
) (s
ec /
im
ag
e)
102
104
106
10−2
10−1
100
101
102
103
Scenario 2
Number of Images
Tim
e (
aft
er)
(se
c /
im
ag
e)
102
104
106
0
20
40
60
80
100Scenario 3
Number of Images
Re
co
gn
itio
n P
erf
orm
an
ce
(b
efo
re)
102
104
106
0
20
40
60
80
100Scenario 3
Number of Images
Re
co
gn
itio
n P
erf
orm
an
ce
(a
fte
r)
102
104
106
10−2
10−1
100
101
102
103
Scenario 3
Number of Images
Tim
e (
be
fore
) (s
ec /
im
ag
e)
102
104
106
10−2
10−1
100
101
102
103
Scenario 3
Number of Images
Tim
e (
aft
er)
(se
c /
im
ag
e)
102
104
106
0
20
40
60
80
100Scenario 4
Number of Images
Re
co
gn
itio
n P
erf
orm
an
ce
(b
efo
re)
102
104
106
0
20
40
60
80
100Scenario 4
Number of Images
Re
co
gn
itio
n P
erf
orm
an
ce
(a
fte
r)
102
104
106
10−2
10−1
100
101
102
103
Scenario 4
Number of Images
Tim
e (
be
fore
) (s
ec /
im
ag
e)
102
104
106
10−2
10−1
100
101
102
103
Scenario 4
Number of Images
Tim
e (
aft
er)
(se
c /
im
ag
e)
Figure 4: Recognition Performance and Time Vs Dataset Size. First two rows show recognition performance before and
after the geometric step. Lower two rows show total processing time per image before and after the geometric step. Every
column represents a different experimental scenario, see tables 4 and 5.
10
10−2
100
102
0
20
40
60
80
100Scenario 1
Time before (sec/image)
Pe
rfo
rma
nce
(b
efo
re)
10−2
100
102
0
20
40
60
80
100Scenario 1
Time after (sec/image)
Pe
rfo
rma
nce
(a
fte
r)
kdt
lsh−l2
lsh−sim
lsh−orth
hkm
invf−none−l1−l1
invf−tfidf−l2−cos
invf−bin−l2−l2
minhash
10−2
100
102
0
20
40
60
80
100Scenario 2
Time before (sec/image)
Pe
rfo
rma
nce
(b
efo
re)
10−2
100
102
0
20
40
60
80
100Scenario 2
Time after (sec/image)
Pe
rfo
rma
nce
(a
fte
r)
10−2
100
102
0
20
40
60
80
100Scenario 3
Time before (sec/image)
Pe
rfo
rma
nce
(b
efo
re)
10−2
100
102
0
20
40
60
80
100Scenario 3
Time after (sec/image)
Pe
rfo
rma
nce
(a
fte
r)
10−2
100
102
0
20
40
60
80
100Scenario 4
Time before (sec/image)
Pe
rfo
rma
nce
(b
efo
re)
10−2
100
102
0
20
40
60
80
100Scenario 4
Time after (sec/image)
Pe
rfo
rma
nce
(a
fte
r)
Figure 5: Recognition Performance Vs Time. Left column shows recognition performance vs. total matching time per
image before the geometric step, while right column shows performance vs time after the geometric step for dataset size of
10K images. Each row corresponds to a different scenario, see table 4.
11
A Theoretical Scaling Properties
A.1 Full Representation
A.1.1 Exhaustive Search
Description
• Here we do the nearest neighbor search using exhaustive linear search through the set of database features {{fi j} j}i i.e.
for every feature of the query image fq j, we find the feature that minimizes the Euclidean distance: nq j = argmini jfi j
where nq j is the index of the feature, and i is the image in the database containing this feature.
• For every database image, we count the number of such nearest neighbors si, and for all images with a preset amount of
nearest neighbors (10 in our experiments), we do the RANSAC affine step to obtain s′i where s′i =# inliers of the affine
transformation.
Storage The storage requirements for this method are just the full feature vectors and feature information:
• Feature vectors:
S f v = sdIF
where s is the number of bytes per feature dimension, d is the dimension of the feature vector, I is the number of images
in the database, and F is the average number of features per image.
• Feature information:
S f i = 4IF
where for every feature we need to store its location (x and y position in the image) in addition to its scale and orientation.
We assume we can discretize this information and fit it into 1 byte each.
• Total:
Sexhaustive = (sd +4)IF
• Example: for the prototypical case with
s = 1
d = 128 dimensions
I = 109 images
F = 103 feature/image
Sex = 132×109 ×103
= 132 TB
Speed
• The main bottleneck is the nearest neighbor search. For every feature, we have to search through the whole feature
database to find the nearest feature:
Tnn,ex = F(2FId +FI) = F2I(2d +1)
where for every query feature we need to perform d multiplications and d additions per database feature, in addition to
finding the minimum value among FI values.
• Example:
Tnn,ex ≈ 256e3 TFLOP/image
Parameters None
12
Parallelization
• Divide ~100 TB of data into ~2000 machines with ~50GB memory each. For every query feature, search the 2000
machines simultaneously.
• Every machine processes the images it is storing, and produces a final sorted list of geometrically checked images. This
list is then broadcast or sent to a central machine, which will merge them into one list and returns the result.
• Speedup: having more machines here corresponds to linear speedup of the search process (in the no. of images).
Consider having M machines, then the each machine will perform
T pnn,ex =
F2I
M(2d +1)+L(M−1)
where the second term is the time to merge M sorted lists of length L.
• Example:
L = 100
T pnn,ex = 103 ×128×109 +100×1999
≈ 128 TFLOP/image
A.1.2 KD-Trees
Description
• A number of randomized Kd-trees are built for the database features {{fi j} j}i
• The nearest neighbor search is performed by going down the set of Kd-trees using a single priority queue, and the result
is an “approximate” nearest neighbor
Storage
• We still need to store all the features and their information
Sex = (sd +4)IF
• In addition, there is storage for the trees. For a tree with n leaves, there are 2n− 1 total leaves i.e. n-1 internal nodes
and n leaves. Each tree here has IF leaves (features). Storage for each tree is:
Str = 2IF + IF⌈log2 IF⌉
8
Str = IF(2+⌈log2 IF⌉
8)
where the first term is for the internal nodes where we need to store the dimension and the threshold values (assumed to
be discretized to single bytes), and the second term is for the leaves where we need to store the index of the feature it
contains
• Total:
Skdt = Sex +T Str
Skdt = IF(sd +4+2T +T⌈log2 IF⌉
8)
• Example:
T = 4
Skdt = 1012(128+4+8+4×5)
= 160×1012
= 160 TB
13
Speed
• Searching through the Kd-trees involves comparison operations at each level of the tree, in addition to exhaustive search
with the final list of features, typically ~250, corresponding to the number of backtracking steps
Tnn,kdt = FB(2d +1+ log2 FI)
where the second term is for finding the minimum among the B distance values, while the third term is the number of
comparisons performed
• Example:
B = 250
Tnn,kdt = 250×1e3(2×128+1+ log2 1012)
= 250e3×296
≈ 74 MFLOP/image
Parameters There are two parameters that affect the speed, accuracy and storage of the Kd-trees
• Number of trees T : having more trees increases accuracy and storage, without much affecting speed due to the single
search queue.
• Number of backtracking steps B: this poses a trade-off between accuracy and speed. Having more steps will give more
accuracy but takes more time.
Parallelization There are two ways to parallelize Kd-tree operations. The first is a straightforward extension of the exhaus-
tive search case:
• Divide ~160 TB of data into ~3200 machines with ~50GB memory each. For every query feature, search the machines
simultaneously.
• Combine the search results as for the exhaustive case, by having a local sorted list of images, and then broadcasting
these lists to get a global sorted list, which is then geometrically verified.
• Speedup: having more machines here corresponds to logarithmic speedup of the search process. Consider having M
machines, then the each machine will perform
Tp
nn,kdt = F(2dB+B+B log2
FI
M)+L(M−1)
where L is the size of the list on each machine, and the second term is the time required to merge M sorted lists of size
L.
• Example:
B = 250
Tp
nn,kdt = 103(64e3+7.5e3)+100×3199
= 71 MFLOP/image
The second is:
• Instead of having independent Kd-trees in each machine, there will be a single (or multiple) root machines with copies
of the top part of the Kd-trees. For example, with 50GB of memory available, we can fit 25 billion internal nodes.
Therefore, for storing 4 trees, we need to store ~6billion nodes per tree i.e. 3 billion leaves, which corresponds to
storing the top log2 3e9=30 levels for each of the 4 trees.
• The rest of the trees are then divided to the rest of the ~3200 machines (called leaf machines) that will also store the
feature vectors and information.
• The root machine(s) will get query features, compute an initial list of nodes to explore based on the top of the tree.
Then, it will dispatch that feature to all leaf machines that have that part of the tree. Once a leaf machine gets a feature,
it will handle backtracking within its own subtree, and updates its ranked list of images.
14
• Speedup: we have two parts for the running time, one in the root machine and one in the leaf machines:
Tp
nn,kdt,r = FB log2
C
2×2T
Tp
nn,kdt,l =FB
M(2dB+B)+F log2
4FIT
C+L(min(FB,M)−1)
where C is the memory capacity of each machine, Tp
nn,kdt,r is the time to search the root machine, while Tp
nn,kdt,l is the
time for the leaf machines, where the first term is the time taken to search all the F image features concurrently where
each feature will be sent to at most B machines (and so we need FB machines to process them in the worst case, where
each backtracking step goes to a different leaf machine), the second term is the time to get the minimum, the third term
to go down the subtree excluding the top part in the root machine, and the last term is to merge the min(M,FB) sorted
lists (equals the number of machines processing the image).
• Example
Tp
nn,kdt,2 = Tp
nn,kdt,r +Tp
nn,kdt,l
= 103 ×250×30+1e3×250
3200(256×250+250)
= 7.5e6+5e6
≈ 12.5 MFLOP/image
• The motivations for this distinction between root and leaf machines are:
– most of the storage for the tree is in the leaves. Therefore we can store most of the top of the tree in a single
machine, which will then dispatch jobs to the node machines.
– the time spent searching the top of the tree is smaller than that searching the lower parts of the tree, so we can gain
more speedup by splitting the bottom part of the tree instead of splitting the whole tree.
– Searching the tree will generally not activate all of the node machines i.e. the list of explored nodes will not span
all the node machines. Therefore, the central machine can process more features, and node machines can interleave
computations of different features simultaneously. This offers more speedup over the first parallelization approach
above. So, even though the search time per feature is similar to that in the first case, interleaving computations
will make this much smaller.
• However, a drawback with this approach is the extra storage requirements. That is because the lower parts of the trees
will generally not store the same features for different trees, and therefore each machine will have to store the features
in the subtrees it is holding, and as an upper bound the extra storage will be proportional to the number of kd-trees i.e.
we will need to store each feature K times, where K is the number of trees.
A.1.3 Locality Sensitive Hashing
Description
• A number of LSH tables are computed, where each consists of a number of hash functions
• The nearest neighbor search is performed by computing the hash value for the query feature in all tables, and then
processing the returned list exhaustively
Storage
• We still need to store all the features and their information
Sex = (sd +4)IF
• In addition, there is storage for the tables. In each table, we need to store the hash functions, in addition to the indices
of all the features available:
Sta = IFlog2 IF
8+Sh
where Shis the storage for the hash functions, which depends on the specific hash function used and is negligible
compared to the first term.
15
• Total:
Slsh = Sex +T Sta
Slsh = IF(sd +4+T⌈log2 IF⌉
8)
• Example:
T = 4
Slsh = 1012(128+4+4×5)
= 152×1012
= 152 TB
Speed
• Searching through the LSH index involves computing the hash function values, and then exhaustively checking the
features that share the same hash value. The time can be represented as
Tnn,lsh = F(Th +(2d +1)FI
BT )
where the first term is the time to compute the hash functions:
– L2: Th,l2 = T H(2d + 2) where we have H hash functions, and for each we need to compute dot product (one
addition and multiplication per dimension) plus another addition and division.
– Spherical-Simplex: Th,sim = T H(2d(d+1)+d) = T H(2d2 +3d) where we need to compute product of a d+1×d
matrix and a vector of d dimensions, and then need to get the closest vertex.
– Spherical-Orthoplex: Th,orth = T H(2d2 +2d) = 2T H(d2 +d) where we need to multiply a d×d matrix with a d
vector, and then choose the nearest vertex from the 2d choices, where we know that one vertex is just the opposite
sign of the other.
• The second two terms above is the time to compute exhaustive nearest neighbors, and it depends on the size of the LSH
bucket, which is not constant, and grows linearly with the number of total features, in addition to it varying a lot for the
same number of features. We assume we have a preset number of unique bucket values B, and then the average bucket
size would be FIB
.
• Example:
H = 50
T = 4
B = 106
Tnn,lsh,l2 = F(T H(2d +2)+FI
B(2d +1)×T )
= 1e3(200(2×128+2)+106(2×128+1))
= 51e6+1028e9
≈ 1028 GFLOP/image
H = 5
Tnn,lsh,sim = F(T H(2d2 +3d)+FI
B(2d +1)T )
= 1e3(20(2×1282 +3×128)+1028e6)
= 663e6+1028e9
≈ 1028 GFLOP/image
Tnn,lsh,orth = F(2T H(d2 +d)+FI
B(2d +1)T )
= 1e3(40× (1282 +128)+1028e6)
= 660e6+1028e9
≈ 1028 GFLOP/image
16
Parameters There are general two parameters for any LSH method:
• T the number of tables. Generally, having more tables improves performance but increases run time as well (in case of
variable B checks).
• H the number of hash functions. Having more functions increases run time.
However, L2 LSH includes one more parameter, which is w the bin size for the projection.
Parallelization There are two ways to parallelize LSH operations. The first is a straightforward extension of the exhaustive
search case:
• Divide ~152 TB of data into ~3000 machines with ~50GB memory each. For every query feature, search the machines
simultaneously.
• Combine the search results as for the exhaustive case
• Speedup: having more machines here corresponds to linear speedup. Consider having M machines, then the each
machine will perform
Tp
nn,lsh ≈ FFI
MBT (2d +1)+L(M−1)
• Example:
Tp
nn,lsh,l2 = FT H(2d +2)+F2I
MBT (2d +1)+L(M−1)
= 51.6e6+85e9+3e5
≈ 85 GFLOP/image
The second is similar to Kd-trees:
• Instead of having independent LSH Indexes in different machines, there will be a central machine that computes all the
hash functions, and has a list of which machines store which buckets.
• The rest of the machines will store features that belong to a subset of buckets.
• The central machine(s) will get query features, compute the hash values (buckets), and then dispatches that feature to
machines containing that bucket.
• Speedup: we have two parts for the running time, one in the central machine and one in the node machines:
Sm2,c = FTnn,lsh,x
Sm2,n =F
T((2d +1)
FI
B
where Sm,c is the time to compute the hash values on the central machine, while Sm,n is the time taken for the node
machines to search that bucket. This method will work if Sm2,c is much smaller compared to Sm2,n. On the other hand,
it will be a bottleneck if the two times are comparable.
A.1.4 Hierarchical K-Means (HKM)
Description HKM performs a hierarchical clustering tree of the features. At each level node, k-means is performed on the
features in that node, and then features are split into the child nodes depending on their distance to the cluster centers. This
continues until the maximum depth is reached.
17
Storage
• We still need to store all the features and their information
Sex = (sd +4)IF
• In addition, there is storage for the clustering tree. For a tree with depth D and branching factor k, there are kD leaf
nodes, which will store the actual features (in subsets of k). The number of internal nodes is
n =D−1
∑i=0
ki =kD −1
k−1
and for each internal node we need to store the k cluster centers, therefore the total storage is:
Shkm =kD −1
k−1× ksd +(sd +4)IF
• Example:
k = 10 branches
D = 7
Shkm = 1.1e6×10×128+132×1012
= 142e7+132e12
≈ 132 TB
Speed
• At each level of the clustering tree, we need to find the nearest cluster, in addition to finding the nearest feature in the
leaf node:
Tnn,hkm = F(D× (2d + k)+FI
kD× (2d +1))
where the second term comes from searching the features in the leaf node, assuming each leaf node will have the same
share of features.
• Example:
k = 10
D = 7
Thkm = 1e3(7×266+1e5×257)
≈ 25.7 GFLOP/image
Parameters There are two parameters that affect the amount of storage and running time:
• D the maximum depth of the tree.
• k the branching factor of the tree.
Parallelization There are two ways to parallelize HKM operations. The first is a straightforward extension of the exhaustive
search case:
• Divide ~132 TB of data into ~2650 machines with ~50GB memory each. For every query feature, search the machines
simultaneously.
• Combine the search results as for the exhaustive case.
• Speedup: having more machines here corresponds to less search time. Consider having M machines, then the each
machine will perform
Tp
nn,hkm = FD(2d + k)+F2I
MkD× (2d +1)+L(M−1)
18
• Example:
Tnn,hkm = 1e3×7×266+3.77e4+100×2649
≈ 2.1 MFLOP/image
The second is similar to Kd-trees:
• Instead of having independent HKM trees in different machines, there will be a central machine that stores the upper
part of the HKM tree (internal nodes).
• The rest of the machines, node machines, will store subsets of the leaf nodes and their features.
• The central machine(s) will get query features, goes through the top of the tree, and the dispatch features to the appro-
priate node machines.
• Speedup: we have two parts for the running time, one in the central machine and one in the node machines:
Sm2,c = D(2d + k)
Sm2,n = 2d ×FI
kD
where Sm,c is the time to go through the top of the HKM tree on the central machine, while Sm,n is the time taken for the
node machines to search the leaf nodes.
A.2 Bag-of-Words (BoW) Representation
In this approach, we only store histogram counts of visual words for each image, in addition to storing features information
(location).
A.2.1 Inverted File
Description The inverted file is a data structure designed to allow fast searching through a large index of documents rep-
resented by histograms of word occurrences. Instead of having a “forward” file, in which a list is stored for every document
listing the words it contains, the opposite is done i.e. a list is stored for each word listing the documents that have this word.
This has the advantage of very fast search through the index, because only documents with shared words are considered during
the search process.
Storage
• We need to store the feature information (e.g. location) to allow the spatial consistency step afterwards
S f i = 4FI
• We also need storage for the inverted file:
Si f = Wsd +FI(log2 I
8+1)
where the first term is the storage for the W words, and the second term is storage for the lists of images, where we have
a total of FI features and for each we need to store the image id and the feature count.
• For large dictionaries of visual words, ~1M visual words, image histograms tend to be binary i.e. there is one-to-one
mapping between features and visual words, so we do not need to store counts
Si f ,bin = Wsd +FIlog2 I
8
19
• Total
St = Si f +S f i
= 4FI +Wsd +FI(log2 I
8+1)
= Wsd +FI(5+log2 I
8)
St,bin = Wsd +FI(4+log2 I
8)
• Example
W = 106
F = 103
I = 109
St = 106 ×128+1012(5+4)
≈ 9 TB
St,bin ≈ 8 TB
Speed There are two components for the run time
• Time for computing the visual words for every feature, which is usually done with either Kd-trees or with HKM:
Tvw,kdt = 2dB+B+B log2 W = B(2d +1+ log2 W )
Tvw,hkm = D(2d + k)
• Time for searching the inverted file, which is hard to estimate precisely because it depends on the amount of overlap
between the query image and the database images. Assuming image features are distributed evenly among the visual
words, each query feature will encounter roughly IFW
images per word, and so the time would be:
Ti f = 2+min(IF
W, I)
where the first term is for the normalization of the histogram, and the second is for computing the distance function,
e.g. L2 distance, between these two values.
• Total time
Tbow,i f ,kdt = F(B(2d +1+ log2 W )+2+min(IF
W, I))
Tbow,i f ,hkm = F(D(2d + k)+2+min(IF
W, I))
• Example
B = 250
W = 106
Tbow,i f ,kdt = 1e3(250(257+20)+2+1e12
1e6)
≈ 1 GFLOP/image
D = 6
k = 10
Tbow,i f ,hkm = 1e3(6(256+10)+2+1e12
1e6)
≈ 1 GFLOP/image
20
Parameters There are a number of parameters:
• W the number of visual words in the dictionary. Generally, the larger the dictionary, the better the performance and the
faster the actual search, because the number of overlapping images is smaller.
• The weighting of the histogram counts:
– none: use raw histogram counts
– binary: binarize the histogram counts
– tf-idf: use the term frequency-inverse document frequency weighting to emphasize more informative words
• The normalization of the histograms:
– none: use un-normalized histograms
– L1: use L1 normalization i.e. ∑i hi = 1
– L2: use L2 normalization i.e. ∑i h2i = 1
• The distance function:
– L1: use distance dl1(hi,h j) = ∑k |hik −h jk|
– L2: use distance dl2(hi,h j) =√
∑k(hik −h jk)2
– Cosine: use distance dcos(hi,h j) = 1−∑k hikh jk
Parallelization Parallelization is straightforward:
• Divide the storage of 9 TB among 180 machines, with about ~50GB each.
• Given a query image, all machines are searched simultaneously to get a ranked list of images in that machine
• Broadcast the top image per machine, and each machine will compile a list of 100 images
• The top 100 images are searched for spatial consistency by the machines that are storing them, and the results are again
broadcast to get a final list of 100 ranked images
• The results are returned to the user
• Speedup: the time to search in parallel in the different machines will be smaller, but not as much as the theoretical
number above suggests. This is because these calculations assume the worst case run time, which is not usually the
case.
Tbow,i f ,kdt = F(B(2d +1+ log2 W )+2+min(IF
MW,
I
M))
Tbow,i f ,hkm = F(D(2d + k)+2+min(IF
MW,
I
M))
A.2.2 Min-Hash
Description The min-hash is a type of LSH. The histograms are binarized i.e. represented as a “set” of visual words without
word counts. The hash function is just the minimum of a random permutation of the numbers 0, ..,W −1: h(bi) = minπ(bi).
Storage
• We need to store the feature information (e.g. location) to allow the spatial consistency step afterwards
S f i = 4FI
21
• We also need storage for LSH tables:
Slsh,mh = Wsd +FIlog2 W
8+T × (
log2 I
8+Sh)
where the first term is the storage for the W words, the second is for storing the sparse histograms, and the third term
is storage for the T LSH tables, where for each we need to store indices to the images in the different buckets plus the
negligible Sh to store the hash functions themselves.
• Total
St = Slsh,mh +S f i
= Wsd + IFlog2 W
8+T × (
log2 I
8+Sh)+2FI
= Wsd +FI(4+log2 W
8)+T
log2 I
8
• Example
W = 106
F = 103
I = 109
T = 50
St = 106 ×128+1012(4+3)+50×4
≈ 7 TB
Speed There are two components for the run time
• Time for computing the visual words for every feature, which is usually done with either Kd-trees or with HKM:
Tvw,kdt = F(2dB+B+B log2 W ) = FB(2d +1+ log2 W )
Tvw,hkm = FD(2d + k)
• Time for computing the hash functions and the exhaustive search for images within the bucket:
Tmh = FT H × (3+1)+TI
B
where the first term is for computing the hash function (the permutation is computed as π(x) = ax + b mod W , and so
requires three operations), and the second term is for computing the distance for images within that bucket, which is
assumed evenly distributed IB
with B unique buckets.
• Total time
Tbow,mh,kdt = FB(2d +1+ log2 W )+4FT H +T I
B
Tbow,mh,hkm = FD(2d + k)+4FT H +T I
B
• Example
B = 106
W = 106
T = 50
H = 1
Tbow,mh,kdt = 1e3×250(257+20)+200e3+50×1e9
1e6≈ 70 MFLOP/image
D = 6
k = 10
Tbow,i f ,hkm = 1e3(6(256+10)+200+50)
≈ 2 MFLOP/image
22
Parameters There are a number of parameters:
• W the number of visual words in the dictionary. Generally, the larger the dictionary, the better the performance and the
faster the actual search, because the number of overlapping images is smaller.
• T the number of hash tables
• H the number of hash functions
Parallelization Parallelization is straightforward:
• Divide the storage of 7 TB among 140 machines, with about ~50GB each.
• Given a query image, all machines are searched simultaneously to get a ranked list of images in that machine
• Broadcast the top image per machine, and each machine will compile a list of 100 images
• The top 100 images are search for spatial consistency by the machines that are storing them, and the results are again
broadcast to get a final list of 100 ranked images
• The results are returned to the user
• Speedup: computations will be faster when using more machines, as the number of images per bucket will go down
Tmh = T H × (3+1)+TI
BMF
B Parameter Tuning
Some of the algorithms have a large number of parameter combinations that greatly affect its performance, in terms of trading
off speed and recognition performance. In order to get a quick feel for which combinations of parameters to try out, we
performed some quick experiments to estimate the performance of various combinations, in terms of speed and recognition
rate. The tuning is done with a random subset of 100 images from Scenario 1. The matching rate is the percentage of nearest
neighbor features that are actually in the ground truth model image. Based on these quick assessments, we chose certain
combinations and performed further experiments on all the datasets.
B.1 Kd-Tree
For Kd-trees we have two parameters: T the number of trees and B the number of backtracking steps. Figure 6(a) shows
results for trying 1,4, and 8 trees with 50,100,250, and 500 backtracking steps. Using 8 trees takes much more memory with
minimal gain in matching rate, while using 50 or 500 backtracking steps takes too much time. Based on the plot, we decided
to go forward with using 1 and 4 trees with 100 and 250 backtracking steps. Figure 6(b) shows results for running these
parameters on the four scenarios. We chose the combination of {T,B} = {1,100} as the representative from this method in
later comparisons.
We notice the following: 1) we generally do not need the geometric consistency check, as the recognition performance
before is very comparable to the recognition performance after; 2) the geometric check time generally decreases with increas-
ing the database size. This is intuitive as with increasing the number of features in the database, the number of matches for
the distractor images would go down, and most of them would have less matches such that they do not undergo the geometric
step
B.2 Lsh
B.2.1 Lsh-L2
Here we have three parameters: T the number of tables, H the number of hash functions, and w the bin size. We tried different
combinations of using 1, 4, or 8 tables, 5, 10, 25, or 100 hash functions, and bin size of 0.1, 0.25, 0.5, or 1. Based on
results in figure 7(a), we decided to use (T,H,w) = {4,10,0.1}, {4,25,0.25}, {4,50,0.5}, and {4,100,1} for the rest of the
experiments, which are the points at the knee of the plot, and represent the most promising combinations. Those results are
shown in figure 7(b). We notice that they provide very similar recognition performance and searching time. We chose the
combination (T,H,w) = {4,25,0.25} as the representative from this method in later comparisons.
23
10−2
100
102
104
10
20
30
40
50No. of distractors: 100
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
exhaust−l2
kdt−t1−c50
kdt−t1−c100
kdt−t1−c250
kdt−t1−c500
kdt−t4−c50
kdt−t4−c100
kdt−t4−c250
kdt−t4−c500
kdt−t8−c50
kdt−t8−c100
kdt−t8−c250
kdt−t8−c500
10−2
100
102
104
10
20
30
40
50No. of distractors: 1k
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
10−2
100
102
104
10
20
30
40
50No. of distractors: 10k
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
(a) Quick tuning for Kd-trees
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 1
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 1
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 1
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 1
kdt−t1−c100
kdt−t1−c250
kdt−t4−c100
kdt−t4−c250
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 2
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 2
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 2
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 2
kdt−t1−c100
kdt−t1−c250
kdt−t4−c100
kdt−t4−c250
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 3
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 3
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 3
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 3
kdt−t1−c100
kdt−t1−c250
kdt−t4−c100
kdt−t4−c250
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 4
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 4
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 4
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 4
kdt−t1−c100
kdt−t1−c250
kdt−t4−c100
kdt−t4−c250
(b) Results for selected Kd-trees parameters on 4 scenarios
Figure 6: Kd-trees Parameter Tuning
24
10−2
100
102
104
10
20
30
40
50No. of distractors: 100
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
exhaust−l2
lsh−l2−t1−f10−w.01
lsh−l2−t1−f25−w.01
lsh−l2−t1−f50−w.01
lsh−l2−t1−f100−w.01
lsh−l2−t1−f10−w.1
lsh−l2−t1−f25−w.1
lsh−l2−t1−f50−w.1
lsh−l2−t1−f100−w.1
lsh−l2−t1−f10−w.25
lsh−l2−t1−f25−w.25
lsh−l2−t1−f50−w.25
lsh−l2−t1−f100−w.25
lsh−l2−t4−f10−w.01
lsh−l2−t4−f25−w.01
lsh−l2−t4−f50−w.01
lsh−l2−t4−f100−w.01
lsh−l2−t4−f10−w.1
lsh−l2−t4−f25−w.1
lsh−l2−t4−f50−w.1
lsh−l2−t4−f100−w.1
lsh−l2−t4−f10−w.25
lsh−l2−t4−f25−w.25
lsh−l2−t4−f50−w.25
lsh−l2−t4−f100−w.25
lsh−l2−t4−f50−w.5
lsh−l2−t4−f100−w1
lsh−l2−t8−f10−w.1
lsh−l2−t8−f25−w.1
lsh−l2−t8−f25−w.25
lsh−l2−t8−f50−w.25
lsh−l2−t8−f50−w.5
lsh−l2−t8−f100−w1
10−2
100
102
104
10
20
30
40
50No. of distractors: 1k
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
10−2
100
102
104
10
20
30
40
50No. of distractors: 10k
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
(a) Quick tuning for Lsh-L2
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 1
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 1
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 1
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 1
lsh−l2−t4−f10−w.1
lsh−l2−t4−f25−w.25
lsh−l2−t4−f50−w.5
lsh−l2−t4−f100−w1
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 2
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 2
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 2
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 2
lsh−l2−t4−f10−w.1
lsh−l2−t4−f25−w.25
lsh−l2−t4−f50−w.5
lsh−l2−t4−f100−w1
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 3
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 3
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 3
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 3
lsh−l2−t4−f10−w.1
lsh−l2−t4−f25−w.25
lsh−l2−t4−f50−w.5
lsh−l2−t4−f100−w1
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 4
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 4
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 4
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 4
lsh−l2−t4−f10−w.1
lsh−l2−t4−f25−w.25
lsh−l2−t4−f50−w.5
lsh−l2−t4−f100−w1
(b) Results for selected parameters for LSH L2
Figure 7: Lsh L2 Parameter Tuning
25
B.2.2 Lsh Spherical Simplex
Here we have two parameters: T the number of tables, and H the number of hash functions. We tried using 1 and 4 tables with
3, 5, or 7 hash functions. Results are in figure 8(a). We then decided to use 4 tables with 5 and 7 hash functions, whose results
are in figure 8(b). We notice that the recognition performance of using 5 functions is consistently better, while having similar
search time. We hence chose the combination {T,H} = {4,5} as the representative from this method in later comparisons.
B.2.3 Lsh Spherical Orthoplex
Here we have two parameters: T the number of tables, and H the number of hash functions. We tried using 1 and 4 tables
with 3, 5, or 7 hash functions. Results are in figure 9(a). We then decided to use 4 tables with 5 and 7 hash functions, whose
results are in figure 9(b). Again, we notice that the recognition performance of using 5 functions is consistently better, while
having similar search time. We hence chose the combination {T,H} = {4,5} as the representative from this method in later
comparisons.
B.3 Hierarchical K-Means
HKM has two parameters: D the depth of the tree and k the branching factor. We tried depths of 5, 6 and 7 and with branching
factor of 10. Results for quick tuning are in figure 10(a), and full results are in figure 10(b). From the quick tuning, all the
combinations are quite similar, so we ran those on the four scenarios. We notice that they have similar performance, with
using a depth of 5 yielding slightly better performance. We chose this combination {D,k} = {5,10} as the representative of
the method in later comparisons.
B.4 Inverted File
We have several parameters:
• The number of words in the dictionary. We try using 104, 105, and 106 words.
• The way to generate the dictionary, and we compare two popular ways: a vocabulary tree using Hierarchical K-Means
procedure (HKM) [19], and a flat dictionary built using Approximate K-Means employing a set of random Kd-trees to
perform the nearest neighbor step (AKM) [20]. The dictionaries are built using features from 105 images from each
distractor set.
• Weighting the histogram, normalization of the histogram, and distance function. We used the following combinations:
{w,n,d}={none,l1,l1}, {none,l2,l2}, {tf-idf,l2,cos}, {bin,l1,l1}, and {bin,l2,l2}.
Results of tuning are show in figure 11. From these results, we decided to go ahead with using three combinations: {tf-
idf,l2,cos}, {bin,l2,l2}, and {none,l1,l1}. The first is the standard way in the literature to use inverted file search. The second
has also been shown to work competitively with larger dictionaries [15], while the third is a novel combination. We note that
increasing the number of words in the dictionary enhances performance, and that HKM is a bit worse than AKM, and so we go
ahead and use the AKM dictionary with 1M words. Figure 12 shows results of these chosen parameters on the four scenarios.
We notice several points:
• Using the novel combination {none,l1,l1} is comparable or outperforms the other two combinations on most scenarios,
especially before the geometric step
• Using the binary histograms generally outperforms the standard way of tf-idf weighting the histograms.
• Results before the geometric consistency checks are generally better than results afterwards. We believe this is because
of the way geometric check is performed, where feature matches are based on visual words, which provides a crude
way of matching the features between the two images, and will not be one-to-one when more than one feature have the
same visual word.
B.5 Min-Hash
Here we have similar parameters pertaining to the dictionary: number of words and how the dictionary is generated. In
addition, we have two more parameters: T the number of hash tables and H the number of hash functions. We try 1, 5, 25, and
100 tables with 1, 2, and 3 hash functions. Based on the results in figure 13, which show these combinations with different
dictionary sizes and dictionary generation, we chose to go ahead with Akm-1M dictionary and using 25 and 100 tables with 1
hash function. Figure 14 shows the results of these combinations on the four scenarios. We notice the following:
26
10−2
100
102
104
0.1
1
10
20
304050
No. of distractors: 100
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
exhaust−l2
lsh−sph−sim−t1−f3
lsh−sph−sim−t1−f5
lsh−sph−sim−t1−f7
lsh−sph−sim−t4−f3
lsh−sph−sim−t4−f5
lsh−sph−sim−t4−f7
10−2
100
102
104
0.1
1
10
20
304050
No. of distractors: 1k
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
10−2
100
102
104
0.1
1
10
20
304050
No. of distractors: 10k
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
(a) Quick Tuning
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 1
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 1
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 1
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 1
lsh−sph−sim−t4−f5
lsh−sph−sim−t4−f7
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 2
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 2
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 2
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 2
lsh−sph−sim−t4−f5
lsh−sph−sim−t4−f7
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 3
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 3
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 3
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 3
lsh−sph−sim−t4−f5
lsh−sph−sim−t4−f7
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 4
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 4
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 4
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 4
lsh−sph−sim−t4−f5
lsh−sph−sim−t4−f7
(b) Results for selected parameters
Figure 8: LSH Spherical Simplex Parameter Tuning
27
10−2
100
102
104
0.1
1
10
20
304050
No. of distractors: 100
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
exhaust−l2
lsh−sph−or−t1−f3
lsh−sph−or−t1−f5
lsh−sph−or−t1−f7
lsh−sph−or−t4−f3
lsh−sph−or−t4−f5
lsh−sph−or−t4−f7
10−2
100
102
104
0.1
1
10
20
304050
No. of distractors: 1k
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
10−2
100
102
104
0.1
1
10
20
304050
No. of distractors: 10k
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
(a) Quick Tuning
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 1
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 1
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 1
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 1
lsh−sph−or−t4−f5
lsh−sph−or−t4−f7
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 2
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 2
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 2
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 2
lsh−sph−or−t4−f5
lsh−sph−or−t4−f7
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 3
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 3
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 3
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 3
lsh−sph−or−t4−f5
lsh−sph−or−t4−f7
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 4
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 4
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 4
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 4
lsh−sph−or−t4−f5
lsh−sph−or−t4−f7
(b) Results for selected parameters
Figure 9: Lsh Spherical Orthoplex Parameter Tuning
28
10−2
100
102
104
0.1
1
10
20
304050
No. of distractors: 100
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
exhaust−l2
hkm−l5−b10
hkm−l6−b10
hkm−l7−b10
10−2
100
102
104
0.1
1
10
20
304050
No. of distractors: 1k
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
10−2
100
102
104
0.1
1
10
20
304050
No. of distractors: 10k
Time per feature (ms)
Pe
rce
nta
ge
Co
rre
ct
Ma
tch
es
(a) Quick Tuning
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 1
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 1
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 1
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 1
kdt−t1−c100
kdt−t1−c250
kdt−t4−c100
kdt−t4−c250
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 2
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 2
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 2
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 2
kdt−t1−c100
kdt−t1−c250
kdt−t4−c100
kdt−t4−c250
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 3
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 3
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 3
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 3
kdt−t1−c100
kdt−t1−c250
kdt−t4−c100
kdt−t4−c250
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
befo
re)
Scenario 4
102
103
104
10−2
100
102
Number of Images
Searc
h tim
e (
sec/im
age)
Scenario 4
102
103
104
30
40
50
60
70
80
90
100
Number of Images
Perf
orm
ance (
after)
Scenario 4
102
103
104
10−2
100
102
Number of Images
Geom
etr
ic C
heck tim
e (
sec/im
age)
Scenario 4
kdt−t1−c100
kdt−t1−c250
kdt−t4−c100
kdt−t4−c250
(b) Results for selected parameters
Figure 10: Hierarchical K-Means Parameter Tuning
29
102
103
104
105
30
40
50
60
70
80
90
100bow−akmeans−10k
Number of Images
Perf
orm
ance (
befo
re)
102
103
104
105
10−2
10−1
100
101
102
bow−akmeans−10k
Number of Images
Searc
h tim
e (
sec/im
age)
102
103
104
105
30
40
50
60
70
80
90
100bow−akmeans−10k
Number of Images
Perf
orm
ance (
after)
102
103
104
105
10−2
10−1
100
101
102
bow−akmeans−10k
Number of Images
Geom
etr
ic C
heck T
ime (
sec/im
age)
ivf−none−l1−l1
ivf−none−l2−l2
ivf−tfidf−l2−cos
ivf−bin−l1−l1
ivf−bin−l2−l2
102
103
104
105
30
40
50
60
70
80
90
100bow−akmeans−100k
Number of Images
Perf
orm
ance (
befo
re)
102
103
104
105
10−2
10−1
100
101
102
bow−akmeans−100k
Number of Images
Searc
h tim
e (
sec/im
age)
102
103
104
105
30
40
50
60
70
80
90
100bow−akmeans−100k
Number of Images
Perf
orm
ance (
after)
102
103
104
105
10−2
10−1
100
101
102
bow−akmeans−100k
Number of Images
Geom
etr
ic C
heck T
ime (
sec/im
age)
ivf−none−l1−l1
ivf−none−l2−l2
ivf−tfidf−l2−cos
ivf−bin−l1−l1
ivf−bin−l2−l2
102
103
104
105
30
40
50
60
70
80
90
100bow−akmeans−1M
Number of Images
Perf
orm
ance (
befo
re)
102
103
104
105
10−2
10−1
100
101
102
bow−akmeans−1M
Number of Images
Searc
h tim
e (
sec/im
age)
102
103
104
105
30
40
50
60
70
80
90
100bow−akmeans−1M
Number of Images
Perf
orm
ance (
after)
102
103
104
105
10−2
10−1
100
101
102
bow−akmeans−1M
Number of Images
Geom
etr
ic C
heck T
ime (
sec/im
age)
ivf−none−l1−l1
ivf−none−l2−l2
ivf−tfidf−l2−cos
ivf−bin−l1−l1
ivf−bin−l2−l2
102
103
104
105
30
40
50
60
70
80
90
100bow−hkm−1M
Number of Images
Perf
orm
ance (
befo
re)
102
103
104
105
10−2
10−1
100
101
102
bow−hkm−1M
Number of Images
Searc
h tim
e (
sec/im
age)
102
103
104
105
30
40
50
60
70
80
90
100bow−hkm−1M
Number of Images
Perf
orm
ance (
after)
102
103
104
105
10−2
10−1
100
101
102
bow−hkm−1M
Number of Images
Geom
etr
ic C
heck T
ime (
sec/im
age)
ivf−none−l1−l1
ivf−none−l2−l2
ivf−tfidf−l2−cos
ivf−bin−l1−l1
ivf−bin−l2−l2
Figure 11: Inverted File Parameter Tuning. Rows correspond to dictionaries: akm-10K, akm-100K, 1km-1M, and hkm-1M,
where the number corresponds to the number of words. First column depicts the recognition performance before geometric
checks, second column shows the search time through the inverted file, third column shows the recognition performance after
geometric step, while the fourth column shows the geometric check time.
30
102
103
104
105
20
40
60
80
100Scenario 1
Number of Images
Perf
orm
ance (
befo
re)
102
103
104
105
10−2
10−1
100
101
102
Scenario 1
Number of Images
Searc
h tim
e (
sec/im
age)
102
103
104
105
20
40
60
80
100Scenario 1
Number of Images
Perf
orm
ance (
after)
102
103
104
105
10−2
10−1
100
101
102
Scenario 1
Number of Images
Geom
etr
ic C
heck T
ime (
sec/im
age)
ivf−none−l1−l1
ivf−tfidf−l2−cos
ivf−bin−l2−l2
102
103
104
105
20
40
60
80
100Scenario 2
Number of Images
Perf
orm
ance (
befo
re)
102
103
104
105
10−2
10−1
100
101
102
Scenario 2
Number of Images
Searc
h tim
e (
sec/im
age)
102
103
104
105
20
40
60
80
100Scenario 2
Number of Images
Perf
orm
ance (
after)
102
103
104
105
10−2
10−1
100
101
102
Scenario 2
Number of Images
Geom
etr
ic C
heck T
ime (
sec/im
age)
102
103
104
105
20
40
60
80
100Scenario 3
Number of Images
Perf
orm
ance (
befo
re)
102
103
104
105
10−2
10−1
100
101
102
Scenario 3
Number of Images
Searc
h tim
e (
sec/im
age)
102
103
104
105
20
40
60
80
100Scenario 3
Number of Images
Perf
orm
ance (
after)
102
103
104
105
10−2
10−1
100
101
102
Scenario 3
Number of Images
Geom
etr
ic C
heck T
ime (
sec/im
age)
102
103
104
105
20
40
60
80
100Scenario 4
Number of Images
Perf
orm
ance (
befo
re)
102
103
104
105
10−2
10−1
100
101
102
Scenario 4
Number of Images
Searc
h tim
e (
sec/im
age)
102
103
104
105
20
40
60
80
100Scenario 4
Number of Images
Perf
orm
ance (
after)
102
103
104
105
10−2
10−1
100
101
102
Scenario 4
Number of Images
Geom
etr
ic C
heck T
ime (
sec/im
age)
Figure 12: Inverted File results on the four scenarios for the chosen parameters.
31
• Min-Hash performance is generally worse than Inverted File. This is expected, as it is just an approximation that works
best for near identical images, and not for images with large distortions [9, 11].
• Geometric consistency checks is crucial for Min-Hash, as the first step provides very poor performance.
References
[1] Mohamed Aly, Peter Welinder, Mario Munich, and Pietro Perona. Scaling object recognition: Benchmark of current
state of the art techniques. In ICCV Workshop WS-LAVD, 2009.
[2] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimen-
sions. Commun. ACM, 51(1):117–122, 2008.
[3] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu. An optimal algorithm for approximate nearest
neighbor searching. Journal of the ACM, 45:891–923, 1998.
[4] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999.
[5] Andrei Broder, Moses Charikar, and Michael Mizenmacher. Min-wise independent permutations. Journal of Computer
and System Sciences, 60:630–659, 2000.
[6] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. Com-
puter Networks and ISDN Systems, 29:8–13, 1997.
[7] A.Z. Broder. On the resemblance and containment of documents. In Proc. Compression and Complexity of Sequences
1997, pages 21–29, 1997.
[8] O. Chum, M. Perdoch, and J. Matas. Geometric min-hashing: Finding a (thick) needle in a haystack. In CVPR, 2009.
[9] O. Chum, J. Philbin, M. Isard, and A. Zisserman. Scalable near identical image and shot detection. In CIVR, pages
549–556, 2007.
[10] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Automatic query expansion with a generative
feature model for object retrieval. In ICCV, 2007.
[11] O. Chum, J. Philbin, and A. Zisserman. Near duplicate image detection: min-hash and tf-idf weighting. In British
Machine Vision Conference, 2008.
[12] D. Forsyth and J. Ponce. Computer Vision: A modern approach. Prentice Hall, 2002.
[13] J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders. The amsterdam library of object images. IJCV, 61:103–112,
2005.
[14] H. Jégou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search.
In ECCV, 2008.
[15] H. Jégou, M. Douze, and C. Schmid. Packing bag-of-features. In ICCV, sep 2009.
[16] Y. Ke, R. Sukthankar, and L. Huston. An efficient parts-based near-duplicate and sub-image retrieval system. In MUL-
TIMEDIA, pages 869–876, 2004.
[17] David Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
[18] K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors. IJCV, 2004.
[19] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. CVPR, 2006.
[20] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial
matching. CVPR, 2007.
[21] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in
large scale image databases. In CVPR, 2008.
32
102
103
104
105
0
20
40
60
80
100bow−akmeans−10k
Number of Images
Pe
rfo
rma
nce
(b
efo
re)
102
103
104
105
10−6
10−4
10−2
100
bow−akmeans−10k
Number of Images
Se
arc
h t
ime
(se
c/im
ag
e)
102
103
104
105
0
20
40
60
80
100bow−akmeans−10k
Number of Images
Pe
rfo
rma
nce
(a
fte
r)
102
103
104
105
10−6
10−4
10−2
100
bow−akmeans−10k
Number of Images
Ge
om
etr
ic C
he
ck T
ime
(se
c/im
ag
e)minhash−t1−f1
minhash−t1−f2
minhash−t1−f3
minhash−t5−f1
minhash−t5−f2
minhash−t5−f3
minhash−t25−f1
minhash−t25−f2
minhash−t25−f3
minhash−t100−f1
minhash−t100−f2
minhash−t100−f3
102
103
104
105
0
20
40
60
80
100bow−akmeans−100k
Number of Images
Pe
rfo
rma
nce
(b
efo
re)
102
103
104
105
10−6
10−4
10−2
100
bow−akmeans−100k
Number of Images
Se
arc
h t
ime
(se
c/im
ag
e)
102
103
104
105
0
20
40
60
80
100bow−akmeans−100k
Number of Images
Pe
rfo
rma
nce
(a
fte
r)
102
103
104
105
10−6
10−4
10−2
100
bow−akmeans−100k
Number of Images
Ge
om
etr
ic C
he
ck T
ime
(se
c/im
ag
e)
102
103
104
105
0
20
40
60
80
100bow−akmeans−1M
Number of Images
Pe
rfo
rma
nce
(b
efo
re)
102
103
104
105
10−6
10−4
10−2
100
bow−akmeans−1M
Number of Images
Se
arc
h t
ime
(se
c/im
ag
e)
102
103
104
105
0
20
40
60
80
100bow−akmeans−1M
Number of Images
Pe
rfo
rma
nce
(a
fte
r)
102
103
104
105
10−6
10−4
10−2
100
bow−akmeans−1M
Number of Images
Ge
om
etr
ic C
he
ck T
ime
(se
c/im
ag
e)
102
103
104
105
0
20
40
60
80
100bow−hkm−1M
Number of Images
Pe
rfo
rma
nce
(b
efo
re)
102
103
104
105
10−6
10−4
10−2
100
bow−hkm−1M
Number of Images
Se
arc
h t
ime
(se
c/im
ag
e)
102
103
104
105
0
20
40
60
80
100bow−hkm−1M
Number of Images
Pe
rfo
rma
nce
(a
fte
r)
102
103
104
105
10−6
10−4
10−2
100
bow−hkm−1M
Number of Images
Ge
om
etr
ic C
he
ck T
ime
(se
c/im
ag
e)
Figure 13: Min-Hash Parameter Tuning. Rows correspond to dictionaries: akm-10K, akm-100K, akm-1M, and hkm-1M,
where the number corresponds to the number of words. First column depicts the recognition performance before geometric
checks, second column shows the search time through the inverted file, third column shows the recognition performance after
geometric step, while the fourth column shows the geometric check time.
33
102
103
104
105
0
20
40
60
80
100Scenario 1
Number of Images
Perf
orm
ance (
befo
re)
102
103
104
105
10−2
10−1
100
101
102
Scenario 1
Number of Images
Searc
h tim
e (
sec/im
age)
102
103
104
105
0
20
40
60
80
100Scenario 1
Number of Images
Perf
orm
ance (
after)
102
103
104
105
10−2
10−1
100
101
102
Scenario 1
Number of Images
Geom
etr
ic C
heck T
ime (
sec/im
age)
minhash−t25−f1
minhash−t100−f1
102
103
104
105
0
20
40
60
80
100Scenario 2
Number of Images
Perf
orm
ance (
befo
re)
102
103
104
105
10−2
10−1
100
101
102
Scenario 2
Number of Images
Searc
h tim
e (
sec/im
age)
102
103
104
105
0
20
40
60
80
100Scenario 2
Number of Images
Perf
orm
ance (
after)
102
103
104
105
10−2
10−1
100
101
102
Scenario 2
Number of Images
Geom
etr
ic C
heck T
ime (
sec/im
age)
102
103
104
105
0
20
40
60
80
100Scenario 3
Number of Images
Perf
orm
ance (
befo
re)
102
103
104
105
10−2
10−1
100
101
102
Scenario 3
Number of Images
Searc
h tim
e (
sec/im
age)
102
103
104
105
0
20
40
60
80
100Scenario 3
Number of Images
Perf
orm
ance (
after)
102
103
104
105
10−2
10−1
100
101
102
Scenario 3
Number of Images
Geom
etr
ic C
heck T
ime (
sec/im
age)
102
103
104
105
0
20
40
60
80
100Scenario 4
Number of Images
Perf
orm
ance (
befo
re)
102
103
104
105
10−2
10−1
100
101
102
Scenario 4
Number of Images
Searc
h tim
e (
sec/im
age)
102
103
104
105
0
20
40
60
80
100Scenario 4
Number of Images
Perf
orm
ance (
after)
102
103
104
105
10−2
10−1
100
101
102
Scenario 4
Number of Images
Geom
etr
ic C
heck T
ime (
sec/im
age)
Figure 14: Min-Hash results on the four scenarios for the chosen parameters.
34
[22] J. Sivic, B. C. Russell, A.. A. Efros, A. Zisserman, and W.T. Freeman. Discovering object categories in image collections.
In ICCV, 2005.
[23] T. Terasawa and Y Tanaka. Spherical lsh for approximate nearest neighbor search on unit hypersphere. Proceedings of
the Workshop on Algorithms and Data Structures, 2007.
[24] Z. Wu, Q. Ke, M. Isard, and J. Sun. Bundling features for large scale partial-duplicate web image search. In CVPR,
2009.
[25] J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., (2), 2006.
35