1 a revisit of hashing algorithms for approximate nearest

14
1 A Revisit of Hashing Algorithms for Approximate Nearest Neighbor Search Deng Cai Abstract—Approximate Nearest Neighbor Search (ANNS) is a fundamental problem in many areas of machine learning and data mining. During the past decade, numerous hashing algorithms are proposed to solve this problem. Every proposed algorithm claims to outperform Locality Sensitive Hashing (LSH) which is the most popular hashing method. However, the evaluation of these hashing papers was not thorough enough, and the claim should be re-examined. The ultimate goal of an ANNS method is returning the most accurate answers (nearest neighbors) in the shortest time. If implemented correctly, almost all the hashing methods will have their performance improved as the code length increases. However, many existing hashing papers only report the performance with the code length shorter than 128. In this paper, we carefully revisit the problem of search with a hash index and analyze the pros and cons of two popular hash index search procedures. Then we proposed a very simple but effective two-level index structure and made a thorough comparison of eleven popular hashing algorithms. Surprisingly, the random-projection-based Locality Sensitive Hashing (LSH) ranked the first, which is in contradiction to the claims in all the other ten hashing papers. Despite the extreme simplicity of random-projection-based LSH, our results show that the capability of this algorithm has been far underestimated. For the sake of reproducibility, all the codes used in the paper are released on GitHub, which can be used as a testing platform for a fair comparison between various hashing algorithms. Index Terms—Approximate nearest neighbor search, hashing. 1 I NTRODUCTION Nearest neighbor search plays an important role in many appli- cations of machine learning and data mining. Given a dataset with N entries, the cost of finding the exact nearest neighbor is O(N ), which is very time consuming when the data set is large. So people turn to Approximate Nearest Neighbor (ANN) search in practice [1], [22]. Hierarchical structure (tree) based methods, such as KD-tree [4] ,Randomized KD-tree [36], K-means tree [8], are popular solutions for the ANN search problem. These methods perform very well when the dimension of the data is relatively low. However, the performance decreases dramatically as the dimension of the data increases [39]. During the past decade, hashing based ANN search methods [10], [41], [25] received considerable attention. These methods generate binary codes for high dimensional data points (real vectors) and try to preserve the similarity among the original real vectors. A hashing algorithm generating l-bits code can be regarded as contains l hash functions. Each function partitions the original feature space into two parts, the points in one part are coded as 1, and the points in the other part are coded as 0. When an l-bits code is used, the hashing algorithm (l hash functions) partitions the feature space into 2 l parts, which can be named as hash buckets. Thus, all the data points fall into different hash buckets (associated with different binary codes). Ideally, if neighbor vectors fall into the same bucket or the nearby buckets (measured by the Hamming distance of binary codes), the hashing based methods can efficiently retrieve the nearest neighbors of a query point. D. Cai is with the State Key Lab of CAD&CG, College of Computer Science, Zhejiang University, Hangzhou, Zhejiang, China, 310058. Email: [email protected]. One of the most popular hashing algorithms is Locality Sensitive Hashing (LSH) [15], [10]. LSH is a name for a set of hashing algorithms, and we can specifically design different LSH algorithms for different types of data [10]. For real vectors, random-projection-based LSH [5] might be the most simple and popular one. This algorithm uses random projection to partition the feature space. LSH is naturally data independent. Many data-dependent hashing algorithms [41], [25], [24], [37], [44], [12], [11], [20], [43], [31], [35], [16], [30], [14], [40], [23], [27], [13], [9], [17], [42], [18], [21], [29] have been proposed during the past decades and all these algorithms claimed to have superior performance over the baseline algorithm LSH. However, the evaluation of all these papers are not thorough enough, and their conclusions are questionable: How to measure the performance of the learned binary code might be the most critical problem in designing a new hashing algorithm. A straightforward answer could be using the binary code as the index to solve the ANNS problem. The ANNS performance can then be used to evaluate the performance of the binary code. However, there are two typical procedures for search with a hash index, the “hamming ranking” approach and the “hash bucket search” approach. Each approach has its pros and cons. For different datasets, different hashing algorithms favour different approaches. It is not easy to fairly compare different hashing algorithms. Most of the existing hashing paper [41], [25], [24], [37], [44], [12], [43], [31], [16], [30], [40], [27], [13], [9], [17], [42], [18], [21], [29] uses the ”hamming ranking” approach to examine the performance of various hashing arXiv:1612.07545v6 [cs.CV] 19 Jun 2019

Upload: others

Post on 02-Oct-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 A Revisit of Hashing Algorithms for Approximate Nearest

1

A Revisit of Hashing Algorithms for ApproximateNearest Neighbor Search

Deng Cai

Abstract—Approximate Nearest Neighbor Search (ANNS) is a fundamental problem in many areas of machine learning and datamining. During the past decade, numerous hashing algorithms are proposed to solve this problem. Every proposed algorithm claims tooutperform Locality Sensitive Hashing (LSH) which is the most popular hashing method. However, the evaluation of these hashingpapers was not thorough enough, and the claim should be re-examined. The ultimate goal of an ANNS method is returning the mostaccurate answers (nearest neighbors) in the shortest time. If implemented correctly, almost all the hashing methods will have theirperformance improved as the code length increases. However, many existing hashing papers only report the performance with thecode length shorter than 128. In this paper, we carefully revisit the problem of search with a hash index and analyze the pros and consof two popular hash index search procedures. Then we proposed a very simple but effective two-level index structure and made athorough comparison of eleven popular hashing algorithms. Surprisingly, the random-projection-based Locality Sensitive Hashing(LSH) ranked the first, which is in contradiction to the claims in all the other ten hashing papers. Despite the extreme simplicity ofrandom-projection-based LSH, our results show that the capability of this algorithm has been far underestimated. For the sake ofreproducibility, all the codes used in the paper are released on GitHub, which can be used as a testing platform for a fair comparisonbetween various hashing algorithms.

Index Terms—Approximate nearest neighbor search, hashing.

F

1 INTRODUCTION

Nearest neighbor search plays an important role in many appli-cations of machine learning and data mining. Given a datasetwith N entries, the cost of finding the exact nearest neighbor isO(N), which is very time consuming when the data set is large.So people turn to Approximate Nearest Neighbor (ANN) searchin practice [1], [22]. Hierarchical structure (tree) based methods,such as KD-tree [4] ,Randomized KD-tree [36], K-means tree[8], are popular solutions for the ANN search problem. Thesemethods perform very well when the dimension of the data isrelatively low. However, the performance decreases dramaticallyas the dimension of the data increases [39].

During the past decade, hashing based ANN search methods[10], [41], [25] received considerable attention. These methodsgenerate binary codes for high dimensional data points (realvectors) and try to preserve the similarity among the originalreal vectors. A hashing algorithm generating l-bits code can beregarded as contains l hash functions. Each function partitionsthe original feature space into two parts, the points in one partare coded as 1, and the points in the other part are coded as0. When an l-bits code is used, the hashing algorithm (l hashfunctions) partitions the feature space into 2l parts, which can benamed as hash buckets. Thus, all the data points fall into differenthash buckets (associated with different binary codes). Ideally, ifneighbor vectors fall into the same bucket or the nearby buckets(measured by the Hamming distance of binary codes), the hashingbased methods can efficiently retrieve the nearest neighbors of aquery point.

D. Cai is with the State Key Lab of CAD&CG, College of ComputerScience, Zhejiang University, Hangzhou, Zhejiang, China, 310058. Email:[email protected].

One of the most popular hashing algorithms is LocalitySensitive Hashing (LSH) [15], [10]. LSH is a name for a setof hashing algorithms, and we can specifically design differentLSH algorithms for different types of data [10]. For real vectors,random-projection-based LSH [5] might be the most simple andpopular one. This algorithm uses random projection to partitionthe feature space.

LSH is naturally data independent. Many data-dependenthashing algorithms [41], [25], [24], [37], [44], [12], [11], [20],[43], [31], [35], [16], [30], [14], [40], [23], [27], [13], [9], [17],[42], [18], [21], [29] have been proposed during the past decadesand all these algorithms claimed to have superior performanceover the baseline algorithm LSH. However, the evaluation of allthese papers are not thorough enough, and their conclusions arequestionable:

• How to measure the performance of the learned binarycode might be the most critical problem in designing anew hashing algorithm. A straightforward answer couldbe using the binary code as the index to solve the ANNSproblem. The ANNS performance can then be used toevaluate the performance of the binary code. However,there are two typical procedures for search with a hashindex, the “hamming ranking” approach and the “hashbucket search” approach. Each approach has its pros andcons. For different datasets, different hashing algorithmsfavour different approaches. It is not easy to fairly comparedifferent hashing algorithms.

• Most of the existing hashing paper [41], [25], [24], [37],[44], [12], [43], [31], [16], [30], [40], [27], [13], [9],[17], [42], [18], [21], [29] uses the ”hamming ranking”approach to examine the performance of various hashing

arX

iv:1

612.

0754

5v6

[cs

.CV

] 1

9 Ju

n 20

19

Page 2: 1 A Revisit of Hashing Algorithms for Approximate Nearest

2

algorithms. Since there are no publicly available c++codes, all these papers use Matlab function and the accu-racy - # of located samples curves to compare differenthashing algorithms. Thus, different hashing algorithmsshould be compared with the same code length, and it isimpossible to compare the hashing algorithms with variousnon-hashing based ANNS methods.

• Almost all the hashing papers report that the performanceincreases as the code length increases. However, most ofthe hashing papers [41], [25], [24], [37], [44], [12], [43],[31], [16], [30], [40], [27], [13], [9], [17], [42], [18], [21],[29] only reported the performance with the code lengthshorter than 128. How about we further increase the codelength?

• Another possible approach to search with a hash index isso-called “hash bucket search” approach. Different fromthe “hamming ranking” approach, “hash bucket search”requires sophisticated data structures. If the hash bucketsearch approach is used, # of located samples is no longera good indicator of the search time, and the accuracy - #of located samples curves become meaningless.

• Besides the “hamming ranking” and the “hash bucketsearch” approaches, can we design better hash indexsearch procedure so that different hashing algorithms canbe fairly and easily compared?

These problems have already been raised in the literature. Jolyand Buisson [20] pointed out that many data-dependent hashingalgorithms have the improvements over LSH, but ”improvementsoccur only for relatively small hash code sizes up to 64 or 128bits”. The figure 4 in this paper [20] shows that when 1,024-bitscodes were used, LSH is the best-performed hashing algorithm.However, Joly and Buisson [20] failed to provide a detailedanalysis.

There is also a public available c++ program named FAL-CONN on the GitHub which implements LSH algorithm. Since wecan record the search time of the program, the FALCONN (LSH)can be fairly compared with other non-hashing based ANNSmethods. However, the results are not very encouraging1. Thereare even researchers pointing out that ”LSH is so hopelessly slowand/or inaccurate”2. Is this true that ”LSH is so hopelessly slowand/or inaccurate”?

All these problems motivate us to revisit the problem of ap-plying hashing algorithms for ANNS. This study is carried on twopopular million-scale ANNS datasets (SIFT1M and GIST1M)3,and the findings are very surprising:

• Eleven popular hashing algorithms are thoroughly com-pared on these two datasets. Despite the fact that all theother ten data dependent hashing algorithms claimed thesuperiority over LSH which is data independent, LSHranked the first among all the eleven compared algorithmson SIFT1M and ranked the second on GIST1M.

• We also compare our implementation of random-projection-based LSH (RPLSH) with other five popularopen source ANNS algorithms (FALCONN, flann, annoy,faiss and KGrpah). On GIST1M, RPLSH performs bestamong the six compared methods. On SIFT1M, RPLSH

1. http://www.itu.dk/people/pagh/SSS/ann-benchmarks/2. http://www.kgraph.org/3. http://corpus-texmex.irisa.fr/

performs the second best, and if a high recall is required(higher than 99%), RPLSH performs the best.

Despite the extreme simplicity of the random-projection-basedLSH, our results show that the capability of this algorithm hasbeen far underestimated. For the sake of reproducibility, all thecodes are released on GitHub4 which can be used as a testing plat-form for a fair comparison between various hashing algorithms.

It is worthwhile to highlight the contributions of this paper:

• The goal of this paper is not introducing yet anotherhashing algorithm. We aim at providing analysis on how tocorrectly measure the performance of a hashing algorithm.Given the fact that almost all the existing hashing papers[41], [25], [24], [37], [44], [12], [43], [31], [16], [30], [40],[27], [13], [9], [17], [42], [18], [21], [29] failed to addressthis problem and gave misleading conclusions, we believesuch analysis is important and useful.

• We introduce a simple yet novel two-level index schemeto search with a hash index. This approach significantlyoutperforms the traditional hamming ranking and thehash bucket search approaches. With this novel searchapproach, it becomes easy to compare various hashingalgorithms fairly.

• We release an open source ANNS library RPLSH whichimplements this two-level index scheme with random-projection-based LSH. The comparison with other fivepopular open source ANNS algorithms (FALCONN, flann,annoy, faiss and KGraph) demonstrate the superiority ofRPLSH. It performs the best on GIST1M and performsthe second best on SIFT1M.

• Recently, graph-based algorithms [3], [33], [26], [7] haveshown very promising performance on the ANNS prob-lem. Besides KGraph, we do not compare with othermore advanced graph-based algorithms. We agree that theperformance of RPLSH is inferior to the performance ofsome graph-based algorithms (e.g., HNSW[33], NSG[7])on CPU. However, RPLSH has its own advantages: verysmall index size, very efficient indexing and very simplesearch procedure. In some cases the graph-based algo-rithms cannot be used, RPLSH is a good choice. Forexample, RPLSH is very suitable for GPU, our studysuggests the possibility of a GPU based ANNS algorithmwhich will outperform faiss (the current fastest GPU-basedANNS algorithm).

2 SEARCH WITH A HASH INDEX

A hashing algorithm transforms the real vectors to binary codes.The binary codes can then be used as a hash index for onlinesearch. The general procedure of searching with a hash index isas follows (Suppose the user submit a query q and ask for Kneighbors of the query):

1) The search system encodes the query to a binary code buse the hash functions.

2) The search system finds L points which are closest to bin terms of the Hamming distance.

3) The search system performs a scan within the L points,returns the top K points which are closet to q.

4. https://github.com/ZJULearning/hashingSearch

Page 3: 1 A Revisit of Hashing Algorithms for Approximate Nearest

3

Algorithm 1 Search with a hash indexRequire: base set D, query vector q, the hash functions, hash

index (binary codes for all the points in the base set), the poolsize L and the number K of required nearest neighbors,

Ensure: K points from the base set which are close to q.1: Encode the query q to a binary code b;2: Locate L points which have shortest hamming distance to b;3: return the closet K points to q among these L points.

TABLE 1Data sets

data set dimension base number query numberSIFT1M 128 1,000,000 10,000GIST1M 960 1,000,000 1,000

This procedure is summarised in Algorithm 1, and there is oneparameter L which can be used to control the accuracy-efficiencytrade-off. Based on this search procedure, we can see that the timeof searching with a hash index contains three parts [19]:

1) Coding time Tc: the line 1 in Algorithm 1.2) Locating time Tl: the lines 2 in Algorithm 1.3) Scanning time Ts: the line 3 in Algorithm 1.

The total search time T = Tc+Tl+Ts. For many popular hashingalgorithms, Tc can be neglected compared to Tl and Ts, please seeTable 5 for details. Ts is not related to the hashing algorithm andonly depends on the parameter L. Thus, correctly measuring Tl

becomes the key to evaluate different hashing algorithms. Thereare two popular procedures (the hamming ranking approach andthe hash bucket search approach) to complete the task in line2. And these two approaches need significantly different Tl fordifferent hashing algorithms on different data set.

In the remaining part of this Section, we will analyze thepros and cons of two search procedures using the hash index. Wewill also experimentally verify our analysis. The experiments areperformed on two popular ANNS datasets, SIFT1M and GIST1M.The basic statistics of these two datasets are shown in Table 1.

We use the random-projection-based LSH [5] for its simplic-ity. In the remaining part of this paper, the LSH algorithm wementioned is this random-projection-based LSH.

Given a dataset with dimensionality m, one can generate anm × l random matrix from Gaussian distribution N (0, 1) wherel is the required code length. Then the data will be projected ontol dimensional space using this random matrix. The l dimensionalreal representation of each point can then be converted to l-bitsbinary code by a simple binarize operation5. The formal algorithmis illustrated in Algorithm (2).

Given a query point, the search algorithm is expected to returnK points. Then we need to examine how many points in thisreturned set are among the real K nearest neighbors of the query.Suppose the returned set of K points given a query is R and thereal K nearest neighbors set of the query is R′, the precisionand recall can be defined [32] as

precision(R) =|R′ ∩R||R|

, recall(R) =|R′ ∩R||R′|

. (1)

5. i-th bit is 1 if i-th feature is greater or equal to 0; i-th bit is 0 if i-thfeature is smaller than 0;

Algorithm 2 random projection based LSHRequire: base set D with n m-dimensional points, code length lEnsure: m× l projection matrix A and n l-bits binary codes..

1: Generate an m× l random matrix A from Gaussian distribu-tion N (0, 1);

2: Project the data onto l dimensional space using this randommatrix.

3: Convert the n × l dimensional data matrix to n l-bits binarycodes by the binarize operation.

0 5 10 15

# of located samples ×104

20

40

60

80

Rec

all (

%)

LSH(16)LSH(20)LSH(24)LSH(28)LSH(32)

(a) SIFT1M

0.5 1 1.5 2 2.5 3

# of located samples ×105

20

40

60

80

Rec

all (

%)

LSH(24)LSH(28)LSH(32)LSH(36)LSH(40)

(b) GIST1M

Fig. 1. The recall - # of located samples curves of LSH on two datasets.We can see longer codes means better performance.

Since the sizes of R′ and R are the same, the recall and theprecision of R are actually the same. We fixed K = 100throughout our experiments.

2.1 The Case That The Locating Time Being IgnoredSuppose we have an ideal approach which can locate L binaryvectors closest (measured by the hamming distance) to the queryat no cost, i.e., the locating time Tl can be ignored, the # of locatedsamples, L, becomes a good indicator of the search time. Thus,the accuracy - # of located samples curve can used to comparethe performance of different hashing algorithms.

Figure 1 plots the recall - # of located samples curves of LSHon two datasets. The code length grows from 16 to 32 on SIFT1Mand 24 to 40 on GIST1M. We can easily get the conclusionthat longer code means better performance, which is a commonconclusion in almost all the previous hashing papers [41], [25],[24], [37], [44], [12], [43], [31], [16], [30], [40], [27], [13], [9],[17], [42], [18], [21], [29].

This result is very natural and reasonable. Since each bit ofthe hash code is a partition of the feature space and same code(0 or 1) on this bit means two points are on the same side of thispartition. If two points share t bits same code, which means thesetwo points are on the same sides of t partitions. Neighbors in thehamming space with longer code are more likely to be the realneighbors in the original feature space. However, this is only validwhen the locating time is ignored.

2.2 The Hamming Ranking ApproachThe hamming ranking approach is very simple and wildly usedin many hashing papers [41], [25], [24], [37], [44], [12], [43],[31], [16], [30], [40], [27], [13], [9], [17], [42], [18], [21], [29].The algorithm procedure of hamming ranking can be found inAlgorithm 3.

It is easy to find that the computational complexity of thisapproach is O(N) where N is the data points. Thus, the reason ofusing hash index (hamming ranking approach) to speed up search

Page 4: 1 A Revisit of Hashing Algorithms for Approximate Nearest

4

Algorithm 3 Hamming ranking approach for locate L closestbinary codes

1: Compute the hamming distances between b and all the binarycodes in the data set;

2: Sort the points based on the hamming distances in ascendingorder and keep the first L points;

16 20 24 28 32Code length

120

130

140

150

160

Loca

ting

time

(s)

SIFT1M

24 28 32 36 40Code length

11

12

13

14

15

Loca

ting

time

(s)

GIST1M

(a)

0 5 10 15Located samples ×104

0

100

200

300

400

Loca

ting

time

(s)

SIFT1M

0.5 1 1.5 2 2.5 3Located samples ×105

10

20

30

40

50

60

Loca

ting

time

(s)

GIST1M

(b)

Fig. 2. The locating time complexity of the hamming ranking approach.(a) The locating time VS. the code length (L = 50, 000) (b) The locatingtime VS. the parameter L. The locating time grows linearly with respectto both the code length and the parameter L.

is that the computation of hamming distance of two binary codescan be extremely fast [38].

The locating time of hamming ranking approach (the timeof Algorithm 3) is only determined by the code length and theparameter L. In other words, if two different hashing algorithmsuse the same code length and the same L, they will have the samelocating time. Thus, we can compare different hashing algorithmswith the same code length simply using the accuracy - # oflocated samples curves (ignore the locating time). This strategyis used in most of the exiting hashing papers [41], [25], [24], [37],[44], [12], [43], [31], [16], [30], [40], [27], [13], [9], [17], [42],[18], [21], [29]. Moreover, The locating time grows linearly withrespect to the code length and the parameter L (as shown in Figure2).

Figure 3 shows the recall - search time curves of LSH usinghamming ranking approach on two datasets. This result is verysimilar to the result in Figure 1. It is because the locating timegrows slowly as the code length increases and the code lengthdifferences between different curves in Figure 3 are also small. Asthe code length difference becomes larger, we can see the impactof the locating time, as shown in Figure 4.

One advantage of the hamming ranking approach is that thememory usage of the index is extremely small. From Algorithm3, we can find that there is no additional data structure neededand we only need to store the binary codes for the data points inthe database. Suppose the database contains n data points and weuse l bits code, we need n l

8 byte. This memory usage may bethe smallest one among most of the popular ANNS methods evenwhen l = 1024.

100 200 300 400 500

Time (s)

20

40

60

80

Rec

all (

%)

LSH(16)LSH(20)LSH(24)LSH(28)LSH(32)

(a) SIFT1M

50 100 150

Time (s)

20

40

60

80

Rec

all (

%)

LSH(24)LSH(28)LSH(32)LSH(36)LSH(40)

(b) GIST1M

Fig. 3. The recall - search time curves of LSH using hamming rankingapproach on two datasets.

0 2000 4000 6000 8000 1000# of located samples

40

60

80

100

Rec

all (

%)

SIFT1M

(128)(256)(512)(1024)

(a)

0 0.5 1 1.5 2# of located samples ×105

40

60

80

100

Rec

all (

%)

GIST1M

(128)(256)(512)(1024)

(b)

102

Time (s)

40

60

80

100R

ecal

l (%

)

SIFT1M

(128)(256)(512)(1024)

(c)

102

Time (s)

40

60

80

100

Rec

all (

%)

GIST1M

(128)(256)(512)(1024)

(d)

Fig. 4. The ANN search results of LSH with significant different codelength on two datasets. The y-axis denotes the average recall while thex-axis denotes (a,b) the # of located samples (c,d) the search time usingthe hamming ranking approach.

2.3 The Hash Bucket Search ApproachThe hash bucket search is another popular method to locate thosenearest binary codes in the hamming space. The correspondingalgorithm procedure can be found in Algorithm 4.

Figure 5 shows the recall - time curves using the hashbucket search approach of LSH on two datasets. These curvesare significant different with the curves in Figure 1 and Figure 3.There is an optimal code length for hash bucket search approach(24 in SIFT1M and 28 in GIST1M). As the code length furtherincreases, the search performance decreases significantly.

Given a binary code b, locating the hashing bucket correspond-ing to b costs O(1) time (by using std::vector). If there are enoughpoints (larger than L) in this bucket, the total time cost of the hashbucket search is O(1), which seems much more efficient than thehamming ranking approach. However, this is only true when thebinary code is short. As the code length increases, the numberof hash bucket increases exponentially. And there are always notenough points (even no point) in the bucket corresponding to thequery. To locate L points, We need to increase the hamming radiusr. Given an l-bits code b, considering those hashing buckets whoseHamming distance to b is smaller or equal to r. It is easy to showthe number of these buckets is

∑ri=0

(li

), which increases almost

exponentially with respect to r and l. As the L increases, to locate

Page 5: 1 A Revisit of Hashing Algorithms for Approximate Nearest

5

TABLE 2Hamming radius VS. # of queries successfully located 50000 points on SIFT1M (total # of queries 10,000)

hamming radius 0 1 2 3 4 5 6 7 8 9 10

# of buckets (16 bits) 1 16 120 560 1,820 4,368 8,008 11,440 12,870 11,440 8,008# of queries (16 bits) (1.752s) 0 3637 4932 1230 169 30 1 1 0 0 0

# of buckets (32 bits) 1 32 496 4960 35,960 201,376 906,192 3,365,856 10,518,300 28,048,800 64,512,240# of queries (32 bits) (237.8s) 0 0 0 4 1710 4282 2691 1021 241 44 7

TABLE 3Hamming radius VS. # of queries successfully located 50000 points on GIST1M (total # of queries 1,000)

hamming radius 0 1 2 3 4 5 6 7 8 9 10

# of buckets (24 bits) 1 24 276 2,024 10,626 42,504 134,596 346,104 735,471 1,307,504 1,961,256# of queries (24 bits) (0.234s) 70 467 277 115 44 18 7 2 0 0 0

# of buckets (40 bits) 1 40 780 9,880 91,390 658,008 3,838,380 18,643,560 76,904,685 273,438,880 847,660,528# of queries (40 bits) (240.0s) 0 0 27 252 302 205 107 57 27 14 9

Algorithm 4 Hash bucket search approach for locate L closestbinary codes

1: Let hamming radius r = 0;2: Let candidate pool set C = ∅;3: while |C| < L do4: Get the buckets list B with hamming distance r to b;5: for all bucket bb in B do6: Put all the points in bb into C;7: if |C| >= L then8: Keep the first L points in C;9: return;

10: end if11: end for12: r = r + 1;13: end while

101 102 103

Time (s)

20

40

60

80

Rec

all (

%)

LSH(16)LSH(20)LSH(24)LSH(28)LSH(32)

(a) SIFT1M

101 102 103

Time (s)

20

40

60

80

Rec

all (

%)

LSH(24)LSH(28)LSH(32)LSH(36)LSH(40)

(b) GIST1M

Fig. 5. The recall - search time curves of LSH using hash bucket searchapproach on two datasets.

L points successfully, r has to be increased. Overall, the locatingtime of the hash bucket search approach grows exponentially withrespect to the code length l and super-linearly with respect to theparameter L (as shown in Figure 6).

Table 2 and 3 record the number of queries (the total numberis 10,000 on SIFT1M and 1,000 on GIST1M) which successfullylocated 50,000 samples in different hamming radius with differentcode length with the hash bucket search approach. As the codelength increases, the number of hash buckets corresponding to thesame hamming radius increases almost exponentially. Moreover,as the code length increases, the base points are more diversified.Thus we need to increase the hamming radius to locate enoughnumber of samples. These two reasons cause the locating timegrows exponentially as the code length increases for the hash

16 20 24 28 32Code length

0

50

100

150

200

250

Loca

ting

time

(s)

SIFT1M

24 28 32 36 40Code length

0

50

100

150

200

250

Loca

ting

time

(s)

GIST1M

(a)

0 5 10 15Located samples ×104

0

500

1000

1500

2000

Loca

ting

time

(s)

SIFT1M

0.5 1 1.5 2 2.5 3Located samples ×105

0

500

1000

1500

2000

Loca

ting

time

(s)

GIST1M

(b)

Fig. 6. The locating time complexity of the hash bucket search approach.(a) The locating time VS. the code length (L = 50, 000) (b) The locatingtime VS. the parameter L. The locating time grows exponentially withrespect to the code length and grows super-linearly with the parameterL.

bucket search approach. As a result, the hash bucket searchapproach has no practical meaning if the code length is longerthan 64.

However, Section 2.1 shows long codes can generate bettersearch candidates. A natural question will be can we design amethod for the hash bucket search approach to take the advantagesof long codes?

The answer is yes. A straightforward way is using multiplehash tables to represent long codes. Suppose we have 128-bitscodes and we use a single table with 128-bits. To locate L points, ifthe points in all the buckets within the Hamming distance r are notenough, we have to increase the hamming radius r. The locatingtime will then grow significantly and the extremely long locatingtime will make the hash bucket search approach meaningless. Ifwe use four tables, each table is 32-bits. Instead of growing r, wecan scan the buckets within the hamming radius r in all the tables,which gives us a more substantial chance to locate enough datapoints.

Figure 7 shows the recall-time curves of the hash bucket

Page 6: 1 A Revisit of Hashing Algorithms for Approximate Nearest

6

102 103

Time (s)

40

60

80

100R

ecal

l (%

)

(1 table)(2 table)(4 table)(8 table)(16 table)

(a) SIFT1M

20 50 100

Time (s)

30

40

50

60

70

80

90

Rec

all (

%)

(1 table)(2 table)(4 table)(8 table)(16 table)

(b) GIST1M

Fig. 7. The recall-time curves of the hash bucket search approach withdifferent number of tables on SIFT1M and GIST1M.

search approach on LSH code with the different number of tables(each table is 32-bits). The results are very consistent: the hashbucket search approach gains the advantage by using multipletables on both datasets. This multiple tables trick is more effectiveon SIFT1M than on GIST1M.

With 1,024 bits code, we can use d 1024w e tables, each table isw ≤ 1024 bits. If we use the recall - time curve, the performanceof the hash bucket search approach will firstly increase thendramatically decrease as the w increases. There will be an optimalw given a hashing algorithm and the dataset.

Figure 8 shows the performance of LSH on two datasets withboth hamming ranking and hash bucket search approaches. We use128, 256, 512 and 1024 bits for the hamming ranking approach.For the hash bucket search approach, we use d 1024w e tables torepresent 1024 bits code, each table is w bits, where table widthw is 16, 20, 30(32), 36 and 40.

Generally speaking, the hash bucket search approach has theadvantage on SIFT1M while the hamming ranking approach ispreferred on GIST1M. For the hash bucket search approach, theoptimal m is 34 on SIFT1M and 28 on GIST1M. For the hammingranking approach, longer code is preferred when we aim at ahigh recall while short code has the advantage when we aim ata low recall. It is important to note that these optimal parameters(the code length for the hamming ranking and table number forthe hash bucket search) will change if we use another hashingalgorithm (see Figure 10 and 11). A fair comparison betweenvarious hashing algorithms seems impossible.

3 GROUPED HAMMING RANKING APPROACH

The commonly used approaches (the hamming ranking and thehash bucket search) for search with a hash index have theirown pros and cons. Different hashing algorithms prefer differentapproach on different dataset (see Figure 10 and 11). It is not aeasy task to fairly compare different hashing algorithms simplybased on these two approaches. In this section, we are aiming atdeveloping a better way to search with a hash index.

The proposed approach is based on the hamming rankingapproach. Recall that the majority part of the locating time usingthe hamming ranking approach spends on the computation of theHamming distances of the query code and all the database codes.If we can find an efficient wa to filter out those points which areunlikely to be the answers, we can reduce this computation.

Follow the the idea in [16], we use kmeans to group thesamples into k clusters at the indexing stage, and we representeach cluster with its centroid. At the on-line query stage, insteadof computing the Hamming distances of the query code and theentire database codes, we can only focus on the nearest c clusters

101 102 103

Time (s)

20

30

40

50

60

70

80

90

100

Rec

all (

%) (16)

(20)(30)(36)(40)(128b)(256b)(512b)(1024b)

(a) SIFT1M

101 102

Time (s)

0

20

40

60

80

100

Rec

all (

%)

(16)(20)(32)(36)(40)

(128b)(256b)(512b)(1024b)

(b) GIST1M

Fig. 8. The recall - time curves of LSH with the hamming ranking andthe hash bucket search approaches on SIFT1M and GIST1M. The readcurves are the hamming ranking approach and the blue curves are thehash bucket search approach. The hamming ranking approach uses128, 256, 512 and 1,024 bits and the hash bucket search approach used 1024

we tables to represent 1024 bits code, each table is w bits, where

table width w is 16, 20, 30(32), 36 and 40.

Algorithm 5 Grouped hamming rankingRequire: base set D, query vector q, the hash functions, hash

index (binary codes for all the points in the base set), kmeanspartition of the base set, the nearest clusters C ,the pool size Land the number K of required nearest neighbors,

Ensure: K points from the base set which are close to q.1: Encode the query q to a binary code b;2: Find the C nearest clusters to q3: Compute the hamming distances between b and all the binary

codes in C clusters;4: Find the first L points with shortest hamming distance to

b;Grouped5: return the closet K points to q among these L points.

(measured by the distance of the query vector and the centroid).We named this approach as Grouped Hamming Ranking.

To use grouped hamming ranking, besides the binary code, weneed a kmeans partition of the database. The search procedure issummarized in Algorithm 5. There will be two parameters C andL which can be used to control the accuracy-efficiency trade-off.

Page 7: 1 A Revisit of Hashing Algorithms for Approximate Nearest

7

100 101 102 103

Time (s)

20

40

60

80

100

Rec

all (

%)

(16)(20)(30)(36)(40)(128b)(256b)(512b)(1024b)

Ghamming

(a) SIFT1M

100 101 102

Time (s)

0

20

40

60

80

100

Rec

all (

%)

(16)(20)(32)(36)(40)

(128b)(256b)(512b)(1024b)

Ghamming

(b) GIST1M

Fig. 9. The recall-time curves of LSH with the grouped hamming ranking, the hamming ranking and the hash bucket search approaches on SIFT1Mand GIST1M. The red and blue curves are the same as the curves in Figure 8. The black curve is the grouped hamming ranking approach proposedin this paper. The grouped hamming ranking approach is significant better than other two approaches on both datasets.

100 101 102 103

Time (s)

20

40

60

80

100

Rec

all (

%)

LSH

(16)(20)(30)(36)(40)

(128)(256)(512)(1024)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

SH

(20)(22)(24)(26)

(32)(64)(128)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

BRE

(20)(24)(26)(28)

(128)(256)(512)(1024)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

USPLH

(20)(24)(28)(32)

(64)(128)(256)(512)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

ITQ

(24)(28)(32)(36)

(32)(64)(128)

Ghamming

100 101 102

Time (s)

40

50

60

70

80

90

100

Rec

all (

%)

SpH

(16)(20)(24)(26)

(128)(256)(512)(1024) Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

IsoH

(16)(18)(20)(24)

(32)(64)(128)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

DSH

(20)(24)(28)(32)

(128)(256)(512)(1024)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

KLSH

(26)(28)(30)(32)

(128)(256)(512)(1024)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

AGH

(18)(20)(22)(24)

(128)(256)(512)(1024)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

CH

(16)(18)(20)(24)

(128)(256)(512)(1024)

Ghamming

Fig. 10. The comparison of three search-the-hash-index approaches with 11 hashing algorithms on SIFT1M. The red curves are hamming rankingapproaches with the numbers in the legend indicate the code length. The blue curves are hash bucket search approaches with the numbers in thelegend indicates the table width. If the w is the table width, the corresponding table number is d 1024

we. The black curve is the grouped hamming

ranking approach proposed in this paper. No matter which hashing algorithm is used, the grouped hamming ranking approach is significant betterthan other two approaches.

Figure 9 shows the performance of LSH on two datasetswith the grouped hamming ranking approach (we generated 1,000clusters with kmeans and 1,024 bits codes were used). The curveof grouped hamming ranking approach is added on the Figure 8and we can see the significant advantage of the grouped hammingranking approach over the other two commonly used approaches.Figure 10 and 11 show the performance of three search-the-hash-index approaches with 11 hashing algorithms on two datasets. Thegrouped hamming ranking approach outperforms the other twoapproaches for all the hashing algorithms on both two datasets.

4 A COMPREHENSIVE COMPARISON

With the proposed grouped hamming ranking approach, we areready to conduct a fair comparison between different hashingalgorithms. In this Section, we will conduct a comprehensive com-parison between eleven popular hashing algorithms on SIFT1Mand GIST1M.

4.1 Compared Algorithms

Eleven popular hashing algorithms compared in the experimentsare listed as follows:

Page 8: 1 A Revisit of Hashing Algorithms for Approximate Nearest

8

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

LSH

(16)(20)(32)(36)(40)

(128)(256)(512)(1024)

Qhamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

SH

(16)(20)(24)(26)

(128)(256)(512)(960)

Ghamming

100 101 102

Time (s)

20

40

60

80

100

Rec

all (

%)

BRE

(28)(32)(36)(40)

(128)(256)(512)(1024) Ghamming

100 101 102

Time (s)

0

20

40

60

80

100

Rec

all (

%)

USPLH

(20)(24)(32)

(32)(64)(128) Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

ITQ

(32)(36)(48)(64)

(128)(256)(512)(960)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

SpH

(16)(22)(24)(26)

(128)(256)(512)(1024)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

IsoH

(16)(20)(22)(24)

(128)(256)(512)(960)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

DSH

(16)(20)(24)(28)

(128)(256)(512)(1024)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

KLSH

(16)(20)(24)(28)(128)

(256)(512)(1024)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

AGH

(16)(20)(22)(24)

(128)(256)(512)(1024)

Ghamming

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

CH

(16)(20)(22)(24)

(128)(256)(512)(1024)

Ghamming

Fig. 11. The comparison of three search-the-hash-index approaches with 11 hashing algorithms on GIST1M. The red curves are hamming rankingapproaches with the numbers in the legend indicate the code length. The blue curves are hash bucket search approaches with the numbers in thelegend indicates the table width. If the w is the table width, the corresponding table number is d 1024

we. The black curve is the grouped hamming

ranking approach proposed in this paper. No matter which hashing algorithm is used, the grouped hamming ranking approach is significant betterthan other two approaches.

• LSH is a short name for random-projection-based LocalitySensitive Hashing [5] as in Algorithm 2. It is frequentlyused as a baseline method in various hashing papers.

• SH is a short name for Spectral Hashing [41]. SH isbased on quantizing the values of analytical eigenfunctionscomputed along PCA directions of the data.

• KLSH is a short name for Kernelized Locality SensitiveHashing [25]. KLSH generalizes the LSH method to thekernel space.

• BRE is a short name for Binary Reconstructive Embed-dings [24].

• USPLH is a short name for Unsupervised SequentialProjection Learning Hashing [37].

• ITQ is a short name for ITerative Quantization [11]. ITQfinds a rotation of zero-centered data so as to minimize thequantization error of mapping this data to the vertices of azero-centered binary hypercube.

• AGH is a short name for Anchor Graph Hashing [31].It aims at performing spectral analysis [2] of the datawhich shares the same goal with Self-taught Hashing [44].The advantage of AGH over Self-taught Hashing is thecomputational efficiency. AGH uses an anchor graph [28]to speed up the spectral analysis.

• SpH is a short name for Spherical Hashing [14]. SpH usesa hyperspherebased hash function to map data points intobinary codes.

• IsoH is a short name for Isotropic Hashing [23]. IsoHlearns the projection functions with isotropic variancesfor PCA projected data. The main motivation of IsoH isthat PCA directions with different variance should not be

TABLE 4Indexing Time (s)

Method SIFT1M GIST1M Method SIFT1M GIST1MLSH 3.35 12.33 SH 83.44 3864.3BRE 680.3 369.49 IsoH 1.68 91.99

USPLH 1099.2 10508.3 KLSH 94.09 194.89ITQ 88.3 1372.4 AGH 296.84 389.91SpH 2583.5 4041.1 CH 263.78 365.34DSH 51.92 122.84

ITQ, SH and IsoH learn 128 bits code on SIFT1M and 960 bits codeon GIST1M.

TABLE 5Coding Time (s)

Method SIFT1M GIST1M Method SIFT1M GIST1MLSH 0.03 0.016 SH 1.21 6.82BRE 0.25 0.043 IsoH 0.007 0.034

USPLH 0.05 0.02 KLSH 0.34 0.06ITQ 0.01 0.025 AGH 0.86 0.11SpH 0.11 0.027 CH 0.96 0.10DSH 0.04 0.02

ITQ, SH and IsoH use 128 bits code on SIFT1M and 960 bits codeon GIST1M.

equally treated (one bit for one direction).• CH is a short name for Compressed Hashing [27]. CH first

learns a landmark (anchor) based sparse representation [6]then followed by LSH.

• DSH is a short name for Density Sensitive Hashing [18].DSH finds the projections which aware the density distri-bution of the data.

For SH, ITQ, and IsoH, the learning process will involve the

Page 9: 1 A Revisit of Hashing Algorithms for Approximate Nearest

9

32 64 128

Code length

40

50

60

70

80

90

Rec

all(%

)

LSHSHBREUSPLHITQSpH

IsoHDSHKLSHAGHCH

256 512 1024

Code length

75

80

85

90

95

100

Rec

all(%

)

LSHBREUSPLH

SpHDSHKLSH

AGHCH

Fig. 12. The recall - codelength curves of various hashing algorithms when locate 10000 samples on SIFT1M

32 64 128

Code length

20

30

40

50

60

70

80

90

Rec

all(%

)

LSHSHBREUSPLH

ITQSpHIsoHDSH

KLSHAGHCH

256 512 1024

Code length

70

75

80

85

90

95

100

Rec

all(%

)

LSHSHBREUSPLH

ITQSpHIsoHDSH

KLSHAGHCH

Fig. 13. The recall - codelength curves of various hashing algorithms when locate 50000 samples on GIST1M

computation of eigenvectors of the covariance matrix. Thus, thesethree algorithms can only learn 128-bits code on SIFT1M and960-bits code on GIST1M. All the other eight algorithms learn1,024-bits code on both datasets.

For KLSH, AGH and CH, we need to pick a certain numberof anchor (landmark) points. We use the same 1,500 anchor pointsfor all these algorithms, and the anchor points are generated byusing kmeans with five iterations. Also, the number of nearestanchors is set to 50 for AGH and CH algorithms. Please see [31]for details.

All the hashing algorithms are implemented in Matlab6, andwe use these hashing functions to learn the binary codes for bothbase vectors and query vectors. The binary codes can then befed into Algorithm 5 for search. We use the same 1,000 clustersgenerated by kmeans for all the hashing algorithms. The Matlabfunctions are run on an i7-5930K CPU with 128G memory, andthe Algorithm 5 is implemented in c++ and run on an i7-4790K

6. Almost all the core parts of the Matlab code of the hashing algorithmsare written by the original authors of the papers.

CPU and 32G memory. For the sake of reproducibility, both theMatlab codes and c++ codes are released on GitHub7.

Besides all the hashing algorithms, we also report the perfor-mance of the brute-force search as a baseline.

• brute-force. The performance of brute-force search isreported to show the advantages of using hashing methods.To get different recall, we simply perform brute-forcesearch on different percentage of the query number. Forexample, the brute-force search time on 90% queries ofthe origin query set stands for the brute-force search timeof 90% recall.

4.2 ResultsSince we use Matlab functions to learn the binary codes and c++algorithm to search with a hash index, the coding time and latertime (locating time and scanning time) cannot be added together,and we reported them separately.

7. https://github.com/ZJULearning/hashingSearch

Page 10: 1 A Revisit of Hashing Algorithms for Approximate Nearest

10

101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

SIFT1M 100NN

LSHSHBREUSPLH

ITQSpHIsoHDSH

KLSHAGHCHbrute-force

(a)

10 50 100 300

Time (s)

90

92

94

96

98

100

Rec

all (

%)

SIFT1M 100NN

LSHSHBREUSPLH

ITQSpHIsoHDSH

KLSHAGHCHbrute-force

(b)

Fig. 14. The recall-time curves of various hashing algorithms using the grouped hamming ranking approach on SIFT1M.

100 101 102

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

GIST1M 100NN

LSHSHBREUSPLH

ITQSpHIsoHDSH

KLSHAGHCHbrute-force

(a)

10 50 100 200

Time (s)

90

92

94

96

98

100

Rec

all (

%)

GIST1M 100NN

LSHSHBREUSPLH

ITQSpHIsoHDSH

KLSHAGHCHbrute-force

(b)

Fig. 15. The recall-time curves of various hashing algorithms using the grouped hamming ranking approach on GIST1M.

Table 4 and 5 report the indexing (hash learning) time andcoding (testing) time of various hashing algorithms. The indexing(hash learning) stage is performed off-line thus the indexing timeis not very crucial. The coding time of all the hashing algorithmsis short. Therefore the coding time can be ignored compared withlocating time and scanning time (see Figure 14 and 15) when weconsider the total search time.

Figure 12 and 13 show how the recall changes as the codelength increases of various hashing algorithms when the numberof the located samples are fixed at 10,000 on SIFT1M and 50,000on GIST1M. This figure illustrates several interesting points:

• When the code length is 32, all of the other algorithms out-perform LSH on both two datasets. When the code length

is 64, eight algorithms outperform LSH on SIFT1M, andnine algorithms outperform LSH on GIST1M. However,when the code length exceeds 512, LSH performs the beston SIFT1M and the 3rd best on GIST1M. This resultis consistent with the finding in [20] that ”many data-dependent hashing algorithms have the improvements overLSH, but improvements occur only for relatively smallhash code sizes up to 64 or 128 bits”. This result is alsoconsistent with the ”outperform LSH” claim in all thesehashing papers since they never report the performanceusing longer codes.

• If the locating time is ignored, the performance of mostof the algorithms increases as the code length increases.

Page 11: 1 A Revisit of Hashing Algorithms for Approximate Nearest

11

100 101

Time (s)

50

60

70

80

90

100

Rec

all (

%)

SIFT1M 1NN

LSHSHBREUSPLH

ITQSpHIsoHDSH

KLSHAGHCH

(a)

100 101

Time (s)

40

50

60

70

80

90

100

Rec

all (

%)

GIST1M 1NN

LSHSHBREUSPLH

ITQSpHIsoHDSH

KLSHAGHCH

(b)

Fig. 16. The recall-time curves of various hashing algorithms using the grouped hamming ranking approach with k = 1.

100 101

Time (s)

40

50

60

70

80

90

100

Rec

all (

%)

SIFT1M 10NN

LSHSHBREUSPLH

ITQSpHIsoHDSH

KLSHAGHCH

(a)

100 101

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

GIST1M 10NN

LSHSHBREUSPLH

ITQSpHIsoHDSH

KLSHAGHCH

(b)

Fig. 17. The recall-time curves of various hashing algorithms using the grouped hamming ranking approach with k = 10.

However, on SIFT1M, the recall of USPLH decreasesas the code length exceeds 256 and the recall of AGHdecreases as the code length exceeds 512. On GIST1M, therecall of USPLH almost fixed as the code length increasesand the recalls of AGH, SH and IsoH decrease as the codelength exceeds 512. This result suggests the four hashingalgorithms, USPLH, AGH, SH, and IsoH may not be ableto learn very long discriminative codes.

We cannot judge which algorithm is better based merely onFigure 12 and 13, since the locating time is ignored. To pick thebest hashing algorithm, we have to use the hashing code as theindex and use time - recall curve to see the ANNS performance.

For all the hashing algorithms, we use the Algorithm 5 (thegrouped hamming ranking approach) for search. All the algo-rithms use the same 1,000 clusters generated by kmeans. Basedon the Figure 12 and 13, we can pick the optimal code lengthof each hash algorithm. On SIFT1M, SH, ITQ and IsoH use 128bits code, AGH and USPLH use 256 bits code, and all the otherhashing algorithms use 1,024 bits code. On GIST1M, AGH andIsoH use 256 bits code, ITQ uses 960 bits code, USPLH uses 64bits code, and all the other hashing algorithms use 1,024 bits code.

Figure 14 and 15 show the recall-time curves of varioushashing algorithms using the grouped hamming ranking approachon SIFT1M and GIST1M respectively. A number of interestingpoints can be found: On SIFT1M, LSH performs significantly

Page 12: 1 A Revisit of Hashing Algorithms for Approximate Nearest

12

2 5 10 50

Time (s)

50

60

70

80

90

100

Rec

all (

%)

SIFT1M LSH

500 Clusters800 Clusters1000 Clusters1500 Clusters2000 Clusters

(a)

100 101

Time (s)

20

40

60

80

100

Rec

all (

%)

GIST1M LSH

500 Clusters800 Clusters1000 Clusters1500 Clusters2000 Clusters

(b)

Fig. 18. The performance of LSH using the grouped hamming ranking approach with various cluster number.

2 5 10 50Time (s)

50

60

70

80

90

100

Rec

all (

%)

SIFT1M LSH

512 bits1024 bits1536 bits2048 bits2560 bits

(a)

100 101

Time (s)

30

40

50

60

70

80

90

100

Rec

all (

%)

GIST1M LSH

512 bits1024 bits2048 bits3072 bits4096 bits

(b)

Fig. 19. The performance of LSH using the grouped hamming ranking approach with various code length.

better than all the other hashing algorithms. On GIST1M, LSHis the second best-performed algorithm. It is slightly worse thanSpH. Considering the extreme simplicity of LSH, we can concludethat LSH is the best choice among all the eleven compared hashingalgorithms. This finding is a contradiction to the conclusions inmost of the previous hashing papers.

In the previous and the remaining experiments, we require allthe algorithms to return 100 answers, i.e., fix K = 100. To showthis setting is reasonable, we plot the recall-time curves of varioushashing algorithms with K = 1 and K = 10 in the Figure 16 and17. We can see the same trend as K = 100. As K = 1, 10, 100,LSH is always the best performed algorithm on SIFT1M and thesecond best-performed algorithm on GIST1M.

4.3 Parameters of The Grouped Hamming Ranking

There are two essential parameters in the proposed groupedhamming ranking approach: the number of groups G and thebinary code length l. In this Section, we will discuss how thesetwo parameters affect the performance of the grouped hammingranking approach.

Figure 18 shows the performance of LSH on SIFT1M andGIST1M with various number of clusters (500, 800, 1,000, 1,500

and 2,000) using the grouped hamming ranking. We can see thesearch performance is very stable. With 1M points, 1,000 clustersmeans the average number of points in each cluster is 1,000. Thisis a reasonable choice.

The binary code length is another important parameter. In theprevious experiment, we choose this parameter for various hashingalgorithms based on the recall-codelength curve (See Figure 12and 13). And we set the maximum length to be 1,024. It turnsout that most of the compared hashing algorithms choose thisparameter as 1,024. How about we further increase the codelength? Figure 19 shows the performance of LSH on SIFT1Mand GIST1M with various code length. On SIFT1M, LSH withgrouped hamming ranking reaches the best search performance asthe code length is 1,024. The search performance slowly decreasesas the code length further increases. On GIST1M, the optimal codelength becomes 2,048.

5 COMPARISON WITH OTHER POPULAR ANNSMETHODS

Given the surprising performance of random-projection-basedLSH, we want to compare it with some public popular ANNS

Page 13: 1 A Revisit of Hashing Algorithms for Approximate Nearest

13

101 102 103

Time (s)

88

90

92

94

96

98

100

Rec

all (

%)

RPLSHFALCONNflannannoyfaissKGraph

(a) SIFT1M

101 102 103

Time (s)

88

90

92

94

96

98

100

Rec

all (

%)

RPLSHFALCONNflannannoyfaissKGraph

(b) GIST1M

Fig. 20. The recall-time curves of six popular ANNS methods on SIFT1M and GIST1M.

software. To be fair, we convert our Matlab code for random-projection-based LSH to c++ code and make a complete searchsoftware (by c++). We named our algorithm RPLSH. The codesare also released on GitHub8.

5.1 Compared AlgorithmsWe pick five popular ANNS algorithms for comparisonwhich cover various types such as tree-based, hashing-based,quantization-based and graph-based approaches. They are:

1) FALCONN9 is a well-known ANNS library based onmulti-probe locality sensitive hashing. It implements thehash bucket search with multiple tables approach.

2) flann is a well-known ANNS library based on trees [34].It integrates Randomized KD-tree [36] and Kmeans tree[8]. In our experiment, we use the randomized KD-treeand set the number of the trees as 32 on SIFT1M and 64on GIST1M. The code on GitHub can be found at10.

3) annoy is based on a binary search forest index. We usetheir code on GitHub for comparison11.

4) faiss is recently released12 by Facebook. It contains wellimplemented code for state-of-the-art product quantiza-tion [16] based methods both on CPU and GPU. TheCPU version is used for a fair comparison.

5) KGraph has open source code GitHub at13, which isbased on an approximate kNN Graph.

The experiments are carried out on a machine with an i7-4790kCPU and 32G memory. The performance on the high recall regionfor an ANNS algorithm will be more crucial in real applications.Thus we sample one percentage points from each base set as itscorresponding validation set and tune all the algorithms on thevalidation sets to get their best-performing indices at the highrecall region. For RPLSH, we use 1,024-bits code and 1,000clusters. For all the search experiments, we only evaluate thealgorithms on a single thread.

8. https://github.com/ZJULearning/RPLSH9. https://github.com/FALCONN-LIB/FALCONN10. https://github.com/mariusmuja/flann11. https://github.com/spotify/annoy12. https://github.com/facebookresearch/faiss13. https://github.com/aaalgo/kgraph

5.2 ResultsThe recall-time curves of the six compared ANNS methods areshown in Figure 20. A number of interesting conclusions can bemade:

• On both two datasets, RPLSH performs at least the secondbest among all the six compared algorithms when therecall is higher than 87%. On SIFT1M, If the recall ishigher than 99% on SIFT1M and higher than 90% onGIST1M, RPLSH performs the best.

• Both the RPLSH and the FALCONN are LSH basedalgorithms. RPLSH significantly outperforms FALCONNon both two datasets. This probably dues to the reasonthat RPLSH uses the grouped hamming ranking approachwhile FALCONN uses the hash buckets search with mul-tiple tables approach.

• flann and annoy are tree-based methods. Compare theperformance of two hashing-based methods with that oftwo tree-based methods, and we can find that tree-basedmethods are more suitable for low dimensional data.

• faiss is a product quantization (PQ) [16] based method,and it uses binary codes to approximate the originalfeatures. Thus it is impossible for faiss to achieve a veryhigh recall. The advantage of this approximation is that wedo not need the original vectors which is a huge saving ofthe memory usage [16].

• KGraph is a graph-based algorithm, and it is KGraph’sauthor who says ”LSH is so hopelessly slow and/or in-accurate”14. However, RPLSH is significantly better thanKGraph on GIST1M when the recall is higher than 90%.

6 CONCLUSIONS

We carefully studied the problem of using hashing algorithmsfor ANNS. We introduce a novel approach (grouped hammingranking) to search with a hash index. With this new approach,random-projection-based LSH is superior to many other popularhashing algorithms, and this algorithm is extremely simple. Allthe codes used in the paper are released on GitHub, which can beused as a testing platform for a fair comparison between varioushashing methods.

14. http://www.kgraph.org/

Page 14: 1 A Revisit of Hashing Algorithms for Approximate Nearest

14

REFERENCES

[1] S. Arya and D. M. Mount. Approximate nearest neighbor queries in fixeddimensions. In Proceedings of the Fourth Annual ACM/SIGACT-SIAMSymposium on Discrete Algorithms, 1993.

[2] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniquesfor embedding and clustering. In Advances in Neural InformationProcessing Systems 14, 2001.

[3] H. Ben and D. Tom. FANNG: Fast approximate nearest neighbourgraphs. In Proceedings of the 2016 IEEE Conference on Computer Visionand Pattern Recognition, pages 5713–5722, 2016.

[4] J. L. Bentley. Multidimensional binary search trees used for associativesearching. Communications of the ACM, 18(9):509–517, 1975.

[5] M. S. Charikar. Similarity estimation techniques from rounding algo-rithms. In Proceedings of the Thiry-fourth Annual ACM Symposium onTheory of Computing, pages 380–388, 2002.

[6] X. Chen and D. Cai. Large scale spectral clustering with landmark-basedrepresentation. In Proceedings of the Twenty-Fifth AAAI Conference onArtificial Intelligence, 2011.

[7] C. Fu, C. Xiang, C. Wang, and D. Cai. Fast approximate nearest neighborsearch with the navigating spreading-out graphs. PVLDB, 12(5):461 –474, 2019.

[8] K. Fukunaga and P. M. Narendra. A branch and bound algorithmfor computing k-nearest neighbors. IEEE Transactions on Computers,100(7):750–753, 1975.

[9] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization forapproximate nearest neighbor search. In 2013 IEEE Conference onComputer Vision and Pattern Recognition, 2013.

[10] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimen-sions via hashing. In Proceedings of the 25th International Conferenceon Very Large Data Bases, 1999.

[11] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approachto learning binary codes. In The 24th IEEE Conference on ComputerVision and Pattern Recognition, 2011.

[12] J. He, W. Liu, and S. Chang. Scalable similarity search with optimizedkernel hashing. In Proceedings of the 16th ACM International Confer-ence on Knowledge Discovery and Data Mining, 2010.

[13] K. He, F. Wen, and J. Sun. K-means hashing: An affinity-preservingquantization method for learning binary compact codes. In 2013 IEEEConference on Computer Vision and Pattern Recognition, 2013.

[14] J. Heo, Y. Lee, J. He, S. Chang, and S. Yoon. Spherical hashing. In 2012IEEE Conference on Computer Vision and Pattern Recognition, 2012.

[15] P. Indyk and R. Motwani. Approximate nearest neighbors: Towardsremoving the curse of dimensionality. In Proceedings of the ThirtiethAnnual ACM Symposium on the Theory of Computing, 1998.

[16] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearestneighbor search. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):117–128,2011.

[17] Z. Jin, Y. Hu, Y. Lin, D. Zhang, S. Lin, D. Cai, and X. Li. Complementaryprojection hashing. In IEEE International Conference on ComputerVision, 2013.

[18] Z. Jin, C. Li, Y. Lin, and D. Cai. Density sensitive hashing. IEEE Trans.Cybernetics, 44(8):1362–1371, 2014.

[19] Z. Jin, D. Zhang, Y. Hu, S. Lin, D. Cai, and X. He. Fast and accuratehashing via iterative nearest neighbors expansion. IEEE transactions oncybernetics, 44(11):2167–2177, 2014.

[20] A. Joly and O. Buisson. Random maximum margin hashing. In The 24thIEEE Conference on Computer Vision and Pattern Recognition, 2011.

[21] Y. Kalantidis and Y. S. Avrithis. Locally optimized product quantizationfor approximate nearest neighbor search. In 2014 IEEE Conference onComputer Vision and Pattern Recognition, 2014.

[22] J. M. Kleinberg. Two algorithms for nearest-neighbor search in highdimensions. In Proceedings of the Twenty-Ninth Annual ACM Symposiumon the Theory of Computing, 1997.

[23] W. Kong and W. Li. Isotropic hashing. In Advances in NeuralInformation Processing Systems 25, 2012.

[24] B. Kulis and T. Darrell. Learning to hash with binary reconstructiveembeddings. In Advances in Neural Information Processing Systems 22,NIPS, 2009.

[25] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing forscalable image search. In IEEE 12th International Conference onComputer Vision, 2009.

[26] W. Li, Y. Zhang, Y. Sun, W. Wang, W. Zhang, and X. Lin. Approx-imate nearest neighbor search on high dimensional data—experiments,analyses, and improvement (v1. 0). arXiv:1610.02455, 2016.

[27] Y. Lin, R. Jin, D. Cai, S. Yan, and X. Li. Compressed hashing. In 2013IEEE Conference on Computer Vision and Pattern Recognition, 2013.

[28] W. Liu, J. He, and S.-F. Chang. Large graph construction for scalablesemi-supervised learning. In Proceedings of the 27th InternationalConference on Machine Learning, 2010.

[29] W. Liu, C. Mu, S. Kumar, and S. Chang. Discrete graph hashing. InAdvances in Neural Information Processing Systems 27, 2014.

[30] W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang. Supervised hashingwith kernels. In 2012 IEEE Conference on Computer Vision and PatternRecognition, 2012.

[31] W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. InProceedings of the 28th International Conference on Machine Learning,2011.

[32] J. Makhoul, F. Kubala, R. Schwartz, and R. Weischedel. Performancemeasures for information extraction. In Proceedings of DARPA BroadcastNews Workshop, 2000.

[33] Y. A. Malkov and D. A. Yashunin. Efficient and robust approximatenearest neighbor search using hierarchical navigable small world graphs.arXiv:1603.09320, 2016.

[34] M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for highdimensional data. IEEE Transactions on Pattern Analysis and MachineIntelligence, 36(11):2227–2240, 2014.

[35] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binarycodes. In Proceedings of the 28th International Conference on MachineLearning,ICML 2011, 2011.

[36] C. Silpa-Anan and R. Hartley. Optimised kd-trees for fast imagedescriptor matching. In Proceedings of the 2008 IEEE Conference onComputer Vision and Pattern Recognition, 2008.

[37] J. Wang, S. Kumar, and S. Chang. Sequential projection learning forhashing with compact codes. In Proceedings of the 27th InternationalConference on Machine Learning, 2010.

[38] H. S. Warren. Hacker’s Delight, Chapter 5. Addison-Wesley Profes-sional, 2012.

[39] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis andperformance study for similarity-search methods in high-dimensionalspaces. In VLDB, 1998.

[40] Y. Weiss, R. Fergus, and A. Torralba. Multidimensional spectral hashing.In 12th European Conference on Computer Vision, 2012.

[41] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Advances inNeural Information Processing Systems 21, NIPS, 2008.

[42] B. Xu, J. Bu, Y. Lin, C. Chen, X. He, and D. Cai. Harmonious hashing.In Proceedings of the 23rd International Joint Conference on ArtificialIntelligence, 2013.

[43] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu. Complementaryhashing for approximate nearest neighbor search. In IEEE InternationalConference on Computer Vision, 2011.

[44] D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fastsimilarity search. In Proceeding of the 33rd International ACM SIGIRConference on Research and Development in Information Retrieval,2010.