ieee transactions on image processing, vol. 24, no. 6 ...cfm.uestc.edu.cn/~fshen/tip2015-hashing on...

13
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6, JUNE 2015 1839 Hashing on Nonlinear Manifolds Fumin Shen, Chunhua Shen, Qinfeng Shi, Anton van den Hengel, Zhenmin Tang, and Heng Tao Shen Abstract— Learning-based hashing methods have attracted considerable attention due to their ability to greatly increase the scale at which existing algorithms may operate. Most of these methods are designed to generate binary codes preserv- ing the Euclidean similarity in the original space. Manifold learning techniques, in contrast, are better able to model the intrinsic structure embedded in the original high-dimensional data. The complexities of these models, and the problems with out-of-sample data, have previously rendered them unsuitable for application to large-scale embedding, however. In this paper, how to learn compact binary embeddings on their intrinsic manifolds is considered. In order to address the above-mentioned difficulties, an efficient, inductive solution to the out-of-sample data problem, and a process by which nonparametric manifold learning may be used as the basis of a hashing method are proposed. The proposed approach thus allows the develop- ment of a range of new hashing techniques exploiting the flexibility of the wide variety of manifold learning approaches available. It is particularly shown that hashing on the basis of t-distributed stochastic neighbor embedding outperforms state-of-the-art hashing methods on large-scale benchmark data sets, and is very effective for image classification with very short code lengths. It is shown that the proposed framework can be further improved, for example, by minimizing the quantization error with learned orthogonal rotations without much compu- tation overhead. In addition, a supervised inductive manifold hashing framework is developed by incorporating the label infor- mation, which is shown to greatly advance the semantic retrieval performance. Index Terms— Hashing, binary code learning, manifold learning, image retrieval. I. I NTRODUCTION O NE key challenge in many large scale image data based applications is how to index and organize the data accurately, but also efficiently. Various hashing techniques Manuscript received August 14, 2014; revised December 6, 2014 and February 9, 2015; accepted February 10, 2015. Date of publication Feb- ruary 24, 2015; date of current version March 27, 2015. This work was supported in part by the Australian Research Council Future Fellowship under Grant FT120100969 and in part by the National Natural Science Foundation of China under Project 61472063 and Project 61473154. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Dacheng Tao. F. Shen is with the School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610051, China (e-mail: [email protected]). C. Shen and A. van den Hengel are with the School of Computer Science, The University of Adelaide, Adelaide, SA 5005, Australia, and also with Australian Centre for Robotic Vision, Brisbane, QLD 4000, Australia (e-mail: [email protected]; [email protected]). Q. Shi is with the School of Computer Science, The University of Adelaide, Adelaide, SA 5005, Australia (e-mail: [email protected]). Z. Tang is with the School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing 210094, China (e-mail: [email protected]). H. T. Shen is with the School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD 4072, Australia (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2015.2405340 have attracted considerable attention in computer vision, infor- mation retrieval and machine learning [11], [13], [28], [29], [43], [49], [53], [56], and seem to offer great promise towards this goal. This paper is focused on the hashing methods that aim to encode documents or images as a set of short binary codes, while maintaining aspects of the structure of the original data (e.g., similarities between data points). The advantage of these compact binary representations is that pairwise comparisons may be carried out extremely efficiently in the Hamming space. This means that many algorithms which are based on such pairwise comparisons can be made more efficient, and applied to much larger datasets. Due to the flexibility of hash codes, hashing techniques can be applied in many ways. One can, for example, efficiently perform a similarity search by exploring only those data points falling into the close-by buckets to the query by the Hamming distance, or use the binary representations for other tasks like image classification. Locality sensitive hashing (LSH) [11] is one of the most well-known data-independent hashing methods, and generates hash codes based on random projections. With the success of LSH, random hash functions have been extended to several similarity measures, including p-norm distances [7], the Mahalanobis metric [26], and kernel similarity [25], [40]. However, the methods belonging to the LSH family normally require relatively long hash codes (compared to the recently developed data-dependent hashing algorithms) and several hash tables to achieve high precision and recall. This leads to a larger storage cost than would otherwise be necessary, and thus limits the sale at which the algorithm may be applied. Data-dependent or learning-based hashing methods have been developed with the goal of learning more compact hash codes. Directly learning binary embeddings typically results in an optimization problem which is very hard to solve, however. Relaxation is often used to simplify the optimization (see [3], [49]). As in LSH, the methods aim to iden- tify a set of hyperplanes, but now these hyperplanes are learned, rather than randomly selected. For example, PCAH [49], semi-supervised hashing (SSH) [49], iterative quantization (ITQ) [13] and isotropic hashing [23], generate linear hash functions through simple principal component analysis (PCA) projections, while LDAhash [3] is based on Linear Discriminant Analysis (LDA). Extending this idea, there are also methods which learn hash functions in a kernel space, such as binary reconstructive embeddings (BRE) [24], random maximum margin hashing (RMMH) [22] and kernel-based supervised hashing (KSH) [33]. Other representative methods in the literature include the unsupervised locally linear hashing [20], discrete graph hashing (DGH) [32] and the supervised 1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 06-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6 ...cfm.uestc.edu.cn/~fshen/TIP2015-Hashing on Nonlinear Manifolds.pdfhashing framework is developed by incorporating the label

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6, JUNE 2015 1839

Hashing on Nonlinear ManifoldsFumin Shen, Chunhua Shen, Qinfeng Shi, Anton van den Hengel, Zhenmin Tang, and Heng Tao Shen

Abstract— Learning-based hashing methods have attractedconsiderable attention due to their ability to greatly increasethe scale at which existing algorithms may operate. Most ofthese methods are designed to generate binary codes preserv-ing the Euclidean similarity in the original space. Manifoldlearning techniques, in contrast, are better able to model theintrinsic structure embedded in the original high-dimensionaldata. The complexities of these models, and the problems without-of-sample data, have previously rendered them unsuitablefor application to large-scale embedding, however. In this paper,how to learn compact binary embeddings on their intrinsicmanifolds is considered. In order to address the above-mentioneddifficulties, an efficient, inductive solution to the out-of-sampledata problem, and a process by which nonparametric manifoldlearning may be used as the basis of a hashing method areproposed. The proposed approach thus allows the develop-ment of a range of new hashing techniques exploiting theflexibility of the wide variety of manifold learning approachesavailable. It is particularly shown that hashing on the basisof t-distributed stochastic neighbor embedding outperformsstate-of-the-art hashing methods on large-scale benchmark datasets, and is very effective for image classification with very shortcode lengths. It is shown that the proposed framework can befurther improved, for example, by minimizing the quantizationerror with learned orthogonal rotations without much compu-tation overhead. In addition, a supervised inductive manifoldhashing framework is developed by incorporating the label infor-mation, which is shown to greatly advance the semantic retrievalperformance.

Index Terms— Hashing, binary code learning, manifoldlearning, image retrieval.

I. INTRODUCTION

ONE key challenge in many large scale image data basedapplications is how to index and organize the data

accurately, but also efficiently. Various hashing techniques

Manuscript received August 14, 2014; revised December 6, 2014 andFebruary 9, 2015; accepted February 10, 2015. Date of publication Feb-ruary 24, 2015; date of current version March 27, 2015. This work wassupported in part by the Australian Research Council Future Fellowship underGrant FT120100969 and in part by the National Natural Science Foundationof China under Project 61472063 and Project 61473154. The associate editorcoordinating the review of this manuscript and approving it for publicationwas Prof. Dacheng Tao.

F. Shen is with the School of Computer Science and Engineering, Universityof Electronic Science and Technology of China, Chengdu 610051, China(e-mail: [email protected]).

C. Shen and A. van den Hengel are with the School of Computer Science,The University of Adelaide, Adelaide, SA 5005, Australia, and also withAustralian Centre for Robotic Vision, Brisbane, QLD 4000, Australia (e-mail:[email protected]; [email protected]).

Q. Shi is with the School of Computer Science, The University of Adelaide,Adelaide, SA 5005, Australia (e-mail: [email protected]).

Z. Tang is with the School of Computer Science and Technology, NanjingUniversity of Science and Technology, Nanjing 210094, China (e-mail:[email protected]).

H. T. Shen is with the School of Information Technology and ElectricalEngineering, The University of Queensland, Brisbane, QLD 4072, Australia(e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2015.2405340

have attracted considerable attention in computer vision, infor-mation retrieval and machine learning [11], [13], [28], [29],[43], [49], [53], [56], and seem to offer great promise towardsthis goal. This paper is focused on the hashing methodsthat aim to encode documents or images as a set of shortbinary codes, while maintaining aspects of the structure ofthe original data (e.g., similarities between data points). Theadvantage of these compact binary representations is thatpairwise comparisons may be carried out extremely efficientlyin the Hamming space. This means that many algorithmswhich are based on such pairwise comparisons can be mademore efficient, and applied to much larger datasets. Due to theflexibility of hash codes, hashing techniques can be appliedin many ways. One can, for example, efficiently perform asimilarity search by exploring only those data points fallinginto the close-by buckets to the query by the Hammingdistance, or use the binary representations for other tasks likeimage classification.

Locality sensitive hashing (LSH) [11] is one of the mostwell-known data-independent hashing methods, and generateshash codes based on random projections. With the successof LSH, random hash functions have been extended to severalsimilarity measures, including p-norm distances [7], theMahalanobis metric [26], and kernel similarity [25], [40].However, the methods belonging to the LSH family normallyrequire relatively long hash codes (compared to the recentlydeveloped data-dependent hashing algorithms) and severalhash tables to achieve high precision and recall. This leadsto a larger storage cost than would otherwise be necessary,and thus limits the sale at which the algorithm may beapplied.

Data-dependent or learning-based hashing methods havebeen developed with the goal of learning more compact hashcodes. Directly learning binary embeddings typically resultsin an optimization problem which is very hard to solve,however. Relaxation is often used to simplify the optimization(see [3], [49]). As in LSH, the methods aim to iden-tify a set of hyperplanes, but now these hyperplanesare learned, rather than randomly selected. For example,PCAH [49], semi-supervised hashing (SSH) [49], iterativequantization (ITQ) [13] and isotropic hashing [23], generatelinear hash functions through simple principal componentanalysis (PCA) projections, while LDAhash [3] is based onLinear Discriminant Analysis (LDA). Extending this idea,there are also methods which learn hash functions in a kernelspace, such as binary reconstructive embeddings (BRE) [24],random maximum margin hashing (RMMH) [22] andkernel-based supervised hashing (KSH) [33].

Other representative methods in the literature includethe unsupervised locally linear hashing [20], discretegraph hashing (DGH) [32] and the supervised

1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6 ...cfm.uestc.edu.cn/~fshen/TIP2015-Hashing on Nonlinear Manifolds.pdfhashing framework is developed by incorporating the label

1840 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6, JUNE 2015

Fig. 1. Top 10 retrieved digits for 4 queries (a) on a subset of MNIST with 300 samples. Search is conducted in the original feature space (b, c) andnonlinear embedding space by t-SNE [47] (d, e) using Euclidean distance (b, d) and Hamming distance (c, e).

minimal loss hashing (MLH) [38], ranking-based supervisedhashing [50], two-step hashing (TSH) [30], FastHash [29],graph cuts coding (GCC) [10], etc. A bilinear form of hashfunction is adopted in [12] and [35].

In a departure from such methods, however, spectralhashing (SH) [53], one of the most popular learning-basedmethods, generates hash codes by solving the relaxed math-ematical program that is similar to the one in Laplacianeigenmaps [1]. Embedding the original data into alow-dimensional space while simultaneously preserving theinherent neighborhood structure is critical for learning com-pact, effective hash codes. In general, nonlinear manifoldlearning methods are more powerful than linear dimensionalityreduction techniques, as they are able to more effectivelypreserve the local structure of the input data without assumingglobal linearity [44]. The geodesic distance on a manifoldhas been shown to outperform the Euclidean distance in thehigh-dimensional space for image retrieval [16], for example.Figure 1 demonstrates that searching using either theEuclidean or Hamming distance after nonlinear embeddingresults in more semantically accurate neighbors than thesame search in the original feature space, and thus thatlow-dimensional embedding may actually improve retrievalor classification performance. However, the only widely usednonlinear embedding method for hashing is Laplacian eigen-maps (LE) ([34], [53], [55]). Other effective manifold learn-ing approaches (e.g., Locally Linear Embedding (LLE) [41],Elastic Embedding (EE) [4] or t-Distributed Stochastic Neigh-bor Embedding (t-SNE) [47]) have rarely been explored forhashing. Very recently, the authors of [20] choose to jointlyminimize the LLE embedding error and the quantization losswith an orthogonal rotation.

One problem hindering the use of manifold learning forhashing is that these methods do not directly scale to largedatasets. For example, to construct the neighborhood graph(or pairwise similarity matrix) in these algorithms for n datapoints is O(n2) in time, which is intractable for large datasets.The second problem is that they are typically non-parametricand thus cannot efficiently solve the critical out-of-sampleextension problem. This fundamentally limits their applicationto hashing, as generating codes for new samples is an essentialpart of the problem. One of the widely used solutions forthe methods involving spectral decomposition (e.g., LLE, LEand isometric feature mapping (ISOMap) [45]) is the Nyströmextension [2], [44], which solves the problem by learningeigenfunctions of a kernel matrix. As mentioned in [53],however, this is impractical for large-scale hashing since theNyström extension is as expensive as doing exhaustive nearestneighbor search (O(n)). A more significant problem, however,

is that the Nyström extension cannot be directly applied tonon-spectral manifold learning methods such as t-SNE.

In order to address the out-of-sample extension problem,this study proposes a new non-parametric regression approachwhich is both efficient and effective. This method allows rapidassignment of new codes to previously unseen data in a mannerwhich preserves the underlying structure of the manifold.Having solved the out-of-sample extension problem, a methodby which a learned manifold may serve as the basis for abinary encoding is introduced. This method is designed soas to generate encodings which reflect the geodesic distancesalong such manifolds. On this basis, a range of new embeddingapproaches based on a variety of manifold learning methodsare developed. The best performing of these is based onmanifolds identified through t-SNE, which has been shownto be effective in discovering semantic manifolds amongst theset of all images [47].

Given the computational complexity of many manifoldlearning methods, it is shown in this work that it is possibleto learn the manifold on the basis of a small subset of thedata B (with size m � n), and subsequently to inductivelyinsert the remainder of the data, and any out-of-sample data,into the embedding in O(m) time per point. This processleads to an embedding method labelled as Inductive Manifold-Hashing (IMH) which is shown to outperform state-of-the-artmethods on several large scale datasets both quantitatively andqualitatively.

As an extension, this study shows that the proposed IMHframework can be improved by minimizing the quantizationerror of mapping real-valued data to binary codes, for example,through orthogonal rotations. Significant performance gainsare achieved by this simple step, as shown in Section V.Based on supervised subspace learning, this study also presentsa supervised inductive manifold hashing framework (IMHs),which is shown to significantly advance the semantic retrievalperformance of IMH.

The rest of this paper is organized as follows.In Section II, some representative hashing methods relatedare briefly reviewed. Section III describes the proposedInductive Manifold-Hashing framework, followed by theexperimental results in Section IV. The IMH method isfurther shown to be improved by learned orthogonal rotationsin Section V. Section VI introduces the supervised extensionof the inductive manifold hashing framework based onsupervised subspace learning.

This paper is an extended version of the workpreviously published in [42]. Major improvements over [42]include the minimization of quantization errors withlearned rotations (Section V) and the extension to the

Page 3: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6 ...cfm.uestc.edu.cn/~fshen/TIP2015-Hashing on Nonlinear Manifolds.pdfhashing framework is developed by incorporating the label

SHEN et al.: HASHING ON NONLINEAR MANIFOLDS 1841

supervised case (Section VI). We made the code available athttps://github.com/chhshen/Hashing-on-Nonlinear-Manifolds.

II. RELATED WORK

Learning based or data-dependent hashing has achievedconsiderable attention recently in computer vision,machine learning and information retrieval community.Many hashing methods have been proposed by applyingdifferent learning algorithms, including unsupervisedmethods [13], [34], [53], [55], and (semi-)supervisedmethods [24], [29], [33], [49]. In this section, somerepresentative unsupervised hashing methods related to theproposed method are briefly reviewed.

A. Spectral Hashing

Weiss et al. [53] formulated the spectral hashing (SH)problem as

minY

xi ,x j ∈X

w(xi , x j )‖yi − y j‖2

s.t. Y ∈ {−1, 1}n×r , Y�Y = nI, Y�1 = 0. (1)

Here yi ∈ {−1, 1}r , the i th row in Y, is the hash codethat one wants to learn for xi ∈ Rd , which is one of then data points in the training data set X. W ∈ Rn×n withWi j = w(xi , x j ) = exp(−‖xi − x j‖2/σ 2) is the graph affinitymatrix, where σ is the bandwidth parameter. I is the identitymatrix. The last two constraints force the learned hash bits tobe uncorrelated and balanced, respectively. By removing thefirst constraint (i.e., spectral relaxation [53]), Y can be easilyobtained by spectral decomposition on the Laplcaian matrixL = D − W, where D = diag(W1) and 1 is the vector ofall ones. However, constructing W is O(dn2) (in time) andcalculating the Nyström extension for a new point is O(rn),which are both intractable for large datasets. It is assumed inSH [53], therefore, that the data are sampled from a uniformdistribution, which leads to a simple analytical eigenfunctionsolution of 1-D Laplacians. However, this strong assumptionis often not true in practice and the manifold structure of theoriginal data is thus destroyed [34].

SH was extended into the tangent space in [6], however,based on the same uniform assumption. The author of [6] alsoproposed a non-Euclidean SH algorithm based on nonlinearclustering, which is O(n3) for training and O(m + n/m) fortesting. Weiss et al. [52] then improved SH by expanding thecodes to include the outer-product eigenfunctions instead ofonly single-dimension eigenfunctions in SH, i.e., multidimen-sional spectral hashing (MDSH).

B. Graph Based Hashing

To efficiently solve problem (1), anchor graphhashing (AGH) [34] approximates the affinity matrix Wby the low-rank matrix W = ZΛ−1Z, where Z ∈ Rn×m

is the normalized affinity matrix (with k non-zeros ineach row) between the training samples and m anchors(generated by K -means), and Λ−1 normalizes W to be doublystochastic. Then the desired hash functions may be efficiently

identified by binarizing the Nyström eigenfunctions [2]with the approximated affinity matrix W. AGH is thusefficient, in that it has linear training time and constantsearch time, but as is the case for SH [53], the generalizedeigenfunction is derived only for the Laplacian eigenmapsembedding.

Different from SH and AGH, Locally LinearHashing (LLH [20]) constructs the graph affinity bylocality-sensitive sparse coding to better capture the locallinearity of manifolds. With the obtained affinity matrix, LLHformulates hashing as a joint optimization problem of LLEembedding error and quantization loss.

C. Self-Taught Hashing

Self-taught Hashing (STH) [55] addressed the out-of-sampleproblem by a novel way: hash functions are obtained bytraining a support vector machine (SVM) classifier for eachbit using the pre-learned binary codes as class labels. Thebinary codes were learned by directly solving (1) with a cosinesimilarity function. This process has prohibitive computationaland memory costs, however, and training the SVM can betime consuming for dense data. Very recently, this idea wasimplemented in collaboration with graph cuts to the binarycoding problems to bypass continuous relaxation [10], [29].

III. THE PROPOSED METHOD

A. Inductive Learning for Hashing

Assuming that one has the manifold-based embeddingY := {y1, y2,· · · ,yn} for the entire training dataX := {x1, x2,· · · ,xn}. Given a new data point xq , oneaims to generate an embedding yq which preserves the localneighborhood relationships among its neighbors Nk(xq) in X.The following simple objective is utilized:

C(yq) =n∑

i=1

w(xq, xi )‖yq − yi‖2, (2)

where

w(xq, xi ) ={exp(−‖xq − xi‖2/σ 2), if xi ∈ Nk(xq),0 otherwise.

Minimizing (2) naturally uncovers an embedding for thenew point on the basis of its nearest neighbors on the low-dimensional manifold initially learned on the base set. Thatis, in the low-dimensional space, the new embedded locationfor the point should be close to those of the points close to itin the original space.

Differentiate C(yq) with respect to yq , one obtains

∂C(yq)

yq

∣∣∣∣yq=y�

q

= 2n∑

i=1

w(xq , xi )(y�q − yi ) = 0, (3)

which leads to the optimal solution

y�q =

∑ni=1 w(xq , xi )yi∑n

i=1 w(xq, xi ). (4)

Equation (4) provides a simple inductive formulation for theembedding: produce the embedding for a new data point by a(sparse) locally linear combination of the base embeddings.

Page 4: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6 ...cfm.uestc.edu.cn/~fshen/TIP2015-Hashing on Nonlinear Manifolds.pdfhashing framework is developed by incorporating the label

1842 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6, JUNE 2015

The proposed approach here is inspired byDelalleau et al. [8], where they have focused onnon-parametric graph-based learning in semi-supervisedclassification. The aim of this study here is completelydifferent: the present work attempts to scale up the manifoldlearning process for hashing in an unsupervised manner.

The resulting solution (4) is consistent with the basicsmoothness assumption in manifold learning, that close-by data points lie on or close to a locally linearmanifold [1], [41], [45]. This local-linearity assumption hasalso been widely used in semi-supervised learning [8], [54],image coding [51], and similar. This paper proposes to applythis assumption to hash function learning.

However, as aforementioned, (4) does not scalewell for both computing Y (O(n2) e.g., for LE) andout-of-sample extension (O(n)), which is intractable for largescale tasks. Next, it is shown that the following prototypealgorithm is able to approximate yq using only a small baseset well.

B. The Prototype Algorithm

This prototype algorithm is based on entropy numbersdefined below.

Definition 1 (Entropy Numbers [18]): Given any Y ⊆ Rr

and m ∈ N, the m-th entropy number εm(Y ) of Y is defined as

εm(Y ) := inf{ε > 0|N (ε, Y, ‖ · − · ‖) ≤ m},where N is the covering number. This means εm(Y ) is thesmallest radius that Y can be covered by less or equal tom balls.

Inspired by [18, Th. 27], a prototype algorithm isconstructed below. One can use m balls to cover Y , thusobtain m disjoint nonempty subsets Y1, Y2, · · · , Ym suchthat for any ε > εm(Y ), ∀ j ∈ {1, · · · , m}, ∃c j ∈ R

r ,s.t. ∀y ∈ Y j , ‖c j − y‖ ≤ ε and

⋃mj=1 Y j = Y . One can see

that each Y j naturally forms a cluster with the center c j andthe index set I j = {i |yi ∈ Y j }.

Let αi = w(xq,xi )∑nj=1 w(xq,x j )

and C j = ∑i∈I j

αi . For each cluster

index set I j , j = 1, · · · , m, � j = mC j +1� many indices arerandomly drawn from I j proportional to their weight αi . Thatis, for μ ∈ {1, · · · , � j }, the μ-th randomly drawn index u j,μ,

Pr(u j,μ = i) = αi

C j, ∀ j ∈ {1, · · · , m}.

yq is constructed as

yq =m∑

j=1

C j

� j

� j∑

μ=1

yu j,μ . (5)

Lemma 1: There is at most 2m many unique yu j,μ in yq.Proof:

∑mj=1 � j ≤ ∑m

j=1(mC j + 1) = 2m.The following lemma shows that through the prototype

algorithm the mean is preserved and the variance is small.Lemma 2: The following holds

E [yq ] = yq , Var (yq) ≤ ε2

m. (6)

Proof:

E [yq] = E [m∑

j=1

C j

� j

� j∑

μ=1

yu j,μ] =m∑

j=1

C j

� j

� j∑

μ=1

E [yu j,μ]

=m∑

j=1

C j

� j

� j∑

μ=1

i∈I j

αi

C jyi =

m∑

j=1

i∈I j

αi yi = yq .

Var (yq) =m∑

j=1

� j∑

μ=1

Var (C j

� jyu j,μ) ≤

m∑

j=1

C2j

�2j

� j∑

μ=1

ε2

=m∑

j=1

C2j

� jε2 ≤

m∑

j=1

C2j

mC jε2 =

∑mj=1 C2

j

mε2 = ε2

m.

Theorem 1: For any even number n′ ≤ n. If PrototypeAlgorithm uses n′ many non-zero y ∈ Y to express yq , then

Pr[‖yq − yq‖ ≥ t] <2(ε n′

2(Y ))2

n′t2 . (7)

Proof: Via Chebyshev’s inequality and Lemma 2, for any > 0 one gets

Pr(‖yq − yq‖ ≥

√Var (yq)

)≤ 1

2 .

Let t = √

Var (yq) and ε → ε n′2(Y ) yields the theorem.

Corollary 1: For an even number n′, any ε > ε n′2(Y ), any

δ ∈ (0, 1) and any t > 0, if n′ ≥ 2ε2

δt2 , then with probability atleast 1 − δ,

‖yq − yq‖ < t .Proof: Via Theorem 1, for ε > ε n′

2(Y ), Pr[‖yq−yq‖≥ t]<

2ε2

n′t2 . Let δ ≥ 2ε2

n′t2 , the following holds n′ ≥ 2ε2

δt2 .The quality of the approximation depends on ε n′

2(Y ) and n′.

If data has strong clustering pattern, i.e. data within eachcluster are very close to cluster center, one will have smallε n′

2(Y ), hence better approximation. Likewise, the bigger n′

is, the better approximation is.

C. Approximation of the Prototype Algorithm

For a query point xq , the prototype algorithm samplesfrom clusters and then constructs yq . The clusters can beobtained via clustering algorithms such as K-means. For eachcluster, the higher C j = ∑

i∈I jαi , the more draws are made.

At least one draw is made from each cluster. Since n couldbe potentially massive, it is impractical to rank (or computeand keep a few top ones) αi within each cluster. Moreover,w(xq, x j ) depends on xq — for a different query point xq ′ ,w(xq ′, xi ) may be very smaller even if w(xq , xi ) is high. Thusone needs to consider the entire X instead of a single xq .

Recall that αi (xq) = w(xq,xi )∑nj=1 w(xq ,x j )

. Ideally, for each cluster,

one wants to select the yi that has high overall weightOi = ∑

xq∈X αi (xq). For large scale X, the reality is thatone does not have access to w(x, x′) for all x, x′ ∈ X.Only limited information is available such as cluster

Page 5: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6 ...cfm.uestc.edu.cn/~fshen/TIP2015-Hashing on Nonlinear Manifolds.pdfhashing framework is developed by incorporating the label

SHEN et al.: HASHING ON NONLINEAR MANIFOLDS 1843

Algorithm 1 Inductive Manifold-Hashing (IMH)

centers {c j , j ∈ {1, · · · , m}} and w(c j , x), x ∈ X. Fortu-nately, the clustering result gives useful information about Oi .The cluster centers {c j , j ∈ {1, · · · , m}} have the largestoverall weight w.r.t the points from their own cluster,i.e.

∑i∈I j

w(c j , xi ). This suggests one should select all clustercenters to express yq . For a base set B , and any query point xq ,the embedding is predicted as

yq =∑

x∈B w(xq, x)y∑x∈B w(xq , x)

. (8)

Following many methods in the area (see [34], [53]), thegeneral inductive hash function is formulated by binarizingthe low-dimensional embedding

h(x) = sgn

(∑mj=1 w(x, c j )y j∑m

j=1 w(x, c j )

), (9)

where sgn(·) is the sign function and YB := {y1, y2,· · · ,ym}is the embedding for the base set B := {c1, c2, · · · , cm},which is the cluster centers obtained by K-means. Here theembedding yi are assumed to be centered on the origin.The proposed hashing method is termed Inductive Manifold-Hashing (IMH). The inductive hash function provides ameans for generalization to new data, which has a constantO(dm + rk) time. With this, the embedding for the trainingdata becomes

Y = WXBYB, (10)

where WXB is defined such that Wi j = w(xi ,c j )∑mi=1 w(xi ,c j )

, forxi ∈ X, c j ∈ B.

Although the objective function (2) is formally relatedto LE, it is generally in preserving local similarity. Theembeddings YB can be learned by any appropriate manifoldlearning method which preserves the similarity of interest inthe low-dimensional space. Several other embedding methodsare empirically evaluated in Section III-F. Actually, as will beshown, some manifold learning methods (e.g., t-SNE describedin Section III-D) can be better choices for learning binarycodes, although LE has been widely used. Two methods forlearning YB will be discussed in the sequel.

The IMH framework is summarized in Algorithm 1. Notethat the computational cost is dominated by K-means in thefirst step, which is O(dmnl) in time (with l the number ofiterations). Considering that m (normally a few hundreds) isfar less than n, and is a function of manifold complexity ratherthan the volume of data, the total training time is linear inthe size of training set. If the embedding method is LE, forexample, then using IMH to compute YB requires constructing

the small affinity matrix WB and solving r eigenvectors of them ×m Laplacian matrix LB which is O(dm2 +rm). Note thatin step 3, to compute WXB, one needs to compute the distancematrix between B and X, which is the output of K-means, orcan be computed additionally in O(dmn) time. The trainingprocess on a dataset of 70K items with 784 dimensions canthus be achieved in a few seconds on a standard desktop PC.

Connection to the Nyström Method: As Equation (4),the Nyström eigenfunction by Bengio et al. [2] also gener-alizes to a new point by a linear combination of a set oflow-dimensional embeddings:

φ(x) = √n

n∑

j=1

K(x, x j )Vjr �−1

r .

For LE, Vr and �r correspond to the top r eigenvectors andeigenvalues of a normalized kernel matrix K with

Ki j = k(xi , x j ) = 1

n

w(xi , x j )√Ex[w(xi , x)]Ex[w(x, x j )]

.

In AGH [34], the formulated hash function was proved to bethe corresponding Nyström eigenfunction with the approxi-mate low-rank affinity matrix. The Laplacian eigenmaps latentvariable model (LELVM) [5] also formulated the out-of-sample mappings for LE in a manner similar to (4) by com-bining latent variable models. Both of these methods, and theproposed one, can thus be seen as applications of the Nyströmmethod. Note, however, that the suggested method differs inthat it is not restricted to spectral methods such as LE, andthat the present study aims to learn binary hash functions forsimilarity-based search rather than dimensionality reduction.LELVM [5] cannot be applied to other embedding methodsother than LE.

D. Stochastic Neighborhood Preserving Hashing

In order to demonstrate the effectiveness of the proposedapproach, a hashing method based on t-SNE [47],a non-spectral embedding method, is derived below.T-SNE is a modification of stochastic neighborhoodembedding (SNE) [19] which aims to overcome the tendencyof that method to crowd points together in one location.t-SNE provides an effective technique for visualizing dataand dimensionality reduction, which is capable of preservinglocal structures in the high-dimensional data while retainingsome global structures [47]. These properties make t-SNE agood choice for nearest neighbor search. Moreover, as statedin [48], the cost function of t-SNE in fact maximizes thesmoothed recall [48] of query points and their neighbors.

The original t-SNE does not scale well, as it has a timecomplexity which is quadratic in n. More significantly, how-ever, it has a non-parametric form, which means that thereis no simple function which may be applied to out-of-sampledata in order to calculate their coordinates in the embeddedspace. As was proposed in the previous subsection, one firstapplies t-SNE [47] to the base set B,

minYB

xi∈B

x j∈B

pi j log

(pi j

qi j

). (11)

Page 6: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6 ...cfm.uestc.edu.cn/~fshen/TIP2015-Hashing on Nonlinear Manifolds.pdfhashing framework is developed by incorporating the label

1844 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6, JUNE 2015

Here pi j is the symmetrized conditional probability in thehigh-dimensional space, and qi j is the joint probability definedusing the t-distribution in the low-dimensional embeddingspace. The optimization problem (11) is easily solved by agradient descent procedure.1 After obtaining the embeddingsYB of samples xi ∈ B, the hash codes for the entire datasetcan be easily computed using (9). This method is labelledIMH-tSNE.

E. Hashing With Relaxed Similarity Preservation

As in the last subsection, one can compute YB consideringlocal smoothness only within B. Based on equation (4), inthis subsection, YB is alternatively computed by consideringthe smoothness both within B and between B and X. As in [8],the objective can be easily obtained by modifying (1) as:

C(YB) =∑

xi ,x j∈B

w(xi , x j )‖yi − y j‖2 (CBB)

+ λ∑

xi∈B,x j∈X

w(xi , x j )‖yi − y j‖2 (CBX) (12)

where λ is the trade-off parameter. CBB enforces smoothnessof the learned embeddings within B while CBX ensures thesmoothness between B and X. This formulation is actually arelaxation of (1), by discarding the part which minimizes thedissimilarity within X (denoted as CXX). CXX is ignored sincecomputing the similarity matrix within X costs O(n2) time.The smoothness between points in X is implicitly ensuredby (10).

Applying equation (10) for y j , j ∈ X to (12), one obtainsthe following problem

min trace(Y�B(DB − WB)YB)

+ λ trace(Y�B(DBX − W�

XBWXB)YB), (13)

where DB = diag(WB1) and DBX = diag(WBX1) are bothm × m diagonal matrices. Taking the constraint in (1), oneobtains

minYB

trace(Y�B(M + λT)YB)

s.t. Y�BYB = mI (14)

where M = DB − WB, T = DBX − W�XBWXB. The optimal

solution YB of the above problem is easily obtained byidentifying the r eigenvectors of M+λT corresponding to thesmallest eigenvalues (excluding the eigenvalue 0 with respectto the trivial eigenvector 1).2 This method is named IMH-LEin the following text.

F. Manifold Learning Methods for Hashing

In this section, different manifold learning methods arecompared for hashing within the proposed IMH framework.Figure 2 reports the comparative results in mean of aver-age precision (MAP). For comparison, linear PCA is alsoevaluated within the framework (IMH-PCA in the figure).

1See details in [47]. A Matlab implementation of t-SNE is provided by theauthors of [47] at http://homepage.tudelft.nl/19j49/t-SNE.html.

2The parameter λ is set to 2 in all experiments.

Fig. 2. Comparison among different manifold learning methods withinthe proposed IMH hashing framework on CIFAR-10. IMH with linearPCA (IMH-PCA) and PCAH [49] are also evaluated for comparison. Forclarity, for IMH-LE in Section III-E, IMH with the original LE algorithm onthe base set B is termed as IMH-LEB. IMH-DM is IMH with the diffusionmaps of [27]. The base set size is set to 400.

As can be clearly seen that, IMH-tSNE, IMH-SNE andIMH-EE perform slightly better than IMH-LE (Section III-E).This is mainly because these three methods are able to preservelocal neighborhood structure while, to some extent, preventingdata points from crowding together. It is promising that all ofthese methods perform better than an exhaustive �2 scan usingthe uncompressed GIST features.

Figure 2 shows that LE (IMH-LEB in the figure), the mostwidely used embedding method in hashing, does not performas well as a variety of other methods (e.g., t-SNE), andin fact performs worse than PCA, which is a linear tech-nique. This is not surprising because LE (and similarly LLE)tends to collapse large portions of the data (and not onlynearby samples in the original space) close together in thelow-dimensional space. The results are consistent with theanalysis in [4] and [47]. Based on the above observations, weargue that manifold learning methods (e.g. t-SNE, EE), whichnot only preserve local similarity but also force dissimilar dataapart in the low-dimensional space, are more effective thanthe popular LE for hashing. Preserving the global structuresof the data is critical for the manifold learning algorithms liket-SNE used in the proposed IMH framework. The previ-ous work called spline regression hashing (SRH) [36] alsoexploited both the local and global data similarity structures ofdata via a Laplacian matrix, which can decrease over-fitting, asdiscussed in [36]. SRH captures the global similarity structureof data through constructing global non-linear hash functions,while, in the proposed method, capturing the global structures(i.e., by t-SNE) is only applied on the small base set andthe hash function is formulated by a locally linear regressionmodel.

It is interesting to see that IMH-PCA outperformsPCAH [49] by a large margin, despite the fact that PCA isperformed on the whole training data set by PCAH. This showsthat the generalization capability of IMH based on a very smallset of data points also works for linear dimensionality methods.

IV. EXPERIMENTAL RESULTS

IMH is evaluated on four large scale image datasets:CIFAR-10,3 MNIST, SIFT1M [49] and GIST1M.4

3http://www.cs.toronto.edu/~kriz/cifar.html4http://corpus-texmex.irisa.fr/

Page 7: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6 ...cfm.uestc.edu.cn/~fshen/TIP2015-Hashing on Nonlinear Manifolds.pdfhashing framework is developed by incorporating the label

SHEN et al.: HASHING ON NONLINEAR MANIFOLDS 1845

TABLE I

MAP (%) OF DIFFERENT BASE GENERATING METHODS: RANDOM

SAMPLING VS. K-MEANS. THE COMPARISON IS PERFORMED

ON THE CIFAR-10 DATASET WITH CODE LENGTHS FROM

32 TO 96 AND BASE SET SIZE 400. AVERAGE RESULTS

(STANDARD DEVIATION) ARE GIVEN

BASED ON 10 RUNS

The MNIST dataset consists of 70, 000 images, each of784 dimensions, of handwritten digits from ‘0’ to ‘9’. As asubset of the well-known 80M tiny image collection [46],CIFAR-10 consists of 60,000 images which are manuallylabelled as 10 classes with 6, 000 samples for each class.Each image in this dataset is represented by a GIST featurevector [39] of dimension 512. For MNIST and CIFAR-10,the whole dataset is split into a test set with 1, 000 samplesand a training set with all remaining samples.

Nine hashing algorithms are compared, including theproposed IMH-tSNE, IMH-LE and seven other unsu-pervised state-of-the-art methods: PCAH [49], SH [53],AGH [34] and STH [55], BRE [24], ITQ [13], SphericalHashing (SpH) [17]. The provided codes and suggested para-meters according to the authors of these methods are used.Due to the high computational cost of BRE and high memorycost of STH, 1, 000 and 5, 000 training points are sampledfor these two methods respectively. The performance is mea-sured by MAP or precision and recall curves for Hammingranking using 16 to 128 hash bits. The results for hashlookup using a Hamming radius within 2 by F1 score [37]:F1 = 2(precision · recall)/(precision + recall) are alsoreported. Ground truths are defined by the category informa-tion for the labeled datasets MNIST and CIFAR-10, and byEuclidean neighbors for SIFT1M and GIST1M.

A. Base Selection

In this section, the CIFAR-10 dataset is taken as an exampleto compare different base generation methods and differentbase sizes for the proposed methods. AGH is also evaluatedhere for comparison. Table I compares three methods forgenerating base point sets: random sampling, K-medians andK-means on the training data. One can easily see that theperformance of the proposed methods using K-means is betterat all code lengths than that using other two methods. It is alsoclear from Table I that the K-medians algorithm achieves betterperformance than random sampling, however still inferiorto K-means.

Different from K-means that constructs each base point byaveraging the data points in the corresponding cluster, random

Fig. 3. MAP results versus varying base set size m (left, fixing k = 5) andnumber of nearest base points k (right, fixing m = 400) for the proposedmethods and AGH. The comparison is conducted on the CIFAR-10 datasetusing 64-bits.

sampling and K-medians generate the base set with real datasamples in the training set. However, this property does nothelp K-medians obtain better hash codes than K-means. Thisis possibly because K-means produces the cluster centers withminimum quantization distortions at each group of the datapoints. Another advantage of K-means is it is much moreefficient than K-medians.

As can also be seen, even with a base set by randomsampling, the proposed methods outperform AGH in all casesbut one. Due to the superior results and high efficiency inpractice, the base set is generated by K-means in the followingexperiments.

From Figure 3, it is clear that the performance of theproposed methods is consistently improved with increasingbase set size m, which is consistent with the analysis of theprototype algorithm. One can observe that, the performancedoes not change significantly when varying the number ofnearest base points k. It is also clear that IMH-LEB, whichonly enforces smoothness in the base set, does not performas well as IMH-LE, which also enforces smoothness betweenthe base set and training set.

To further investigate the impact of the base set size m,in both performance and efficiency, the proposed methodsare performed with a large variance of m on CIFAR-10and MNIST. The results for the proposed IMH-tSNE areshown in Table II. As can be clearly seen, for the dataset ofCIFAR-10, the MAP score by IMH-tSNE improves consis-tently with the increasing base set size m when m ≤ 400,while does not change dramatically with larger m. On MNIST,the performance of IMH-tSNE is also consistently improvedwhen m increases. Same as on CIFAR-10, the MAP score doesnot significantly change with larger m.

In terms of computational efficiency, on both datasets, thetraining time cost is considerably increased with larger m.For example, IMH-tSNE costs about 4.3 seconds with400 base samples, while costs more than 21 seconds with1,000 base samples on the CIFAR-10 dataset. For the retrievaltask, the testing time is more crucial. As shown in Table II,the testing time is in general linearly increased with m.With a small m, the testing is very efficient. When with alarge m (e.g., m ≥ 3000), IMH-tSNE needs more than onemilliseconds to compute the binary codes of the query, whichis not scalable for large-scale tasks. Take both performanceand computational efficiency into account, for the remainder

Page 8: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6 ...cfm.uestc.edu.cn/~fshen/TIP2015-Hashing on Nonlinear Manifolds.pdfhashing framework is developed by incorporating the label

1846 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6, JUNE 2015

TABLE II

IMPACT OF THE BASE SET SIZE m OF IMH-tSNE ON THE RETRIEVAL PERFORMANCE AND COMPUTATIONAL EFFICIENCY. THE BASE SET IS

GENERATED BY K-MEANS. THE RESULTS WITH 64 BITS ARE REPORTED, BASED ON 10 INDEPENDENT RUNS. THE EXPERIMENTS

ARE CONDUCTED ON A DESKTOP PC WITH A 4-CORE 3.40GHz CPU AND 32G RAM

Fig. 4. Visualization of the digits in MNIST in 2D by t-SNE: (left) t-SNEembeddings of the base set (400 data points generated by K-means);(right) embeddings of all the 60,000 samples computed by the proposedinductive method. In the right, the labels of the digits are indicated with10 different colors.

Fig. 5. Comparison of different methods on CIFAR-10 based on MAP (left)and F1 (right) for varying code lengths.

of this paper, the settings m = 400 and k = 5 are used for theproposed methods, unless otherwise specified.

Figure 4 shows the t-SNE embeddings of the base set(400 points generated by K-means) and the embeddings ofthe whole MNIST dataset computed by the proposed induc-tive method (without binarization). As can be seen fromthe right figure that most of the points are close to theircorresponding clusters (with respect to 10 digits). This obser-vation shows that, in the low-dimensional embedding space,the proposed method can well preserve the local manifoldstructure of the entire dataset based on a relatively smallbase set.

B. Results on CIFAR-10 Dataset

The comparative results based on MAP for Hammingranking with code lengths from 16 to 128 bits are reportedin Figure 5. It can be seen that the proposed IMH-LE andIMH-tSNE perform best in all cases. Among the proposed

Fig. 6. Comparison of different methods on CIFAR-10 based onprecision (left) and recall (right) using 64-bits. Please refer to the comple-mentary for complete results for other code lengths.

algorithms, the LE based IMH-LE is inferior to the t-SNEbased IMH-tSNE. IMH-LE is still much better than AGHand STH, however. ITQ performs better than SpH and BREon this dataset, but is still inferior to IMH. SH and PCAHperform worst in this case. This is because SH relies upon itsuniform data assumption while PCAH simply generates thehash hyperplanes by PCA directions, which does not explicitlycapture the similarity information. The results are consistentwith the complete precision and recall curves shown in thesupplementary material. The F1 results for hash lookup withHamming radius 2 are also reported. It is can be seen thatIMH-LE and IMH-tSNE also outperform all other methodsby large margins. BRE and AGH obtain better results than theremaining methods, although the performance of all methodsdrop as code length grows.

Figure 6 shows the precision and recall curves of Hammingranking for the compared methods. STH and AGH obtainrelatively high precisions when a small number of samplesare returned, however precision drops significantly as thenumber of retrieved samples increases. In contrast, IMH-tSNE,IMH-LE and ITQ achieve higher precisions with relativelylarger numbers of retrieved points.

C. Results on MNIST Dataset

The MAP and F1 scores for these compared methods arereported in Figure 7. As in Figure 5, IMH-tSNE achievesthe best results. It is clear that, on this dataset IMH-tSNEoutperforms IMH-LE by a large margin, which increases as thecode length increases. This further demonstrates the advantageof t-SNE as a tool for hashing by embedding high-dimensionaldata into a low-dimensional space. The dimensionality reduc-tion procedure not only preserves the local neighborhood

Page 9: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6 ...cfm.uestc.edu.cn/~fshen/TIP2015-Hashing on Nonlinear Manifolds.pdfhashing framework is developed by incorporating the label

SHEN et al.: HASHING ON NONLINEAR MANIFOLDS 1847

Fig. 7. Comparison of different methods on the MNIST dataset usingMAP (left) and F1 (right) for varying code lengths.

TABLE III

COMPARISON OF TRAINING AND TESTING TIMES (IN SECONDS) ON

MNIST WITH 70K 784D FEATURE POINTS. THE REPORTED

RESULTS FOR AGH AND IMH INCLUDE THE TIME TAKEN

BY THE DOMINATING K-MEANS (8.9 SECONDS), WHICH

CAN BE CONDUCTED IN ADVANCE IN PRACTICE

THOUGH. THE EXPERIMENTS ARE BASED

ON A DESKTOP PC WITH A

4-CORE 3.07GHz CPU

AND 8G RAM

structure, but also reveals important global structure (suchas clusters) [47]. Among the four LE-based methods, whileIMH-LE shows a small advantage over AGH, both methodsachieve much better results than STH and SH. ITQ and BREobtain high MAPs with longer bit lengths, but they stillperform less well for the hash look up F1. PCAH performsworst in terms of both MAP and the F1 measure. Refer to thesupplementary material for the complete precision and recallcurves which validate the observations here.

Efficiency Table III shows training and testing time onthe MNIST dataset for various methods, and shows that thelinear method, PCAH, is fastest. IMH-tSNE is slower thanIMH-LE, AGH and SH in terms of training time, however allof these methods have relatively low execution times and aremuch faster than STH and BRE. In terms of test time, bothIMH algorithms are comparable to other methods, except STHwhich takes much more time to predict the binary codes bySVM on this non-sparse dataset.

D. Results on SIFT1M and GIST1M

SIFT1M contains one million local SIFT descriptorsextracted from a large set of images [49], each of whichis represented by a 128D vector of histograms of gradientorientations. GIST1M contains one million GIST features and

Fig. 8. Comparative results results on SIFT1M for F1 (left) and recall (right)with Hamming radius 2. Ground truth is defined to be the closest 2 percentof points as measured by the Euclidean distance.

Fig. 9. Comparative results results on GIST1M by F1 (left) and recall (right)with Hamming radius 2. Ground truth is defined to be the closest 2 percentof points as measured by the Euclidean distance.

Fig. 10. Classification accuracy (%) on MNIST with binary codes of varioushashing methods by linear SVM.

each feature is represented by a 960D vector. For both ofthese datasets, one million samples are used as training set andadditional 10K are used for testing. As in [49], ground truth isdefined as the closest 2 percent of points as measured by theEuclidean distance. For these two large datasets, 1, 000 pointsare generated by K-means and k is set as 2 for both IMHand AGH. The comparative results on SIFT1M and GIST1Mare summarized in Figure 8 and Figure 9, respectively. Again,IMH consistently achieves superior results in terms of bothF1 score and recall with Hamming radius 2. As can be seenthat, the performance of most of these methods decreasesdramatically with increasing code length as the Hammingspaces become more sparse, which makes the hash lookup failmore often. However, IMH-tSNE still achieves relatively highscores with large code lengths. If one looks at Figure 8 (left),ITQ obtains the highest F1 with 16-bits, however it decreasesto near zero at 64-bits. In contrast, IMH-tSNE still managesan F1 of 0.2. Similar results are observed in the recall curves.

Page 10: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6 ...cfm.uestc.edu.cn/~fshen/TIP2015-Hashing on Nonlinear Manifolds.pdfhashing framework is developed by incorporating the label

1848 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6, JUNE 2015

Fig. 11. Evaluation of IMH with ITQ rotations on the MNIST and CIFAR dataset using F1 and recall with Hamming radius 2 for varying code lengths.

E. Classification on Binary Codes

In order to demonstrate classification performance a linearSVM is trained on the binary codes generate by IMH forthe MNIST data set. In order to learn codes with higherbit lengths for IMH and AGH, the base set size is set to1, 000. Accuracies of different binary encodings are shownin Figure 10. Both IMH and AGH achieve high accuracieson this dataset, although IMH performs better with highercode lengths. In contrast, the best results of all other methods,obtained by ITQ, are consistently worse than those for IMH,especially for short code lengths. Note that even with only128-bit binary features IMH obtains a high 94.1%. Interest-ingly, the same classification rate of 94.1% is obtained byapplying the linear SVM to the uncompressed 784D features,which occupy several hundreds times as much space as thelearned hash codes.

V. MINIMIZING THE QUANTIZATION DISTORTION

BY LEARNED ROTATIONS

In the above sections, the binary codes are obtained bydirectly thresholding the learned embeddings at zeros. Thesimple binarization may cause large quantization loss. In thissection, learned rotations are applied to minimize the quanti-zation errors. That is, normalize the data points such that theyare zero-centred in the embedded space and then orthogonallyrotate the normalized embeddings for binarization. Orthogonalrotation has been adopted by various methods (see [9], [13],[15], [21]) to minimize the quantization error. The simplealgorithm in ITQ [13] is used for the proposed method.

The impact of the rotations applied on the learned embed-dings is evaluated on MNIST and CIFAR. Figure 11 clearly

Algorithm 2 Supervised Inductive Manifold-Hashing (IMHs)

shows that the orthogonal rotations achieve significantly per-formance improvements (in terms of both F1 and recall)for IMH-LE and IMH-tSNE. In conjunctions with the rota-tions, the proposed IMH-LE and IMH-tSNE methods performmuch better than the PCA based PCA-ITQ. Again this resultdemonstrates the advantages of the proposed manifold hashingmethod.

VI. SEMANTIC HASHING WITH SUPERVISED

MANIFOLD LEARNING

The proposed inductive manifold hashing algorithm hasshown to work well on preserving the semantic neighborhoodrelationships without using label information. It is expectedthat the performance can be improved by applying supervisedlearning methods instead of the unsupervised ones to learn thenonlinear embeddings. A straightforward supervised extensionto the proposed IMH algorithm is proposed in this study.First, the base set B := {c1,1, · · · , cm1,1, · · · , c1,t · · · , cmt ,t }is generated by applying K-means on data from each of

Page 11: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6 ...cfm.uestc.edu.cn/~fshen/TIP2015-Hashing on Nonlinear Manifolds.pdfhashing framework is developed by incorporating the label

SHEN et al.: HASHING ON NONLINEAR MANIFOLDS 1849

Fig. 12. Evaluation of IMH with supervised learning by LDA on the MNIST and CIFAR datasets. Non-linear embeddings of IMH are obtained by t-SNE.Since there are only 10 classes with both these two datasets, the reduced dimensionality by LDA (thereby the binary code lenght) is set to 9.

the t classes. After the nonlinear embeddings YB of B areobtained, the supervised subspace learning algorithms aresimply conducted on YB. For a new data point, its binary codesare then obtained by (9). The supervised manifold hashingmethod is summarized in Algorithm 2.

In this section, linear discriminant analysis (LDA) is takenas an example to verify the efficacy of the proposed supervisedmanifold hashing algorithm. The proposed method is alsocompared with several recently proposed supervised hashingapproaches such as semi-supervised hashing (SSH [49]) withsequential projection learning, kernel-based supervised hash-ing (KSH [33]) and ITQ with Canonical Correlation Analysis(CCA-ITQ [14]).

Experiments are performed on MNIST and CIFAR. In thisexperiment, 2,000 labelled samples are randomly selected forsupervised learning for SSH and KSH, and 1,000 labelledsamples are sampled for the base set for IMHs. All labelledtraining data are used for the linear CCA-ITQ. Since thereare only 10 classes with both these two datasets, the reduceddimensionality by LDA (thereby the binary code length) inthe proposed IMHs is fixed at 9. The results are reported inFigure 12. It is clearly that the proposed supervised inductivemanifold hashing algorithm IMHs significantly improve theoriginal IMH and other compared supervised methods in bothMAP and F1 measure. The ITQ rotations (IMHs-ITQ inthe figure) further improves IMHs with considerable gains,especially on the CIFAR dataset of natural images. Amongother supervised hashing methods, KSH obtains the highestMAPs on MNIST and CIFAR with IMHs or IMHs-ITQ.However, it needs much larger binary code lengths to achieve

comparable performance with the proposed methods. In termsof F1, CCA-ITQ is identified to have the best results with longcodes.

From these results, it is clear that label information is veryuseful to achieve semantically effective hashing codes. Furthermore, the proposed simple supervised hashing framework caneffectively leverage the supervised information in the proposedmanifold hashing algorithms. Also note that the proposedmethod does not assume a specific algorithm like LDA,any other supervised subspace learning or metric learningalgorithms may further improve the performance.

VII. CONCLUSION AND DISCUSSION

We have proposed a simple yet effective hashing frame-work, which provides a practical connection between man-ifold learning methods (typically non-parametric and withhigh computational cost) and hash function learning (requir-ing high efficiency). By preserving the underlying manifoldstructure with several non-parametric dimensionality reductionmethods, the proposed hashing methods outperform severalstate-of-the-art methods in terms of both hash lookup andHamming ranking on several large-scale retrieval-datasets.The proposed inductive hashing methods require only lineartime (O(n)) for indexing all of the training data and aconstant search time for a novel query. Experiments showedthat the hash codes can also achieve promising results ona classification problem even with very short code lengths.The proposed inductive manifold hashing method was thenextended by applying orthogonal rotations on the learnednonlinear embeddings to minimize the quantization errors,

Page 12: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6 ...cfm.uestc.edu.cn/~fshen/TIP2015-Hashing on Nonlinear Manifolds.pdfhashing framework is developed by incorporating the label

1850 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6, JUNE 2015

which was shown to achieve significant performanceimprovements. In addition, this work further extended IMHby adopting supervised subspace learning on the data mani-folds, which provides an effective supervised manifold hashingframework.

The proposed hashing methods have been shown to workwell on image retrieval and classification tasks. As an effi-cient and effective nonlinear feature extraction method, thisalgorithm can also be applied to other applications, especiallythe ones need to deal with large dataset. For example, theintroduced hashing techniques can be applied to large-scalemobile video retrieval [31]. Another useful application of thehashing methods would be compressing the high-dimensionalfeatures into short binary codes, which could significantlyspeed up the potential tasks, such as the large scale ImageNetimage classification [12], [29].

In this work, the base set size m was set empirically, whichwas not an optimal choice apparently. How to automaticallyset this parameter according to the size and distribution of aspecific dataset deserves a future study.

REFERENCES

[1] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral tech-niques for embedding and clustering,” in Advances in Neural Infor-mation Processing Systems. Cambridge, MA, USA: MIT Press, 2001,pp. 585–591.

[2] Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement, P. Vincent,and M. Ouimet, “Learning eigenfunctions links spectral embeddingand kernel PCA,” Neural Comput., vol. 16, no. 10, pp. 2197–2219,2004.

[3] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua, “LDAHash:Improved matching with smaller descriptors,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 34, no. 1, pp. 66–78, Jan. 2012.

[4] M. A. Carreira-Perpinán, “The elastic embedding algorithm fordimensionality reduction,” in Proc. Int. Conf. Mach. Learn., 2010,pp. 167–174.

[5] M. Carreira-Perpinán and Z. Lu, “The Laplacian eigenmaps latentvariable model,” in Proc. Int. Conf. Artif. Intell. Statist., 2007,pp. 59–66.

[6] R. Chaudhry and Y. Ivanov, “Fast approximate nearest neighbor methodsfor non-Euclidean manifolds with applications to human activity analysisin videos,” in Proc. 11th Eur. Conf. Comput. Vis., 2010, pp. 735–748.

[7] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitivehashing scheme based on p-stable distributions,” in Proc. 20th Annu.Symp. Comput. Geometry, 2004, pp. 253–262.

[8] O. Delalleau, Y. Bengio, and N. Le Roux, “Efficient non-parametricfunction induction in semi-supervised learning,” in Proc. 10th Int.Workshops Artif. Intell. Statist., 2005, pp. 96–103.

[9] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization forapproximate nearest neighbor search,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Jun. 2013, pp. 2946–2953.

[10] T. Ge, K. He, and J. Sun, “Graph cuts for supervised binary coding,” inProc. 13th Eur. Conf. Comput. Vis., 2014, pp. 250–264.

[11] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in highdimensions via hashing,” in Proc. 25th Int. Conf. Very Large Data Bases,1999, pp. 518–529.

[12] Y. Gong, S. Kumar, H. A. Rowley, and S. Lazebnik, “Learning binarycodes for high-dimensional data using bilinear projections,” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 484–491.

[13] Y. Gong and S. Lazebnik, “Iterative quantization: A procrusteanapproach to learning binary codes,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Jun. 2011, pp. 817–824.

[14] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quanti-zation: A procrustean approach to learning binary codes for large-scaleimage retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12,pp. 2916–2929, Dec. 2013.

[15] K. He, F. Wen, and J. Sun, “K-means hashing: An affinity-preservingquantization method for learning binary compact codes,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., Jun. 2013, pp. 2938–2945.

[16] X. He, W.-Y. Ma, and H.-J. Zhang, “Learning an image manifoldfor retrieval,” in Proc. 12th Annu. ACM Int. Conf. Multimedia, 2004,pp. 17–23.

[17] J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon, “Sphericalhashing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2012,pp. 2957–2964.

[18] R. Herbrich and R. C. Williamson, “Algorithmic luckiness,” J. Mach.Learn. Res., vol. 3, pp. 175–212, Sep. 2002.

[19] G. E. Hinton and S. T. Roweis, “Stochastic neighbor embedding,” inAdvances in Neural Information Processing Systems. Cambridge, MA,USA: MIT Press, 2002, pp. 833–840.

[20] G. Irie, Z. Li, X.-M. Wu, and S.-F. Chang, “Locally linear hashingfor extracting non-linear manifolds,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Jun. 2014, pp. 2123–2130.

[21] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating localdescriptors into a compact image representation,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2010, pp. 3304–3311.

[22] A. Joly and O. Buisson, “Random maximum margin hashing,” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2011, pp. 873–880.

[23] W. Kong and W.-J. Li, “Isotropic hashing,” in Advances inNeural Information Processing Systems. Red Hook, NY, USA:Curran & Associates Inc., 2012, pp. 1646–1654.

[24] B. Kulis and T. Darrell, “Learning to hash with binary reconstructiveembeddings,” in Advances in Neural Information Processing Systems.Red Hook, NY, USA: Curran & Associates Inc., 2009.

[25] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing forscalable image search,” in Proc. IEEE 12th Int. Conf. Comput. Vis.,Sep./Oct. 2009, pp. 2130–2137.

[26] B. Kulis, P. Jain, and K. Grauman, “Fast similarity search for learnedmetrics,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 12,pp. 2143–2157, Dec. 2009.

[27] S. Lafon and A. B. Lee, “Diffusion maps and coarse-graining: A unifiedframework for dimensionality reduction, graph partitioning, and dataset parameterization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28,no. 9, pp. 1393–1403, Sep. 2006.

[28] X. Li, G. Lin, C. Shen, A. van den Hengel, and A. Dick, “Learninghash functions using column generation,” in Proc. 30th Int. Conf. Mach.Learn., 2013, pp. 142–150.

[29] G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter, “Fastsupervised hashing with decision trees for high-dimensional data,”in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014,pp. 1971–1978.

[30] G. Lin, C. Shen, D. Suter, and A. van den Hengel, “A general two-stepapproach to learning-based hashing,” in Proc. IEEE Int. Conf. Comput.Vis., Dec. 2013, pp. 2552–2559.

[31] J. Liu, Z. Huang, H. Cai, H. T. Shen, C. W. Ngo, and W. Wang, “Near-duplicate video retrieval: Current research and future trends,” ACMComput. Surv., vol. 45, no. 4, 2013, Art. ID 44.

[32] W. Liu, C. Mu, S. Kumar, and S.-F. Chang, “Discrete graph hashing,”in Advances in Neural Information Processing Systems. Red Hook, NY,USA: Curran & Associates Inc., 2014, pp. 3419–3427.

[33] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Supervised hash-ing with kernels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jun. 2012, pp. 2074–2081.

[34] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, “Hashing with graphs,” inProc. 28th Int. Conf. Mach. Learn., 2011, pp. 1–8.

[35] W. Liu, J. Wang, Y. Mu, S. Kumar, and S.-F. Chang, “Compacthyperplane hashing with bilinear functions,” in Proc. 29th Int. Conf.Mach. Learn., 2012, pp. 17–24.

[36] Y. Liu, F. Wu, Y. Yang, Y. Zhuang, and A. G. Hauptmann, “Splineregression hashing for fast image search,” IEEE Trans. Image Process.,vol. 21, no. 10, pp. 4480–4491, Oct. 2012.

[37] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Informa-tion Retrieval. New York, NY, USA: Cambridge Univ. Press, 2008.

[38] M. Norouzi and D. J. Fleet, “Minimal loss hashing for compact binarycodes,” in Proc. 28th Int. Conf. Mach. Learn., 2011, pp. 353–360.

[39] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holisticrepresentation of the spatial envelope,” Int. J. Comput. Vis., vol. 42,no. 3, pp. 145–175, 2001.

[40] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes fromshift-invariant kernels,” in Advances in Neural Information ProcessingSystems. Red Hook, NY, USA: Curran & Associates Inc., 2009.

[41] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction bylocally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326,2000.

Page 13: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 6 ...cfm.uestc.edu.cn/~fshen/TIP2015-Hashing on Nonlinear Manifolds.pdfhashing framework is developed by incorporating the label

SHEN et al.: HASHING ON NONLINEAR MANIFOLDS 1851

[42] F. Shen, C. Shen, Q. Shi, A. van den Hengel, and Z. Tang, “Inductivehashing on manifolds,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2013, pp. 1562–1569.

[43] J. Song, Y. Yang, Y. Yang, Z. Huang, and H. T. Shen, “Inter-mediahashing for large-scale retrieval from heterogeneous data sources,” inProc. ACM SIGMOD Int. Conf. Manage. Data, 2013, pp. 785–796.

[44] A. Talwalkar, S. Kumar, and H. A. Rowley, “Large-scale manifoldlearning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008,pp. 1–8.

[45] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometricframework for nonlinear dimensionality reduction,” Science, vol. 290,no. 5500, pp. 2319–2323, 2000.

[46] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images:A large data set for nonparametric object and scene recognition,” IEEETrans. Pattern Anal. Mach. Intell., vol. 30, no. 11, pp. 1958–1970,Nov. 2008.

[47] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”J. Mach. Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008.

[48] J. Venna, J. Peltonen, K. Nybo, H. Aidos, and S. Kaski, “Informationretrieval perspective to nonlinear dimensionality reduction for datavisualization,” J. Mach. Learn. Res., vol. 11, pp. 451–490, Feb. 2010.

[49] J. Wang, S. Kumar, and S.-F. Chang, “Semi-supervised hashing for large-scale search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 12,pp. 2393–2406, Dec. 2012.

[50] J. Wang, W. Liu, A. X. Sun, and Y.-G. Jiang, “Learning hash codes withlistwise supervision,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013,pp. 3032–3039.

[51] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Locality-constrained linear coding for image classification,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2010, pp. 3360–3367.

[52] Y. Weiss, R. Fergus, and A. Torralba, “Multidimensional spectral hash-ing,” in Proc. 12th Eur. Conf. Comput. Vis., 2012, pp. 340–353.

[53] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Advancesin Neural Information Processing Systems. Red Hook, NY, USA:Curran & Associates Inc., 2008.

[54] K. Yu, T. Zhang, and Y. Gong, “Nonlinear learning using local coor-dinate coding,” in Advances in Neural Information Processing Systems.Red Hook, NY, USA: Curran & Associates Inc., 2009.

[55] D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashing for fastsimilarity search,” in Proc. 33rd Int. ACM SIGIR Conf. Res. Develop.Inf. Retr., 2010, pp. 18–25.

[56] X. Zhu, Z. Huang, H. Cheng, J. Cui, and H. T. Shen, “Sparse hashingfor fast multimedia search,” ACM Trans. Inf. Syst., vol. 31, no. 2, 2013,Art. ID 9.

Fumin Shen received the B.S. degree fromShandong University, in 2007, and the Ph.D. degreefrom the Nanjing University of Science and Technol-ogy, China, in 2014. He is currently a Lecturer withthe School of Computer Science and Engineering,University of Electronic of Science and Technologyof China, China. His major research interests includecomputer vision and machine learning, includingface recognition, image analysis, hashing methods,and robust statistics with its applications in computervision.

Chunhua Shen is currently a Professor with theSchool of Computer Science, The University of Ade-laide. His current research interests include inter-section of computer vision and statistical machinelearning. He received the Australian Research Coun-cil Future Fellowship in 2012. He is an AssociateEditor of the IEEE TRANSACTIONS ON NEURAL

NETWORKS AND LEARNING SYSTEMS.

Qinfeng Shi received the bachelor’s and master’sdegrees in computer science and technologyfrom Northwestern Polytechnical University, in2003 and 2006, respectively, and the Ph.D. degree incomputer science from Australian National Univer-sity, in 2011. He is currently a Senior Lecturer withthe Australian Centre for Visual Technologies andthe School of Computer Science, The University ofAdelaide.

Anton van den Hengel received the B.Sc. andL.L.B. degrees, the master’s degree in computerscience, and the Ph.D. degree in computer visionfrom The University of Adelaide, in 1991, 1993,1994, and 2000, respectively. He is currently theFounding Director of the Australian Centre forVisual Technologies. He is currently a Professor withthe School of Computer Science, The University ofAdelaide.

Zhenmin Tang received the Ph.D. degree fromthe Nanjing University of Science and Technology,Nanjing, China. He is currently a Professor andthe Head of the School of Computer Science withthe Nanjing University of Science and Technology.He has authored over 80 papers. His major researcharea includes intelligent system, pattern recognition,image processing, and embedded system. He is alsothe Leader of several key programs of the NationalNature Science Foundation of China.

Heng Tao Shen received the B.Sc. (Hons.) andPh.D. degrees from the Department of ComputerScience, National University of Singapore,in 2000 and 2004, respectively. He joined TheUniversity of Queensland as a Lecturer, SeniorLecturer, and Reader, where he became a Professorin 2011. He is currently a Professor ofComputer Science and an ARC Future Fellow withthe School of Information Technology and ElectricalEngineering, The University of Queensland.He is a Visiting Professor with Nagoya University

and the National University of Singapore. His research interests mainlyinclude multimedia/mobile/web search, and big data management onspatial, temporal, multimedia, and social media databases. He received theChris Wallace Award for outstanding research contribution by the ComputingResearch and Education Association, Australasia, in 2010. He hasextensively published and served on program committees in most prestigiousinternational publication venues of interests. He is an Associate Editor of theIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, and aPC Co-Chair of ACM Multimedia 2015.