pairwise rotation hashing for high-dimensional … · pairwise rotation hashing for...

16
Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc. [email protected] Abstract. Binary Hashing is widely used for effective approximate near- est neighbors search. Even though various binary hashing methods have been proposed, very few methods are feasible for extremely high-dimensional features often used in visual tasks today. We propose a novel highly sparse linear hashing method based on pairwise rotations. The encoding cost of the proposed algorithm is O(n log n) for n-dimensional features, whereas that of the existing state-of-the-art method is typically O(n 2 ). The pro- posed method is also remarkably faster in the learning phase. Along with the efficiency, the retrieval accuracy is comparable to or slightly outper- forming the state-of-the-art. Pairwise rotations used in our method are formulated from an analytical study of the trade-off relationship between quantization error and entropy of binary codes. Although these hashing criteria are widely used in previous researches, its analytical behavior is rarely studied. All building blocks of our algorithm are based on the analytical solution, and it thus provides a fairly simple and efficient pro- cedure. 1 Introduction Approximate nearest neighbors (ANN) search is widely used in retrieval [1,2,3,4], and the scale of its database has been increasing rapidly in recent times. Fur- thermore, to achieve more accurate retrieval results, high-dimensional features such as Fisher Vectors [5,6] and VLAD [7] are being used in the computer vision community. To achieve feasible retrieval with such features, highly efficient ANN search methods is necessarily needed. Vector Quantization based methods are widely used and actively studied for ANN. For high-dimensions, Product Quantization [8] and its family are the state-of-the-art methods [9]. It reduces high-dimensional vector space into direct product of small subspaces. Then a clustering is applied for each subspace to obtain representative vectors (quantizers). Although product quantization based methods are applicable to high-dimensional features, it is still not easy to obtain good quantizer in some cases, and a random rotation often needed before PQ is expensive in high-dimensions. And the floating-point distance calculation needed for retrieval is also expensive compared to binary-based methods [10]. Binary hashing is one of the most commonly used techniques for efficient retrieval [11,12,13,14], recognition [15], and other problems[7,16]. It is a series arXiv:1501.07422v1 [cs.CV] 29 Jan 2015

Upload: others

Post on 08-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

Pairwise Rotation Hashing for High-dimensionalFeatures

Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai

Denso IT Laboratory, [email protected]

Abstract. Binary Hashing is widely used for effective approximate near-est neighbors search. Even though various binary hashing methods havebeen proposed, very few methods are feasible for extremely high-dimensionalfeatures often used in visual tasks today. We propose a novel highly sparselinear hashing method based on pairwise rotations. The encoding cost ofthe proposed algorithm is O(n logn) for n-dimensional features, whereasthat of the existing state-of-the-art method is typically O(n2). The pro-posed method is also remarkably faster in the learning phase. Along withthe efficiency, the retrieval accuracy is comparable to or slightly outper-forming the state-of-the-art. Pairwise rotations used in our method areformulated from an analytical study of the trade-off relationship betweenquantization error and entropy of binary codes. Although these hashingcriteria are widely used in previous researches, its analytical behavioris rarely studied. All building blocks of our algorithm are based on theanalytical solution, and it thus provides a fairly simple and efficient pro-cedure.

1 Introduction

Approximate nearest neighbors (ANN) search is widely used in retrieval [1,2,3,4],and the scale of its database has been increasing rapidly in recent times. Fur-thermore, to achieve more accurate retrieval results, high-dimensional featuressuch as Fisher Vectors [5,6] and VLAD [7] are being used in the computer visioncommunity. To achieve feasible retrieval with such features, highly efficient ANNsearch methods is necessarily needed.

Vector Quantization based methods are widely used and actively studiedfor ANN. For high-dimensions, Product Quantization [8] and its family are thestate-of-the-art methods [9]. It reduces high-dimensional vector space into directproduct of small subspaces. Then a clustering is applied for each subspace toobtain representative vectors (quantizers). Although product quantization basedmethods are applicable to high-dimensional features, it is still not easy to obtaingood quantizer in some cases, and a random rotation often needed before PQ isexpensive in high-dimensions. And the floating-point distance calculation neededfor retrieval is also expensive compared to binary-based methods [10].

Binary hashing is one of the most commonly used techniques for efficientretrieval [11,12,13,14], recognition [15], and other problems[7,16]. It is a series

arX

iv:1

501.

0742

2v1

[cs

.CV

] 2

9 Ja

n 20

15

Page 2: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

2 Pairwise Rotation Hashing for High-dimensional Features

of methods that transforms real-valued feature vectors into binary-valued ones.Binary-valued vectors are highly favorable for large-scale or high-dimensionaltasks because they provide high memory efficiency and fast Hamming distancecalculation. There are a lot of methods proposed. Major approaches are catego-rized as Vector Quantization (VQ) based methods [17,18,19,9], hyperplane basedlinear methods [20,21,10], and nonlinear hashing function methods [22,23,24,25,26].

A typical nonlinear method is Spectral Hashing [22], whose hashing functionsare nonlinear eigenfunctions derived from a distribution of data. Some family ofLocality Sensitive Hashing uses nonlinear hashing functions[27]. Kernelized ap-proaches have also been proposed [28,29,30]. Spectral Hashing, ordinary usesa uniform distribution to deriving analytical solution, and its precision is em-pirically lower compared to ITQ[20] and other state-of-the-art methods for non-uniformly distributed data. To overcome this difficulty, a kernel based approachesare proposed[16]. But it is difficult to apply to high-dimensions. Recently pro-posed method called Spherical Hashing [31] is a example of non-kernelized non-linear method. Since its hashing function is hyper-sphere based, it also needseuclidean distance calculation for hashing. The computational cost is gettinglarge for high-dimensional data.

Recently, a bilinear hashing method, which is called BPBC, that is feasiblein high-dimensions was proposed [10]. To our knowledge, this is the first binaryhashing method that can treat 10K dimensions or higher. However, this methodfolds feature vectors and bilinearly rotates them in the folded space. It is unableto treat all of the Special Orthogonal group (Rotational group) SO(n). There isstill no linear high dimensional binary hashing method that can directly treatSO(n).

In this paper, we propose a new highly efficient linear binary hashing method.Our method is inspired by Isotropic Hashing [32]. We found out its natural exten-sion. First, we study the meaning of isotropic transformation analytically. Thenwe develop a efficient isotropic hashing algorithm and its extension using trade-off relationship between isotropy and entropy. Recently proposed Sparse IsotopicHashing method[33] produces sparse rotational matrices that yield isotropic vari-ances; however the learning of high-dimensional rotational matrices is not feasiblein practice. Our main contributions are,

1: State-of-the-art computational cost and accuracy

Our algorithm takes O(n log n) encoding cost for n dimensional features. Thepreviously known state-of-the-art method BPBC requires O(n2/d+nd) (typicallyd = 128, with no dimension reduction case) cost. We show that the proposedalgorithm is more accurate than BPBC. Moreover, it is remarkably faster inlearning phase. The main cost consuming point of our algorithm is calculationof a variance-covariance matrix. We only need O(n2 log n) computational costin learning iteration loop, whereas BPBC requires O(n2 log n) in each iterationstep. Therefore it is practically faster than BPBC, although total computationalcost of learning has the same order O(mn2) in our algorithm and BPBC withm training data size.

Page 3: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

Pairwise Rotation Hashing for High-dimensional Features 3

2: Analytical treatment of hashing criteriaTypical criteria for measuring hashing performance are quantization error,

variance of each bit, and entropy. To the author’s knowledge, analytical treat-ment of these criteria has not been well studied yet. We show an analyticalresult, and a set of algorithms derived naturally from the result. The analyt-ical calculation is mainly based on gaussian distribution. That does not meanthe proposed algorithm is only applicable to gaussian-distributed data. Becauseany non-gaussian distributions can be expanded around gaussian [34], assuminggaussian distribution means taking lowest order of expansion. As will be dis-cussed below, lowest order approximation leads enough hashing accuracy andyields extremely efficient algorithm.

2 Theoretical Background

2.1 Quantization Error for Binary Hashing

Along with clustering methods such as k-means clustering, most binary hashingalgorithms aim at minimizing quantization error between binarized codes andoriginal feature vectors[20,10]. In this study, therefore, the properties of quanti-zation error are first investigated analytically. The result will be used to developthe binary hashing algorithm we propose.

Most of linear binary hashing methods consist of translation operation andlinear transformation.

b(x) ≡ sgn(A(x− t)), (1)

x, t ∈Rn, b(x) ∈ {−1, 1}m, A ∈ Rm×n.

It is assumed that a translation is mean centering in the following discussion.Quantization error is defined as the sum of squared Euclidean distance be-

tween an original feature vector and its binarized vector.

Eq ≡1

N

∑i

|xi − b(xi)|2, (2)

where N is the number of data points. When the data is distributed as arbitrarydistribution function p(x), it is possible to write down mean quantization error.If two-dimensional data are assumed,

EN→∞q =

∫ ∞0

dx1

∫ ∞0

dx2[(x1 − 1)2 + (x2 − 1)2

]p(x)

+

∫ 0

−∞dx1

∫ ∞0

dx2[(x1 + 1)2 + (x2 − 1)2

]p(x)

+

∫ 0

−∞dx1

∫ 0

−∞dx2

[(x1 + 1)2 + (x2 + 1)2

]p(x)

+

∫ ∞0

dx1

∫ 0

−∞dx2

[(x1 − 1)2 + (x2 + 1)2

]p(x). (3)

Page 4: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

4 Pairwise Rotation Hashing for High-dimensional Features

Then it is generally calculated as follows;

EN→∞q =

∫ ∞−∞dx1

∫ ∞−∞dx2

[x21 + x22 + 2

]p(x)

− 2

[ ∫ ∞0

dx1 x1p1(x1)−∫ 0

−∞dx1 x1p1(x1) +

∫ ∞0

dx2 x2p2(x2)−∫ 0

−∞dx2 x2p2(x2)

].

(4)

where p1(·) and p2(·) are marginal distributions with respect to x1 and x2.If it is assumed that the distribution is gaussian with mean centering and

variance-covariance matrix Σ, (4) is calculated as follows,

EN→∞q = 2 + Tr(Σ)− 2

√2

π(√σ11 +

√σ22) , Σ =

(σ11 σ12σ12 σ22

), (5)

where√σ11 and

√σ22 are standard deviations of each dimension. These results

are straightforwardly extended to higher dimensions case.

2.2 Quantization Error Minimization in Gaussian Distribution withRotational Transformation

Some binary hashing algorithms use orthogonal transformation [20,10]. Thismeans that their purpose is to find cost minimizing point in rotational groupSO(n). We also consider rotational group in this paper.

With rotational group transformation, minimizing (5) is equivalent to max-imizing

√σ11 +

√σ22 subject to Tr(Σ) = const. Since constancy of Tr(Σ) con-

strains a two dimensional vector (√σ11,√σ22) onto a circle, the solution is

σ11 = σ22. It is proved that the Isotropic Hashing [32] is quantization-error-minimizing hashing for gaussian distribution. We can see the isotropy as a mea-sure of quantization error.

2.3 Entropy and Quantization Error

Then the entropy of binary code is calculated. Here, eigenvalues and angle rep-resentation are used instead of a variance-covariance matrix, and only the two-dimensional case is treated. In this representation, elements of the variance-covariance matrix are described as follows;

σ11 = λ̄+ λ cos2θ, σ22 = λ̄− λ cos 2θ, σ12 = λ sin 2θ, (6)

λ̄ =λ1 + λ2

2, λ =

λ1 − λ22

,

where λ1 and λ2 are eigenvalues of variance-covariance matrix (λ1 ≥ λ2). θmeans the angle between x1-axis and the longer axis direction of the gaussianellipse (Fig. 1).

Page 5: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

Pairwise Rotation Hashing for High-dimensional Features 5

Fig. 1. Schematic illustration of two-dimensional gaussian distribution. An ellipse de-scribes the shape of distribution. Left: eigenvalue and angle parameterization(Eq. (6)).λ1 and λ2 are the eigenvalues of variance-covariance matrix. Right: rotation angle forIsotropic or PCA transformation. θiso and θpca are given in (9), (10).

Fig. 2. Quantization error and entropy of binary code for λ1 = 2, λ2 = 1. The x-axis isangle θ given in equation (6). From the symmetry of gaussian distribution, it is enoughto consider range θ ∈ [0, π/2]

From the symmetry of gaussian distribution, it is enough to get probabilitiesof binary code (1, 1) and (−1, 1). It is possible to analytically calculate theseprobabilities as follows;

p(1,1) =

∫ ∞0

dx1

∫ ∞0

dx21

1

|Σ|−1/2e−

12x

TΣ−1x =1

2− 1

2πtan−1

(2

γ sin 2θ

)p(−1,1) =

1

2− p(1,1) =

1

2πtan−1

(2

γ sin 2θ

)θ ∈ [0,

π

2], (7)

where γ is defined as√λ1/λ2 −

√λ2/λ1, which is the maximum value of cor-

relation between x1 and x2 under rotational transformation. The entropy of thetwo dimensional binary code is then given as

S(γ, θ) = 2× (−p(1,1) log p(1,1) − p(−1,1) log p(−1,1)). (8)

Fig. 2 is plots of quantization error and entropy with respect to angle θ. Whenquantization error is minimized, entropy is also minimized and vice versa. Thismeans that the quantization error and the entropy have a trade-off relationship.

Page 6: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

6 Pairwise Rotation Hashing for High-dimensional Features

Compatibility of the two factors depends on the ”sharpness” of the distribution.When the distribution is sharp (λ1 � λ2), entropy is heavily damaged withisotropic variances. There is no trade-off relationship if the distribution is circular(λ1 = λ2). But in general case, we should balance these criteria.

An analysis that is similar to ours is recently proposed in [9]. In the paper, thequantization error for Product Quantization was discussed. The authors showedthat it is bounded by determinant of variance-covariance matrix, and proposedan algorithm that is minimizing the bound under rotational transformation. Theresult indicates trade-off relationship between quantization error and entropy ofthe gaussian distribution. However, since the entropy of the gaussian distribu-tion is invariant under rotational transformation, this method only determinesthe partition of the entire space to the set of small subspaces. Rotations in eachsubspaces are not under discussion. By contrast, our analysis can consider ro-tational optimality in the two-dimensional subspaces because we investigate theentropy of binary codes directly.

Another example is [35]. The authors proposed two criteria, that is the ”cross-ing sparse region” and the ”balanced buckets”. The first criterion can be in-terpreted as quantization error minimization, and the second means quantizerentropy maximization. We think that it is possible to interpret many existingmethods as such trade-off problem of quantization error minimization and en-tropy maximization.

3 Methods

A binary hashing algorithm based on the above-discussed theory is developedas follows. We are going to have very sparse transformation matrices, whichsubstantially decrease encoding cost.

3.1 Problem Statement

The problem is to yield a linear transformation matrix A in equation (1). Mostexisting methods split the transformation into dimension reduction projectionW ∈ Rm×n and transformation in reduced space Q ∈ Rm×m. PCA is commonlyused for reducing the number of dimension. However, a PCA transformationmatrix is dense, it is difficult to get transformation and efficient encoding cal-culation in highly dimensional cases. In this paper, dimension reduction is nottreated. It is thus assumed that the number of dimensions of the original featurevector and the encoded binary vector are the same, and A = Q ∈ Rn×n only istreated. Dimension reduction can be done in the similar way as we are going todiscuss below, but detailed study is a future work.

3.2 Sequential Pairwise Isotropic Rotation

First, we derive transformation that makes variances completely isotropic. Wecan get very sparse isotropic transformation matrix with O(n log2 n) fill-ins using

Page 7: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

Pairwise Rotation Hashing for High-dimensional Features 7

Fig. 3. Schematic illustration of basic isotropic rotations (in 8-dimensions). Upper:Structure of the transformation matrix. Non-zero matrix elements are filled with graycolor. Sorting matrix is a permutation matrix, which sort variances in descending order.Rotation is done for pair of largest with smallest variance dimensions. Basic rotation iscontinuously applied log2 n times for n dimensions. Lower: Behavior of variances undercontinuous multiplication of basic isotropic rotation. The graphs are sorted variancesunder sequential multiplication. The rightmost graph is the (sorted) initial state. Onebasic isotropic rotation makes variances isotropic by pairs. And it is exponentiallytransformed to globally isotropic state with continuous application of basic rotation.

pairwise isotropic rotation, although original Isotropic Hashing [32] needs densetransformation matrix with O(n2) fill-ins.

In two-dimensional space, there are only two isotropic transformations. Itcorresponds to θ = π/4, 3π/4 in equation (6). From the symmetry of the gaussiandistribution, it is enough to consider θ = π/4. For any two-dimensional variance-covariance matrix Σ, the rotation matrix that makes variances isotropic is

R =

(cos θiso − sin θisosin θiso cos θiso

), θiso = tan−1

(1

2

σ11 − σ22σ12

). (9)

To develop isotropic transformation for full dimension, we define ”basic isotropicrotation”. It consists of three steps. First step is to sort dimensions by diagonalelements of variance-covariance matrix in descending order. Second step is tocreate pairs of dimensions as (1, n), (2, n− 1), · · · . Third step is taking isotropicrotation (9) for each pairs. The set of processes is denoted by a permutation(sorting) matrix and a rotational matrix with 2n fill-ins. This transformation,which we call it ”basic isotropic rotation”1in what follows, make variances pair-wise isotropic. Then we apply above transformation sequentially. Applying thetransformation two times make variances quadruple isotropic, three times makesthen octuple isotropic (Fig. 3), and so on. Finally, applying the transformation

1 When a permutation is odd, the determinant of the transformation matrix is -1. Inthat case, the transformation is not a element of SO(n), but O(n). However, wecan always obtain the element of SO(n) by applying a odd permutation to the finalmatrix A. The application means a permutation of bits and does not affect retrievalresults.

Page 8: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

8 Pairwise Rotation Hashing for High-dimensional Features

dlog2 ne times, we have completely isotropic variances. 2 Finally developed trans-formation is a product of sparse matrices. It has 2ndlog2 ne fill-ins in total. Thisfactorized form is highly sparse especially for high dimensions. It is possible touse standard sparse matrix data structure. Memory usage and computationalcost can be substantially decreased.

3.3 Trade-off between Quantization Error and Entropy

A factorized sparse transformation that makes variances completely isotropicwas obtained as described in the preceding section. However, in Section 2, atrade-off relationship between isotropy (quantization error minimization) andentropy maximization was revealed. Since entropy reduction degrades retrievalaccuracy, a balance between isotropy and entropy should be kept. Accordingly,two methods for such balancing are proposed hereafter. The first one is thesimpler one and does not increase the number of fill-ins. The second one is usingadditional sparse rotation matrices. It increases the number of fill-ins, but hasbetter accuracy than the first one in some cases.

PCA tilting (PCAT) In the first method, each pairwise rotation is ”tilted”from the isotropic angle to the PCA angle (Fig. 1). It corresponds to θ = 0 inequation (6). Entropy is increased with this tilting, since the PCA angle is theentropy-maximizing angle. Rotation matrix is derived as

R =

(cos θ(λ) − sin θ(λ)sin θ(λ) cos θ(λ)

),

θ(λ) = θiso + λ(θpca − θiso), θpca = tan−1(

1

2

σ12σ11 − σ22

), (10)

where θiso is given in Eq. (9).λ is a tuning parameter ranging from zero (completely isotropic) to one

(completely PCA). We can control a degree of balance between isotropy andentropy by tuning λ. Since PCA tilting does not lead to completely isotropicvariances, there is no definite reason to stop applying basic rotations at log2 ntimes. However, it is not necessary increase the number of basic rotations becauseit practically leads enough accuracy with log2 n times application. The numberof fill-ins of the transformation matrix therefore does not need to change.

Random Sparse PCA Rotation (RSPCA) The second method appliesadditional sparse rotations after having a completely isotropic transformation.The additional matrices have the same form as basic isotropic rotation, but n/2

2 In a precise sense, completely isotropic variances can be obtained for only 2n dimen-sional case. For other dimensions, it needs infinite number of basic isotropic rotation.In practice, however, enough sub-isotropic variances can be obtained with dlog2 netimes transformation.

Page 9: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

Pairwise Rotation Hashing for High-dimensional Features 9

rotational pairs are randomly chosen and PCA rotation (θpca given in Eq. (10))is applied to each pair. This procedure is called ”basic PCA rotation”. Althoughit is not obvious how many times the basic PCA rotation should be applied, theexperiments discussed in succeeding section show that O(log2 n) times rotationsattains maximal retrieval accuracy. So the increasing of the number of fill-ins ofthe transformation matrix is very little.

3.4 Relation with Major Existing Strategies

The proposed algorithm introduces a novel strategy, in which the transformationmatrix is expressed as a factored form of pairwise rotational matrices. For con-structing each rotation, only the variance-covariance matrix is used. In contrast,some existing linear binary hashing algorithms (such as ITQ) use an objectivefunction that is directly calculated from the data (e.g, quantization error due todiscretization). These data-dependent objective functions capture non-gaussianproperty of the distribution of data.

On the other hand, an arbitrary probability distribution function has an ex-pansion series with the lowest order term given by a gaussian distribution. Suchexpansion is called Edgeworth expansion [34]. From this viewpoint, it can be re-garded as the lowest-order approximation is taken in our algorithm, whereas ITQand other data-dependent methods consider higher order non-gaussian terms.Omission of higher order terms enables analytical treatments, which can providea simple and computationally efficient binarization procedure. Despite the factthat the higher order terms are disregarded, the proposed method still achievesconsiderably high accuracy as explained below.

4 Experiments

In the experiments, 128-dimensional gaussian toy data, 128-dimensional SIFTdata, and high-dimensional VLAD data with various dimensions are used. Thegaussian data is used for evaluating the theoretical behavior of the proposedalgorithm. The SIFT data is used for comparing existing methods that is notfeasible in high dimensions. The VLAD data is used for evaluating the algorithmin comparison with the state-of-the-art high-dimensional method.

4.1 Experimental Protocols

Settings We use Top-10 recall as performance measure of binary hashing. Eu-clidean nearest neighbors in original feature space is used as ground truth. Forthe gaussian data, 10K data points for training, 2K for query and 100K fordatabase is used. For SIFT data, we use SIFT1M dataset[8] and obey the orig-inal protocol (100K training set, 10K query set, and 1M database set). Forcreating VLAD data, we use ILSVRC2010 dataset [36]. 25600-dimensional and64000-dimensional VLAD is calculated from original SIFT data. 20K points fortraining and 5K points for queries are then randomly picked. The rest of thedataset (about 1M points) is used for the database.

Page 10: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

10 Pairwise Rotation Hashing for High-dimensional Features

Existing methods to be compared We choose counterpart methods as fol-lows: Sparse Random Rotation (SRR): This is a random method corre-sponding to our sequence of sparse matrices scheme. The transformation matrixfor SRR has the same form as our method (Fig. 3), whereas there is no sortingand rotation angle for each pair is randomly chosen. The number of basic rotationapplied is set to dlog2 ne. Iterative Quantization (ITQ)[20]: This is one ofthe most well-known methods that keeps nearly-state-of-the-art performance fora wide range of data. It is considered as a reasonable performance counterpartfor low-dimensional case. Isotropic Hashing (ISO)[32]: This is the originalmethod that generates orthogonal transformation to make variances completelyisotropic. For high-dimensional case (d ≥ 3), there are generally an infinite num-ber of isotropic states for any variance-covariance matrix. Each isotropic statehas different retrieval performance because it differs from others in terms ofentropy and higher-order cumulants (non-gaussian effects). It is considered asa counterpart that measures the quality of our isotropic transformation. Weuse Lift-and-Projection optimization algorithm proposed in [32]. PCA Hash-ing (PCA): PCA hashing, as its name suggests, uses linear transformation toPCA basis. As discussed in section 2.3, PCA basis is the opposite extreme ofthe isotropic basis with regard to the trade-off relationship between quantizationerror and entropy. K-means Hashing (KMH)[19]: This is a recently proposedstate-of-the-art method. It uses k-means Vector Quantization and binary codeassignment optimization for each cluster center. It is thus a kind of nonlinearmethod. It is selected for evaluating binary hashing performance compared tononlinear methods. We use algorithm parameter b = 4,M = ndim/b and 50iteration number defined in [19]. Bilinear Projection-based Binary Codes(BPBC)[10]: This is the state-of-the-art high-dimensional hashing method usingbilinear transformation. It is considered the baseline method. We use algorithmparameter d1 = 128, d2 = ndim/d1, and 50 iteration number.

4.2 Toy-data Experiment

First we use artificial gaussian data to observe theoretical behavior of the pro-posed algortihm, which is discussed above.

A 128-dimensional random variance-covariance matrix is created and usedto generate mean-centered gaussian data. To create a variance-covariance ma-trix, a diagonal matrix with random positive eigenvalues that is distributedlog-normally is generated. Then a diagonal matrix is rotated by random rota-tion. We consider two different eigenvalue distributions. One uses a log-normaldistribution with log variance of one (sphere-like distribution), and the otheruses a log-normal distribution with log variance of three (sharp distribution).

Fig. 4 shows the retrieval results. In the case of a sphere-like distribution (up-per row), most of methods have little difference in accuracy because the shape ofthe distribution is insusceptible under rotational transformation. A notable pointis that in the case of sharp distribution (lower row), completely isotropic PRH isobviously inferior to the other cases, although Isotropic Hashing, which also hascompletely isotropic variances, achieves reasonable performance. As discussed

Page 11: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

Pairwise Rotation Hashing for High-dimensional Features 11

(a) sphere-like gaussian.comparison with existing

methods.

(b) sphere-like gaussian.effect of PCAT.

(c) sphere-like gaussian.effect of RSPCA.

(d) sharp gaussian.comparison with existing

methods.

(e) sharp gaussian. effect ofPCAT.

(f) sharp gaussian. effect ofRSPCA.

Fig. 4. Top-10 NN retrieval results for 128-dimensional gaussian data. PRH(m, n, λ)indicates the proposed method with m-times basic isotropic rotation, n-times basicPCA rotation, and PCAT parameter λ (Eq. (10)). Upper row is for the data with log-variance value of one. Lower row is for the data with log-variance of three (describedin Section 4.2). Abbreviated legends of plot (d) are the same as that of plot (a).

in Section 4.1, there are an infinite number of isotropic states. The Lift-and-Projection in Isotropic Hashing tends to find entropically favorable isotropicstates. Despite the fact that PRH is extremely simple and sparse, it sometimesachieves entropically inferior isotropic states. However, this inferiority is reason-ably overcome by PCAT or RSPCA without loss of sparsity.

The lower middle plot of Fig. 4 indicates that almost-PCA angle (λ ∼ 1) ateach pairwise rotation leads to good performance in the sharp gaussian distribu-tion. It is important to distinguish our sequential pairwise almost-PCA rotationand PCA hashing rotation. To obtain exact PCA basis, it is necessary to accountfor all n(n − 1)/2 possible pairs. PCAT, however, only deals with O(n log2 n)pairs.

4.3 Real Datasets

Page 12: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

12 Pairwise Rotation Hashing for High-dimensional Features

(a) comparison withexisting methods.

(b) effect of PCAT. (c) effect of RSPCA.

Fig. 5. Top-10 NN retrieval results of 128-dimensional SIFT1M data. The meaning ofPRH(m, n, λ) is shown in Figure 4.

(a) 64bit (1/2 dimensionreduction) results.

(b) 32bit (1/4 dimensionreduction) results.

(c) PRHs for differentdegree of reductions.

Fig. 6. Top-10 NN retrieval results of SIFT1M data with PCA dimension reduction.Left:64bit (1/2 reduction) case. Middle: 32bit (1/4 reduction) case. Right: behaviorof PRHs across some dimensions. The meaning of PRH(m, n, λ) is shown in Figure 4.

Low Dimensional Case The SIFT1M case is considered next. Fig. 5 is theretrieval results. In common with the gaussian case, the completely isotropicPRH (PRH(7, 0, 0.0)) leads unfavorable accuracy. PCAT and RSPCA attaingood performance. Especially, RSPCA achieves remarkably better retrieval re-sult compared to other methods in figure (d).

Relation to dimension reduction Although a sparse dimension reductionscheme is not devised in this study, we examine the effect of dimension reductionon the performance of the proposed algorithm. As with existing methods, PCAbasis is tentatively used as dimension reduction for PRH and SRR. Note thatit does not keep sparsity of transformation. Fig 6 shows the results of SIFT1M.It is clear that the proposed algorithm maintains higher performance with eachdimension reduction compared to the other methods.

Page 13: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

Pairwise Rotation Hashing for High-dimensional Features 13

(a) comparison withbaseline method.

(b) effect of PCAT. (c) effect of RSPCA.

(d) comparison withbaseline method.

(e) effect of PCAT. (f) effect of RSPCA.

Fig. 7. Top-10 NN retrieval results for 64000 (Upper) and 25600-dimensional (Lower)VLAD data. The meaning of PRH(m, n, λ) is shown in Figure 4.

High-dimensional Case The high-dimensional case, which is the main con-tribution of our algorithm, is examined next. Fig. 7 is the retrieval resultsfor 64000-dimensional and 25600-dimensional VLAD features calculated fromILSVRC2010 dataset. PCAT achieves state-of-the-art retrieval accuracy. It isverified that it attains high performance for each dimension. In the experiment,contrary to the lower-dimensional case, RSPCA is inferior to PCAT. We thinkthat this is the result of improper random pairing in RSPCA because the possiblenumber of pairing is O(n2).

4.4 Computational Cost

Since our implementation is using a MATLAB sparse matrix datatype, it isdifficult to reasonably evaluate encoding cost in comparison with other methodsthat use optimized dense matrix operations. We use tentative environment forthe evaluation. We show the comparison of the number of product operations andspeed improvement ratio to BPBC in encoding phase (Fig. 8). The improvementratio is calculated with naive C implementation of dense/sparse matrix operationand it is compared with the theoretical (the number of product operations)one. The number of sum operation is also cut down in our method and weonly need one sum operation per two product operations. BPBC needs almost

Page 14: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

14 Pairwise Rotation Hashing for High-dimensional Features

Table 1. Learning time for each methods. Left: Learning time for SIFT1M dataset.Right: Learning time for 25600-dimensional VLAD case.

Methods Learn t(s)

PRH(7,0,0.5) 0.11

PRH(7,7,0.0) 0.10

SRR 0.015

ITQ 12.0

ISO 3.90

PCA 0.09

KMH 402

Methods Learn t(s)

PRH(15,0,0.3) 344

PRH(15,15,0.0) 527

SRR 24

BPBC 1740

Fig. 8. Encoding cost. Left: Comparison of the number of product operations. Right:Encoding speed improvement ratio of PRH in comparison with BPBC.

the same number of sum and product operations. This attributes would be anexplanation of the exceeding of naive C implementation result to theoreticalspeed improvement ratio.

Table. 1 is the learning time comparison. PRH learns very fast in each case.For high dimensional case, 25600-dimensional learning time is shown to store allof the data on memory. Our implementation is not optimized (efficient treatmentof sparse and symmetric matrix is possible).

5 Conclusion

We have proposed Pairwise Rotation Hashing (PRH), a linear binary hashingalgorithm that has O(n log n) encoding cost. PRH is based on two-dimensionalanalytical study of trade-off relationship between quantization error and entropy.The proposed algorithm is also fast in the learning phase because it needs onlyO(n log n) computations in the iteration loop. It shows high hashing accuracy inretrieval tasks at both low and high dimensions. Especially it achieves state-of-the-art performance at high dimensions (10K or higher).

We still have room for improvement. In this study, a dimension reductionscheme compatible with pairwise concept is excluded. We have an idea that di-mension reduction can be done again in a pairwise fashion, i.e, droping minorcomponents in the pairwise PCA in Eq. 10. A key issue is to find an appropriatepairing method in the pairwise PCA part. Even though RSPCA demonstrated

Page 15: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

Pairwise Rotation Hashing for High-dimensional Features 15

high performance, it can be further improved if components are paired in se-lective ways, rather than in random ways. In RSPCA, there is the potential forfinding more sophisticated pairing scheme that favorably balances isotropy withentropy. However, the exhaustive search of O(n2) possible pairing substantiallydegrades the learning speed. Non-random but efficient pairing scheme is needed.

References

1. Shakhnarovich, G., Darrell, T., Indyk, P.: Nearest-Neighbor Methods in Learningand Vision : Theory and Practice. MIT press (March 2006)

2. Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matchingin videos. In: ICCV. (2003) 1470–1477 vol.2

3. Torralba, A., Fergus, R., Freeman, W.T.: 80 Million Tiny Images: a Large DataSet for Nonparametric Object and Scene Recognition. TPAMI 30(11) (November2008) 1958–1970

4. Aiger, D., Kokiopoulou, E., Rivlin, E.: Random Grids : Fast Approximate NearestNeighbors and Range Searching for Image Search. In: ICCV. (2013) 3471–3478

5. Perronnin, F., Dance, C.: Fisher Kernels on Visual Vocabularies for Image Cate-gorization. In: CVPR. (2007) 1–8

6. Perronnin, F., Liu, Y., Sanchez, J., Poirier, H.: Large-scale image retrieval withcompressed Fisher vectors. In: CVPR. (2010) 3384–3391

7. Jegou, H., Perronnin, F., Douze, M., Sanchez, J., Perez, P., Schmid, C.: Aggregat-ing local image descriptors into compact codes. TPAMI 34(9) (September 2012)1704–1716

8. Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighborsearch. TPAMI 33(1) (January 2011) 117–128

9. Ge, T., He, K., Ke, Q., Sun, J.: Optimized product quantization. IEEE Trans.Pattern Anal. Mach. Intell. 36(4) (2014) 744–755

10. Gong, Y., Kumar, S., Rowley, H.A., Lazebnik, S.: Learning Binary Codes forHigh-Dimensional Data Using Bilinear Projections. In: CVPR. (2013) 484–491

11. Jain, P., Kulis, B., Grauman, K.: Fast Image Search for Learned Metrics. In:CVPR. (2008) 1–8

12. Wang, J., Kumar, S., Chang, S.F.: Sequential projection learning for hashing withcompact codes. In: ICML. (2010) 1127–1134

13. Wang, J., Kumar, S., Chang, S.F.: Semi-supervised hashing for scalable imageretrieval. In: CVPR. (2010) 3424–3431

14. He, J., Feng, J., Liu, X., Cheng, T., Lin, T.H., Chung, H., Chang, S.F.: Mobileproduct search with Bag of Hash Bits and boundary reranking. In: CVPR. (2012)3005–3012

15. Torralba, A., Fergus, R., Weiss, Y.: Small codes and large image databases forrecognition. In: CVPR. (2008) 1–8

16. Chaudhry, R., Ivanov, Y.: Fast Approximate Nearest Neighbor Methods for Non-Euclidean Manifolds with Applications to Human Activity Analysis in Videos. In:ECCV. (2010) 735–748

17. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrainedLinear Coding for image classification. In: CVPR. (2010) 3360–3367

18. Norouzi, M., Fleet, D.J.: Cartesian k-means. In: CVPR. (2013) 3017–302419. He, K., Wen, F., Sun, J.: K-Means Hashing: An Affinity-Preserving Quantization

Method for Learning Binary Compact Codes. In: CVPR. (2013) 2938–2945

Page 16: Pairwise Rotation Hashing for High-dimensional … · Pairwise Rotation Hashing for High-dimensional Features Kohta Ishikawa, Ikuro Sato, and Mitsuru Ambai Denso IT Laboratory, Inc

16 Pairwise Rotation Hashing for High-dimensional Features

20. Gong, Y., Lazebnik, S., Gordo, A., Perronnin, F.: Iterative quantization: A pro-crustean approach to learning binary codes for large-scale image retrieval. TPAMI35(12) (December 2013) 2916–2929

21. Liu, W., Wang, J., Mu, Y., Kumar, S., Chang, S.F.: Compact Hyperplane Hashingwith Bilinear Functions. In: ICML. (2012)

22. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS. (2008) 1753–176023. Joly, A., Buisson, O.: Random maximum margin hashing. In: CVPR. (2011)

873–88024. Weiss, Y., Fergus, R., Torralba, A.: Multidimensional spectral hashing. In: ECCV.

(2012) 340–35325. Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with

kernels. In: CVPR. (2012) 2074–208126. Fan, L.: Supervised Binary Hash Code Learning With Jensen Shannon Divergence.

In: ICCV. (2013) 2616–262327. Pauleve, L., Jegou, H., Amsaleg, L.: Locality sensitive hashing: a comparison of

hash function types and querying mechanisms. Pattern Recognition Letters 31(11)(2010) 1348–1358

28. Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable imagesearch. In: ICCV. (2009) 2130–2137

29. Kulis, B., Darrell, T.: Learning to hash with binary reconstructive embeddings.In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A., eds.:NIPS. (2009) 1042–1050

30. Raginsky, M., Lazebnik, S.: Locality-sensitive binary codes from shift-invariantkernels. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A.,eds.: NIPS. (2009) 1509–1517

31. Heo, J.P., Lee, Y., He, J., Chang, S.F., Yoon, S.E.: Spherical hashing. In: CVPR.(2012) 2957–2964

32. Kong, W., Li, W.J.: Isotropic hashing. In Bartlett, P., Pereira, F., Burges, C.,Bottou, L., Weinberger, K., eds.: NIPS. (2012) 1655–1663

33. Sato, I., Ambai, M., Suzuki, K.: Sparse isotropic hashing. IPSJ Transactions onComputer Vision and Applications 5 (2013) 40–44

34. Kolassa, J.E.: Series Approximation Methods in Statistics (3rd ed.). Springer,NewYork (2006)

35. Jin, Z., Hu, Y., Lin, Y., Zhang, D., Lin, S., Cai, D., Li, X.: ComplementaryProjection Hashing. In: ICCV. (2013) 257–264

36. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR. (2009)