anoop cherian and nikolaos papanikolopouloslear.inrialpes.fr/people/cherian/papers/msi-anoop.pdf ·...

1
RESEARCH POSTER PRESENTATION DESIGN © 2011 www.PosterPresentations.com Similarity search or Nearest Neighbor (NN) retrieval is the backbone of a variety of our day-to-day applications. Some examples are shown below. Image data are generally high-dimensional: e.g. SIFT (128D), GIST (960D). NN retrieval becomes computationally challenging in such high-dimensions due to the curse-of-dimensionality. The algorithm developed in this poster is not limited to computer vision applications, but can be used for similar problems in other domains such as document search, bioinformatics and data mining. Introduction Our approach is motivated by the recent advances in compressive sensing and sparse coding. Briefly, if a given signal has a full support in one domain, then it is highly likely that it has a sparse representation in some other basis. For example: A sine wave is dense in the spatial domain, but can be represented by a single spike in the frequency domain. Generally, finding this sparse basis is non-trivial for real-world data descriptors (such as SIFT). Thus, we propose to use Dictionary Learning (DL) to find such basis. Assume data v i and basis dictionary B={b 1 , b 2 , …, b n }. Let a i (to be found) be the sparse code associated with each v i , then we have the following dictionary learning and sparse coding mathematical formulation: When dealing with real-data, one has to account for data noise, which might lead to different sparse codes using the above framework even for neighboring data vectors. The regularization parameter adjusts the weights of the sparsity term with the data representation term to account for such noise. But since the amount of noise cannot be estimated in reality, we propose to do sparse coding for multiple regularizations, leading to our Multi-Regularization Sparse Coding algorithm. This means each data vector is associated with multiple sparse codes. Multi-Regularization Sparse Coding Algorithm Experiments and Results Large scale image search experiments require storing lots of data descriptors. To give an idea of our computational requirements, let us say we use the SIFT (128D) descriptors. Each descriptor takes 128*4=512 bytes. More than 20K such descriptors are generated typically by a single image, and we use thousands of such images. In our experiments, we use approximately 400 million SIFT descriptors, thus requiring more than 200 GB of storage space. To add to this, sparse coding of a d-dimensional descriptor using a dictionary of dimension d x n takes time O(dn 2 ) time, requiring fast/multi-core processors and processing clusters. Computational Challenges Qualitative Image Retrieval Results Here we present a few qualitative results from image search operations using the MRSC framework. Results are from the Tiny images dataset [7] and Notre Dame image set [1]. The Tiny images dataset consists of 80M small images each of size 32x32. We used 10M images from this dataset, each resized by ¼ to form 768D data vectors. We used a dictionary of size 768x3072 for the MRSC algorithm. For the Notre Dame dataset, we used the same framework as before. Conclusions In this project, we developed a novel framework (Multi-Regularization Sparse Coding) for nearest neighbor retrieval on a large image database using the paradigms of sparse coding and dictionary learning. The algorithm was shown to out-perform state-of-the-art in accuracy, speed of retrieval, scalability and robustness. Working with millions of data points, as in a real-world webscale image search application, demands huge resources for storage and computations. Towards this end, we effectively deployed the resources provided by the University of Minnesota, Supercomputing institute for our application. Acknowledgements Large Scale Image Search via Sparse Coding Dept. of Computer Science and Engineering, University of Minnesota {cherian, npapas}@cs.umn.edu Anoop Cherian and Nikolaos Papanikolopoulos Figure 1: A collage of computer vision applications built on top of nearest neighbors (Middle column: Microsoft Photosynth [1], and video Google [2] respectively.) Our goal in this project is to develop an NN algorithm that is computationally tractable at high-dimensions, and at the same time provide state-of-the-art performance in: (i) accuracy, (ii) speed of search, (iii) robustness to data distortions, (iv) storage efficiency, and (v) scalability Representational sparsity Representational similarity to data Normalizer for numerical stability Once such a basis dictionary is learned from a subset of the data using the above formulation, it can be used to sparse code the entire data. The indices of the active support of each sparse code can be used as a new short descriptor for the data, that can then be used for indexing a hash table, leading to fast search. In addition, we need to store only sparse coefficients for retrieval, thus saving memory. Extract Featuress Features Dictionary Learning Dataset of images Sparse Coding Indexing Figure 2: Dictionary Learning and Sparse Coding framework. Regularization We used two different datasets for performance evaluation of our system: (1) SIFT (128D) from the INRIA Holidays images [3], and (2) Spin images (400D) from the SHREC 3D object recognition dataset [4]. Dictionaries of sizes 128 x 2048 and 400 x 1024 are learned respectively for each of the databases. Sparse codes of length {2,3,4,5} were generated using the MRSC algorithm for each descriptor in each dataset. This cuts down the storage requirement to a maximum of 55 bits per descriptor (ignoring the magnitudes of the coefficients). This means a 20-fold compression for SIFT and 60-fold compression for Spin images. Next, we compare the performance of our algorithm (MRSC) against state-of-the- art algorithms such as Product Quantization (PQ)(2011) , KD-Tree (KDT), Locality Sensitive Hashing (E2LSH)(1998), Kernelized LSH (KLSH)(2009), Spectral Hashing (SH)(2008) and Shift Invariant Kernel Hashing (SIKH)(2009). Figure 3: Average recall against increasing number of retrieved points. Figure 4: (left) Mean average precision for SIFT and Spin images for increasing number of retrieved points, (right) average hash bucket size for increasing database size. Figure 5: (a) Search time per query for increasing database size, (b) Accuracy of retrieval for an increasing database size (that is, increasing number of distracting neighbors). Robustness of MRSC against Image Distortions Figure 6: Each image set on the left undergoes a specific type of distortion. The challenge is to recover SIFT correspondences between the first image and subsequent images in the set. See [6] for more details of this experiment. Figure 7: Perceptual similarity search on Tiny Images dataset. First column in each block is the query, second column shows the first four NNs and last column is the ground truth. Figure 8: SIFT based image search on the Notre Dame dataset. First column is the query and rest of columns shows the first three NNs. The dataset consists of 1500 images, each image generating approximately 10K SIFT descriptors. Images containing the maximum number of SIFT matches (using the MRSC algorithm) are shown as the NN image to the query. References [1] N. Snavely, S. Seitz, and R. Szeliski: Photo tourism: Exploring photo collections in 3D, ACM, 2006. [2] J. Sivic and A. Zisserman: Video Google: A text retrieval approach to object matching in videos, ICCV, 2003. [3] http://lear.inrialpes.fr/~jegou/data.php [4] E. Boyer, A. Bronstein, M. Bronstein, B. Bustos, T. Darom, R. Horaud, I. Hotz, Y. Keller, J. Keustermans, A. Kovnatsky, et al.: SHREC 2011: Robust feature detection and description benchmark, ArXiv 1102.4258, 2011. [5] A. Cherian, V. Morellas and N. Papanikolopoulos : Efficient Similarity Search via Sparse Coding, Transactions on PAMI, 2012 (under review). [6] A. Cherian, V. Morellas and N. Papanikolopoulos : Efficient Similarity Search via Sparse Coding, University of Minnesota, Tech. Report, 2011. [7] A. Torralba, R. Fergus,Y. Weiss: Small codes and large image databases for recognition., CVPR, 2008 This work was supported in part by the National Science Foundation through grants #IIP-0443945, #IIP-0726109, #CNS-0708344, #CNS-0821474, #IIP-0934327, #CNS-1039741, #IIS-1017344, #IIP-1032018, and #SMA- 1028076. We acknowledge the resources provided by the University of Minnesota Supercomputing Institute, without which it would have been difficult to realize our experiments. This poster is based on the work in [5] submitted to Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2012. See [6] for the tech report version.

Upload: others

Post on 15-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Anoop Cherian and Nikolaos Papanikolopouloslear.inrialpes.fr/people/cherian/papers/msi-anoop.pdf · Anoop Cherian and Nikolaos Papanikolopoulos Figure 1: A collage of computer vision

RESEARCH POSTER PRESENTATION DESIGN © 2011

www.PosterPresentations.com

Similarity search or Nearest Neighbor (NN) retrieval is the backbone of a variety

of our day-to-day applications. Some examples are shown below.

Image data are generally high-dimensional: e.g. SIFT (128D), GIST (960D). NN

retrieval becomes computationally challenging in such high-dimensions due to

the curse-of-dimensionality.

The algorithm developed in this poster is not limited to computer vision

applications, but can be used for similar problems in other domains such as

document search, bioinformatics and data mining.

Introduction

Our approach is motivated by the recent advances in compressive sensing and

sparse coding. Briefly, if a given signal has a full support in one domain, then it

is highly likely that it has a sparse representation in some other basis. For

example: A sine wave is dense in the spatial domain, but can be represented

by a single spike in the frequency domain. Generally, finding this sparse basis is

non-trivial for real-world data descriptors (such as SIFT). Thus, we propose to

use Dictionary Learning (DL) to find such basis.

Assume data vi and basis dictionary B={b1, b2, …, bn}. Let ai (to be found) be the

sparse code associated with each vi, then we have the following dictionary

learning and sparse coding mathematical formulation:

When dealing with real-data, one has to account for data noise, which might

lead to different sparse codes using the above framework even for neighboring

data vectors.

The regularization parameter adjusts the weights of the sparsity term with the

data representation term to account for such noise. But since the amount of

noise cannot be estimated in reality, we propose to do sparse coding for

multiple regularizations, leading to our Multi-Regularization Sparse Coding

algorithm. This means each data vector is associated with multiple sparse

codes.

Multi-Regularization Sparse Coding Algorithm

Experiments and Results

Large scale image search experiments require storing lots of data descriptors.

To give an idea of our computational requirements, let us say we use the SIFT

(128D) descriptors. Each descriptor takes 128*4=512 bytes. More than 20K such

descriptors are generated typically by a single image, and we use thousands of

such images.

In our experiments, we use approximately 400 million SIFT descriptors, thus

requiring more than 200 GB of storage space. To add to this, sparse coding of

a d-dimensional descriptor using a dictionary of dimension d x n takes time

O(dn2) time, requiring fast/multi-core processors and processing clusters.

Computational Challenges Qualitative Image Retrieval Results

Here we present a few qualitative results from image search operations using

the MRSC framework. Results are from the Tiny images dataset [7] and Notre

Dame image set [1]. The Tiny images dataset consists of 80M small images

each of size 32x32. We used 10M images from this dataset, each resized by ¼

to form 768D data vectors. We used a dictionary of size 768x3072 for the MRSC

algorithm. For the Notre Dame dataset, we used the same framework as

before.

Conclusions

In this project, we developed a novel framework (Multi-Regularization Sparse

Coding) for nearest neighbor retrieval on a large image database using the

paradigms of sparse coding and dictionary learning. The algorithm was shown

to out-perform state-of-the-art in accuracy, speed of retrieval, scalability and

robustness. Working with millions of data points, as in a real-world webscale

image search application, demands huge resources for storage and

computations. Towards this end, we effectively deployed the resources

provided by the University of Minnesota, Supercomputing institute for our

application.

Acknowledgements

Large Scale Image Search via Sparse Coding

Dept. of Computer Science and Engineering, University of Minnesota

{cherian, npapas}@cs.umn.edu

Anoop Cherian and Nikolaos Papanikolopoulos

Figure 1: A collage of computer vision applications built on top of nearest neighbors (Middle column: Microsoft Photosynth [1], and video Google [2] respectively.)

Our goal in this project is to develop an NN algorithm that is computationally tractable at high-dimensions, and at the same time provide state-of-the-art performance in: (i) accuracy, (ii) speed of search, (iii) robustness to data distortions, (iv) storage efficiency, and (v) scalability

Representational sparsity Representational similarity to data

Normalizer for numerical stability

Once such a basis dictionary is learned from a subset of the data using the above

formulation, it can be used to sparse code the entire data. The indices of the active

support of each sparse code can be used as a new short descriptor for the data, that

can then be used for indexing a hash table, leading to fast search. In addition, we need to store only sparse coefficients for retrieval, thus saving memory.

Extract

Featuress

Features Dictionary Learning

Dataset of images Sparse Coding Indexing

Figure 2: Dictionary Learning and Sparse Coding framework.

Regularization

We used two different datasets for performance evaluation of our system: (1)

SIFT (128D) from the INRIA Holidays images [3], and (2) Spin images (400D) from

the SHREC 3D object recognition dataset [4]. Dictionaries of sizes 128 x 2048 and

400 x 1024 are learned respectively for each of the databases.

Sparse codes of length {2,3,4,5} were generated using the MRSC algorithm

for each descriptor in each dataset. This cuts down the storage requirement

to a maximum of 55 bits per descriptor (ignoring the magnitudes of the

coefficients). This means a 20-fold compression for SIFT and 60-fold

compression for Spin images.

Next, we compare the performance of our algorithm (MRSC) against state-of-the-

art algorithms such as Product Quantization (PQ)(2011) , KD-Tree (KDT), Locality

Sensitive Hashing (E2LSH)(1998), Kernelized LSH (KLSH)(2009), Spectral Hashing

(SH)(2008) and Shift Invariant Kernel Hashing (SIKH)(2009).

Figure 3: Average recall against increasing number of retrieved points.

Figure 4: (left) Mean average precision for SIFT and Spin images for increasing number of retrieved points, (right) average hash bucket size for increasing database size.

Figure 5: (a) Search time per query for increasing database size, (b) Accuracy of retrieval

for an increasing database size (that is, increasing number of distracting neighbors).

Robustness of MRSC against Image Distortions

Figure 6: Each image set on the left undergoes a specific type of

distortion. The challenge is to recover SIFT correspondences

between the first image and subsequent images in the set. See [6]

for more details of this experiment.

Figure 7: Perceptual similarity search on Tiny Images dataset. First column in each block is the query, second column shows the first four NNs and last column is the ground truth.

Figure 8: SIFT based image search on the Notre Dame dataset. First column is the query and rest of columns shows the first three NNs. The dataset consists of 1500 images, each image generating approximately 10K SIFT descriptors. Images containing the maximum number of SIFT matches (using the MRSC algorithm) are shown as the NN image to the query.

References

[1] N. Snavely, S. Seitz, and R. Szeliski: Photo tourism: Exploring photo collections in 3D,

ACM, 2006.

[2] J. Sivic and A. Zisserman: Video Google: A text retrieval approach to object matching in

videos, ICCV, 2003.

[3] http://lear.inrialpes.fr/~jegou/data.php

[4] E. Boyer, A. Bronstein, M. Bronstein, B. Bustos, T. Darom, R. Horaud, I. Hotz, Y. Keller, J.

Keustermans, A. Kovnatsky, et al.: SHREC 2011: Robust feature detection and description

benchmark, ArXiv 1102.4258, 2011.

[5] A. Cherian, V. Morellas and N. Papanikolopoulos : Efficient Similarity Search via Sparse

Coding, Transactions on PAMI, 2012 (under review).

[6] A. Cherian, V. Morellas and N. Papanikolopoulos : Efficient Similarity Search via Sparse

Coding, University of Minnesota, Tech. Report, 2011.

[7] A. Torralba, R. Fergus,Y. Weiss: Small codes and large image databases for recognition.,

CVPR, 2008

This work was supported in part by the National Science Foundation through grants #IIP-0443945, #IIP-0726109, #CNS-0708344, #CNS-0821474, #IIP-0934327, #CNS-1039741, #IIS-1017344, #IIP-1032018, and #SMA- 1028076. We acknowledge the resources provided by the University of Minnesota Supercomputing Institute, without which it would have been difficult to realize our experiments. This poster is based on the work in [5] submitted to Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2012. See [6] for the tech report version.