arxiv:1809.03867v1 [cs.mm] 8 sep 2018

Noname manuscript No.(will be inserted by the editor)

Efficient Multimedia Similarity Measurement Using SimilarElements

Chengyuan Zhang † · Yunwu Lin † · Lei Zhu† · Zuping Zhang † · XinPan Yuan ‡ · FangHuang † ·

Received: date / Accepted: date

Abstract online social networking techniques and large-scale multimedia systems are de-veloping rapidly, which not only has brought great convenience to our daily life, but gener-ated, collected, and stored large-scale multimedia data. This trend has put forward higherrequirements and greater challenges on massive multimedia data retrieval. In this paper, weinvestigate the problem of image similarity measurement which is used to lots of applica-tions. At first we propose the definition of similarity measurement of images and the relatednotions. Based on it we present a novel basic method of similarity measurement namedSMIN. To improve the performance of calculation, we propose a novel indexing structurecalled SMI Temp Index (SMII for short). Besides, we establish an index of potential similarvisual words off-line to solve to problem that the index cannot be reused. Experimentalevaluations on two real image datasets demonstrate that our solution outperforms state-of-the-art method.

Keywords Image similarity · SMI · SMI Temp Index · PSMI

Chengyuan ZhangE-mail: [email protected]

Yunwu LinE-mail: [email protected]

�Lei ZhuE-mail: [email protected]

Zuping ZhangE-mail: [email protected]

XinPan YuanE-mail: [email protected]

Fang HuangE-mail: [email protected]

† School of Information Science and Engineering, Central South University, PR China‡ School of Computer, Hunan University of Technology, China

arX

iv:1

809.

0386

7v1

[cs

.MM

] 8

Sep

201

8

2 Chengyuan Zhang † et al.

1 Introduction

In the recent years, online social networking techniques and large-scale multimedia sys-tems [36,33,31,35] are developing rapidly, which not only has brought great convenienceto our daily life, but generated, collected, and stored large-scale multimedia data [32],such as text, image [34], audio, video [45] and 3D data. For example, in China, Weibo(https://weibo.com/) is the largest online social networking service, which have 376 mil-lion active users and more than 100 million micro-blogs containing short text, image, orshort video are posted. The most famous social networking platform all over the world,Facebook (https://facebook.com/), reports 350 million images uploaded everyday in theend of November 2013. More than 400 million tweets with texts and images have beengenerated by 140 million users on Twitter (http://www.twitter.com/) which is anotherpopular social networking web site in the world. In September 2017, the largest onlinesocial networking platform. Another type of common application in the Internet is mul-timedia data sharing services. Flickr(https://www.flickr.com/) is one of the most famousphotos sharing web site around the world. More than 3.5 million new images uploaded tothis platform everyday in March 2013. More than 14 million articles are clicked every dayon Pinterest, which is an attractive image social networking web site. More than 2 billiontotally videos stored in YouTube, the most famous video sharing platform by the end of2013, and every minute there are 100 hours of videos which are uploaded to this service(https://www.youtube.com/). The total watch time exceeded 42 billion minutes on IQIYI(http://www.iqiyi.com/), the most famous online video sharing service in China and numberof independent users monthly is more than 230 million monthly. For audio sharing services,the total amount of audio in Himalaya (https://www.ximalaya.com/) had exceeded 15 mil-lion as of December 2015. Other web services like Wikipedia (https://en.wikipedia.org/),the largest and most popular free encyclopedia on the Internet, contains more than 40million articles with pictures in 301 different languages. Other mobile applications such asWeChat, Instagram, etc, provide great convenience for us to share multimedia data. Thanksto these current rich multimedia services and applications, multimedia techniques [40,46] ischanging every aspect of our lives. On the other hand, the emergence of massive multime-dia data [38] and applications puts forward greater challenges for techniques of informationretrieval.Motivation. Textual similarity measurement is a classical issue in the community of infor-mation retrieval and data mining, and lots of approaches have been proposed. Guo et al [8]proposed to use vectors as basic elements, and the edit distance and Jaccard coefficient areused to calculate the sentence similarity. Li et al. [17] proposed the use of word vectors torepresent the meaning of words, and considers the influence of multiple factors such as wordmeaning, word order and sentence length on the calculation of sentence similarity. Unlikethe studies of textual similarity measurement, in this paper we investigate the problem ofimage similarity measurement, which is a widely applied technique in lots of applicationscenarios such as image retrieval [37,44,41], image similarity calculation and matching [39,47]. There are two examples shown in Figure 1 and Figure 2 which can describe this problemin a more clear way.

Example 1 In Figure 1, An user have a photo and she want to find out others pictures whichare highly similar to it in the Internet. She can submit a image query containing this photointo the multimedia retrieval system. The system measures the visual content similaritybetween this photo and the images in the database and after that a set of similar images isreturned.

Efficient Multimedia Similarity Measurement Using Similar Elements 3

Multimedia DatabaseSimilarity

Measurement

Input Multimedia Retrieval System Output

Fig. 1: An example of multimedia retrieval via similarity measurement

Select

Select

Sim

ilar

ity

Mea

sure

men

t

Image 1

Image 2

Image 1

Image 2

90%Return

Input

Input

Fig. 2: An example of multimedia retrieval via similarity measurement

Example 2 Figure. 2 demonstrates another example of application of image similarity mea-surement. An user want to measure similarity betweeen two pictures in a dataset quantita-tively. She selects two pictures from the image dataset and input them into the similaritymeasurement system. According to image similarity measurement algorithm, the systemwill calculate the value of similarity between these images (e.g., 90%).

In order to improve the efficiency and accuracy of image similarity measurement, wepresent the definition of similarity measurement of images and the relevant notions. Weintroduce the measurement of similar visual words named SMI Naive (SMIN for short) whichis the basic method for similarity measurement, and then propose the SMIN algorithm. After


that, to optimize this method, we design a novel indexing structure named SMI Temp Indexto reduce the time complexity of calculation. In addition, another technique named indexof potential similar visual words is proposed to solve the problem that the index cannot bereused. We could search for the index to perform the measurement of similar visual wordswithout having to repeatedly create a temporary index.Contributions. Our main contributions can be summarized as follows:

– Firstly we introduce the definition of similarity measurement of images and the relatedconceptions. The image similarity calculation function are designed.

– We introduce the basic method of image similarity measurement, called SMI Naive(SMIN for short). In order to improve the performance of similarity measurement, basedon it we design two indexing techniques named SMI Temp Index (SMII for short) andIndex of Potential Similar Visual Words (PSMI for short).

– We have conducted extensive experiments on two real image datasets. Experimentalresults demonstrate that our solution outperforms the state-of-the-art method.

Roadmap. In the remainder of this paper, Section 2 presents the related works about imagesimilarity measurement and image retrieval. In Section 3, the definition of image similaritymeasurement and related conceptions are proposed. We present the basic similarity mea-surement method named SMIN and two improved indexing techniques and algorithms inSection 4. Our experimental results are presented in Section 5. Finally, we conclude thepaper in Section 6.

2 Related Work

In this section, we present the related works of image similarity measurement and imageretrieval, which are relevant to this study.Image Similarity Measurement. In recent years, image similarity measurement has be-come a hot issue in the community of multimedia system [43] and information retrieval sincethe massive image data can be accessed in the Internet. On the other hand, like textual sim-ilarity measurement, image similarity measurement is an important technique which can beapplied in lots of applications, such as image retrieval, image matching, image recognitionand classification, computer vision, etc. Many researchers work for this issue and numerousapproaches have been proposed. For example, Coltuc et al. [5] studied the usefulness of thenormalized compression distance (NCD for short) for image similarity detection. In theirwork, they considered correlation between NCD based feature vectors extracted for each im-age. Albanesi et al. [2] proposed a novel class of image similarity metrics based on a waveletdecomposition. They investigated the theoretical relationship between the novel class ofmetrics and the well-known structural similarity index. Abe et al. [1] studied similarityretrieval of trademark images represented by vector graphics. To improve the performanceof the system, they introduced centroid distance into the feature extraction. Cicconet etal. [4] studied the problem of detecting duplication of scientific images. They introduceda data-driven solution based on a 3-branch Siamese Convolutional Neural Network whichcan serve to narrow down the pool of images. For multi-label image retrieval, Zhang etal. [53] proposed a novel deep hashing method named ISDH in which an instance-similaritydenition was applied to quantify the pairwise similarity for images holding multiple classlabels. Kato et al. [12] proposed a novel solutions for the problem of selecting image pairsthat are more likely to match in Structure from Motion. They used Jaccard Similarity andbag-of-visual-words in addition to tf-idf to measure the similarity between images. Wang et


al [30] designed a regularized distance metric framework which is named semantic discrim-inative metric learning (SDML for short). This framework combines geometric mean withnormalized divergences and separates images from different classes simultaneously. Guhaet al. [7] proposed a new approach called Sparse SNR (SSNR for short) to measuring thesimilarity between two images using sparse reconstruction. Their measurement does notneed to use any prior knowledge about the data type or the application. KHAN et al. [14]proposed two halftoning methods to improve efficiency in generating structurally similarhalftone images using Structure Similarity Index Measurement. Their Method I can im-proves efficiency as well as image quality and Method II can reaches a better image qualitywith fewer evaluations than pixel-swapping algorithm used in Method I.

Near-duplicate image detection is a another problem related to image similarity mea-surement. To solve the problem of near-duplicate image retrieval, Wang et al. [42] developeda novel spatial descriptor embedding method which encodes the relationship of the SIFTdominant orientation and the exact spatial position between local features and their con-text. Gadeski et al. [54] proposed an effective algorithm based on MapReduce frameworkto identify the near duplicates of images from large-scale image sets. Nian et al. [24] inves-tigated this type of problem and presented an effective and efficient local-based represen-tation method named Local-based Binary Representation to encode an image as a binaryvector. Zlabinger et al. [55] developed a semi-automatic duplicate detection approach inwhich single-image-duplicates are detected between sub-images based on a connected com-ponent approach and duplicates between images are detected by using min-hashing method.Hsieh et al. [9] designed a novel framework that adopts multiple hash tables in a novel wayfor quick image matching and efficient duplicate image detection. Based on a hierarchicalmodel, Li et al. [16] introduced an automatic NDIG mining approach by utilizing adaptiveglobal feature clustering and local feature refinement to solve the problem of near duplicateimage groups mining. Liu et al. [18] presented a variable-length signature to address theproblem of near-duplicate image matching. They used the earth mover’s distance to handlevariable-length signatures. Yao et al. [51] developed a novel contextual descriptor whichmeasures the contextual similarity of visual words to immediately discard the mismatchesand reduce the count of candidate images. For large scale near-duplicate image retrievalFedorov et al. [6] introduced a feature representation combining of three local descriptors,which is reproducible and highly discriminative. To improve the efficiency of near-duplicateimage retrieval, Yldz et al. [52] proposed a novel interest point selection method in whichthe distinctive subset is created with a ranking according to a density map.

Image Retrieval. Content-based image retrieval (CBIR for short) [10,15,50] is to retrieveimages by analyzing visual contents, and therefore image representation [28,44] plays animportant role in this task. In recent years, the task of CBIR has attracted more andmore attentions in the multimedia [47,48] and computer vision community [41,39]. Manytechniques have been proposed to support efficient multimedia query and image recognition.Scale Invariant Feature Transform (SIFT for short) [20,21] is a classical method to extractvisual features, which transforms an image into a large collection of local feature vectors.SIFT includes four main step: (1)scale-space extrema detection; (2)keypoint localization;(3)orientation assignment; (4)Kkeypoint descriptor. It is widely applied in lots of researchesand applications. For example, Ke et al. [13] proposed a novel image descriptor namedPCA-SIFT which combines SIFT techniques and principal components analysis (PCA forshort) method. Mortensen et al. [22] proposed a feature descriptor that augments SIFT witha global context vector. This approach adds curvilinear shape information from a muchlarger neighborhood to reduce mismatches. Liu et al. [19] proposes a novel image fusionmethod for multi-focus images with dense SIFT. This dense SIFT descriptor can not only


Notation Definition

DI A given database of imagesIi The i-th imageWi A visual words set|W| The number of visual words in Wwi

1 The 1-th visual word in the visual words set Wi

λk The similarity of two visual words

Pk = (wik, w

jk) The similar visual word pair⊗

The operator to generates the set of SVWPs

λ The similarity threshold of predefinedΞi The set of visual words weightSimI(Ii(Wi), Ij(Wj)) The image similarity measurementµi The similarity of visual word

Table 1: The summary of notations

be employed as the activity level measurement, but also be used to match the mis-registeredpixels between multiple source images to improve the quality of the fused image. Su et al. [27]designed a horizontal or vertical mirror reflection invariant binary descriptor named MBR-SIFT to solve the problem of image matching. Nam et al. [23] introduced a SIFT featuresbased blind watermarking algorithm to address the issue of copyright protection for DIBR3D images. Charfi et al. [3] developed a bimodal hand identification system based on SIFTdescriptors which are extracted from hand shape and palmprint modalities.

Bag-of-visual-words [26,41,49](BoVW for short) model is another popular technique forCBIR and image recognition, which was first used in textual classification. This model isa technique to transform images into sparse hierarchical vectors by using visual words, sothat a large number of images can be manipulated. Santos et al. [25] presented the first evermethod based on the signature-based bag of visual words (S-BoVW for short) paradigmthat considers information of texture to generate textual signatures of image blocks forrepresenting images. Karakasis et al. [11] presents an image retrieval framework that usesaffine image moment invariants as descriptors of local image areas by BoVW representation.Wang et al. [29] presented an improved practical spatial weighting for BoV (PSW-BoV forshort) to alleviate this effect while keep the efficiency.

3 Preliminaries

In this section, we propose the definition of region of visual interests (RoVI for short) atthe first time, then present the notion of region of visual interests query (RoVIQ for short)and the similarity measurement. Besides, we review the techniques of image retrieval whichis the base of our work. Table 1 summarizes the notations frequently used throughout thispaper to facilitate the discussion.

3.1 Problem Definition

Definition 1 (Image object) Let DI be an image dataset and Ii and Ij be two images,Ii, Ij ∈ DI . We define the image object represented by bag-of-visual-word model as Ii(Wi)and Ij(Wj), wherein Wi = {wi

1, wi2, ..., w

im} and Wj = {wj

1, wj2, ..., w

jn} are the visual word

set generated by low-level feature extraction from Ii and Ij , |Wi| = m and |Wj | = n are


the number of visual words in these two sets respectively. In this study, we utilize imageobject as the representation model of images for the task of image similarity measurement.

Definition 2 (Similarity of visual word) Given two image objects Ii(Wi) and Ij(Wj),whereinWi = {wi

1, wi2, ..., w

im} andWj = {wj

1, wj2, ..., w

jn} are the visual words set. The sim-

ilarity of two visual word wik ∈ Wi and wj

k ∈ Wj is represented by λk = SimW(wik, w

jk), λk ∈

[0, 1], and if these visual words are identical, i.e., Wi =Wj , λk = 1.

Definition 3 (Similar visual word pair) Given two visual words wik ∈ Wi and wj

k ∈Wj and the similarity of them is λk = SimW(wi

k, wjk). Let λ is the similarity threshold

predefined, if λk > λ, this visual word pair is called as similar visual word pair (SVWP forshort), represented as Pk = (wi

k, wjk).

Definition 4 (Similarity measurement of two image objects) Given two image ob-jects Ii(Wi) and Ij(Wj). Let operation Wi

⊗Wj = {P1,P2, ...,Pl} generates the set of

SVWPs which contain the visual words in Wi and Wj , l = |Wi⊗Wj |, and the similarity

set of them are denoted as Λ = {λ1, λ2, ..., λl}, ∀λi ∈ Λ, λi > λ. Let ξik and ξjk be the weight

of visual word wik and wj

k. For image objects Ii(Wi) and Ij(Wj), the sets of their visual

words weight are denoted as Ξi = {ξi1, ξi2, ..., ξil} and Ξj = {ξj1, ξj2, ..., ξ

jl }. The definitional

equation of similarity between Ii(Wi) and Ij(Wj) is shown as follows:

SimI(Ii(Wi), Ij(Wj)) = F(m,n, l, Λ,Ξi, Ξj) (1)

wherem and n are the number of visual words of Ii(Wi) and Ij(Wj) respectively. It is clearlythat SimI(Ii(Wi), Ij(Wj)) can meet the systematic similarity measurement criterion.

Theorem 1 (Monotonicity of similarity function) The similarity measurement SimI(Ii(Wi), Ij(Wj))has the following five monotonicity conditions:

– SimI(Ii(Wi), Ij(Wj)) is a monotonic increasing function of weights of visual words inSVWPs, i.e., ∀ξwi

x∈ Ξi and ξwj

y∈ Ξj, and ∀ξwi

x∈ Ξi and ξwj

y∈ Ξj, if ξwi

x> ξwi

xand

ξwjy> ξwj

y, F(m,n, l1, Λ,Ξi, Ξj) > F(m,n, l2, Λ, Ξi, Ξj).

– SimI(Ii(Wi), Ij(Wj)) is a monotonic increasing function of the similarities of SVMPs

Λ = {λ1, λ2, ...λl}, i.e., ∀λx ∈ Λ and λx ∈ Λ, F(m,n, l1, Λ,Ξi, Ξj) > F(m,n, l2, Λ, Ξi, Ξj).– SimI(Ii(Wi), Ij(Wj)) is a monotonic increasing function of number of SVWPs l, i.e.,∀l1, l2 ∈ N+, if l1 > l2, F(m,n, l1, Λ,Ξi, Ξj) > F(m,n, l2, Λ,Ξi, Ξj).

– SimI(Ii(Wi), Ij(Wj)) is a monotonic decreasing function of weights of visual wordswhich are not in SVWPs, i.e., .

– SimI(Ii(Wi), Ij(Wj)) is a monotonic decreasing function of the number of visual wordswhich are not in SVWPs, i.e., if r1 = m+n− l1 and r2 = m+n− l2, r1 > r2 → l1 < l2,F(m,n, l1, Λ,Ξi, Ξj) < F(m,n, l2, Λ,Ξi, Ξj).

According to the Definition 4 and theorem 1, the similarity measurement for two imageobjects is proposed, which is described in formal as follows.

Given two image objects Ii(Wi) and Ij(Wj), m = |Wi| and n = |Wj |. The sets oftheir visual words weight are Ξi = {ξi1, ξi2, ...ξil} and Ξj = {ξj1, ξ

j2, ...ξ

jl }. The SVMPs set

of Ii(Wi) and Ij(Wj) is {P1,P2, ...,Pl}, l ≤ min(m,n), and the similarities set of them isΛ = {λ1, λ2, ...λl}. The similarity measurement function SimI(Ii(Wi), Ij(Wj)) is:


SimI(Ii(Wi), Ij(Wj)) =

l∑k=1

λkξikξ

jk√

m∑k=1

ξik

n∑k=1

ξjk

√l∑

k=1

λk2ξikξ

jk +

m∑k=l+1

ξik

n∑k=l+1

ξjk

(2)

Function 2 apparently meet the monotonicity described in Theorem 1. On the otherhand, if these two image objects are identical, i.e., Ii(Wi) = Ij(Wj), Wi =Wj , m = n = l,and ξik = ξjk, then SimI(Ii(Wi), Ij(Wj)) = 1.

Theorem 2 (dissatisfying commutative law) The similarity measurement SimI(Ii(Wi), Ij(Wj)) dissatisfy commutative law, i.e.,

SimI(Ii(Wi), Ij(Wj)) 6= SimI(Ij(Wj), Ii(Wi))

In general, some visual words (e.g., noise words) in image objects have negative or reverseeffects on the expression of the whole image. The SMI has a penalty effect on non-similarvisual elements according to Theorem 1. this feature of the SMI has high accuracy for thesimilarity measurement of images.

4 Image Similarity Measurement Algorithm

4.1 The Measurement of Similar Visual Words

SMI is subject to the time complexity of the calculation of similar visual words. µi representsthe similarity of a similar visual word as shown in the following formula:

µi =

{arg maxbj∈SB

SimI(ai, bj), if > µ0

0, if < µ0

(3)

where SimI(ai, bj) represents the cosine of the angle between two vectors as the measure-ment of similarity. µ0 is a judgment of the similarity threshold.

We give an intuitive way to measure similar visual words. The pseudo-code of the algo-rithm is shown in Algorithm 1. In this work, the double loop cosine calculation method iscalled to be SMI Naive (SMIN for short).

4.2 The Optimization of Calculating Similar Visual Words

In the context of massive multimedia data, the multimedia retrieval system or image simi-larity measurement system requires an efficient sentence similarity measurement algorithm,the time complexity of the SMI focuses on the optimization of calculating similar visualwords.SMI Temp Index. To reduce the double loop cos calculation to 1 cycle, a further approachis to construct an index γi of SB for each vector ai in SA. According to experience, thedimension of the visual word vector is generally 200-300 dimensions to get better results.

For a vector ai in SA, we search for the vector bj with the highest similarity in thetemp index γi, so that the process requires only one similarity calculation. The n timescalculations of similar visual words < ai, bj > are reduced to vector searching, thereby


Algorithm 1 SMIN Algorithm

Input: SA, SB , µ0.Output: µ.1: Initializing: µ← ∅;2: Initializing: S ← ∅;3: Initializing: NS ← ∅;4: Initializing: maxsim← 0;5: for each Wi ∈ SA do6: for each W ′

j ∈ SB do

7: if cos(Wi,W ′j) then

8: maxsim← cos(Wi,W ′j);

9: end if10: if maxsim ≥ µ0 then11: S.Add(Wi);12: µ.Add(maxsim);13: else14: NS.Add(Wi);15: µ.Add(0);16: end if17: end for18: end for19: return µ;

reducing the execution time of SMI. However, there is a flaw that when every time a similarelement of a sentence is calculated, a temp index needs to be built once, and the indexcannot be reused. The temp index approach is called to be SMI Temp Index (SMII forshort), as shown in Figure 3.

Index of Potential Similar Visual Words. In order to solve the problem that the indexcannot be reused, we establish an index of potential similar visual words off-line in theprocess of word vector training. We could search for the index to perform the measurementof similar visual words without having to repeatedly create a temporary index. The mainsteps for index of potential similar visual words construction is shown as follows:

– Establishing an index for all the visual word vector set by trained word vector model.– Traversing any vector v to search the index to get a return set. In this set, the potential

similar visual words are abstained with the similarity is greater than the threshold µ0,in similarity descending order.

– The physic indexing structure of potential similar visual words could be implementedby a Huffman tree.

According to the hierarchical Softmax strategy in Word2Vec, an original Word2VecHuffman tree constructed on the basis of the visual words frequency, and each node (exceptthe root node) represents a visual word and its corresponding vector.

We try to replace the vector with potential similar visual words. Thus each node of treerepresents a visual word and its corresponding potential similar visual words. The indexstructure is illustrated by Figure 4:

We call the methods using global index of potential similar visual words as PSMI.Algorithm 2 illustrates the pseudo-code of PSMI.

Algorithm 2 demonstrates the processing of the PSMI Algorithm. Firstly, for each visualwords vector Wi ∈ SA, the algorithm executes the procedure HuffmanSearch(Wi) to getthe node of the huffman tree which containsWi and stored it in P. Then, for eachW ′

j ∈ SB ,the algorithm select each pk from P and check if Wj is equal to pk.vector or not. if them


SMII

Similar visual words <Wi, W’j>

with argmin ||vi, fj||2

ResultSimilar visual words <Wi, W’j>

with argmin ||vi, fj||2

Result

W’1

W’2

W’n

...

...

W’1

W’2

W’n

...

...

W1

W2

Wn

...

...

W1

W2

Wn

...

...

......Fig. 3: The processing of similarity measurement via SMII

……

……

vi Visual words vectorvi Visual words vector vi Visual words vectorvi Visual words vector

Potential similar visual words poolReplace

Fig. 4: The index structure of potential similar visual words


Algorithm 2 PSMI Algorithm

Input: SA, SB , µ0Output: µ.1: Initializing: µ← ∅;2: Initializing: S ← ∅;3: Initializing: NS ← ∅;4: Initializing: P ← ∅;5: Initializing: maxsim← 0;6: for each Wi ∈ SA do7: P ← HuffmanSearch(Wi);8: for each W ′

j ∈ SB do

9: for each pk ∈ P do10: if Wj .equal(pk.vector) then11: S.Add(Wi);12: µ.Add(pk.sim);13: Break to loop Wi;14: end if15: end for16: end for17: NS.Add(Wi);18: µ.Add(0);19: end for20: return µ;

are equal, the algorithm adds Wi into set S and adds pk.sim into µ. Then break to theouter loop. If Wi and pk.vector are not equal, then adds Wi into set NS and add 0 into µ.

5 PERFORMANCE EVALUATION

In this section, we present results of a comprehensive performance study on real imagedatasets Flickr and ImageNet to evaluate the efficiency and scalability of the proposedtechniques. Specifically, we evaluate the effectiveness of the following indexing techniquesfor region of visual interests search on road network.

– WJ WJ is the word2Vec technique proposed in https://github.com/jsksxs360/Word2Vec.– WMD WMD is the word2Vec technique, which is based on moving distance, is proposed

in https://github.com/crtomirmajer/wmd4j.– SMIN SMIN is the double loop cosine calculation technique proposed in Section 4.– SMII SMII is the advanced technique of SMIN, which is proposed in Section 4.– PSMI PSMI is the potential similar visual words technique of SMII, which is also

proposed in Section 4.

Datasets. Performance of various algorithms is evaluated on both real image datasets.We first evaluate these algorithms on Flickr is obtained by crawling millions image the

photo-sharing site Flickr(http://www.flickr.com/). For the scalability and performanceevaluation, we randomly sampled five sub datasets whose sizes vary from 200,000 to 1000,000from the image dataset. Similarly, another image dataset ImageNet, which is widely usedin image processing and computer vision, is used to evaluate the performance of thesealgorithms. Dataset ImageNet not only includes 14,197,122 images, but also contained 1.2million images with SIFT features. We generate ImageNet dataset with varying size from20K to 1M.

Workload. A workload for the region of visual interests query consists of 100 queries.The accuracy of these algorithm and the query response time is employed to evaluate the

https://github.com/jsksxs360/Word2Vec

https://github.com/crtomirmajer/wmd4j

http://www.flickr.com/


2 0 4 0 6 0 8 0 1 0 00

1 02 03 04 05 06 07 08 09 0

1 0 0HI

T RAT

E(%)

# O F V I S U A L W O R D S

W J S M I W M D

(a) Evaluation on Flickr

5 0 1 0 0 1 5 0 2 0 0 2 5 00

2 0

4 0

6 0

8 0

1 0 0

HIT R

ATE (

%)


W J S M I W M D

(b) Evaluation on ImageNet

Fig. 5: Evaluation on the number of visual words on Flickr and ImageNet

2 0 4 0 6 0 8 0 1 0 00

5 0 0 0

1 0 0 0 0

1 5 0 0 0

2 0 0 0 0

RESP

ONSE

TIME

(ms)


S M I N S M I I P S M I W M D


5 0 1 0 0 1 5 0 2 0 0 2 5 00

1 0 0 0 0

2 0 0 0 0

3 0 0 0 0

4 0 0 0 0

5 0 0 0 0

6 0 0 0 0

7 0 0 0 0RE

SPON

SE TI

ME (m

s)




Fig. 6: Evaluation on the number of visual words on Flickr and ImageNet

performance of the algorithms. The image dataset size grows from 0.2M to 1M; the numberof the query visual words of dataset Flickr changes from 20 to 100; the number of thequery visual words of dataset ImageNet varies from 50 to 250. The image dataset size, thenumber of the query visual words of dataset Flickr, and the number of the query visualwords of dataset ImageNet set to 0.2M, 40, 100 respectively. Experiments are run on a PCwith Intel Xeon 2.60GHz dual CPU and 16G memory running Ubuntu. All algorithms inthe experiments are implemented in Java. Note that we only consider the algorithms WJ,SMI, WDM in accuracy comparison, because the SMIN, SMII, PSMI algorithms have thesame error tolerance.

Evaluating hit rate on the number of visual words. We evaluate the hit rate onthe number of query visual words on Flickr and ImageNet dataset shown in Figure 7. Theexperiment on Flickr is shown in Figure 7(a). It is clear that the hit rate of WJ, SMI andWMD decrease with the rising of the number of visual words. Specifically, the hit rate ofour method, SMI, is the highest all the time. It descends slowly from around 90% to about


0 . 2 0 . 4 0 . 6 0 . 8 1 . 00

2 0

4 0

6 0

8 0

1 0 0HI

T RAT

E (%)

# O F I M A G E S

W J S M I W M D


0 . 2 0 . 4 0 . 6 0 . 8 1 . 00

2 0

4 0

6 0

8 0

1 0 0

HIT R

ATE (

%)

# O F I M A G E S

W J S M I W M D


Fig. 7: Evaluation on the number of images on Flickr and ImageNet

85%. On the other hand, the hit rate of WJ and WMD are very close. In the interval [20, 40],they go down rapidly and after that the decrement of them become moderate. At 100, thehit rate of WJ is a litter higher than WMD, and both of them are much lower than SMI.In Figure 7(b), all of the decreasing trends are similar. Apparently, the hit rate of SMI isthe highest, which goes down gradually with the increasing of the number of visual words.On ImageNet dataset, the hit rate of WMD is a litter higher than WJ all the time.

Evaluating response time on the number of visual words. We evaluate the responsetime on the number of visual words on Flickr and ImageNet dataset shown in Figure 8. InFigure 8(a), with the increment of number of visual words, the response time of PSMI hasa slight growth, which is the lowest in these methods. The increasing trends of SMII is verymoderate too, but it is slightly inferior to PSMI. Like PSMI and SMII, the performance ofSMIN shows a moderate decrement with the rising of spatial similarity threshold. Althoughthe response time of it is higher than the former two, it is much lower than WMD which hasa fast growth in the interval of 20, 100. Figure 8(b) illustrates that the efficiency of PSMI isalmost the same with the increment of number of visual words, which is the highest amountthese four methods. Like the experiment on Flickr, the performance of both SMII and SMINincrease gradually and they are much better than WMD.

Evaluating hit rate on the number of images. We evaluate the hit rate on the numberof images on Flickr and ImageNet dataset shown in Figure 7. Figure 7(a) demonstratesclearly that the hit rate of SMI is much higher than WJ and WMD. With the increasingof images number, it fluctuates slightly. the hit rate of WMD is almost unchanged with theincreasing of number of images. On the other hand, the hit rate of WJ shows a moderategrowth in the interval of 0.2, 0.6 and after that it drops and it is a litter lower than WMD.Clearly, the performance of SMI is the best. Figure 7(b) shows that the hit rate of SMIgrows slightly in [0.2, .06] and then go down weakly, which is higher than two others. Likethe trend of SMI, the hit rate of WMD hit the maximum value at 0.6 and after that itdecreases in the interval of [0.6, 0.8]. With this just the opposite is that the hit rate of WJhas a moderate decrement in [0.2, 0.6] and it rises after 0.6.

Evaluating response time on the number of images. We evaluate response time ondifferent size of query region on Flickr and ImageNet dataset shown in Figure 8. We canfind from Figure 8(a) that the response time of PSMI and SMII increase slowly with the


0 . 2 0 . 4 0 . 6 0 . 8 1 . 00

5 0 0 0

1 0 0 0 0

1 5 0 0 0

2 0 0 0 0

2 5 0 0 0

3 0 0 0 0

3 5 0 0 0RE

SPON

SE TI

ME (m

s)

S I Z E O F D A T A S E T



0 . 2 0 . 4 0 . 6 0 . 8 1 . 00

2 0 0 0 0

4 0 0 0 0

6 0 0 0 0

8 0 0 0 0

1 0 0 0 0 0

RESP

ONSE

TIME

(ms)

S I Z E O F D A T A S E T



Fig. 8: Evaluation on the number of images on Flickr and ImageNet

increasing of size of dataset. Both of them are much better than the others. The growthrate of SMIN is a litter higher than the two formers. The efficiency time of WMD is theworst. It grows rapidly and at 1.0 it is more than 30000ms. In Figure 8(b), we see that thegrowth of WMD is the fastest too. Like the situation on Flickr, the performance of WMDis the worst among them. By comparison, the upward trends of SMII and PSMI are muchmore moderate, and PSMI shows the best performance.

6 Conclusion

In this paper, we investigate the problem of image similarity measurement which is a sig-nificant issue in many applications. Firstly we proposed the definition of image objects andsimilarity measurement of two images and related notions. We present the basic methodof image similarity measurement which is named SMIN based on Word2Vec. To improvethe performance of similarity calculation, we improve this method and propose SMI TempIndex. To solve the problem of that the index cannot be reused, we develop a novel indexingtechnique called Index of Potential Similar Visual Words (PSMI). The experimental evalua-tion on real geo-multimedia dataset shows that our solution outperforms the state-of-the-artmethod.

Acknowledgments: This work was supported in part by the National Natural ScienceFoundation of China (61702560), project (2018JJ3691, 2016JC2011) of Science and Tech-nology Plan of Hunan Province, and the Research and Innovation Project of Central SouthUniversity Graduate Students(2018zzts177,2018zzts588).

References

1. Abe, K., Morita, H., Hayashi, T.: Similarity retrieval of trademark images by vector graphics basedon shape characteristics of components. In: Proceedings of the 2018 10th International Conferenceon Computer and Automation Engineering, ICCAE 2018, Brisbane, Australia, February 24-26, 2018,pp. 82–86 (2018)

2. Albanesi, M.G., Amadeo, R., Bertoluzza, S., Maggi, G.: A new class of wavelet-based metrics forimage similarity assessment. Journal of Mathematical Imaging and Vision 60(1), 109–127 (2018)


3. Charfi, N., Trichili, H., Alimi, A.M., Solaiman, B.: Bimodal biometric system for hand shape andpalmprint recognition based on SIFT sparse representation. Multimedia Tools Appl. 76(20), 20457–20482 (2017)

4. Cicconet, M., Elliott, H., Richmond, D.L., Wainstock, D., Walsh, M.: Image forensics: Detecting du-plication of scientific images with manipulation-invariant image similarity. CoRR abs/1802.06515(2018)

5. Coltuc, D., Datcu, M., Coltuc, D.: On the use of normalized compression distances for image similaritydetection. Entropy 20(2), 99 (2018)

6. Fedorov, S., Kacher, O.: Large scale near-duplicate image retrieval using triples of adjacent rankedfeatures (TARF) with embedded geometric information. CoRR abs/1603.06093 (2016)

7. Guha, T., Ward, R.K., Aboulnasr, T.: Image similarity measurement from sparse reconstructionerrors. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1937–1941 (2013)

8. Guo, S., Xing, D., Computer, D.O.: Sentence similarity calculation based on word vector and itsapplication research. Modern Electronics Technique (2016)

9. Hsieh, S., Chen, C., Chen, C.: A novel approach to detecting duplicate images using multiple hashtables. Multimedia Tools Appl. 74(13), 4947–4964 (2015)

10. Jing, Y., Baluja, S.: Visualrank: Applying pagerank to large-scale image search. IEEE Trans. PatternAnal. Mach. Intell. 30(11), 1877–1890 (2008)

11. Karakasis, E.G., Amanatiadis, A., Gasteratos, A., Chatzichristofis, S.A.: Image moment invariantsas local features for content based image retrieval using the bag-of-visual-words model. PatternRecognition Letters 55, 22–27 (2015)

12. Kato, T., Shimizu, I., Pajdla, T.: Selecting image pairs for sfm by introducing jaccard similarity.IPSJ Trans. Computer Vision and Applications 9, 12 (2017)

13. Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representation for local image descriptors.In: 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR2004), with CD-ROM, 27 June - 2 July 2004, Washington, DC, USA, pp. 506–513 (2004)

14. Khan, A., Aguirre, H.E., Tanaka, K.: Improving the efficiency in halftone image generation based onstructure similarity index measurement. IEICE Transactions 95-D(10), 2495–2504 (2012)

15. Lew, M.S., Sebe, N., Djeraba, C., Jain, R.: Content-based multimedia information retrieval: State ofthe art and challenges. TOMCCAP 2(1), 1–19 (2006)

16. Li, J., Qian, X., Li, Q., Zhao, Y., Wang, L., Tang, Y.Y.: Mining near duplicate image groups.Multimedia Tools Appl. 74(2), 655–669 (2015)

17. Li Feng Hou Jiaying, Z.R.L.C.: Research on multi-character sentence similarity calculation methodof fusion word vector. J. Computer Science and Exploration (2017)

18. Liu, L., Lu, Y., Suen, C.Y.: Variable-length signature for near-duplicate image matching. IEEETrans. Image Processing 24(4), 1282–1296 (2015)

19. Liu, Y., Liu, S., Wang, Z.: Multi-focus image fusion with dense SIFT. Information Fusion 23, 139–155(2015)

20. Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV, pp. 1150–1157 (1999)21. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Com-

puter Vision 60(2), 91–110 (2004)22. Mortensen, E.N., Deng, H., Shapiro, L.G.: A SIFT descriptor with global context. In: 2005 IEEE

Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20-26June 2005, San Diego, CA, USA, pp. 184–190 (2005)

23. Nam, S., Kim, W., Mun, S., Hou, J., Choi, S., Lee, H.: A SIFT features based blind watermarkingfor DIBR 3d images. Multimedia Tools Appl. 77(7), 7811–7850 (2018)

24. Nian, F., Li, T., Wu, X., Gao, Q., Li, F.: Efficient near-duplicate image detection with a local-basedbinary representation. Multimedia Tools Appl. 75(5), 2435–2452 (2016)

25. dos Santos, J.M., de Moura, E.S., da Silva, A.S., da Silva Torres, R.: Color and texture applied toa signature-based bag of visual words method for image retrieval. Multimedia Tools Appl. 76(15),16855–16872 (2017)

26. Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: 9thIEEE International Conference on Computer Vision (ICCV 2003), 14-17 October 2003, Nice, France,pp. 1470–1477 (2003)

27. Su, M., Ma, Y., Zhang, X., Wang, Y., Zhang, Y.: Mbr-sift: A mirror reflected invariant featuredescriptor using a binary representation for image matching:. Plos One 12(5) (2017)

28. Wan, J., Wang, D., Hoi, S.C., Wu, P., Zhu, J., Zhang, Y., Li, J.: Deep learning for content-basedimage retrieval: A comprehensive study. In: Proceedings of the ACM International Conference onMultimedia, MM ’14, Orlando, FL, USA, November 03 - 07, 2014, pp. 157–166 (2014)


29. Wang, F., Wang, H., Li, H., Zhang, S.: Large scale image retrieval with practical spatial weightingfor bag-of-visual-words. In: Advances in Multimedia Modeling, 19th International Conference, MMM2013, Huangshan, China, January 7-9, 2013, Proceedings, Part I, pp. 513–523 (2013)

30. Wang, H., Feng, L., Zhang, J., Liu, Y.: Semantic discriminative metric learning for image similaritymeasurement. IEEE Transactions on Multimedia 18(8), 1579–1589 (2016)

31. Wang, Y., Lin, X., Wu, L., Zhang, W.: Effective multi-query expansions: Robust landmark retrieval.In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM ’15, Brisbane,Australia, October 26 - 30, 2015, pp. 79–88 (2015)

32. Wang, Y., Lin, X., Wu, L., Zhang, W.: Effective multi-query expansions: Collaborative deep networksfor robust landmark retrieval. IEEE Trans. Image Processing 26(3), 1393–1404 (2017)

33. Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q.: Exploiting correlation consensus: Towards subspaceclustering for multi-modal data. In: Proceedings of the ACM International Conference on Multimedia,MM ’14, Orlando, FL, USA, November 03 - 07, 2014, pp. 981–984 (2014)

34. Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q.: LBMCH: learning bridging mapping for cross-modal hashing. In: Proceedings of the 38th International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, Santiago, Chile, August 9-13, 2015, pp. 999–1002 (2015)

35. Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q., Huang, X.: Robust subspace clustering for multi-view data by exploiting correlation consensus. IEEE Trans. Image Processing 24(11), 3939–3949(2015)

36. Wang, Y., Lin, X., Zhang, Q.: Towards metric fusion on multi-view data: a cross-view based graphrandom walk approach. In: 22nd ACM International Conference on Information and KnowledgeManagement, CIKM’13, San Francisco, CA, USA, October 27 - November 1, 2013, pp. 805–810(2013)

37. Wang, Y., Lin, X., Zhang, Q., Wu, L.: Shifting hypergraphs by probabilistic voting. In: Advancesin Knowledge Discovery and Data Mining - 18th Pacific-Asia Conference, PAKDD 2014, Tainan,Taiwan, May 13-16, 2014. Proceedings, Part II, pp. 234–246 (2014)

38. Wang, Y., Wu, L.: Beyond low-rank representations: Orthogonal clustering basis reconstruction withoptimized graph structure for multi-view spectral clustering. Neural Networks 103, 1–8 (2018)

39. Wang, Y., Wu, L., Lin, X., Gao, J.: Multiview spectral clustering via structured low-rank matrixfactorization. IEEE Trans. Neural Networks and Learning Systems (2018)

40. Wang, Y., Zhang, W., Wu, L., Lin, X., Fang, M., Pan, S.: Iterative views agreement: An iterativelow-rank based structured optimization method to multi-view spectral clustering. In: Proceedings ofthe Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York,NY, USA, 9-15 July 2016, pp. 2153–2159 (2016)

41. Wang, Y., Zhang, W., Wu, L., Lin, X., Zhao, X.: Unsupervised metric fusion over multiview databy graph random walk-based cross-view diffusion. IEEE Trans. Neural Netw. Learning Syst. 28(1),57–70 (2017)

42. Wang, Y., Zhou, Z.: Spatial descriptor embedding for near-duplicate image retrieval. IJES 10(3),241–247 (2018)

43. Wu, L., Wang, Y.: Robust hashing for multi-view data: Jointly learning low-rank kernelized similarityconsensus and hash functions. Image Vision Comput. 57, 58–66 (2017)

44. Wu, L., Wang, Y., Gao, J., Li, X.: Deep adaptive feature embedding with local sample distributionsfor person re-identification. Pattern Recognition 73, 275–288 (2018)

45. Wu, L., Wang, Y., Gao, J., Li, X.: Where-and-when to look: Deep siamese attention networks forvideo-based person re-identification. arXiv:1808.01911 (2018)

46. Wu, L., Wang, Y., Ge, Z., Hu, Q., Li, X.: Structured deep hashing with convolutional neural networksfor fast person re-identification. Computer Vision and Image Understanding 167, 63–73 (2018)

47. Wu, L., Wang, Y., Li, X., Gao, J.: Deep attention-based spatially recursive networks for fine-grainedvisual recognition. IEEE Trans. Cybernetics (2018)

48. Wu, L., Wang, Y., Li, X., Gao, J.: What-and-where to match: Deep spatially multiplicative integrationnetworks for person re-identification. Pattern Recognition 76, 727–738 (2018)

49. Wu, L., Wang, Y., Shao, L.: Cycle-consistent deep generative hashing for cross-modal retrieval. CoRRabs/1804.11013 (2018)

50. Wu, L., Wang, Y., Shepherd, J.: Efficient image and tag co-ranking: a bregman divergence optimiza-tion method. In: ACM Multimedia Conference, MM ’13, Barcelona, Spain, October 21-25, 2013, pp.593–596 (2013)

51. Yao, J., Yang, B., Zhu, Q.: Near-duplicate image retrieval based on contextual descriptor. IEEESignal Process. Lett. 22(9), 1404–1408 (2015)

52. Yildiz, B., Demirci, M.F.: Distinctive interest point selection for efficient near-duplicate image re-trieval. In: 2016 IEEE Southwest Symposium on Image Analysis and Interpretation, SSIAI 2016,Santa Fe, NM, USA, March 6-8, 2016, pp. 49–52 (2016)


53. Zhang, Z., Zou, Q., Wang, Q., Lin, Y., Li, Q.: Instance similarity deep hashing for multi-label imageretrieval. CoRR abs/1803.02987 (2018)

54. Zhao, W., Luo, H., Peng, J., Fan, J.: Mapreduce-based clustering for near-duplicate image identifi-cation. Multimedia Tools Appl. 76(22), 23291–23307 (2017)

55. Zlabinger, M., Hanbury, A.: Finding duplicate images in biology papers. In: Proceedings of theSymposium on Applied Computing, SAC 2017, Marrakech, Morocco, April 3-7, 2017, pp. 957–959(2017)

arxiv:1809.03867v1 [cs.mm] 8 sep 2018

Documents